# Natural Language Processing - Problems

Ref: https://www.tiesdekok.com

**Python version:** Python 3.6+     
**Recommended environment: `researchPython`**

In [42]:
import os
recommendedEnvironment = 'researchPython'
if os.environ['CONDA_DEFAULT_ENV'] != recommendedEnvironment:
    print('Warning: it does not appear you are using the {0} environment, did you run "conda activate {0}" before starting Jupyter?'.format(recommendedEnvironment))

# Introduction

### Relevant tutorial notebooks:

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`NLP_Notebook.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/Python_NLP_Tutorial/blob/master/NLP_Notebook.ipynb)  

## Import required packages

In [1]:
import os, re
from pathlib import Path
import pandas as pd
import numpy as np

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import scipy

In [3]:
!pip install spacy



In [4]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.4.1/en_core_web_lg-3.4.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m759.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [11]:
import spacy
nlp = spacy.load("en_core_web_lg")

You might have to replace the above with the code below if you installed the language model in an alternative way
```python
import en_core_web_lg
nlp = en_core_web_lg.load()
```

# Part 1 

## 1) Perform basic operations on a sample earnings transcript text file

### 1a) Load the following text file: `data > example_transcript.txt` into Python

In [48]:
os.listdir('data')

['MDA_files',
 'LoughranMcDonald_MasterDictionary_2014.xlsx',
 'MDA_META_DF.xlsx',
 'example_transcript.txt',
 '.ipynb_checkpoints']

In [6]:
with open(os.path.join('data','example_transcript.txt'), 'r', encoding = 'utf-8') as f:
    transcript = f.read()

In [7]:
transcript

"\ufeffCompany Name: Hope Bancorp Inc\nCompany Ticker: HOPE US Equity\nDate: 2020-01-23\nQ4 2019 Earnings Call\nCompany Participants\nAlex Ko, Executive Vice President and Chief Financial Officer\nAngie Yang, Director, Investor Relations\nKevin S. Kim, President and Chief Executive Officer\nOther Participants\nBob Sean, Analyst\nChristopher McGratty, Analyst\nJake Stern, Analyst\nTim Coffey, Analyst\nUnidentified Participant\nPresentation\nOperator\nHello and welcome to the Hope Bancorp Q4 2019 Earnings Conference Call. All participants will be in listen-only mode. (Operator Instructions) Please note this event is being recorded.\n\nI would now like to turn the conference over to Angie Yang, Director of Investor Relations. Please go ahead.\n\nAngie Yang \nThank you, Keith. Good morning everyone and thank you for joining us for the Hope Bancorp 2019 Fourth Quarter Investor Conference Call. As usual we will begin -- we will be using a slide presentation to accompany our discussion this m

### 1b) Print the first 400 characters of the text file you just loaded

In [17]:
print(transcript[:400])

﻿Company Name: Hope Bancorp Inc
Company Ticker: HOPE US Equity
Date: 2020-01-23
Q4 2019 Earnings Call
Company Participants
Alex Ko, Executive Vice President and Chief Financial Officer
Angie Yang, Director, Investor Relations
Kevin S. Kim, President and Chief Executive Officer
Other Participants
Bob Sean, Analyst
Christopher McGratty, Analyst
Jake Stern, Analyst
Tim Coffey, Analyst
Unidentified Pa


### 1c) Count the number of times the name `Angie` is mentioned

In [14]:
transcript.count('Angie')

10

### 1c) Use the provided Regular Expression to capture all numbers prior to a "%"  
Use this regular expression: `\W([\.\d]{,})%`  
**You can play around with this regular expression here: <a href='https://bit.ly/3heIqoG'>Test on Pythex.org</a>**

In [15]:
re.findall('\W([\.\d]{,})%',transcript)

['31',
 '21',
 '2',
 '41.2',
 '48.3',
 '1.86',
 '1.88',
 '1.4',
 '5.7',
 '61',
 '31',
 '8',
 '3.16',
 '1.49',
 '1.54',
 '1.49',
 '1.44',
 '1.85',
 '2.4',
 '2.5',
 '6.2',
 '5.7',
 '58.8',
 '57.6',
 '85',
 '4',
 '2.17',
 '2.3',
 '1.9',
 '2.18',
 '1.85',
 '1.85',
 '5.04',
 '6',
 '24',
 '4.6',
 '37',
 '39',
 '39',
 '55',
 '4',
 '30',
 '40']

##\W : extracts non-alphanumeric 
. extracts any character 
\ escape special characters
\d digit
{,} from m to n, m defaults to 0, n to infinity

### It is capturing all numbers prior to %

([\.\d]{,})%


### 1d) Load the text into a Spacy object and split it into a list of  sentences

Make sure to evaluate how well it worked by inspecting various elements of the sentence list.

Note: the beginning of the document contains meta data that are not normal sentences, so you might see some weird "sentences" at the beginning. 

## nlp will help to get rid of the unecessary characters at the beginning of the text

In [18]:
nlp_t = nlp(transcript)

In [21]:
sentences = [x.text for x in nlp_t.sents]  

In [22]:
print(sentences[150])






In [23]:
sentences[1]

'the Hope Bancorp Q4 2019 Earnings Conference Call.'

In [24]:
print(sentences[1])

the Hope Bancorp Q4 2019 Earnings Conference Call.


### 1e) Parse out the following three parts of the earnings call transcript and put them in seperate variables:

* The meta data at the top (e.g., company name, period, etc)   
* The presentation portion  
* The Q&A portion

**Note:** you could do it based on the exact location (e.g, `text_file[:1234]`), however, that would only work for this file. Try to come up with a solution that would work for all files that follow the same structure. 

In [25]:
pres_split = transcript.split('Presentation')
meta = pres_split[0]

In [26]:
rest = ''.join(pres_split[1:])   

In [27]:
qa_split = rest.split('Questions And Answers')  
presentation = qa_split[0]

In [28]:
QA = ''.join(qa_split[1:]) 

### 1f) How many characters, sentences, words (tokens) do the presentation portion and the Q&A portion have?  

Hint: use `Spacy` for the sentence and word counts.

##pres_nlp.sent is a spacy object and we can see the contents after we convert it into a list by the command list(pres_nlp.sents)

In [29]:
## list(pres_nlp) gives a list of words in the nlp object pres_nlp

In [30]:
pres_nlp = nlp(presentation)

In [31]:
print('The presentation has {} characters, {} sentences and {} tokens.'.format(len(presentation), 
                                                                               len(list(pres_nlp.sents)), 
                                                                               len(list(pres_nlp))))

The presentation has 17898 characters, 172 sentences and 3396 tokens.


# Part 2

## 2) Create sentiment score based on Loughran and McDonald (2011)   

Create a sentiment score for MD&As based on the Loughran and McDonald (2011) word lists.    

#### References  

*Loughran, T., & McDonald, B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance, 66(1), 35-65.*

#### Data to use

I have included a random selection of 20 pre-processed MDA filings in the `data > MDA_files` folder. The filename is the unique identifier.   

You will also find a file called `MDA_META_DF.xlsx` in the "data" folder, this contains the following meta-data for each MD&A: 
* filing date  
* cik   
* company name  
* link to filing

### 2a) Load data into a dictionary with as key the filename and as value the content of the text file

The files should all be in the following folder:  
```
os.path.join('data', 'MDA_files')
```

##os.listdir() is used to get the list of all files and directories in the specified directory. If we don’t specify any directory, then list of files and directories in the current working directory will be returned.

##os.path.join join one or more path components intelligently. Colon is the delimiter of the slice syntax to 'slice out' sub-parts in sequences [start:end]

In [32]:
mda_file_folder = os.path.join('data', 'MDA_files')
mda_files = [file for file in os.listdir(mda_file_folder) if file[-3:] == 'txt']

In [33]:
mda_data = {}
for mda_file in mda_files:
    with open(os.path.join(mda_file_folder, mda_file), 'r') as f:
     mda_data[mda_file] = f.read()

### 2b) Load the Loughran and McDonald master dictionary    
**Note:** The Loughran and McDonald dictionary is included in the "data" folder: `LoughranMcDonald_MasterDictionary_2014.xlsx `

In [34]:
lm_df = pd.read_excel(os.path.join('data', 'LoughranMcDonald_MasterDictionary_2014.xlsx'))

In [35]:
lm_df

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,81,5.690194e-09,3.068740e-09,5.779943e-07,45,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,2,1.404986e-10,8.217606e-12,7.841870e-09,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,5.619945e-10,1.686149e-10,7.096240e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,5,3.512466e-10,1.727985e-10,7.532677e-08,5,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,1752,1.230768e-07,1.198634e-07,1.110293e-05,465,0,0,0,0,0,0,0,0,0,0,3,12of12inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85126,ZYGOTE,85127,35,2.458726e-09,1.025127e-09,2.320929e-07,25,0,0,0,0,0,0,0,0,0,0,2,12of12inf
85127,ZYGOTES,85128,1,7.024931e-11,2.593031e-11,2.474469e-08,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
85128,ZYGOTIC,85129,0,0.000000e+00,0.000000e+00,0.000000e+00,0,0,0,0,0,0,0,0,0,0,0,3,12of12inf
85129,ZYMURGIES,85130,0,0.000000e+00,0.000000e+00,0.000000e+00,0,0,0,0,0,0,0,0,0,0,0,3,12of12inf


### 2c) Create two lists: one containing all the negative words and the other one containing all the positive words   

Note, you can treat any number that is not 0 as a 1:

```python
0         ## <-- 0
2009      ## <-- 1
2014      ## <-- 1
2011      ## <-- 1
2012      ## <-- 1
```

They include the year instead of a one for versioning purposes. 

**Tip:** I recommend to change all words to lowercase in this step so that you don't need to worry about that later

In [36]:
neg_words = list(lm_df[lm_df.Negative != 0].Word.astype(str).str.lower().values)
pos_words = list(lm_df[lm_df.Positive != 0].Word.astype(str).str.lower().values)

## .str. is an accessor that is used to access different functions such as lower() and upper()

### 2d) For each MD&A calculate the *total* number of times negative and positive words are mentioned

**Note:** make sure you deal with uppercase vs. lowercase and substring matches.

**Hint 1:** save the counts to a list where each entry is a list that contains the following three items: [*filename*, *total pos count*, *total neg count*], like this:
> [   
    ['21344_0000021344-16-000050.txt', 1234, 1234],   
    ['21510_0000021510-16-000074.txt', 1234, 1234],  
> ....  
 ]   
 
An example to illustrate sub-string matches:

```python
### For example, consider the positive word 'win'

test_sen = "They hockey team made a big win during the winter."

test_sen.count('win')
## gives --> 2 

## We only want to count "win" not "winter", how do we solve that?
```



In [37]:
mda_term_counts = []
for mda_file, mda_text in mda_data.items():
    text_lower = mda_text.lower()
    
    pos_count = 0
    for word in pos_words:
        pos_count += text_lower.count('' + word + '')
        
    neg_count = 0
    for word in neg_words:
        neg_count += text_lower.count('' + word + '')   
        
    mda_term_counts.append([mda_file,pos_count,neg_count])


In [38]:
mda_term_counts[15]

['36146_0001564590-16-013066.txt', 684, 2094]

### 2e) Convert the list created in 3c into a Pandas DataFrame  
**Hint:** Use the `columns=[...]` parameter to name the columns

In [39]:
df = pd.DataFrame(data = mda_term_counts, columns = ['file_name', 'total pos count', 'total neg count'])

In [40]:
df.head()

Unnamed: 0,file_name,total pos count,total neg count
0,26076_0001558370-16-010242.txt,470,1214
1,30625_0000030625-16-000121.txt,400,1268
2,26324_0000026324-16-000040.txt,378,666
3,40533_0000040533-16-000056.txt,328,494
4,46250_0000046250-16-000047.txt,200,458


### 2f) Create a new column with a "sentiment score" for each MD&A

Use the following imaginary sentiment score:  
$$\frac{(Num\ Positive\ Words - Num\ Negative\ Words)}{Sum\ of Pos\ and\ Neg\ Words}$$


In [41]:
df['sentiment'] = (df['total pos count'] - df['total neg count'])/(df['total pos count'] + df['total neg count'])

In [42]:
df

Unnamed: 0,file_name,total pos count,total neg count,sentiment
0,26076_0001558370-16-010242.txt,470,1214,-0.441805
1,30625_0000030625-16-000121.txt,400,1268,-0.520384
2,26324_0000026324-16-000040.txt,378,666,-0.275862
3,40533_0000040533-16-000056.txt,328,494,-0.201946
4,46250_0000046250-16-000047.txt,200,458,-0.392097
5,23217_0001628280-16-017613.txt,450,1470,-0.53125
6,21510_0000021510-16-000074.txt,606,1078,-0.280285
7,47518_0001214659-16-014806.txt,482,974,-0.337912
8,49071_0000049071-16-000117.txt,954,1636,-0.26332
9,47217_0000047217-16-000093.txt,544,928,-0.26087


## 2g) Use the `MDA_META_DF` file to add the company name, filing date, and CIK to the sentiment dataframe

In [44]:
mda_meta_df = pd.read_excel(os.path.join('data', 'MDA_META_DF.xlsx'))

In [45]:
new_df = pd.merge(df,mda_meta_df[['file_name','fdate','cik','coname']], how = 'left', on = 'file_name') 
new_df.head(5)

Unnamed: 0,file_name,total pos count,total neg count,sentiment,fdate,cik,coname
0,26076_0001558370-16-010242.txt,470,1214,-0.441805,2016-11-22,26076,CUBIC CORP /DE/
1,30625_0000030625-16-000121.txt,400,1268,-0.520384,2016-02-18,30625,FLOWSERVE CORP
2,26324_0000026324-16-000040.txt,378,666,-0.275862,2016-02-25,26324,CURTISS WRIGHT CORP
3,40533_0000040533-16-000056.txt,328,494,-0.201946,2016-02-08,40533,GENERAL DYNAMICS CORP
4,46250_0000046250-16-000047.txt,200,458,-0.392097,2016-06-03,46250,HAWKINS INC


# Part 3

## 3a) Calculate the Term Frequency (TF) vectors for the MD&A files. 

You should end up with a matrix of the shape 20x6747 (or something along those lines). 20 reflects the number of MDA filings and 6747 reflects the number of unique tokens/words.

**Use the `CountVectorizer` from sci-kit learn:** https://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage

In [46]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer 

In [47]:
newdf = pd.DataFrame.from_dict(mda_data, orient = 'index', columns = ['content'])
newdf

Unnamed: 0,content
26076_0001558370-16-010242.txt,We are a leading international provider of cos...
30625_0000030625-16-000121.txt,The following discussion and analysis is provi...
26324_0000026324-16-000040.txt,Curtiss-Wright Corporation and its subsidiarie...
40533_0000040533-16-000056.txt,"For an overview of our business groups, includ..."
46250_0000046250-16-000047.txt,The following is a discussion and analysis of ...
23217_0001628280-16-017613.txt,The following discussion and analysis is inten...
21510_0000021510-16-000074.txt,Below is a summary of some of the quantitative...
47518_0001214659-16-014806.txt,We are a leading global medical technology com...
49071_0000049071-16-000117.txt,"Humana Inc., headquartered in Louisville, Kent..."
47217_0000047217-16-000093.txt,An analysis of our continuing financial result...


## Create a vectorizer function using CountVectorizer to convert the text into term frequencies 

In [55]:
tf_vectorizer = CountVectorizer(stop_words = 'english')

In [56]:
tf_vectors = tf_vectorizer.fit_transform(newdf.content.values)

## This function converts the text values into a matrix of term frequencies

In [57]:
type(tf_vectors)

scipy.sparse._csr.csr_matrix

## This reflects the shape of the scipy matrix with 2 files and 6519 unique tokens or words for which we have frequencies

In [104]:
tf_vectors.shape

(20, 6519)

## Set max_df and min_df values in the CountVectorizer function 

###max_df = 0.8 --> any word that occurs in more than 80% of our documents is not going to be included; min_df is the minimum number of documents in which the word should occur. 

In [51]:
tf_vectorizer = CountVectorizer(stop_words = 'english', max_df = 0.9)

In [61]:
##tf_vectorizer = CountVectorizer(stop_words = 'english', max_df = 0.9, min_df = 0.005)

In [62]:
feature_names = list(tf_vectorizer.get_feature_names())

In [63]:
type(tf_vectors.toarray())

numpy.ndarray

In [65]:
pd.DataFrame(tf_vectors.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,6509,6510,6511,6512,6513,6514,6515,6516,6517,6518
0,0,0,0,0,0,0,2,0,0,2,...,2,0,0,2,6,0,10,4,0,0
1,2,3,1,0,0,0,0,0,0,0,...,0,4,0,0,0,0,0,0,0,0
2,2,6,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,8,0,0,10,...,8,0,0,0,0,0,0,0,0,0
4,2,2,0,0,0,0,0,0,0,0,...,0,0,0,0,2,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,6,0,0,0,0,0,0,0,0,0
6,6,0,0,0,0,0,0,0,0,2,...,2,0,0,4,0,0,0,0,0,0
7,0,2,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
8,0,34,0,0,0,0,0,0,0,0,...,0,0,0,4,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,2,0,0,0,0,0
