# Capstone Project - Modeling
*Author: Samuel Leadley*
## Preprocessing, Modeling, and Evaluation

## Table of Conents
1. [Load Data](#Load-Data)
2. [Preprocessingg](#Preprocessing)
3. [Lemmatizing Words](#Lemmatizing-Words)
4. [Create Variables and Train-Test-Split](#Create-Variables-and-Train-Test-Split)
5. [Create Custom Stop Words](#Create-Custom-Stop-Words)
6. [Initial Modeling](#Initial-Modeling)
7. [Inital Evaluation](#Initial-Evaluation)
8. [Resampling](#Resampling)
9. [Initial Modeling](#Initial-Modeling)
10. [Resampled Logistic Regression](#Resampled-Logistic-Regression)
11. [Resampled Random Forest](#Resampled-Random-Forest)
12. [Evaluation](#Evaluation)

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import datetime
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import regex as re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn import metrics
from sklearn.utils import resample
from sklearn.metrics import precision_score, roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix

%matplotlib inline

In [2]:
import warnings 
warnings.simplefilter('ignore')

## Load Data

In [3]:
shareholder_letters = pd.read_csv('../datasets/clean_df.csv')
shareholder_letters.drop('Unnamed: 0', axis=1, inplace=True)
shareholder_letters.head()

Unnamed: 0,company,ticker,sector,year,letter_to_shareholder,net_income,target
0,Goldman Sachs,GS,Financials,1999,this is our first letter to shareholders inclu...,2.708,1.0
1,Goldman Sachs,GS,Financials,2000,was a remarkable year for goldman sachs and a...,3.067,1.0
2,Goldman Sachs,GS,Financials,2001,it is impossible to discuss without beginning...,2.31,0.0
3,Goldman Sachs,GS,Financials,2002,it was a challenging year for goldman sachs th...,2.114,0.0
4,Goldman Sachs,GS,Financials,2003,looking back on we take pride in our performa...,3.005,1.0


## Preprocessing
### Lemmatizing Words
It was evident from the EDA that similar words were frequently used like year and years or business and businesses. To reduce the number of features and help improve my model I decided to lemmatize each word to its closest root word.

In [4]:
# Instantiate tokenizer and lemmatizer
tokenizer = RegexpTokenizer(r'\w+') 
lemmatizer = WordNetLemmatizer()

In [5]:
# Tokenizing and lemmatizing the letters
lemm_letters = []

for i in shareholder_letters['letter_to_shareholder']:
    token_list = tokenizer.tokenize(i)
    i = [lemmatizer.lemmatize(i) for i in token_list]
    
    lemm_letters.append(' '.join(i))

In [6]:
# Created another data frame of the lemmatized titles
lemm_letters_df = pd.DataFrame(lemm_letters)

In [7]:
shareholder_letters['letter_to_shareholder'] = lemm_letters_df[[0]]

### Create Variables and Train-Test-Split

In [8]:
# checing the balance of classes
shareholder_letters['target'].value_counts(normalize=True)

1.0    0.692771
0.0    0.307229
Name: target, dtype: float64

The above percentage for my positive class also is the baseline score for my models.

In [9]:
# creating variables and train-test-splitting
X = shareholder_letters["letter_to_shareholder"]
y = shareholder_letters["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=50, stratify=y)

### Create Custom Stop Words

In [10]:
words_to_remove = pd.read_csv('../datasets/words_to_remove.csv')
words_to_remove.drop('Unnamed: 0', axis = 1, inplace=True)

In [11]:
words_to_remove.head()

Unnamed: 0,0
0,biggest
1,emerged
2,arabia
3,outstanding
4,originated


In [12]:
words_to_remove_ls = list(words_to_remove['0'])

In [13]:
len(words_to_remove_ls)

5326

In [14]:
# Add nltk stop words to my custom list
words_to_remove_ls.extend(list(stopwords.words('english')))

In [15]:
# Tokenizing and lemmatizing the words_to_remove list 
words_to_remove_ls_lemm = []

for i in words_to_remove_ls:
    token_list = tokenizer.tokenize(i)
    i = [lemmatizer.lemmatize(i) for i in token_list]
    
    words_to_remove_ls_lemm.append(' '.join(i))

In [16]:
len(words_to_remove_ls_lemm)

5505

## Initial Modeling
### Logistic Regression

In [17]:
# define a function that will run fit a model and return its accuracy score 
def pipe_searcher(pipe, params):
    gs = GridSearchCV(estimator=pipe, param_grid=params, cv=3, scoring="accuracy")
    gs.fit(X_train, y_train)
    print(f'CrossVal Score: {gs.best_score_}')
    print(f'Training Score: {gs.score(X_train, y_train)}')
    print(f'Testing Score: {gs.score(X_test, y_test)}')
    print(gs.best_params_)
    return gs

### Logistic Regression

In [18]:
# instantiate pipe for logistic regression and TfIdf
lr_pipe = Pipeline([('tfidf', TfidfVectorizer()), ('lr', LogisticRegression())])
lr_params = {
    'tfidf__max_features': [13_000, None],
    'tfidf__stop_words' : ['english', words_to_remove_ls_lemm],
    'tfidf__ngram_range': [(1,2), (2, 2)],
    'tfidf__max_df': [0.3, 0.4, 0.5],
    'lr__C' : [0.0009, 0.001],
    'lr__penalty': ['l1', 'l2']
}

In [19]:
lr_model = pipe_searcher(lr_pipe, lr_params)

CrossVal Score: 0.6935483870967742
Training Score: 0.6935483870967742
Testing Score: 0.6904761904761905
{'lr__C': 0.0009, 'lr__penalty': 'l2', 'tfidf__max_df': 0.3, 'tfidf__max_features': 13000, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': 'english'}


In [20]:
# instantiate pipe for logistic regression and TfIdf
lr_pipe_cvec = Pipeline([('cvec', CountVectorizer()), ('lr', LogisticRegression())])
lr_params_cvec = {
    'cvec__max_features': [13_000, None],
    'cvec__stop_words' : ['english', words_to_remove_ls_lemm],
    'cvec__ngram_range': [(1,2), (2, 2)],
    'lr__C' : [0.0009, 0.0001],
    'lr__penalty': ['l1', 'l2']
    
}

In [21]:
lr_model_cvec = pipe_searcher(lr_pipe_cvec, lr_params_cvec)

CrossVal Score: 0.6935483870967742
Training Score: 0.6935483870967742
Testing Score: 0.6904761904761905
{'cvec__max_features': 13000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'lr__C': 0.0001, 'lr__penalty': 'l2'}


Both of my logisic models got the exact same training and testing score and both failed to beat the baseline score so I will attempt improve my score by running a more sophisticated model.

### Decision Tree

In [22]:
# instantiate pipe for decision tree classifier and TdIdf
dt_pipe = Pipeline([('tfidf', TfidfVectorizer()), ('dt', DecisionTreeClassifier(random_state=26))])
dt_params = {
    'tfidf__max_features': [13_000, None],
    'tfidf__ngram_range': [(1,2), (2,2)],
    'tfidf__stop_words' : ['english', words_to_remove_ls_lemm],
    'dt__min_samples_leaf' : [1, 2],
    'dt__max_features' : ['auto', None],
    'dt__max_depth': [500, None]
}

In [23]:
dt_model = pipe_searcher(dt_pipe, dt_params)

CrossVal Score: 0.7016129032258065
Training Score: 1.0
Testing Score: 0.6666666666666666
{'dt__max_depth': 500, 'dt__max_features': None, 'dt__min_samples_leaf': 1, 'tfidf__max_features': 13000, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': 'english'}


The decision tree is extreamly overfit with a perfect training score and a poor testing score. To reduce variance I will run a random forest.

### Random Forest

In [24]:
# Instantiate a pipe for CountVectorizer and random forest model
rf_pipe = Pipeline([('tfidf', TfidfVectorizer()), ('rf', RandomForestClassifier(random_state=26))])
rf_params = {
    'tfidf__max_features': [13_000, 15_000, None],
    'tfidf__stop_words' : ['english', words_to_remove_ls],
    'tfidf__ngram_range': [(1,2), (2, 2)],
    'rf__n_estimators' : [45, 50],
    'rf__max_features' : ['auto', 10],
    'rf__min_samples_leaf': [3, 4],
    'rf__max_depth' :[80, 100]
}

In [25]:
rf_model = pipe_searcher(rf_pipe, rf_params)

CrossVal Score: 0.717741935483871
Training Score: 0.8790322580645161
Testing Score: 0.6904761904761905
{'rf__max_depth': 80, 'rf__max_features': 'auto', 'rf__min_samples_leaf': 3, 'rf__n_estimators': 45, 'tfidf__max_features': 13000, 'tfidf__ngram_range': (2, 2), 'tfidf__stop_words': 'english'}


Although the random forest helped reduce variance it is still very overfit and again the baseline score has not been beat. 

## Initial Evaluation

In [28]:
predictions = rf_model.predict(X_test)

In [29]:
predictions

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.])

My predictions are all ones which could indicate that the models are just predicting the majority class each time. I will resample my data so it is more even to hopefully improve my model.

## Resampling

In [30]:
# split the dataframe by class
shareholder_letters_majority = shareholder_letters[shareholder_letters['target'] == 1]
shareholder_letters_minority = shareholder_letters[shareholder_letters['target'] == 0]

In [31]:
# use resample to upsample the minority class
shareholder_letters_minority_upsampled = resample(shareholder_letters_minority,
                                                 replace=True, n_samples=len(shareholder_letters_majority))

In [32]:
# merge dataframes
resampled_shareholder_letters = pd.concat([shareholder_letters_majority, shareholder_letters_minority_upsampled])

In [33]:
# check new balance of classes
resampled_shareholder_letters['target'].value_counts()

0.0    115
1.0    115
Name: target, dtype: int64

### Re-create variables and re-do train-test-split

In [34]:
# creating variables and train-test-splitting
X = shareholder_letters["letter_to_shareholder"]
y = shareholder_letters["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=50)

### Resampled Logistic Regression

In [35]:
lr_pipe_rs = Pipeline([('tfidf', TfidfVectorizer()), ('lr', LogisticRegression())])
lr_params_rs = {
    'tfidf__max_features': [13_000, None],
    'tfidf__stop_words' : ['english', words_to_remove_ls_lemm],
    'tfidf__ngram_range': [(1,2), (2, 2)],
    'tfidf__max_df': [0.3, 0.4, 0.5],
    'lr__C' : [0.0009, 0.001],
    'lr__penalty': ['l1', 'l2']
}

In [36]:
lr_model_rs = pipe_searcher(lr_pipe_rs, lr_params_rs)

CrossVal Score: 0.7016129032258065
Training Score: 0.7016129032258065
Testing Score: 0.6666666666666666
{'lr__C': 0.0009, 'lr__penalty': 'l2', 'tfidf__max_df': 0.3, 'tfidf__max_features': 13000, 'tfidf__ngram_range': (1, 2), 'tfidf__stop_words': 'english'}


The resampled dataset has fared no better than the unbalanced dataset. Maybe a more complex model will improve the socres.

### Resampled Random Forest

In [37]:
# instantiate a pipe for CountVectorizer and random forest model
rf_pipe_cvec_rs = Pipeline([('cvec', CountVectorizer()), ('rf', RandomForestClassifier(random_state=26))])
rf_params_cvec_rs = {
    'cvec__max_features': [13_000, 70_000, None],
    'cvec__stop_words' : ['english', words_to_remove_ls_lemm],
    'cvec__ngram_range': [(1,2), (2, 2)],
    'rf__n_estimators' : [45, 50],
    'rf__max_features' : ['log2', 'auto'],
    'rf__min_samples_leaf': [3, 4],
    'rf__max_depth' :[70, 80, 90]
}

In [38]:
rf_model_cvec_rs = pipe_searcher(rf_pipe_cvec_rs, rf_params_cvec_rs)

CrossVal Score: 0.7096774193548387
Training Score: 0.9274193548387096
Testing Score: 0.6428571428571429
{'cvec__max_features': 13000, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'rf__max_depth': 70, 'rf__max_features': 'auto', 'rf__min_samples_leaf': 3, 'rf__n_estimators': 45}


The random forest model is extreamly overfit and has done very poorly. However I am interested to see if it has at least not just predicted everything to be the majority class.

In [39]:
predictions = rf_model_cvec.predict(X_test)

In [40]:
predictions

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1.,
       1., 0., 1., 1., 1., 1., 1., 1.])

Because it has actually produced varying predictions I will evaluate the random forest model. 

## Evaluation

In [41]:
# create confusion matrix
cm = confusion_matrix(y_test, predictions)

In [42]:
# convert confusion matrix into readable format
cm_df = pd.DataFrame(cm, columns=['pred negative', 'pred positive'], index=['actual negative', 'actual positive'])
cm_df

Unnamed: 0,pred negative,pred positive
actual negative,4,10
actual positive,0,28


The confusion matrix aboves shows that the negative class if very difficult to predict. Most of the actual negative letters were misclassified. 

## Conclusions and Recommendations
As stated above the models failed. Given the nature of letters to shareholders (they try to paint a rosy picture in any circumstance) and the tiny amount of data I was able to collect I believe this is the best performance I could have gotten from my models. Given the arduous data collection method used in this project a tool that could locate and extract certain sections of text should be developed before attempting to create a document analysis tool. 