# 3. Evaluation of Production Model

This notebook evaluates the production model identified in [02_NLP_Pipeline_Model](02_NLP_Pipeline_Model.ipynb) and provides conclusions and recommendations.

**Evaluation and Conceptual Understanding**
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations**
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?


#### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error, ConfusionMatrixDisplay, recall_score
from sklearn.ensemble import VotingClassifier

In [2]:
misclassifications = pd.read_csv('../datasets/misclassifications.csv')

In [3]:
df = pd.read_csv('../datasets/combined_cleaned.csv',na_filter=False)

In [4]:
%store -r vr1 X_train X_test y_train y_test

### Evaluation

#### Baseline score

In [5]:
%store -r baseline
print(f'Baseline accuracy is {round(baseline*100,1)}% if you were to always predict divorce.')

Baseline accuracy is 50.3% if you were to always predict divorce.


In [6]:
baseline_recall = recall_score(y_test, np.full_like(y_test,'Divorce'),pos_label = 'Divorce')

In [7]:
print(f'Baseline recall score is {round(baseline_recall*100,1)}%')

Baseline recall score is 100.0%


#### Production Model score

The production model uses the following methodology:
- Inputs
    - Selftext (including empty, removed, and deleted posts)
    - Titles
- Column Transformation: count vectorizer on both inputs
    - Stop words = stopwords.words('english') plus 'na','removed','deleted','would','like','get','really'
    - N-grams = 1, 2, and 3 word phrases
    - Mininimum document frequency = 4 documents
    - Maxinimum document frequency = 40% of all documents
    - Strip accents = Use unicode normalizations
- Voting classifier
    - Soft voting
    - Equal weighting for 3 models:
    1. Logistic Regression
        - default settings
    2. Random Forest Classifier
        - no max depth
        - 500 estimators
    3. Ada Boost Classifier
        - Logistic Regression base estimator
        - learning rate = 2.0
        - 100 estimators

In [8]:
train_accuracy = vr1.score(X_train, y_train)
test_accuracy = vr1.score(X_test, y_test)

In [9]:
print(f'The accuracy scores for the train and test splits are {round(train_accuracy*100,1)}% and {round(test_accuracy*100,1)}%, respectively.')

The accuracy scores for the train and test splits are 99.8% and 97.4%, respectively.


In [10]:
train_recall = recall_score(y_train,vr1.predict(X_train),pos_label='Divorce')
test_recall = recall_score(y_test,vr1.predict(X_test),pos_label='Divorce')

The recall score tells us how many of the divorce posts we were able to accurately classify. This is an important metric for the divorce lawyer as well because each divorce post they miss is a potential customer lost.

In [11]:
print(f'The recall scores for the train and test splits are {round(train_recall*100,1)}% and {round(test_recall*100,1)}%, respectively.')

The recall scores for the train and test splits are 100.0% and 99.0%, respectively.


Note, the baseline recall score is 100%, so in the effort to avoid advertising to wedding planners (and thus, avoid angering them), the recall score decreases by 1%, which means you are missing a potential 2% of divorce posts (i.e. clients).

In [76]:
df_scores = pd.DataFrame([baseline, baseline_recall],columns=['Baseline'],index=['Accuracy','Recall'])

In [77]:
df_scores['Baseline'] = [str(round(x*100,1)) + '%' for x in df_scores['Baseline']]

In [78]:
df_scores['Model'] = [test_accuracy, test_recall]

In [79]:
df_scores['Model'] = [str(round(x*100,1)) + '%' for x in df_scores['Model']]

In [81]:
df_scores.transpose()

Unnamed: 0,Accuracy,Recall
Baseline,50.3%,100.0%
Model,97.4%,99.0%


#### Inference of the production model

In [12]:
# Create a vector of the words transformed in the count vectorizer
vect_names = vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-1'].get_feature_names_out() + ' (selftext)'

vect_names = np.append(vect_names,(vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-2'].get_feature_names_out()+' (title)'))

In [13]:
# Create a data frame that contains the coeficients of the logistic regression model and the feature importances of the random forest
df_coefs = pd.DataFrame(np.transpose(vr1.named_estimators['lr'].named_steps['lr'].coef_), index=vect_names,columns =['LR coefs'])

df_coefs['LR odds'] = df_coefs['LR coefs'].apply(np.exp)

df_coefs['RF FI'] = np.transpose(vr1.named_estimators['rf'].named_steps['rf'].feature_importances_)

In [93]:
df_coefs.loc[['rings (title)','band (selftext)','engagement ring (selftext)','engagement (selftext)',
              'sell (selftext)','hassle (selftext)'],:]

Unnamed: 0,LR coefs,LR odds,RF FI
rings (title),0.124997,1.133145,3.5e-05
band (selftext),0.196374,1.216982,0.000103
engagement ring (selftext),0.006919,1.006943,4.6e-05
engagement (selftext),0.130738,1.13967,0.000434
sell (selftext),-0.807043,0.446176,0.00012
hassle (selftext),1e-05,1.00001,5e-06


In [14]:
# Create data frames for the top coefficients for wedding and divorce, and top feature importances
top_coefs_div = df_coefs.sort_values(by=['LR coefs']).head(20)

top_coefs_wed = df_coefs.sort_values(by=['LR coefs']).tail(20)

top_feat_imp = df_coefs.sort_values(by=['RF FI'],ascending = False).head(20)

In [15]:
# Create data frames for the selftext and title vectors, and include information about which subreddit it came from
selftext_vect = vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-1'].transform(X_train['selftext'])
df_selftext_vect = pd.DataFrame(selftext_vect.A, columns=vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-1'].get_feature_names_out())
df_selftext_vect['subreddit'] = y_train
df_selftext_vect['subreddit_div'] = [1 if x == 'Divorce' else 0 for x in y_train]

title_vect = vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-1'].transform(X_train['title'])
df_title_vect = pd.DataFrame(title_vect.A, columns=vr1.named_estimators['lr'].named_steps['ct'].named_transformers_['countvectorizer-1'].get_feature_names_out())
df_title_vect['subreddit'] = y_train
df_title_vect['subreddit_div'] = [1 if x == 'Divorce' else 0 for x in y_train]

In [16]:
# Add the frequency which the words appear in the divorce posts when the word appears in a post or title at all
top_coefs_div['div_freq'] = [df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() if '(selftext)' in word else df_title_vect[df_title_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() for word in top_coefs_div.index]

top_coefs_wed['div_freq'] = [df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() if '(selftext)' in word else df_title_vect[df_title_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() for word in top_coefs_wed.index]

top_feat_imp['div_freq'] = [df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() if '(selftext)' in word else df_title_vect[df_title_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]['subreddit_div'].mean() for word in top_feat_imp.index]

In [17]:
# Add the count which the words appear in the posts or titles
top_coefs_div['count'] = [len(df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]) for word in top_coefs_div.index]

top_coefs_wed['count'] = [len(df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]) for word in top_coefs_wed.index]

top_feat_imp['count'] = [len(df_selftext_vect[df_selftext_vect[word.replace(' (selftext)','').replace(' (title)','')]>0]) for word in top_feat_imp.index]

In [18]:
top_coefs_div

Unnamed: 0,LR coefs,LR odds,RF FI,div_freq,count
divorce (title),-3.276906,0.037745,0.030106,1.0,1349
divorce (selftext),-2.629223,0.072134,0.040236,0.995552,1349
divorced (title),-1.768574,0.170576,0.003233,0.9875,304
ex (title),-1.754912,0.172923,0.006712,0.988304,532
husband (title),-1.494438,0.224375,0.003224,0.988235,638
years (title),-1.253099,0.285618,0.000979,0.861538,1255
custody (title),-1.173133,0.309396,0.002048,1.0,233
wife (title),-1.129264,0.323271,0.002843,1.0,602
ex (selftext),-1.121651,0.325741,0.009057,0.960526,532
spouse (title),-1.095312,0.334435,0.002083,0.968254,160


This tells us that if a post has divorce in the title, it is 4% as likely to be in the wedding planning reddit as compared to the divorce reddit, holding all other factors constant. The likelihood increases to 7% for divorce in the post selftext. 

Interestingly, if "wedding ring" is in the post title, a post is 34% as likely to be in the wedding planning reddit as compared to the divorce reddit, holding all other factors constant. However, when the phrase is in the title, it is in the divorce reddit only 40% of the time. Part of this anomaly could be explained by the few times the phrase appears in the training data (13 times).

In [19]:
top_coefs_wed

Unnamed: 0,LR coefs,LR odds,RF FI,div_freq,count
ceremony (title),1.110191,3.034939,0.001404,0.029851,323
covid (title),1.156924,3.180136,0.000733,0.117647,152
reception (title),1.165263,3.206766,0.001144,0.0,230
ideas (title),1.168042,3.215689,0.001617,0.0,217
honeymoon (title),1.173079,3.231929,0.000697,0.0,56
invite (title),1.190412,3.288437,0.000978,0.0,176
moh (title),1.202803,3.329437,0.000772,0.0,105
planning (title),1.218939,3.383596,0.001755,0.011628,533
invitations (title),1.220057,3.38738,0.000658,0.0,75
dress (selftext),1.238236,3.449524,0.009169,0.010178,393


This tells us that if a post has wedding in the title, it is 2200% as likely to be in the wedding planning reddit as compared to the divorce reddit, holding all other factors constant. The likelihood decreases to 600% for wedding in the post selftext. This is despite the observation that when the phrase is in the title, it is in the divorce reddit 1.4% of the time. 

In [20]:
top_coefs_wed.loc['wedding (title)',:]['div_freq']*top_coefs_wed.loc['wedding (title)',:]['count']/(len(y_train == 'divorce')/2)

0.007499239433770193

If the lawyer were to filter out words based on whether they had wedding in the title, they would potential miss out on around 0.75% of the time.

In [21]:
top_feat_imp

Unnamed: 0,LR coefs,LR odds,RF FI,div_freq,count
wedding (selftext),1.818931,6.165265,0.040938,0.040417,1534
divorce (selftext),-2.629223,0.072134,0.040236,0.995552,1349
wedding (title),3.10763,22.367961,0.034973,0.014541,1534
divorce (title),-3.276906,0.037745,0.030106,1.0,1349
kids (selftext),-0.529632,0.588822,0.01104,0.895487,842
years (selftext),-0.831579,0.435361,0.009949,0.831076,1255
marriage (selftext),-0.800164,0.449255,0.0095,0.922861,713
dress (selftext),1.238236,3.449524,0.009169,0.010178,393
ex (selftext),-1.121651,0.325741,0.009057,0.960526,532
wife (selftext),-0.949566,0.386909,0.008419,0.925249,602


The feature importances give us a general sense of the relevancies for each word. It makes sense that wedding/divorce are the top importances. However, there are some words that do not appear in the previous list for coefficients, which is interesting.

In [22]:
# Let's look at words that are not top coefficients for wedding and divorce, but are top feature importances
top_feat_imp.loc[[word for word in top_feat_imp.index if (word not in top_coefs_wed.index) & (word not in top_coefs_div.index)],:]

Unnamed: 0,LR coefs,LR odds,RF FI,div_freq,count
kids (selftext),-0.529632,0.588822,0.01104,0.895487,842
years (selftext),-0.831579,0.435361,0.009949,0.831076,1255
marriage (selftext),-0.800164,0.449255,0.0095,0.922861,713
fiance (selftext),0.973947,2.648378,0.008064,0.028986,414
life (selftext),-0.503182,0.604604,0.007048,0.839858,843
guests (selftext),0.741002,2.098037,0.006884,0.008671,346
house (selftext),-0.822473,0.439344,0.006739,0.880759,738
husband (selftext),-0.754444,0.470272,0.006713,0.851097,638
venue (selftext),0.750514,2.118088,0.006324,0.0,341
planning (selftext),0.861247,2.366109,0.00628,0.155722,533


#### Misclassifications

In [26]:
missed_divorce = misclassifications[(misclassifications['true class'] != misclassifications['vr1 pred class']) & (misclassifications['true class'] == 'Divorce')][['selftext','title']]

In [27]:
missed_divorce.loc[85,'title']

'Can you give me a checklist of red flags to look out for ? Both during and before you get with a person..'

In [25]:
for index,title,post in zip(['title'].keys(),['title'],['selftext']):
    print(f'Index {index}: {title}')
    print(f'Post: {post}\n------')

Index 85: Can you give me a checklist of red flags to look out for ? Both during and before you get with a person..
Post: Going through something super unexpected, and I feel like an idiot.. :(
------
Index 990: NINE days and counting
Post: SOCK DAY 21/03/2022, 3PM.
It's nearly time!
------
Index 1040: Dads how often do you talk to your kids? its rhetorical the answer is not enough.
Post: [removed]
------
Index 1077: Discovery process Photos
Post: Those who have went threw the Discovery process or Trial. How did you decide what photos to send or use?? I have well over 300 but really do not want to print them all off since it’s all out of my pocket for ink and paper.
------
Index 1113: I live in Ohio and we're supposed to use family wizard to communicate ,
Post: how long do I have time wise to reply back?
------
Index 1182: Name Change
Post: I hate my maiden name because my dad abandoned me when I was a kid.

I actually like my married name but for obvious reasons, no matter how much I 

### Conclusions and Recommendations

The production model is ready to be deployed as a model to separate out wedding planning posts from divorce posts. 

The model will work on reddit posts with 97.4% accuracy This ensures the lawyer will not waste precious advertising dollars on wedding planners and potentially annoy the wedding planners, who might eventually become clients in a few years. 

Additionally, the model correctly classifies divorce posts 99% of the time, so the lawyer is advertising to 99% of the potential clients, only missing 1%. 

The next steps for the lawyer would be to share this technical report with her advertising coordinators/partners so they can implement this modeling pipeline to filter out wedding planning related posts. The advertising partners may wish to collect more data to further improve the model and focus in on key coefficients that seem to be overly low/high even though the word appears frequently in both divorce and wedding planning posts.