## Part 5 Final Model and Evaluation

After choosing the final model, I will evaluate the model and perform error analysis on the model.

In [1]:
# Importing required libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, 
    plot_confusion_matrix,
    accuracy_score,
    plot_roc_curve,
    roc_auc_score,
    recall_score,
    precision_score,
    f1_score,
    RocCurveDisplay
)   


from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

**Summary Table**

|Model|Vectorizer| Train score | Test Score | ROC AUC Score| TN | FP | FN | TP|
|---|---|---|---|---|---|---|---|---|
|Benchmark Model|CVEC|89.3%|83.4%| 90.3% |1446|322|280|1583|
|MultinomialNB |CVEC|86.4%|83.6%|90.2%|1460|308|288|1575|
|MultinomialNB |TVEC|86.8%|83.4%|91.3%|1437|331|271|1592|
|<font color='red'>LogisticRegression</font>|<font color='red'>CVEC</font>|94.7%|84.7%|91.8%|1466|302|255|1608|
|LogisticRegression|TVEC|91.0%|85.1%|92.7%|1477|291|252|1611|
|RandomForest|CVEC|100%|83.4%|91.9%|1460|308|294|1569|
|RandomForest|TVEC|99.99%|82.9%|91.4%|1453|315|305|1558|


In [2]:
# Importing datasets
marriage = pd.read_csv('../datasets/marriage_cleaned.csv')
relationship = pd.read_csv('../datasets/relationship_cleaned.csv')
combined = pd.read_csv('../datasets/combined.csv')
custom_stop = pd.read_csv('../datasets/custom_stop_words.csv')

In [3]:
# Instantiating the custom_stop_words created previously
custom_stop_words = custom_stop['words'].tolist()

In [4]:
X = combined['all_text']
y = combined['is_marriage']

In [5]:
# Creating a train test split for modelling
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

In [6]:
# Creating a pipeline with previously identified best params
pipe_final_model = Pipeline([
    ('cvec', CountVectorizer(
        max_df=0.95,
        max_features=6000,
        min_df=2,
        ngram_range=(1, 2),
        stop_words = custom_stop_words
    )),
    ('logreg', LogisticRegression(
        C=0.1,
        max_iter=1000,
        penalty='l1',
        solver='liblinear'
    ))],
    verbose=2
)



In [7]:
# Fitting the model with the training data.
pipe_final_model.fit(X_train, y_train)

[Pipeline] .............. (step 1 of 2) Processing cvec, total=   8.0s
[Pipeline] ............ (step 2 of 2) Processing logreg, total=   0.2s


Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.95, max_features=6000, min_df=2,
                                 ngram_range=(1, 2),
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...])),
                ('logreg',
                 LogisticRegression(C=0.1, max_iter=1000, penalty='l1',
                                    solver='liblinear'))],
         verbose=2)

In [8]:
# Scoring the training data
pipe_final_model.score(X_train, y_train)

0.8808097500516422

In [9]:
# Scoring the test data
pipe_final_model.score(X_test, y_test)

0.8463233269071881

### Error Analysis

In [35]:
# Creating a dataframe of predictions on the combined dataset
predictions = pd.DataFrame(pipe_final_model.predict_proba(X))

# Changing the columns to relationship and marriage
predictions.columns = ['Relationship', 'Marriage']

# Adding the original posts and actual classification
predictions['all_text'] = combined['all_text']
predictions['Actual'] = combined['is_marriage']

# Adding the predicted classification
predictions['Predicted'] = pipe_final_model.predict(X)

In [11]:
# Rounding the probability to 2 decimal places
predictions['Relationship'] = predictions['Relationship'].map(lambda x: round(x, 2))
predictions['Marriage'] = predictions['Marriage'].map(lambda x: round(x, 2))

In [12]:
predictions

Unnamed: 0,Relationship,Marriage,all_text,Actual,Predicted
0,0.59,0.41,I (m24) am in love with my best friend (f25) a...,0,0
1,0.96,0.04,Blocked and then unblocked ??? Man I’ve been s...,0,0
2,0.75,0.25,Catty In Laws TLDR: Husband and His Sisters ar...,0,0
3,0.96,0.04,My (24F) best friend ended a friendship with m...,0,0
4,0.94,0.06,"Did I cheat in my LDR? I think so, and feel re...",0,0
...,...,...,...,...,...
18149,0.48,0.52,Work to Do in Marriage Part 1,1,1
18150,0.49,0.51,Do you think having a lover can save your marr...,1,1
18151,0.48,0.52,inter caste love marriage specialist,1,1
18152,0.12,0.88,Moving on after infidelity if he's not sorry.....,1,1


In [13]:
# Reordering the columns
predictions = predictions.loc[:,[
    'all_text',
    'Actual',
    'Predicted',
    'Relationship',
    'Marriage'
]]

In [14]:
predictions

Unnamed: 0,all_text,Actual,Predicted,Relationship,Marriage
0,I (m24) am in love with my best friend (f25) a...,0,0,0.59,0.41
1,Blocked and then unblocked ??? Man I’ve been s...,0,0,0.96,0.04
2,Catty In Laws TLDR: Husband and His Sisters ar...,0,0,0.75,0.25
3,My (24F) best friend ended a friendship with m...,0,0,0.96,0.04
4,"Did I cheat in my LDR? I think so, and feel re...",0,0,0.94,0.06
...,...,...,...,...,...
18149,Work to Do in Marriage Part 1,1,1,0.48,0.52
18150,Do you think having a lover can save your marr...,1,1,0.49,0.51
18151,inter caste love marriage specialist,1,1,0.48,0.52
18152,Moving on after infidelity if he's not sorry.....,1,1,0.12,0.88


In [29]:
# All the False Positives as well as False Negatives
wrongly_predicted = predictions[predictions['Actual'] != predictions['Predicted']]

In [30]:
wrongly_predicted 

Unnamed: 0,all_text,Actual,Predicted,Relationship,Marriage
19,How do I trust my stepmother after she threate...,0,1,0.45,0.55
29,seeking awkward nsfw advice TW: sex mention \n...,0,1,0.36,0.64
31,What to do after hooking up If he hasn’t asked...,0,1,0.47,0.53
39,Is sexual assault in a workplace??? So I work ...,0,1,0.45,0.55
46,This is pretty long but i need your opinion. W...,0,1,0.00,1.00
...,...,...,...,...,...
18102,Smoking us away I've been with my fiance for t...,1,0,0.50,0.50
18109,Wife wants to casually date Long story short m...,1,0,0.74,0.26
18113,Get Your Life Partner,1,0,0.57,0.43
18132,Phone Broke So recently my cell phone broke. I...,1,0,0.51,0.49


In [32]:
# False Positive
marriage_in_relationship = wrongly_predicted[
    ((wrongly_predicted['Actual'] == 0 ) & (wrongly_predicted['Predicted'] == 1))]


In [36]:
marriage_in_relationship.head(10)

Unnamed: 0,all_text,Actual,Predicted,Relationship,Marriage
19,How do I trust my stepmother after she threate...,0,1,0.45,0.55
29,seeking awkward nsfw advice TW: sex mention \n...,0,1,0.36,0.64
31,What to do after hooking up If he hasn’t asked...,0,1,0.47,0.53
39,Is sexual assault in a workplace??? So I work ...,0,1,0.45,0.55
46,This is pretty long but i need your opinion. W...,0,1,0.0,1.0
52,Got drunk and cheated Throwaway\n\nI (42M) wen...,0,1,0.33,0.67
54,"Feeling inadequate, almost picked on For a whi...",0,1,0.33,0.67
58,My SIL's bird poop all over the living room an...,0,1,0.43,0.57
64,My Wife [27W] is telling me [27M] that I'm not...,0,1,0.16,0.84
68,Having doubts in my relationship after 7 years...,0,1,0.43,0.57


One particular problem noticed is that relationship posts contains posts made by marriage couples. This may cause confusion the model and the impact is required to be further investigated.

In [19]:
# Creating a mask to identify married couples
married = ((marriage_in_relationship['all_text'].str.contains('husband')) | 
                ((marriage_in_relationship['all_text'].str.contains('wife')))|
                ((marriage_in_relationship['all_text'].str.contains('spouse'))) |
                (marriage_in_relationship['all_text'].str.contains('marr')))

In [20]:
# Checking number of married couples in wrongly classified relationship. 
marriage_in_relationship.loc[married]

Unnamed: 0,all_text,Actual,Predicted,Relationship,Marriage
52,Got drunk and cheated Throwaway\n\nI (42M) wen...,0,1,0.33,0.67
58,My SIL's bird poop all over the living room an...,0,1,0.43,0.57
64,My Wife [27W] is telling me [27M] that I'm not...,0,1,0.16,0.84
157,"I (37F) found my bf (33M) texting with ex, is ...",0,1,0.27,0.73
171,Constant state of anxiety about my wife leavin...,0,1,0.32,0.68
...,...,...,...,...,...
8728,"It's a long, complicated one, but I really nee...",0,1,0.43,0.57
8761,Regret losing my virginity 4 years ago So when...,0,1,0.35,0.65
8785,Have you been the cheating spouse who was forg...,0,1,0.41,0.59
8791,Typing this out I think I know but I'd love to...,0,1,0.00,1.00


Only approximately 400 married couples were posting wrongly in the relationship subreddit. This amounts to approximately 4.5% of the overall relationship subreddit which is an acceptable margin of error. Hence, this impact is minimal.

Furthermore, some of the posts were not made by married couples. It could have been written by unmarried couples who are asking advice about marriage or their future husband or wives. 

An example can be seen below.

In [21]:
marriage_in_relationship['all_text'][8761]

'Regret losing my virginity 4 years ago So when I was 19 I was used and abused by a sociopath who peer pressured and Coerced me into sex. During those times I was always insecure and he used that to his advantage. \n\nMy point is I still regret it because in my culture/religion clearly premarital sex isn’t allowed, and yet I had sacrificed it for the sake of wanting this persons love and affection only in return to be stuck in a cycle of being groomed, used, disrespected and then discarded, repeat. \n\nAs a women I feel like I no longer hold value and worry that no man will love me or marry me, like I’m some damaged goods. \n\nWhat I also worry about now is what if I contracted hpv from this person? He slept with 9 other women before me, and I don’t know where they had been in which makes me feel dirty. And I can’t get checked till I’m 25. Although to my understanding you can clear hpv and hpv vaccine protects you from high risk strains but I’m still paranoid by the thought of cervical

In [37]:
# False Negative
relationship_in_marriage = wrongly_predicted[
    ((wrongly_predicted['Actual'] == 1 ) & (wrongly_predicted['Predicted'] == 0))]


In [44]:
relationship_in_marriage.head(10)

Unnamed: 0,all_text,Actual,Predicted,Relationship,Marriage
8841,Dichotomy regarding marriage in my mind I'm a ...,1,0,0.7,0.3
8843,Am I the asshole to not pick her up from the g...,1,0,0.86,0.14
8853,What should I do Im 27 and he is 31. I just go...,1,0,0.71,0.29
8863,My (25f) boyfriend (24m) gave me an ultimatum?,1,0,0.68,0.32
8864,kids visiting grandparents Question for you al...,1,0,0.51,0.49
8871,Is it incest to marry your step sister? The fo...,1,0,0.61,0.39
8872,Me (25F) and my husband 35F) won't have sexual...,1,0,0.78,0.22
8886,Did you also oppose marriage before getting ma...,1,0,0.59,0.41
8887,Husband becomes very aggressive when tired Eve...,1,0,0.76,0.24
8893,Am I being rude? So after giving our 2 toddler...,1,0,0.67,0.33


Most of the posts seemed to be wrongly misclassified due to the strict stop words used.

### Conclusion and Recommendation

Using LogisticRegression classifier with a CountVectorizer, I was able to achieve an accuracy of 84.6% which is quite reasonable. Being able to differentiate the posts would allow me to identify concerns which are relevant to married couples and unmarried couples.

Next step for me would be to create the FAQ regarding the issues which were identified.

Other areas for future exploration:
- I will be looking to incorporate a sentiment library which was specially tailored to relationship issue and to subsequently perform sentiment analysis on the posts. This will allow for more worrying issues to be more prominent.

- Some of the previously identified investigation points were lost during the modelling and were not able to be investigated. I will specifically look into those points.