<img src="../assets/a_eyes_readme.gif" style="float:right ; margin: 10px ; width:300px;"> 

<h1><left>GOODBYE WORLD: using Natural Language Processing to identify suicidal posts</left></h1>
<h4><left></left></h4>

___

## **3. Modelling**

In this section, we will be using a Pipeline to score different classifier models like K-Nearest Neighbours and Multinomial Naive Bayes before finally settling on a final production model.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
sns.set_style("white")

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, roc_auc_score

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
!pip install wordninja
# IF YOU ARE MISSING "WordCloud":
# TRY INSTALLING VIA TERMINAL LIKE THIS: /anaconda3/bin/python -m pip install wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
# IF YOU ARE MISSING "wordninja":
# TRY INSTALLING VIA TERMINAL LIKE THIS: pip install wordninja
import wordninja

Collecting wordninja
[?25l  Downloading https://files.pythonhosted.org/packages/30/15/abe4af50f4be92b60c25e43c1c64d08453b51e46c32981d80b3aebec0260/wordninja-2.0.0.tar.gz (541kB)
[K     |▋                               | 10kB 18.5MB/s eta 0:00:01[K     |█▏                              | 20kB 1.6MB/s eta 0:00:01[K     |█▉                              | 30kB 1.8MB/s eta 0:00:01[K     |██▍                             | 40kB 2.1MB/s eta 0:00:01[K     |███                             | 51kB 2.0MB/s eta 0:00:01[K     |███▋                            | 61kB 2.2MB/s eta 0:00:01[K     |████▎                           | 71kB 2.4MB/s eta 0:00:01[K     |████▉                           | 81kB 2.5MB/s eta 0:00:01[K     |█████▌                          | 92kB 2.5MB/s eta 0:00:01[K     |██████                          | 102kB 2.7MB/s eta 0:00:01[K     |██████▋                         | 112kB 2.7MB/s eta 0:00:01[K     |███████▎                        | 122kB 2.7MB/s eta 0:00:01

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model_data = pd.read_csv('/content/drive/My Drive/ML mini project (Group-9) /data_for_model.csv', keep_default_na=False)

In [None]:
model_data.head(3)

Unnamed: 0,title,selftext,author,num_comments,is_suicide,url,selftext_clean,title_clean,author_clean,selftext_length,title_length,megatext_clean
0,Our most-broken and least-understood rules is ...,We understand that most people who reply immed...,SQLwitch,175,0,https://www.reddit.com/r/depression/comments/d...,understand people reply immediately op invitat...,broken least understood rule helper may invite...,sql witch,4792,144,sql witch understand people reply immediately ...
1,"Regular Check-In Post. Plus, a reminder about ...",Welcome to /r/depression's check-in post - a p...,SQLwitch,1237,0,https://www.reddit.com/r/depression/comments/i...,welcome r depression check post place take mom...,regular check post plus reminder activism rule,sql witch,1225,67,sql witch welcome r depression check post plac...
2,"I’ve wasted so much time being depressed, that...",I can’t imagine how many days I have wasted to...,RosiePosie710,104,0,https://www.reddit.com/r/depression/comments/j...,imagine many day wasted sad cry able accomplis...,wasted much time depressed depressed wasting time,rosie po ie 710,323,80,rosie po ie 710 imagine many day wasted sad cr...


In [None]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1982 entries, 0 to 1981
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            1982 non-null   object
 1   selftext         1982 non-null   object
 2   author           1982 non-null   object
 3   num_comments     1982 non-null   int64 
 4   is_suicide       1982 non-null   int64 
 5   url              1982 non-null   object
 6   selftext_clean   1982 non-null   object
 7   title_clean      1982 non-null   object
 8   author_clean     1982 non-null   object
 9   selftext_length  1982 non-null   int64 
 10  title_length     1982 non-null   int64 
 11  megatext_clean   1982 non-null   object
dtypes: int64(4), object(8)
memory usage: 185.9+ KB


### 3.1 Establishing a baseline score 
We will first calculate the baseline score for our models to "out-perform". A baseline score in the context of our project be the percentage of us getting it right if we predict that all our reddit posts are from the r/SuicideWatch subreddit. 


In [None]:
#OUR BASELINE ACCURACY IS 51.66%
#BASELINE ACCURACY IS BASICALLY OUR SCORE IF WE GUESS EVERYTHING == 1
model_data['is_suicide'].mean()

0.49848637739656915

### 3.2 Selecting the best column to pick our features from
Before moving forward to creating a production model, we will run a Count Vectorizer + Naive Bayes model on different columns and score them. This will help us pick which one that we will use to build more models on. 

In [None]:
#DEFINING A FUNCTION TO SCORE MULTIPLE MODELS USING DIFFERENT COLS OF OUR DATASET

#PLEASE DEFINE THE FOLLOWING BEFORE RUNNING THE FUNCTION
# columns_list = ['column_1', "column_2", "column_3"]
# model = "CountVec + MultinomialNB"
# df_list=[] #THIS LIST SHOULD BE AN EMPTY LIST

#DEFINING THE FUNCTION
def multi_modelling(columns_list, model):
    for i in columns_list:
        #DEFINING X and y
        X = model_data[i]
        y = model_data['is_suicide']
        
        #TRAIN-TEST SPLIT
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
        
        #INSTATIATING CVEC AND FITTING OUR TRAINING DATA INTO IT
        cvec = CountVectorizer()
        cvec.fit(X_train)
        
        #CREATING DATAFRAMES FROM X_train AND X_test
        X_train = pd.DataFrame(cvec.transform(X_train).todense(),
                               columns=cvec.get_feature_names())
        X_test = pd.DataFrame(cvec.transform(X_test).todense(),
                               columns=cvec.get_feature_names())
        
        #INSTATIATING AND FITTING MODEL
        nb = MultinomialNB()
        nb.fit(X_train,y_train)
        
        #GETTING PREDICTIONS FROM MODEL
        pred = nb.predict(X_test)
        
        #GETTING VALUES FROM A CONFUSION MATRIX
        tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
        
        
        #CALCULATING AREA UNDER THE CURVE
        nb.predict_proba(X_test)
        pred_proba = [i[1] for i in nb.predict_proba(X_test)] 
        auc = roc_auc_score(y_test, pred_proba)

        #CREATING A DICTIONARY FROM THE CLASSIFICATION REPORT(WE'LL DRAW SOME METRICS FROM HERE)
        classi_dict = (classification_report(y_test,pred, output_dict=True))

        #CREATING A DICTIONARY CONTAINING OUR RESULTS
        model_results = {}
        model_results['series used (X)'] = i
        model_results['model'] = model
        model_results['AUC Score'] = auc
        model_results['precision']= classi_dict['weighted avg']['precision']
        model_results['recall (sensitivity)']= classi_dict['weighted avg']['recall']
        model_results['confusion matrix']={"TP": tp,"FP":fp, "TN": tn, "FN": fn}
        model_results['train accuracy'] = nb.score(X_train, y_train)
        model_results['test accuracy'] = nb.score(X_test, y_test)
        model_results['baseline accuracy']=0.5166
        model_results['specificity']= tn/(tn+fp)  
        model_results['f1-score']= classi_dict['weighted avg']['f1-score']
        #model_results['support']= classi_dict['weighted avg']['support']
        model_results
        df_list.append(model_results) 

    pd.set_option("display.max_colwidth", 50)
    return (pd.DataFrame(df_list)).round(2)

#### Note: Understanding our confusion matrix
In the context of our project, these are what the parameters in our confusion matrix represent: 

> **True Positives (TP)** -  We predict that an entry is from the r/SuicideWatch subreddit and we get it right. As we are seeking to identify suicide cases, our priority is to get as many of these!

> **True Negatives (TN)** - We predict that an entry is from the r/depression subreddit and we get it right. This also means that we did well. 

> **False Positives (FP)** - We predict that an entry is from the r/SuicideWatch subreddit and we get it wrong. Needless to say, this is undesirable. 

> **False Negatives (FN)** - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.  

In [None]:
#CALLING THE FUNCTION ON OUR LIST OF COLUMNS
columns_list = ['selftext', "author", "title",'selftext_clean', "author_clean", "title_clean", "megatext_clean"]
model = "CountVec + MultinomialNB"
df_list=[]
multi_modelling(columns_list, model)

Unnamed: 0,series used (X),model,AUC Score,precision,recall (sensitivity),confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,selftext,CountVec + MultinomialNB,0.71,0.68,0.68,"{'TP': 183, 'FP': 96, 'TN': 153, 'FN': 64}",0.88,0.68,0.52,0.61,0.68
1,author,CountVec + MultinomialNB,0.6,0.7,0.58,"{'TP': 47, 'FP': 8, 'TN': 241, 'FN': 200}",0.98,0.58,0.52,0.97,0.51
2,title,CountVec + MultinomialNB,0.69,0.66,0.66,"{'TP': 172, 'FP': 94, 'TN': 155, 'FN': 75}",0.83,0.66,0.52,0.62,0.66
3,selftext_clean,CountVec + MultinomialNB,0.71,0.68,0.68,"{'TP': 181, 'FP': 95, 'TN': 154, 'FN': 66}",0.89,0.68,0.52,0.62,0.67
4,author_clean,CountVec + MultinomialNB,0.59,0.57,0.57,"{'TP': 120, 'FP': 86, 'TN': 163, 'FN': 127}",0.94,0.57,0.52,0.65,0.57
5,title_clean,CountVec + MultinomialNB,0.69,0.65,0.65,"{'TP': 165, 'FP': 92, 'TN': 157, 'FN': 82}",0.82,0.65,0.52,0.63,0.65
6,megatext_clean,CountVec + MultinomialNB,0.76,0.71,0.71,"{'TP': 187, 'FP': 86, 'TN': 163, 'FN': 60}",0.93,0.71,0.52,0.65,0.7


#### Final choice made: *megatext_clean* as our "Production Column"
Based on a combination of scores from our modelling exercise above, we will proceed with *megatext_clean* -- a combination of our cleaned titles, usernames and posts --  as the column we will use to draw features from. Some reasons why: 

> **Generalising Well** -  The model using *megatext_clean*'s test set scored a 0.67 (the joint highest) while its training set score a 0.95.

> **High ROC Area Under Curve score** - As our classes are largely balanced, it is suitable to use AUC Scores as a metric to measure the quality of our model's predictions. Our top choice performs best there. 

> **Best recall/sensitivity score** - This score measures the ratio of the correctly positive-labeled(is in r/SuicideWatch) by our program to all who are truly in r/SuicideWatch. As that is the target of our project, that the model performed well for this metric is important(and perhaps, most important) to us.



> **False Negatives (FN)** - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.  

### 3.3 The search for a production model
Inspired by our earlier function, we will create a similar function that will run multiple permutations of models with Count, Hashing and TFID Vectorizers. The resulting metrics will be held neatly in a dataframe. 

In [None]:
# DEFINING A FUNCTION THAT WILL RUN MULTIPLE MODELS AND GRIDSEARCH FOR BEST PARAMETERS

def gridsearch_multi(steps_titles, steps_list, pipe_params):
    
    #DEFINING X and y
    X = model_data["megatext_clean"]
    y = model_data['is_suicide']
    #TRAIN-TEST SPLIT
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
    # DATAFRAME TO DISPLAY RESULTS
    gs_results = pd.DataFrame(columns=['model','AUC Score', 'precision', 'recall (sensitivity)', 
                                       'best_params', 'best score', 'confusion matrix', 
                                       'train_accuracy','test_accuracy','baseline_accuracy',
                                       'specificity', 'f1-score'])

    # FOR LOOP THROUGH STEPS LIST
    for i in range(len(steps_list)):
        # INSTATIATE PIPELINE
        pipe = Pipeline(steps=steps_list[i])
        # INSTANTIATE GRIDSEARCHCV WITH PARAMETER ARGUMENT
        gs = GridSearchCV(pipe, pipe_params[i], cv=3) 
        gs.fit(X_train, y_train)
        
        #GETTING PREDICTIONS FROM MODEL
        pred = gs.predict(X_test)
        
        # DEFINE CONFUSION MATRIX ELEMENTS
        tn, fp, fn, tp = confusion_matrix(y_test, gs.predict(X_test)).ravel() 
        
        #CREATING A DICTIONARY FROM THE CLASSIFICATION REPORT(WE'LL DRAW SOME METRICS FROM HERE)
        classi_dict = (classification_report(y_test,pred, output_dict=True))
        
        #CALCULATING AREA UNDER THE CURVE
        gs.predict_proba(X_test)
        pred_proba = [i[1] for i in gs.predict_proba(X_test)] 
        auc = roc_auc_score(y_test, pred_proba)
        
        #DEFINE DATAFRAME COLUMNS
        model_results = {}
        model_results['model'] = steps_titles[i]
        model_results['AUC Score'] = auc
        model_results['precision']= classi_dict['weighted avg']['precision']
        model_results['recall (sensitivity)']= classi_dict['weighted avg']['recall']
        model_results['best params'] = gs.best_params_
        model_results['best score'] = gs.best_score_
        model_results['confusion matrix']={"TP": tp,"FP":fp, "TN": tn, "FN": fn}
        model_results['train accuracy'] = gs.score(X_train, y_train)
        model_results['test accuracy'] = gs.score(X_test, y_test)
        model_results['baseline accuracy'] = 0.5166
        
        model_results['specificity']= tn/(tn+fp)  
        model_results['f1-score']= classi_dict['weighted avg']['f1-score']


        #APPEND RESULTS TO A NICE DATAFRAME
        df_list.append(model_results) 
        pd.set_option("display.max_colwidth", 200)
    return (pd.DataFrame(df_list)).round(2)

In [None]:
#USING THE FUNCTION WITH COUNT VECTORIZOR

# EMPTY LIST THAT WILL HOLD RESULTS
df_list=[]

# LIST OF MODELS
steps_titles = ['cvec+ multi_nb','cvec + ss + knn','cvec + ss + logreg']

# CODE FOR PIPELINE TO INSTATIATE MODELS
steps_list = [ 
    [('cv', CountVectorizer()),('multi_nb', MultinomialNB())],
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('cv', CountVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

# LIST OF PARAMETER DICTIONARIES
pipe_params = [
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)],'cv__max_features': [20, 30, 50],'cv__min_df': [2, 3],'cv__max_df': [.2, .25, .3]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)],'cv__max_features': [20, 30, 50],'cv__min_df': [2, 3],'cv__max_df': [.2, .25, .3]},
    {'cv__stop_words':['english'], 'cv__ngram_range':[(1,1),(1,2)],'cv__max_features': [20, 30, 50],'cv__min_df': [2, 3],'cv__max_df': [.2, .25, .3]}
]   

#RUNNING THE FUNCTION
gridsearch_multi(steps_titles, steps_list, pipe_params)

Unnamed: 0,model,AUC Score,precision,recall (sensitivity),best params,best score,confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,cvec+ multi_nb,0.73,0.67,0.67,"{'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.69,"{'TP': 165, 'FP': 83, 'TN': 166, 'FN': 82}",0.71,0.67,0.52,0.67,0.67
1,cvec + ss + knn,0.66,0.62,0.62,"{'cv__max_df': 0.25, 'cv__max_features': 30, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.61,"{'TP': 147, 'FP': 87, 'TN': 162, 'FN': 100}",0.73,0.62,0.52,0.65,0.62
2,cvec + ss + logreg,0.73,0.67,0.67,"{'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.68,"{'TP': 156, 'FP': 73, 'TN': 176, 'FN': 91}",0.72,0.67,0.52,0.71,0.67


In [None]:
#USING THE FUNCTION WITH TFID VECTORIZOR

# LIST OF MODELS
steps_titles = ['tvec + multi_nb','tvec + ss + knn','tvec + ss + logreg']

# CODE FOR PIPELINE TO INSTATIATE MODELS
steps_list = [ 
    [('tv', TfidfVectorizer()),('multi_nb', MultinomialNB())],
    [('tv', TfidfVectorizer()),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('tv', TfidfVectorizer()),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

# LIST OF PARAMETER DICTIONARIES
pipe_params = [
    {'tv__stop_words':['english'], 'tv__ngram_range':[(1,1),(1,2)],'tv__max_features': [20, 30, 50],'tv__min_df': [2, 3],'tv__max_df': [.2, .25, .3]},
    {'tv__stop_words':['english'], 'tv__ngram_range':[(1,1),(1,2)],'tv__max_features': [20, 30, 50],'tv__min_df': [2, 3],'tv__max_df': [.2, .25, .3]},
    {'tv__stop_words':['english'], 'tv__ngram_range':[(1,1),(1,2)],'tv__max_features': [20, 30, 50],'tv__min_df': [2, 3],'tv__max_df': [.2, .25, .3]}
]   

#RUNNING THE FUNCTION
gridsearch_multi(steps_titles, steps_list, pipe_params)

Unnamed: 0,model,AUC Score,precision,recall (sensitivity),best params,best score,confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,cvec+ multi_nb,0.73,0.67,0.67,"{'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.69,"{'TP': 165, 'FP': 83, 'TN': 166, 'FN': 82}",0.71,0.67,0.52,0.67,0.67
1,cvec + ss + knn,0.66,0.62,0.62,"{'cv__max_df': 0.25, 'cv__max_features': 30, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.61,"{'TP': 147, 'FP': 87, 'TN': 162, 'FN': 100}",0.73,0.62,0.52,0.65,0.62
2,cvec + ss + logreg,0.73,0.67,0.67,"{'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.68,"{'TP': 156, 'FP': 73, 'TN': 176, 'FN': 91}",0.72,0.67,0.52,0.71,0.67
3,tvec + multi_nb,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.69,"{'TP': 146, 'FP': 72, 'TN': 177, 'FN': 101}",0.71,0.65,0.52,0.71,0.65
4,tvec + ss + knn,0.63,0.6,0.6,"{'tv__max_df': 0.2, 'tv__max_features': 20, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.61,"{'TP': 137, 'FP': 89, 'TN': 160, 'FN': 110}",0.71,0.6,0.52,0.64,0.6
5,tvec + ss + logreg,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.68,"{'TP': 154, 'FP': 81, 'TN': 168, 'FN': 93}",0.72,0.65,0.52,0.67,0.65


In [None]:
#USING THE FUNCTION WITH HASHING VECTORIZOR

# LIST OF MODELS
steps_titles = ['hvec + multi_nb','hvec + ss + knn','hvec + ss + logreg']

# CODE FOR PIPELINE TO INSTATIATE MODELS
steps_list = [ 
    [('hv', HashingVectorizer(alternate_sign=False)),('multi_nb', MultinomialNB())],
    [('hv', HashingVectorizer(alternate_sign=False)),('scaler', StandardScaler(with_mean=False)),('knn', KNeighborsClassifier())], 
    [('hv', HashingVectorizer(alternate_sign=False)),('scaler', StandardScaler(with_mean=False)),('logreg', LogisticRegression())]
]

# LIST OF PARAMETER DICTIONARIES
pipe_params = [
    {'hv__stop_words':['english'], 'hv__ngram_range':[(1,1),(1,2)]},
    {'hv__stop_words':['english'], 'hv__ngram_range':[(1,1),(1,2)]},
    {'hv__stop_words':['english'], 'hv__ngram_range':[(1,1),(1,2)]}
]   

#RUNNING THE FUNCTION
gridsearch_multi(steps_titles, steps_list, pipe_params)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Unnamed: 0,model,AUC Score,precision,recall (sensitivity),best params,best score,confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,cvec+ multi_nb,0.73,0.67,0.67,"{'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.69,"{'TP': 165, 'FP': 83, 'TN': 166, 'FN': 82}",0.71,0.67,0.52,0.67,0.67
1,cvec + ss + knn,0.66,0.62,0.62,"{'cv__max_df': 0.25, 'cv__max_features': 30, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.61,"{'TP': 147, 'FP': 87, 'TN': 162, 'FN': 100}",0.73,0.62,0.52,0.65,0.62
2,cvec + ss + logreg,0.73,0.67,0.67,"{'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.68,"{'TP': 156, 'FP': 73, 'TN': 176, 'FN': 91}",0.72,0.67,0.52,0.71,0.67
3,tvec + multi_nb,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.69,"{'TP': 146, 'FP': 72, 'TN': 177, 'FN': 101}",0.71,0.65,0.52,0.71,0.65
4,tvec + ss + knn,0.63,0.6,0.6,"{'tv__max_df': 0.2, 'tv__max_features': 20, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.61,"{'TP': 137, 'FP': 89, 'TN': 160, 'FN': 110}",0.71,0.6,0.52,0.64,0.6
5,tvec + ss + logreg,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.68,"{'TP': 154, 'FP': 81, 'TN': 168, 'FN': 93}",0.72,0.65,0.52,0.67,0.65
6,hvec + multi_nb,0.76,0.69,0.68,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.68,"{'TP': 136, 'FP': 50, 'TN': 199, 'FN': 111}",0.95,0.68,0.52,0.8,0.67
7,hvec + ss + knn,0.49,0.75,0.5,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.51,"{'TP': 247, 'FP': 247, 'TN': 2, 'FN': 0}",0.5,0.5,0.52,0.01,0.34
8,hvec + ss + logreg,0.71,0.64,0.64,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.67,"{'TP': 145, 'FP': 78, 'TN': 171, 'FN': 102}",1.0,0.64,0.52,0.69,0.64


#### Narrowing down to two models 
The *Hashing Vectorizer + Multinomial Naive Bayes* model out-performed other models on multiple metrics. Especially our much-prized AUC score(0.77) and the recall score(which measures our model's ability to predict True Positives well). Another notable performer is the *TFID Vectorizer + Multinomial Naive Bayes* combination. Apart from the joint-second-highest AUC score of 0.73, its consistent performance on both the test and training sets showed that the model generalises well.

> **Next Step: Tuning Hyperparameters** -  We'll now move on to make further moves to tweak our hyperparameters for both of these models. 

In [None]:
#TUNING OUR HYPERPARAMETERS
#USING THE FUNCTION ON OUR TOP TWO MODELS

# LIST OF MODELS
steps_titles = ['hvec + multi_nb(tuning)','tvec + multi_nb(tuning)']

# CODE FOR PIPELINE TO INSTATIATE MODELS
steps_list = [ 
    [('hv', HashingVectorizer(alternate_sign=False)),('multi_nb', MultinomialNB())],
    [('tv', TfidfVectorizer()),('multi_nb', MultinomialNB())]
]

# LIST OF PARAMETER DICTIONARIES
pipe_params = [
    {'hv__stop_words':['english'], 'hv__ngram_range':[(1,1),(1,2),(1,3)], 'hv__n_features': [50, 150, 300, 500, 800, 1000]},
    {'tv__stop_words':['english'], 'tv__ngram_range':[(1,1),(1,2),(1,3)],'tv__max_features': [20, 30, 50, 70, 100],'tv__min_df': [2, 3],'tv__max_df': [.2, .25, .3, .35,.4]},
]   

#RUNNING THE FUNCTION
gridsearch_multi(steps_titles, steps_list, pipe_params)

Unnamed: 0,model,AUC Score,precision,recall (sensitivity),best params,best score,confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,cvec+ multi_nb,0.73,0.67,0.67,"{'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.69,"{'TP': 165, 'FP': 83, 'TN': 166, 'FN': 82}",0.71,0.67,0.52,0.67,0.67
1,cvec + ss + knn,0.66,0.62,0.62,"{'cv__max_df': 0.25, 'cv__max_features': 30, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.61,"{'TP': 147, 'FP': 87, 'TN': 162, 'FN': 100}",0.73,0.62,0.52,0.65,0.62
2,cvec + ss + logreg,0.73,0.67,0.67,"{'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.68,"{'TP': 156, 'FP': 73, 'TN': 176, 'FN': 91}",0.72,0.67,0.52,0.71,0.67
3,tvec + multi_nb,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.69,"{'TP': 146, 'FP': 72, 'TN': 177, 'FN': 101}",0.71,0.65,0.52,0.71,0.65
4,tvec + ss + knn,0.63,0.6,0.6,"{'tv__max_df': 0.2, 'tv__max_features': 20, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.61,"{'TP': 137, 'FP': 89, 'TN': 160, 'FN': 110}",0.71,0.6,0.52,0.64,0.6
5,tvec + ss + logreg,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.68,"{'TP': 154, 'FP': 81, 'TN': 168, 'FN': 93}",0.72,0.65,0.52,0.67,0.65
6,hvec + multi_nb,0.76,0.69,0.68,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.68,"{'TP': 136, 'FP': 50, 'TN': 199, 'FN': 111}",0.95,0.68,0.52,0.8,0.67
7,hvec + ss + knn,0.49,0.75,0.5,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.51,"{'TP': 247, 'FP': 247, 'TN': 2, 'FN': 0}",0.5,0.5,0.52,0.01,0.34
8,hvec + ss + logreg,0.71,0.64,0.64,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.67,"{'TP': 145, 'FP': 78, 'TN': 171, 'FN': 102}",1.0,0.64,0.52,0.69,0.64
9,hvec + multi_nb(tuning),0.75,0.68,0.68,"{'hv__n_features': 1000, 'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'}",0.69,"{'TP': 170, 'FP': 80, 'TN': 169, 'FN': 77}",0.8,0.68,0.52,0.68,0.68


In [None]:
#THE SCORES WERE CLOSE! LET'S DO FINE-TUNE IT JUST ONE MORE TIME
#TUNING OUR HYPERPARAMETERS PART II
#USING THE FUNCTION ON OUR TOP TWO MODELS

# LIST OF MODELS
steps_titles = ['hvec + multi_nb (tuning_2)','tvec + multi_nb (tuning_2)']

# CODE FOR PIPELINE TO INSTATIATE MODELS
steps_list = [ 
    [('hv', HashingVectorizer(alternate_sign=False)),('multi_nb', MultinomialNB())],
    [('tv', TfidfVectorizer()),('multi_nb', MultinomialNB())]
]

# LIST OF PARAMETER DICTIONARIES
pipe_params = [
    {'hv__stop_words':['english'], 'hv__ngram_range':[(1,1)], 'hv__n_features': [1000, 1200, 1400, 2000]},
    {'tv__stop_words':['english'], 'tv__ngram_range':[(1,1),(1,2),(1,3)],'tv__max_features': [60, 65, 70, 75, 80],'tv__min_df': [1, 2, 3],'tv__max_df': [.4, .45,.5,.55, .6]},
]   

#RUNNING THE FUNCTION
gridsearch_multi(steps_titles, steps_list, pipe_params)

Unnamed: 0,model,AUC Score,precision,recall (sensitivity),best params,best score,confusion matrix,train accuracy,test accuracy,baseline accuracy,specificity,f1-score
0,cvec+ multi_nb,0.73,0.67,0.67,"{'cv__max_df': 0.2, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.69,"{'TP': 165, 'FP': 83, 'TN': 166, 'FN': 82}",0.71,0.67,0.52,0.67,0.67
1,cvec + ss + knn,0.66,0.62,0.62,"{'cv__max_df': 0.25, 'cv__max_features': 30, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.61,"{'TP': 147, 'FP': 87, 'TN': 162, 'FN': 100}",0.73,0.62,0.52,0.65,0.62
2,cvec + ss + logreg,0.73,0.67,0.67,"{'cv__max_df': 0.3, 'cv__max_features': 50, 'cv__min_df': 2, 'cv__ngram_range': (1, 1), 'cv__stop_words': 'english'}",0.68,"{'TP': 156, 'FP': 73, 'TN': 176, 'FN': 91}",0.72,0.67,0.52,0.71,0.67
3,tvec + multi_nb,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.69,"{'TP': 146, 'FP': 72, 'TN': 177, 'FN': 101}",0.71,0.65,0.52,0.71,0.65
4,tvec + ss + knn,0.63,0.6,0.6,"{'tv__max_df': 0.2, 'tv__max_features': 20, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.61,"{'TP': 137, 'FP': 89, 'TN': 160, 'FN': 110}",0.71,0.6,0.52,0.64,0.6
5,tvec + ss + logreg,0.74,0.65,0.65,"{'tv__max_df': 0.2, 'tv__max_features': 50, 'tv__min_df': 2, 'tv__ngram_range': (1, 1), 'tv__stop_words': 'english'}",0.68,"{'TP': 154, 'FP': 81, 'TN': 168, 'FN': 93}",0.72,0.65,0.52,0.67,0.65
6,hvec + multi_nb,0.76,0.69,0.68,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.68,"{'TP': 136, 'FP': 50, 'TN': 199, 'FN': 111}",0.95,0.68,0.52,0.8,0.67
7,hvec + ss + knn,0.49,0.75,0.5,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.51,"{'TP': 247, 'FP': 247, 'TN': 2, 'FN': 0}",0.5,0.5,0.52,0.01,0.34
8,hvec + ss + logreg,0.71,0.64,0.64,"{'hv__ngram_range': (1, 2), 'hv__stop_words': 'english'}",0.67,"{'TP': 145, 'FP': 78, 'TN': 171, 'FN': 102}",1.0,0.64,0.52,0.69,0.64
9,hvec + multi_nb(tuning),0.75,0.68,0.68,"{'hv__n_features': 1000, 'hv__ngram_range': (1, 1), 'hv__stop_words': 'english'}",0.69,"{'TP': 170, 'FP': 80, 'TN': 169, 'FN': 77}",0.8,0.68,0.52,0.68,0.68


#### Production Model Chosen: TF-IDF Vectorizer + Multinomial Naive Bayes
The model responded well to the tuning sessions. Although the Hashing model had a slightly better AUC score, I'd prioritise this model's high recall score as it will help predict potential suicide cases(True Positives) more accurately. This model is also proving to generalise pretty well with only a 0.01 variation from its Training to Test set scores. 

### 3.4 Running the optimised production model


Our production model is a combination of two models: TF-IDF and Multinomial Naive Bayes.

The first one, a TF-IDF (or “Term Frequency – Inverse Document” Frequency) Vectorizer, assigns scores to the words (or in our case, the top 70 words) in our selected feature. TF-IDF will penalise a word that appears too often in the document. 

A matrix of "word scores" is then transferred into a Multinomial Naive Bayes classifier, which makes predictions based on the calculation of the probability of a given word falling into the a certain category.



In [None]:
#CHECKING SCORES OF THE OPTIMISED MODEL USING TEST DATA
#DEFINING X and y
X = model_data["megatext_clean"]
y = model_data['is_suicide']
#TRAIN-TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

tvec_optimised = TfidfVectorizer(max_df= 0.5, max_features=70, min_df=2, ngram_range=(1, 3),stop_words = 'english')
X_train_tvec = tvec_optimised.fit_transform(X_train).todense()
X_test_tvec = tvec_optimised.transform(X_test).todense()

#FINDING THE ACCURACY SCORE ON THE TEST DATA
nb = MultinomialNB()
nb.fit(X_train_tvec, y_train)
accuracy = nb.score(X_test_tvec, y_test)

#CALCULATING AREA UNDER THE CURVE

pred_proba = [i[1] for i in nb.predict_proba(X_test_tvec)] 
auc = roc_auc_score(y_test, pred_proba)

print("ACCURACY: {}\nAUC SCORE: {}".format(accuracy, auc) )

ACCURACY: 0.6673387096774194
AUC SCORE: 0.7414678958749981


> **Results** - The optimised model scored well on out test set, scoring an AUC score of 0.75 . We will proceed to understand our model a bit better before making final critiques and recommendations.

In [None]:
#DEFINING A FUNCTION TO VISUALISE MOST USED WORDS
def TF_IDF_most_used_words(category_string, data_series, palette, image_mask):
    #CHECKING OUT COMMON WORDS IN r/SuicideWatch USING TVEC
    tvec_optimised = TfidfVectorizer(max_df= 0.5, max_features=70, min_df=2, ngram_range=(1, 3),stop_words = 'english')
    tvec_optimised.fit(data_series)
    #CREATING A DATAFRAME OF EXTRACTED WORDS
    created_df = pd.DataFrame(tvec_optimised.transform(data_series).todense(),
                              columns=tvec_optimised.get_feature_names())
    total_words = created_df.sum(axis=0)
    
    #<<<WORDCLOUD>>>
    #CREATING A LONG STRING OF WORDS FOR THE WORD CLOUD MODULE
    top_40_words = total_words.sort_values(ascending = False).head(40)
    top_40_words_df = pd.DataFrame(top_40_words)
    top_words_cloud_df = top_40_words_df.reset_index()
    top_words_cloud_df.columns = ["words", "count"]
    one_string_list = []
    for i in range(len(top_words_cloud_df)):
        one_string = (top_words_cloud_df["words"][i] + " ")* (top_words_cloud_df["count"][i]).astype(int)
        one_string_list.append(one_string)
    long_string = " ".join(string for string in one_string_list)
    #print(long_string)
    # CREATING A WORD CLOUD IMAGE
    mask = np.array(Image.open(image_mask))
    wordcloud = WordCloud(repeat=True, collocations=False,min_font_size=2, max_font_size= 80, max_words= 10000, background_color= "white",colormap= palette,  mask= mask).generate(long_string)
    # DISPLAY IT
    #plt.axis("off")
    plt.figure(figsize = (20, 20), dpi=300)
    plt.title('\n{}\n'.format(category_string), fontsize=22)
    #plt.imshow(wordcloud, interpolation='bilinear') 
    image_colors = ImageColorGenerator(mask) #THIS MAKES THE WORDCLOUD RESPOND TO THE COLOURS IN THE MASK
    plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation='bilinear')
    plt.axis("off")
    plt.show()
    
    #<<<BARPLOT>>>
    #CREATING A FINAL DATAFRAME OF THE TOP 20 WORDS
    top_20_words = total_words.sort_values(ascending = False).head(20)
    top_20_words_df = pd.DataFrame(top_20_words, columns = ["count"])
    #PLOTTING THE COUNT OF THE TOP 20 WORDS
    sns.set_style("white")
    plt.figure(figsize = (15, 8), dpi=300)
    ax = sns.barplot(y= top_20_words_df.index, x="count", data=top_20_words_df, palette = palette)
    
    plt.xlabel("Count", fontsize=9)
    plt.ylabel('Common Words in {}'.format(category_string), fontsize=9)
    plt.yticks(rotation=-5)

In [None]:
#CALLING THE FUNCTION ON OUR X FEATURES
TF_IDF_most_used_words("Words used by production model to identify r/SuicideWatch Posts", model_data["megatext_clean"], "vlag_r", image_mask="/content/drive/My Drive/ML mini project/assets/ending_mask_8.png")

FileNotFoundError: ignored

In [None]:
#SEEKING OUT FALSE NEGATIVES

#SETTING UP A DATAFRAME WITH ONLY X_test POSTS
index_seek = list(X_test.index)
fn_explore = model_data.iloc[index_seek, :][["title","selftext","author", "is_suicide"]]
fn_explore["predictions"] = nb.predict(X_test_tvec)

# SEEKING OUT OUR FALSE NEGATIVES
#is_suicide == 1 and predictions == 0
false_negs = fn_explore[fn_explore["is_suicide"]==1][fn_explore["predictions"]==0]
false_negs

In [None]:
#DEFINING A FUNCTION TO VISUALISE MOST USED WORDS
def barplot_most_used_words(category_string, data_series, palette):
    #CHECKING OUT COMMON WORDS IN r/SuicideWatch USING TVEC
    tvec_optimised = TfidfVectorizer(max_df= 0.5, max_features=70, min_df=2, ngram_range=(1, 3),stop_words = 'english')
    tvec_optimised.fit(data_series)
    #CREATING A DATAFRAME OF EXTRACTED WORDS
    created_df = pd.DataFrame(tvec_optimised.transform(data_series).todense(),
                              columns=tvec_optimised.get_feature_names())
    total_words = created_df.sum(axis=0)
    
    #<<<BARPLOT>>>
    #CREATING A FINAL DATAFRAME OF THE TOP 20 WORDS
    top_20_words = total_words.sort_values(ascending = False).head(20)
    top_20_words_df = pd.DataFrame(top_20_words, columns = ["count"])
    #PLOTTING THE COUNT OF THE TOP 20 WORDS
    sns.set_style("white")
    plt.figure(figsize = (15, 8), dpi=300)
    ax = sns.barplot(y= top_20_words_df.index, x="count", data=top_20_words_df, palette = palette)
    plt.title('\nTop Words used in {}\n'.format(category_string), fontsize=22)
    plt.xlabel("Count", fontsize=9)
    plt.ylabel('Common Words in {}'.format(category_string), fontsize=9)
    plt.yticks(rotation=-5)

In [None]:
#VISUALISING WORDS USED BY FALSE NEGATIVES
barplot_most_used_words("False Negatives", false_negs["selftext"], "gray")

### 3.5 Model evaluation and possible future developments 



- **Top Words picked by TF-IDF Vectorizer** - It is important to note that the Vectorizer weighted "feel" higher than "want" although (as seen in our earlier EDA), "want" appeared more times than "feel" in suicide posts. This is probablly because TF-IDF "penalises" words that appear too much. Nonetheless, this set of words might be a good starting point to build other models that might be able to predict the presence of suicidal individuals in, say a subreddit for teachers. 


- **Understanding our False Negatives** - In our barplot showing the top words in our false negatives, we can see many similar words to those that our model is prioritising. To improve our model, we need to use more robust textual analysis to lower the instances of false negatives and this will prevent individuals at risk of suicide from falling through the cracks.


- **Suicide without Depression** - One of the biggest limitations of this project is that we only looked at suicidal tendencies in the context of depression. It is important to note that suicides happen to people without depression too. Thus, future studies can focus on the wider population. 


- **Double-Posting Phenomenon** - In our exploratory data analysis, we noticed more about 26 instances of individuals posting on both subreddits. The example we looked at was u/thathumbletrashcan posted on r/depression on March 4th that "*I don't want to die, but I don't want to live anymore*" . A day later, u/thathumbletrashcan visits the r/SuicideWatch forum and posts "*I've finally grown the balls to fulfil my plan......all of you won't have to deal with me again*." Social psychologist Roy Baumeister postulates in his “Suicide as Escape from Self” theory thata person descends into the pit of self-extinction in six stages. Studying double-posting through the lens of Baumeister's theory might yield links to stage-based progression amongst our redditors.


- **Looking into Male Suicides** - According to Samaritans of Singapore, males account for more than 71% of all suicides in 2018. This is consistent with our EDA findings of male-signifiers in usernames. Future projects can dive deeper into suicide amongst males. This could possibly be relevant for companies in male-dominated, high-stress industries like oil/gas, banking and tech. 


- **Complexity of Depression** - Depression is complex and layered. Future development on our model might be aided by working with someone with domain knowledge, who can help us with filtering out factors that might be linked to other conditions like self-harm/OCD/Anxiety?