# Depression Sentiment Prediction (Part-2)
## 5. Importing Libraries and Feature Extracted Data
### In the last notebook, we have successfully cleaned, organised a feature extracted the Data. Now is the time to import it and proceed with further operations.


In [35]:
import os
try:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.linear_model import LogisticRegression
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.tree import DecisionTreeClassifier
    from xgboost import XGBClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score, confusion_matrix
    from nltk.corpus import stopwords
    from nltk.stem import PorterStemmer
    from nltk.tokenize import word_tokenize
    from wordcloud import WordCloud
except:
    print("Required Libraries not found, Installing them...\nOnce done, please re-run the notebook")
    os.system("pip install numpy pandas sklearn scipy xgboost nltk")

In [2]:
# Use the same encoding 'latin-1' as used in the previous notebook
data = pd.read_csv('feature_extracted.csv', encoding='latin-1')
data.head()

Unnamed: 0,target,text
0,1,upset cant updat facebook text might cri resul...
1,1,kenichan dive mani time ball manag save rest b...
2,1,whole bodi feel itchi like fire
3,1,nationwideclass behav mad cant see
4,1,kwesidei whole crew


In [3]:
# Also, let's just drop the NaN values. Don't know how I forgot that in the last notebook!
data = data.dropna()

## 6. Splitting the Dataset
### Before we vectorize our data, let's first split it into Training and Test Set as it will be complicated to do so after Vectorizing it with TfIdf Vectorizer.
### We will split the data so that there is 98% (about *1,279,489* data points) in Training Set and remaning 2% (about *319,873* data points) in the Test Set.

In [4]:
text = data['text']
target = data['target']
trainX, testX, trainY, testY = train_test_split(text, target, test_size=0.02)

print("Training Data Size: {} and Testing Data Size: {}".format(trainX.shape[0], testX.shape[0]))

Training Data Size: 1567374 and Testing Data Size: 31988


## 7. Vectorizing data
### Machine Learning Models can't directly work on just text data (which is basically a collection of *strings*), so we have to find a way to convert this sequential data to normal numerical data that can be understood by these algorithms.
### One such widely used method is Tf-Idf Vectorizing. It's Basically replacing every word in the dataset with the number of times it appears in the Dataset. So that's just a word frequency counter and replacer.
#### Note: I know I could have made myself a Tf-idf Vectorizer from scratch (and I even tried that as an experiment, it worked!)but the problem with that solution is that it's very resource-inefficient and time consuming since our data is just humongous in size and word diversity. 

In [6]:
# Let's Initialize our Vectorizer
vectorizer = TfidfVectorizer()

In [7]:
# Now let's Vectorize our Training Text data and also time it (just for fun!)
train_features = vectorizer.fit_transform(trainX)
# Also, let's just convert our training target values (trainY) into a numpy array
train_targets = np.array(trainY)

In [8]:
# Also, let's convert our Test Data into similar form for future ease of use
test_features = vectorizer.transform(testX)
test_targets = np.array(testY)

## 8. Training Machine Learning Models
### Now comes the fun part! Let's just start training different model on our data and see which one performs best in-terms of ROC_AUC Value, Confusion Metrics and Obviously, how much it takes to train! 

### 8.1 Logistic Regression
#### Findings: Our Model performs Substantially OK on the Test Set, however the number of False Positives and False Negatives aren't very low, so we will try other models too.
##### ROC-AUC Score: 0.85

In [9]:
# Let's first try the Simple, Good-old Logistic Regression
lr_classifier = LogisticRegression(C=1.)
lr_classifier.fit(train_features, train_targets)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
# Let's Test the Model on our Test Data
predictions = lr_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_lr = roc_auc_score(test_targets, predictions[:,-1])
confusion_lr = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
lr_tp = confusion_lr[0][0]
lr_tn = confusion_lr[1][1]
lr_fp = confusion_lr[0][1]
lr_fn = confusion_lr[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_lr))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(lr_tp, lr_tn, lr_fp, lr_fn))

ROC-AUC Value is: 0.8506090412770613

Total Size of Test Data: 31988
True Positives are: 12526
True Negatives are: 12139
False Positives are: 3481
False Negatives are: 3842


### 8.2 Naive Bayes
#### Findings: Multinomial Naive Bayes performs slightly Less Better than our Previous Logistic Regression model, but still it's performance on Test Model is substantial and well above the Benchmark I set in the Proposal (0.7 on Test Set). However comparing the training time of both models, the time v/s performance tradeoff is obvious.

##### ROC-AUC Score: 0.83

In [11]:
# Let's try Multinomial Naive Bayes on our data
nb_classifier = MultinomialNB()
nb_classifier.fit(train_features, train_targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [12]:
# Let's test the Naive Bayes Classifer on our Test Data
predictions = nb_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_nb = roc_auc_score(test_targets, predictions[:,-1])
confusion_nb = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
nb_tp = confusion_nb[0][0]
nb_tn = confusion_nb[1][1]
nb_fp = confusion_nb[0][1]
nb_fn = confusion_nb[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_nb))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(nb_tp, nb_tn, nb_fp, nb_fn))

ROC-AUC Value is: 0.832812297754705

Total Size of Test Data: 31988
True Positives are: 10944
True Negatives are: 12890
False Positives are: 5063
False Negatives are: 3091


### 8.3 Random Forest Classifier
#### Findings: Quite Suprisingly, Random Forest Classifier has worked out to be lesser Efficient than I previously thought. But, all these models have still managed to get better results on test-set than the ones I set for Benchmarks (0.7 on test-set). So I am hitting the benchmark pretty resonably.
#### ROC-AUC Score: 0.80

In [13]:
# Let's try the model with random forest classifier
rf_classifier = RandomForestClassifier()
rf_classifier.fit(train_features, train_targets)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [14]:
# Let's test the Naive Bayes Classifer on our Test Data
predictions = rf_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_rf = roc_auc_score(test_targets, predictions[:,-1])
confusion_rf = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
rf_tp = confusion_rf[0][0]
rf_tn = confusion_rf[1][1]
rf_fp = confusion_rf[0][1]
rf_fn = confusion_rf[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_rf))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(rf_tp, rf_tn, rf_fp, rf_fn))

ROC-AUC Value is: 0.8065653274064477

Total Size of Test Data: 31988
True Positives are: 12460
True Negatives are: 11084
False Positives are: 3547
False Negatives are: 4897


### 8.4 Decision Tree Classifier
#### Findings: Decision Tree Classifier actually works equal to the benchmark I set (0.7 on test set), Which makes it not a very smart choice to use.
#### ROC-AUC Score: 0.70

In [15]:
# Let us try Decision Tree Classifier on our data
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(train_features, train_targets)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [16]:
# Let's test the Naive Bayes Classifer on our Test Data
predictions = dt_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_dt = roc_auc_score(test_targets, predictions[:,-1])
confusion_dt = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
dt_tp = confusion_dt[0][0]
dt_tn = confusion_dt[1][1]
dt_fp = confusion_dt[0][1]
dt_fn = confusion_dt[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_dt))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(dt_tp, dt_tn, dt_fp, dt_fn))

ROC-AUC Value is: 0.7042962306550173

Total Size of Test Data: 31988
True Positives are: 11135
True Negatives are: 11396
False Positives are: 4872
False Negatives are: 4585


### 8.5 K-Nearest Neighbors Classifier
#### Findings: KNN Classifier did an 'OK' job our test data. Though, not as good as Logistic Regression or Naive Bayes, it still managed to pass the benchmark line (0.7)
#### ROC-AUC Score: 0.72

In [17]:
# Let us also try KNN Classifier on our data
kn_classifier = KNeighborsClassifier()
kn_classifier.fit(train_features, train_targets)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [18]:
# Let's test the Naive Bayes Classifer on our Test Data
predictions = kn_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_kn = roc_auc_score(test_targets, predictions[:,-1])
confusion_kn = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
kn_tp = confusion_kn[0][0]
kn_tn = confusion_kn[1][1]
kn_fp = confusion_kn[0][1]
kn_fn = confusion_kn[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_kn))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(kn_tp, kn_tn, kn_fp, kn_fn))

ROC-AUC Value is: 0.7263250645063234

Total Size of Test Data: 31988
True Positives are: 10125
True Negatives are: 11323
False Positives are: 5882
False Negatives are: 4658


### 8.6 XGBOOST Classifier
#### Findings: XGB Classifier worked substantially OK, although it crossed the Benchmark line having a score of 0.76, I still believe, this model can do better and so I will be using GridSearchCV on this model along with Logistic Regression and Naive Bayes
#### ROC-AUC Score: 0.76

In [19]:
# Let's try the final XGBoost Classifier
xg_classifier = XGBClassifier()
xg_classifier.fit(train_features, train_targets)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1)

In [20]:
# Let's test the Naive Bayes Classifer on our Test Data
predictions = xg_classifier.predict_proba(test_features)

# Let us get the ROC-AUC Score and Confusion Metrics
roc_auc_xg = roc_auc_score(test_targets, predictions[:,-1])
confusion_xg = confusion_matrix(test_targets, np.round(predictions[:,-1]))

# Assign the True Positives (lr_tp), True Negatives (lr_tn), False Positives (lr_fp) and False Negatives (lr_fn)
xg_tp = confusion_xg[0][0]
xg_tn = confusion_xg[1][1]
xg_fp = confusion_xg[0][1]
xg_fn = confusion_xg[1][0]

# Print the Results
print("ROC-AUC Value is: {}".format(roc_auc_xg))
print("\nTotal Size of Test Data: {}".format(testX.shape[0]))
print("True Positives are: {}\nTrue Negatives are: {}\nFalse Positives are: {}\nFalse Negatives are: {}".format(xg_tp, xg_tn, xg_fp, xg_fn))

ROC-AUC Value is: 0.762598276541667

Total Size of Test Data: 31988
True Positives are: 13733
True Negatives are: 8431
False Positives are: 2274
False Negatives are: 7550


## 9. Optimal Hyperparamter Search using GridSearchCV
### In this final section, We will search for Optimal Hyperparamters for 4 models, I found to be worth the time:
#### 1. Logistic Regression
#### 2. Naive Bayes
#### 3. XGBoost Classifier

### 9.1 Logistic Regression Hyperparamter Search
#### Best Paramter found: C=0.76
Clearly there hasn't been much of an improvement in our Logistic Regression Model

In [21]:
# Let us define the parameter we would like to do search on
parameters_lr = {'C':[1,10]}

# Now, let's fit our GCV model to the data
gcv_lr = GridSearchCV(lr_classifier, parameters_lr, cv=5)
gcv_lr.fit(train_features, train_targets)



GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None, param_grid={'C': [1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [22]:
# Best Paramters are:
print("Best Score found is: {}".format(gcv_lr.best_score_))

# Let us recalculate and see if their is any change in our ROC-AUC Score
probs = gcv_lr.predict_proba(test_features)
roc_auc_gcv_lr = roc_auc_score(test_targets, probs[:,-1])
print("New ROC-AUC Score is found to be: {}".format(roc_auc_gcv_lr))

Best Score found is: 0.770955
New ROC-AUC Score is found to be: 0.8506090373678774


### 9.2 Naive Bayes Hyperparameter Search
#### Best Parameter found: C=0.73
So, far there is no change in Naive Bayes too.

In [23]:
# Let's define parameters (Only, alpha)
parameters_nb = {  
    'alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001)  
}

gcv_nb = GridSearchCV(nb_classifier, parameters_nb)
gcv_nb.fit(train_features, train_targets)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': (1, 0.1, 0.01, 0.001, 0.0001, 1e-05)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [24]:
# Best Paramters are:
print("Best Score found is: {}".format(gcv_nb.best_score_))

# Let us recalculate and see if their is any change in our ROC-AUC Score
probs = gcv_nb.predict_proba(test_features)
roc_auc_gcv_nb = roc_auc_score(test_targets, probs[:,-1])
print("New ROC-AUC Score is found to be: {}".format(roc_auc_gcv_nb))

Best Score found is: 0.74708
New ROC-AUC Score is found to be: 0.832812297754705


### 9.3 XGBoost Classifier Hyperparameter Search
#### Best Parameter found: C=

In [25]:
parameters_xgb = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10]}
gcv_xgb = GridSearchCV(xg_classifier, parameters_xgb)
gcv_xgb.fit(train_features, train_targets)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [26]:
# Best Paramters are:
print("Best Score found is: {}".format(gcv_xgb.best_score_))

# Let us recalculate and see if their is any change in our ROC-AUC Score
probs = gcv_xgb.predict_proba(test_features)
roc_auc_gcv_xgb = roc_auc_score(test_targets, probs[:,-1])
print("New ROC-AUC Score is found to be: {}".format(roc_auc_gcv_xgb))

Best Score found is: 0.688375
New ROC-AUC Score is found to be: 0.762598276541667


# Testing on real world Sentences
Let's now test our best performing model `lr_classifier` on both positive and negative real world sentence.
For this, we need to copy the `process_text()` function from the first notebook.

In [36]:
def process_text(text):
    """
    @param: text (Raw Text sentence)
    @return: final_str (Final processed string)
    
    Function -> It takes raw string as an Input and processed data to get important features extracted
                First it converts to lower, then applies regex code, then tokenizes the words, then removes
                stopwords, then stems the words, then puts the output list back into a string.
    """
    import re
    
    # Convert text to lower and remove all special characters from it using regex
    text = text.lower()
    text = re.sub(r'[^(a-zA-Z)\s]','', text)
    
    # Tokenize the words using the word_tokenize() from nltk lib
    words = word_tokenize(text)
    
    # Only take the words whose length is greater than 2
    words = [w for w in words if len(w) > 2]
    
    # Get the stopwords for english language
    sw = stopwords.words('english')
    
    # Get only those words which are not in stopwords (those which are not stopwords)
    words = [word for word in words if word not in sw]
    
    # Get the PorterStemmer algorithm module
    stemmer = PorterStemmer()
    
    # Take the words with commoner morphological and inflexional endings from words removed
    words = [stemmer.stem(word) for word in words]
    
    # Till this point, we have a list of strings (words), we want them to be converted to a string of text
    final_str = ""
    for w in words:
        final_str += w
        final_str += " "
    
    # Return the final string
    return final_str

### Positive Sentence

In [44]:
# Make a sentence and process it
positive_sentence = "Having a really good time here!"
pos_sent = process_text(positive_sentence)

# Transform the sentence using TfIdf Vectorizer
vect_pos = vectorizer.transform(np.array([pos_sent]))

# Predict the Probabilities of the Sentence being Positive and Negative
pred_pos = lr_classifier.predict_proba(vect_pos)

# The First value in the prediction array is the %-chances of the text being positive and the second value is it
# being negative
isDepressed = pred_pos[0][0] < 0.3
print("Chances of Text Being Positive: {} %".format(pred_pos[0][0]*100))
print("Is the Person Depressed? {}".format(isDepressed))

Chances of Text Being Positive: 79.16412778833273 %
Is the Person Depressed? False


### Depressive Sentence

In [45]:
# Make a sentence and process it
negative_sentence = "Really sad that he left us and is never coming back. feel like crying"
neg_sent = process_text(negative_sentence)

# Transform the sentence using TfIdf Vectorizer
vect_neg = vectorizer.transform(np.array([neg_sent]))

# Predict the Probabilities of the Sentence being Positive and Negative
pred_neg = lr_classifier.predict_proba(vect_neg)

# The First value in the prediction array is the %-chances of the text being positive and the second value is it
# being negative
isDepressed = pred_neg[0][0] < 0.3
print("Chances of Text Being Positive: {} %".format(pred_neg[0][0]*100))
print("Is the Person Depressed? {}".format(isDepressed))

Chances of Text Being Positive: 0.01424067000929119 %
Is the Person Depressed? True


# Conclusion
Finally, I would like to Conclude this Project with my best Result being the Logistic Regression Model for the Classification Purposes. Alternatively, I have provided all the different models used in the project for the user and reviewers to use and play around with. Models are in the ```models``` folders in the Github Repository of this project.


This is Tanay Mehta, Signing Out!