## Predict Suicidal Ideation Based on Tweets

<font>Suicidal ideation detection in online social networks is an emerging research area with major challenges. Recent research has shown that the publicly available information, spread across social media platforms, holds valuable indicators for effectively detecting individuals with suicidal intentions. The key challenge of suicide prevention is understanding and detecting the complex risk factors and warning signs that may precipitate the event. We present an approach that uses the social media platform <b>Twitter</b> to quantify suicide warning signs for individuals and to detect posts containing suicide-related content. The main originality of this approach is the automatic identification of sudden changes in a user's online behavior. To detect such changes, we combine natural language processing(NLP) techniques to aggregate behavioral and textual features and pass these features through a model framework, which is widely used for change detection in data.</font>

<div class="alert alert-block alert-info">
<font color='DodgerBlue'>This notebook, classifier the Tweets as 'Positive'/'Negative'. This is done by using the following techniques :
    <ul>
        <li>Import the data</li>
        <li>Data Cleaning - Removing Null, Missing Values, Renaming Columns</li>
        <li>Data Preprocessing - Lower-casing, NLTK, Removing Stop Words, Language Filtering, Lemmetization</li>
        <li>Count Vectorizer</li>
        <li>Modeling - Gaussian NB, Bernoulli NB, Random Forest, Ensemble, Decision Tree, Gradient Boosting, XGradient Boosting, AdaBoost. Deep Learning - 1-layer LSTM, 2-Layer LSTM, CNN + 2-LSTM</li>
        <li>K-Fold Cross Validation</li>
    </ul>
</font>
    </div>

### 1. Import Sentiment Train data 

In [6]:
import re
import nltk
import pickle
import numpy as np
import collections
import pandas as pd
import tensorflow as tf
from sklearn import tree
from textblob import Word 
from sklearn import metrics
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from keras.models import Sequential
from keras.models import Sequential
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import KFold 
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
from keras.layers import Conv1D ,MaxPooling1D
from keras.preprocessing.text import Tokenizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import StandardScaler
from keras.layers.core import Dropout, Activation
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,classification_report
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout, Flatten
%matplotlib inline

Dataset = pd.read_csv("/Users/yeezhianliew/Desktop/Tweets.csv",encoding ="ISO-8859-1") 
Dataset.head()

Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
3,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
4,0,1467811372,Mon Apr 06 22:20:00 PDT 2009,NO_QUERY,joy_wolf,@Kwesidei not the whole crew


### 1.1 Dataset shape

In [17]:
Dataset.shape

(250000, 6)

### 2. Data Cleaning

In [18]:
Dataset.rename(columns={
                 "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D": "Twitter_Tweet",
                "0":'Sentiment',
                "_TheSpecialOne_":"Username"},
               inplace = True )
Dataset = Dataset.drop(['1467810369','Mon Apr 06 22:19:45 PDT 2009','NO_QUERY','Username'], axis=1)  # 0:neg, 4: pos

Dataset.head()          

Unnamed: 0,Sentiment,Twitter_Tweet
0,0,is upset that he can't update his Facebook by ...
674999,0,"@leprcn I saw that! Yeah, did you get the emai..."
675000,0,Hanging with Rosy tonight and tomorrow. P.E. s...
675001,0,@BrooksLazar Yay!!! You've conformed to a soci...
675002,0,I'm hungry. I'll eat noodles and nothing but ...


### 3. Data Preprocessing

<font color='DodgerBlue'>
    <ul><li>Lower-casing</li>
    <li>NLTK</li> 
    <li>Removing Stop Words</li>
    <li>Language Filtering</li>
        <li>Lemmetization</li></ul>
</font>

In [19]:
Dataset['lower_case']= Dataset['Twitter_Tweet'].apply(lambda x: x.lower())       #convert upper to lower case

tokenizer = RegexpTokenizer(r'\w+')
Dataset['Special_word'] = Dataset.apply(lambda row: tokenizer.tokenize(row['lower_case']), axis=1)     #tokenize word

freq = pd.Series(' '.join(Dataset['Twitter_Tweet']).split()).value_counts()[-10:]                       
freq = list(freq.index)
Dataset['Contents'] = Dataset['Twitter_Tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))  #remove less frequent words

stop = stopwords.words('english')
Dataset['stop_words'] = Dataset['Special_word'].apply(lambda x: [item for item in x if item not in stop])   #remove stop word

Dataset['stop_words'] = Dataset['stop_words'].astype('str')
Dataset['short_word'] = Dataset['stop_words'].str.findall('\w{3,}')            #remove words less than 3 characters
Dataset['string'] = Dataset['stop_words'].replace({"'": '', ',': ''}, regex=True)
Dataset['string'] = Dataset['string'].str.findall('\w{3,}').str.join(' ') 

nltk.download('words')
words = set(nltk.corpus.words.words())
Dataset['NonEnglish'] = Dataset['string'].apply(lambda x: " ".join(x for x in x.split() if x in words))  #remove non english word

Dataset['tweet'] = Dataset['NonEnglish'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) # convert it into root words
Dataset.head()

[nltk_data] Downloading package words to
[nltk_data]     /Users/yeezhianliew/nltk_data...
[nltk_data]   Package words is already up-to-date!


Unnamed: 0,Sentiment,Twitter_Tweet,lower_case,Special_word,Contents,stop_words,short_word,string,NonEnglish,tweet
0,0,is upset that he can't update his Facebook by ...,is upset that he can't update his facebook by ...,"[is, upset, that, he, can, t, update, his, fac...",is upset that he can't update his Facebook by ...,"['upset', 'update', 'facebook', 'texting', 'mi...","[upset, update, facebook, texting, might, cry,...",upset update facebook texting might cry result...,upset update might cry result school today als...,upset update might cry result school today als...
674999,0,"@leprcn I saw that! Yeah, did you get the emai...","@leprcn i saw that! yeah, did you get the emai...","[leprcn, i, saw, that, yeah, did, you, get, th...","@leprcn I saw that! Yeah, did you get the emai...","['leprcn', 'saw', 'yeah', 'get', 'email', 'sen...","[leprcn, saw, yeah, get, email, sent, couple, ...",leprcn saw yeah get email sent couple months a...,saw yeah get sent couple ago going junk mail,saw yeah get sent couple ago going junk mail
675000,0,Hanging with Rosy tonight and tomorrow. P.E. s...,hanging with rosy tonight and tomorrow. p.e. s...,"[hanging, with, rosy, tonight, and, tomorrow, ...",Hanging with Rosy tonight and tomorrow. P.E. s...,"['hanging', 'rosy', 'tonight', 'tomorrow', 'p'...","[hanging, rosy, tonight, tomorrow, starts, mon...",hanging rosy tonight tomorrow starts monday,hanging rosy tonight tomorrow,hanging rosy tonight tomorrow
675001,0,@BrooksLazar Yay!!! You've conformed to a soci...,@brookslazar yay!!! you've conformed to a soci...,"[brookslazar, yay, you, ve, conformed, to, a, ...",@BrooksLazar Yay!!! You've conformed to a soci...,"['brookslazar', 'yay', 'conformed', 'social', ...","[brookslazar, yay, conformed, social, trend, m...",brookslazar yay conformed social trend miss se...,social trend miss seeing everyday,social trend miss seeing everyday
675002,0,I'm hungry. I'll eat noodles and nothing but ...,i'm hungry. i'll eat noodles and nothing but ...,"[i, m, hungry, i, ll, eat, noodles, and, nothi...",I'm hungry. I'll eat noodles and nothing but n...,"['hungry', 'eat', 'noodles', 'nothing', 'noodl...","[hungry, eat, noodles, nothing, noodles, bad]",hungry eat noodles nothing noodles bad,hungry eat nothing bad,hungry eat nothing bad


### 4. Remove Null value

In [20]:
Dataset['Sentiment'] =Dataset['Sentiment'].fillna("")
Dataset['tweet'] =Dataset['tweet'].fillna("")

### 5. Train Test Split

In [22]:
x_train, x_test, y_train, y_test = train_test_split(Dataset["tweet"],Dataset["Sentiment"], test_size = 0.33, random_state = 42)   

### 6. Count Vectorizer + TFIDF Transformer 

<font color='DodgerBlue'>Using Count vectorizer combine with TFIDF transformer to convert raw document to tfidf matrix ,as words into binary number</font>

In [None]:
count_vect = CountVectorizer()                        #Convert a collection of text documents to a matrix of token counts
transformer = TfidfTransformer(norm='l2',sublinear_tf=True) 

X_train_counts = count_vect.fit_transform(x_train)
X_train = transformer.fit_transform(X_train_counts)

X_test_counts = count_vect.transform(x_test)
X_test= transformer.transform(X_test_counts)

### 6.1 train dataset shape

In [208]:
print (X_train.shape, y_train.shape)

(167500, 22) (167500,)


### 6.2 test dataset shape

In [209]:
print (X_test.shape, y_test.shape)

(82500, 22) (82500,)


### 7. Machine Learning Model

<font color='DodgerBlue'> Using various Machine learning classifiers to Train, Test and Predict and Validate them.
</font>

### 7.1 Decision Tree

<font color='DodgerBlue'>Using Decision Tree Classifier for Classification and generating the Classification Report.</font>

In [148]:
model_1 = tree.DecisionTreeClassifier()
model_1.fit(X_train,y_train)
y_pred1 = model_1.predict(X_test)
pd.DataFrame(                                                
    confusion_matrix(y_test, y_pred1),                      #to check how good is your model prediction
    columns=['True Positive', 'True Negative'],
    index=['Predicted Positive', 'Predicted Negative'])

Unnamed: 0,True Positive,True Negative
Predicted Positive,23694,17379
Predicted Negative,16381,25046


In [149]:
accuracy_score(y_test, y_pred1)                                    #print the accuracy 
print(classification_report(y_test, y_pred1))                      #print the precision, recall and f1

             precision    recall  f1-score   support

          0       0.59      0.58      0.58     41073
          4       0.59      0.60      0.60     41427

avg / total       0.59      0.59      0.59     82500



### K-fold Validation for DT

In [25]:
scores_1 = cross_val_score(model_1, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred1))
print ("Cross-validated scores:", scores_1)

0.6871030303030303
Cross-validated scores: [0.68182469 0.6816578  0.67895331]


### 7.2 Random Forest

<font color='DodgerBlue'>Running the Random Forest with the following parameters and capturing the performance metrics
    <ul><li>n-estimators = 50</li></ul></font>

In [210]:
model_2 = RandomForestClassifier(n_estimators=50, random_state=0)
model_2.fit(X_train,y_train)
y_pred2 = model_2.predict(X_test)
print(accuracy_score(y_test,y_pred2))
print(classification_report(y_test, y_pred2))

0.6119272727272728
             precision    recall  f1-score   support

          0       0.60      0.64      0.62     41073
          4       0.62      0.58      0.60     41427

avg / total       0.61      0.61      0.61     82500



### K-fold Validation for RF

In [211]:
scores_2 = cross_val_score(model_2, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred2))
print ("Cross-validated scores:", scores_2)

0.6119272727272728
Cross-validated scores: [0.60987929 0.60720363 0.60840363]


### 7.3 Random Forest 

<font color='DodgerBlue'>Running the Random Forest with the following parameters and capturing the performance metrics
    <ul><li>n-estimators =120</li></ul></font>

In [212]:
model_3 = RandomForestClassifier(n_estimators=120, random_state=0)
model_3.fit(X_train,y_train)
y_pred3 = model_3.predict(X_test)
print(accuracy_score(y_test,y_pred3))
print(classification_report(y_test, y_pred3))

0.6167151515151515
             precision    recall  f1-score   support

          0       0.61      0.64      0.63     41073
          4       0.63      0.59      0.61     41427

avg / total       0.62      0.62      0.62     82500



### K-Fold Validation for RF

In [213]:
scores_3 = cross_val_score(model_3, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred3))
print ("Cross-validated scores:", scores_3)

0.6167151515151515
Cross-validated scores: [0.61575384 0.61134096 0.61048126]


### 7.4 GaussianNB

<font color='DodgerBlue'>Running the GaussianNB with the following parameters and capturing the performance metrics
    </font>

In [153]:
model_4= GaussianNB()                                 #good at features have continuous values
model_4.fit(X_train,y_train)
y_pred4 = model_4.predict(X_test)
print(accuracy_score(y_test,y_pred4))
print(classification_report(y_test, y_pred4))

0.49773333333333336
             precision    recall  f1-score   support

          0       0.50      1.00      0.66     41073
          4       0.35      0.00      0.00     41427

avg / total       0.43      0.50      0.33     82500



### K-Fold Validation for GaussianNB

In [197]:
scores_4 = cross_val_score(model_4, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred4))
print ("Cross-validated scores:", scores_4)

0.49773333333333336
Cross-validated scores: [0.50112834 0.50097613 0.50108359]


### 7.5 BernoulliNB

<font color='DodgerBlue'>Running the BernoulliNB with the following parameters and capturing the performance metrics
    </font>

In [154]:
model_5= BernoulliNB(fit_prior=True)        #word fequency less important, better result
model_5.fit(X_train,y_train)
y_pred5 = model_5.predict(X_test)
from sklearn.metrics import accuracy_score, classification_report
print(accuracy_score(y_test,y_pred5))
print(classification_report(y_test, y_pred5))

0.5139030303030303
             precision    recall  f1-score   support

          0       0.52      0.28      0.37     41073
          4       0.51      0.74      0.61     41427

avg / total       0.52      0.51      0.49     82500



### K-Fold Validation for BernoulliNB

In [198]:
scores_5 = cross_val_score(model_5, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred5))
print ("Cross-validated scores:", scores_5)

0.5139030303030303
Cross-validated scores: [0.51441774 0.51028961 0.51152544]


### 7.6 GradientBoostingClassifier

<font color='DodgerBlue'>Running the GradientBoostingClassifier with the following parameters and capturing the performance metrics.
    </font>

In [155]:
model_6 = GradientBoostingClassifier(n_estimators=400,
                                        max_features='auto', max_depth=2,          #log2 = 54% . auto= 55% ,sqrt=54
                                        random_state=1, verbose=1)                  #auto, max_depth=2 - 63
model_6.fit(X_train,y_train)                                                           #auto, max_depth=7 - 70
y_pred6 = model_6.predict(X_test)
print(accuracy_score(y_test, y_pred6))
print(classification_report(y_test, y_pred6))

      Iter       Train Loss   Remaining Time 
         1           1.3855            2.31m
         2           1.3834            2.29m
         3           1.3827            3.35m
         4           1.3817            3.71m
         5           1.3800            3.31m
         6           1.3792            3.17m
         7           1.3788            2.99m
         8           1.3783            2.84m
         9           1.3777            2.78m
        10           1.3770            2.74m
        20           1.3688            2.40m
        30           1.3660            2.23m
        40           1.3628            2.25m
        50           1.3574            2.21m
        60           1.3545            2.09m
        70           1.3501            1.98m
        80           1.3465            1.96m
        90           1.3428            1.83m
       100           1.3377            1.72m
       200           1.3164            1.21m
       300           1.3027           38.10s
       40

### K-Fold Validation for Gradient Boosting Classifier

In [199]:
scores_6 = cross_val_score(model_6, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred6))
print ("Cross-validated scores:", scores_6)

      Iter       Train Loss   Remaining Time 
         1           1.3855           56.33s
         2           1.3834            1.05m
         3           1.3827            1.35m
         4           1.3822            1.35m
         5           1.3805            1.53m
         6           1.3795            1.68m
         7           1.3788            1.70m
         8           1.3784            1.63m
         9           1.3767            1.57m
        10           1.3764            1.51m
        20           1.3684            1.82m
        30           1.3651            1.63m
        40           1.3624            1.74m
        50           1.3598            1.92m
        60           1.3540            1.99m
        70           1.3487            1.93m
        80           1.3444            1.90m
        90           1.3419            1.93m
       100           1.3386            1.90m
       200           1.3184           57.22s
       300           1.3045           23.23s
       40

### 7.7 GradientBoostingClassifier

<font color='DodgerBlue'>Running the GradientBoostingClassifier with the following parameters and capturing the performance metrics.
    </font>

In [156]:
model_7 = GradientBoostingClassifier(n_estimators=800,
                                        max_features='auto', max_depth=4,           
                                        random_state=1, verbose=1)
model_7.fit(X_train, y_train)
y_pred7 = model_7.predict(X_test)
print(accuracy_score(y_test, y_pred7))
print(classification_report(y_test, y_pred7))

      Iter       Train Loss   Remaining Time 
         1           1.3840            9.68m
         2           1.3786           12.41m
         3           1.3743           12.43m
         4           1.3707           11.99m
         5           1.3677           11.38m
         6           1.3661           10.80m
         7           1.3647           10.26m
         8           1.3634           10.22m
         9           1.3598           10.00m
        10           1.3574            9.80m
        20           1.3408            9.05m
        30           1.3291            8.23m
        40           1.3210            7.70m
        50           1.3149            7.35m
        60           1.3091            7.02m
        70           1.3033            6.86m
        80           1.2977            6.68m
        90           1.2927            6.44m
       100           1.2893            6.22m
       200           1.2527            5.28m
       300           1.2293            4.43m
       40

### K-Fold Validation of Grandient Boosting Classifier

In [157]:
scores_7 = cross_val_score(model_7, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred7))
print ("Cross-validated scores:", scores_7)

      Iter       Train Loss   Remaining Time 
         1           1.3839            3.81m
         2           1.3787            3.94m
         3           1.3744            5.24m
         4           1.3708            5.90m
         5           1.3678            5.47m
         6           1.3645            5.15m
         7           1.3619            4.97m
         8           1.3605            4.85m
         9           1.3597            4.75m
        10           1.3583            4.76m
        20           1.3423            4.44m
        30           1.3304            4.32m
        40           1.3229            4.28m
        50           1.3140            4.08m
        60           1.3085            3.96m
        70           1.3034            4.09m
        80           1.2988            4.08m
        90           1.2944            3.90m
       100           1.2910            3.77m
       200           1.2535            3.27m
       300           1.2255            2.71m
       40

### 7.8 XGB Classifier

<font color='DodgerBlue'>XGBClassifier with following parameters and capturing the performance metrics.
    <ul><li>learning_rate =0.3</li>
        <li>n_estimators=50</li>
    <li>objective= 'binary:logistic'</li></ul></font>

In [158]:
!pip install xgboost
model_8 = XGBClassifier( learning_rate =0.3, n_estimators=50,   #build in cross validation #prune: -ve, stop spilt #missing val
                        gamma=0, subsample=0.8, colsample_bytree=0.8,       #learning:slow down tree grow #sub,common used
                        objective= 'binary:logistic', scale_pos_weight=1)           #scale :high class imba 
model_8.fit(X_train, y_train)                                                               
y_pred8 = model_8.predict(X_test)                                                          
print(accuracy_score(y_test, y_pred8))
print(classification_report(y_test, y_pred8))

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
0.6086909090909091
             precision    recall  f1-score   support

          0       0.60      0.62      0.61     41073
          4       0.61      0.60      0.61     41427

avg / total       0.61      0.61      0.61     82500



  unique_values = np.unique(values)


### K-Fold Validation of XGB

In [200]:
scores_8 = cross_val_score(model_8, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred8))
print ("Cross-validated scores:", scores_8)

  unique_values = np.unique(values)
  unique_values = np.unique(values)


0.6086909090909091
Cross-validated scores: [0.60785543 0.60534093 0.60188419]


  unique_values = np.unique(values)


### 7.9 XGB Classifier

<font color='DodgerBlue'>XGBClassifier with following parameters and capturing the performance metrics.
    <ul><li>learning_rate =0.2</li>
        <li>n_estimators=500</li>
    <li>objective= 'binary:logistic'</li></ul></font>

In [159]:
model_9 =XGBClassifier(learning_rate =0.2,n_estimators=500,
                        gamma=0,subsample=0.8,colsample_bytree=0.8,
                        objective= 'binary:logistic',scale_pos_weight=1)  
model_9.fit(X_train,y_train)                                                                
y_pred9 = model_9.predict(X_test)
print(accuracy_score(y_test, y_pred9))
print(classification_report(y_test, y_pred9))

0.6648242424242424
             precision    recall  f1-score   support

          0       0.67      0.65      0.66     41073
          4       0.66      0.68      0.67     41427

avg / total       0.66      0.66      0.66     82500



  unique_values = np.unique(values)


### K-Fold Validation for XGB

In [201]:
scores_9 = cross_val_score(model_9, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred9))
print ("Cross-validated scores:", scores_9)

  unique_values = np.unique(values)
  unique_values = np.unique(values)


0.6648242424242424
Cross-validated scores: [0.66113837 0.66159798 0.65678004]


  unique_values = np.unique(values)


### 7.10 Ensemble

<font color='DodgerBlue'>Running the Ensemble by combining gradient boosting and random forest with the following parameters and capturing the performance metrics.
    </font>

In [214]:
gradientboosting = GradientBoostingClassifier(n_estimators=800,
                                        max_features='auto', max_depth=1,
                                        random_state=1, verbose=1)
forest= RandomForestClassifier(n_estimators=120, random_state=0)
model_10=VotingClassifier(estimators=[('Gradient Boost', gradientboosting), ('Random Forest', forest)], 
                       voting='soft', weights=[2,1])                           #weight focus on better model
model_10.fit(X_train,y_train)                                                   #vote predicts the class label based on the argmax 
y_pred10 = model_10.predict(X_test)
print(accuracy_score(y_test, y_pred10))
print(classification_report(y_test, y_pred10))

      Iter       Train Loss   Remaining Time 
         1           1.3859            2.34m
         2           1.3856            2.03m
         3           1.3853            1.82m
         4           1.3850            1.73m
         5           1.3848            1.70m
         6           1.3846            1.77m
         7           1.3844            1.71m
         8           1.3842            1.66m
         9           1.3841            1.65m
        10           1.3839            1.63m
        20           1.3829            1.52m
        30           1.3821            1.45m
        40           1.3815            1.41m
        50           1.3810            1.37m
        60           1.3805            1.35m
        70           1.3800            1.31m
        80           1.3796            1.29m
        90           1.3791            1.27m
       100           1.3787            1.24m
       200           1.3747            1.04m
       300           1.3716           51.75s
       40

  unique_values = np.unique(values)


### K-Fold Validation for Ensemble

In [215]:
scores_10 = cross_val_score(model_10, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred10))
print ("Cross-validated scores:", scores_10)

      Iter       Train Loss   Remaining Time 
         1           1.3859            1.38m
         2           1.3856            1.39m
         3           1.3853            1.57m
         4           1.3851            1.61m
         5           1.3848            1.44m
         6           1.3846            1.42m
         7           1.3844            1.40m
         8           1.3843            1.41m
         9           1.3841            1.40m
        10           1.3840            1.42m
        20           1.3830            1.30m
        30           1.3822            1.18m
        40           1.3817            1.06m
        50           1.3811           59.56s
        60           1.3806           56.71s
        70           1.3802           54.24s
        80           1.3797           52.42s
        90           1.3792           50.65s
       100           1.3788           49.19s
       200           1.3748           39.09s
       300           1.3715           31.62s
       40

  unique_values = np.unique(values)


      Iter       Train Loss   Remaining Time 
         1           1.3859           50.63s
         2           1.3856           49.60s
         3           1.3853           49.31s
         4           1.3850           51.00s
         5           1.3848           50.46s
         6           1.3846           50.04s
         7           1.3844           49.72s
         8           1.3842           50.56s
         9           1.3840           50.24s
        10           1.3839           49.75s
        20           1.3828           48.32s
        30           1.3820           47.62s
        40           1.3814           47.16s
        50           1.3809           46.57s
        60           1.3804           46.32s
        70           1.3800           45.64s
        80           1.3795           45.12s
        90           1.3791           44.43s
       100           1.3787           43.71s
       200           1.3749           37.63s
       300           1.3718           31.48s
       40

  unique_values = np.unique(values)


      Iter       Train Loss   Remaining Time 
         1           1.3859           48.20s
         2           1.3855           48.54s
         3           1.3852           49.90s
         4           1.3850           51.90s
         5           1.3847           51.57s
         6           1.3845           50.84s
         7           1.3843           50.45s
         8           1.3842           51.10s
         9           1.3840           51.56s
        10           1.3838           51.22s
        20           1.3827           49.86s
        30           1.3819           49.11s
        40           1.3813           48.53s
        50           1.3807           48.09s
        60           1.3802           47.78s
        70           1.3797           47.13s
        80           1.3792           46.69s
        90           1.3787           46.06s
       100           1.3783           45.42s
       200           1.3743           38.93s
       300           1.3710           32.49s
       40

  unique_values = np.unique(values)


### 7.11 AdaBoost

<font color='DodgerBlue'>AdaBoost with DecisionTreeClassifier</font>

In [163]:
dt = tree.DecisionTreeClassifier() 
model_11= AdaBoostClassifier(base_estimator=dt, learning_rate=0.8, n_estimators=30)    #estimate: It controls the number of weak learners.
model_11.fit(X_train,y_train)                                                   
y_pred11 = model_11.predict(X_test)
print(accuracy_score(y_test, y_pred11))
print(classification_report(y_test, y_pred11))

0.6215757575757576
             precision    recall  f1-score   support

          0       0.62      0.61      0.62     41073
          4       0.62      0.63      0.63     41427

avg / total       0.62      0.62      0.62     82500



In [202]:
scores_11 = cross_val_score(model_11, X_train,y_train, cv=3)   #3 fold validation
print(accuracy_score(y_test,y_pred11))
print ("Cross-validated scores:", scores_11)

0.6215757575757576
Cross-validated scores: [0.61621951 0.60575287 0.61624846]


## 8. Comparing different classifiers 

#### Comparing the metrics of the different classifiers

In [165]:
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
Comparison = pd.DataFrame({'GaussianNB':[accuracy_score(y_test,y_pred4)*100,f1_score(y_test,y_pred4,average='macro')*100,recall_score(y_test, y_pred4,average='micro')*100,precision_score(y_test, y_pred4,average='micro')*100],
                            'BernoulliNB':[accuracy_score(y_test,y_pred5)*100,f1_score(y_test,y_pred5,average='macro')*100,recall_score(y_test, y_pred5,average='micro')*100,precision_score(y_test, y_pred5,average='micro')*100],
                            'RF(est:50)': [accuracy_score(y_test,y_pred2)*100,f1_score(y_test,y_pred2,average='macro')*100,recall_score(y_test, y_pred2,average='micro')*100,precision_score(y_test, y_pred2,average='micro')*100 ],
                            'RF(est:500)':[accuracy_score(y_test,y_pred3)*100,f1_score(y_test,y_pred3,average='macro')*100,recall_score(y_test, y_pred3, average='micro')*100 ,precision_score(y_test, y_pred3,average='micro')*100],
                            'Ensemble':[accuracy_score(y_test,y_pred10)*100,f1_score(y_test,y_pred10,average='macro')*100,recall_score(y_test, y_pred10,average='micro')*100,precision_score(y_test, y_pred10,average='micro')*100],
                            'XGB(n:50)':[accuracy_score(y_test,y_pred8)*100,f1_score(y_test,y_pred8,average='macro')*100,recall_score(y_test, y_pred8,average='micro')*100,precision_score(y_test, y_pred8,average='micro')*100],
                            'DT': [accuracy_score(y_test,y_pred1)*100,f1_score(y_test,y_pred1,average='macro')*100,recall_score(y_test, y_pred1,average='micro')*100,precision_score(y_test, y_pred1,average='micro')*100],
                            'XGB(n:500)':[accuracy_score(y_test,y_pred9)*100,f1_score(y_test,y_pred9,average='macro')*100,recall_score(y_test, y_pred9,average='micro')*100,precision_score(y_test, y_pred9,average='micro')*100],
                            'GB(n:400)':[accuracy_score(y_test,y_pred6)*100,f1_score(y_test,y_pred6,average='macro')*100,recall_score(y_test, y_pred6,average='micro')*100,precision_score(y_test, y_pred6,average='micro')*100],
                            'GB(n:800)':[accuracy_score(y_test,y_pred7)*100,f1_score(y_test,y_pred7,average='macro')*100,recall_score(y_test, y_pred7,average='micro')*100,precision_score(y_test, y_pred7,average='micro')*100],
                            'AdaBoost':[accuracy_score(y_test,y_pred11)*100,f1_score(y_test,y_pred11,average='macro')*100,recall_score(y_test, y_pred11,average='micro')*100,precision_score(y_test, y_pred11,average='micro')*100]})
    
    
Comparison.rename(index={0:'Accuracy',1:'F1_score', 2: 'Recall',3:'Precision'}, inplace=True)
Comparison.head()

Unnamed: 0,GaussianNB,BernoulliNB,RF(est:50),RF(est:500),Ensemble,XGB(n:50),DT,XGB(n:500),GB(n:400),GB(n:800),AdaBoost
Accuracy,49.773333,51.390303,53.390303,53.876364,57.589091,60.869091,59.078788,66.482424,61.549091,67.701818,62.157576
F1_score,33.258126,48.635167,53.345975,53.790609,57.39691,60.865831,59.067795,66.471103,61.549015,67.685639,62.151768
Recall,49.773333,51.390303,53.390303,53.876364,57.589091,60.869091,59.078788,66.482424,61.549091,67.701818,62.157576
Precision,49.773333,51.390303,53.390303,53.876364,57.589091,60.869091,59.078788,66.482424,61.549091,67.701818,62.157576


### 9. Deep Learning 

### 9.1.1.  Tokenize and pad sequence

In [166]:
n_most_common_words = 10000         
max_len = 30
tokenizer = Tokenizer(num_words=n_most_common_words)
tokenizer.fit_on_texts(Dataset['tweet'].values)
sequences = tokenizer.texts_to_sequences(Dataset['tweet'].values)
X_Deep = pad_sequences(sequences, maxlen=max_len)

### 9.1.2. Categorized label

In [167]:
from keras.utils.np_utils import to_categorical
Dataset.loc[Dataset['Sentiment'] == 0 , 'LABEL'] = 0                  #negative
Dataset.loc[Dataset['Sentiment'] == 4, 'LABEL'] = 1                   #positive 

print(Dataset['LABEL'][:10])
labels = to_categorical(Dataset['LABEL'], num_classes=2)
print(labels[:])
if 'Sentiment' in Dataset.keys():
    Dataset.drop(['Sentiment'], axis=1)

0         0.0
674998    0.0
674999    0.0
675000    0.0
675001    0.0
675002    0.0
675003    0.0
675004    0.0
675005    0.0
675006    0.0
Name: LABEL, dtype: float64
[[1. 0.]
 [1. 0.]
 [1. 0.]
 ...
 [0. 1.]
 [0. 1.]
 [0. 1.]]


### 9.1.3. Train test split

In [168]:
X_train, X_test, y_train, y_test = train_test_split(X_Deep , labels, test_size=0.33, random_state=42)
print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

((167500, 30), (167500, 2), (82500, 30), (82500, 2))


### 10. Deep Learning Model

### 10.1.1.  1-Layer LSTM - with 15 Epochs

<font color='DodgerBlue'>Running 1 layer LSTM with the following parameters and capturing the performance metrics
    <ul><li>LSTM = 128</li>
        <li>activation = softmax</li>
    <li>loss = binary</li></ul></font>

In [169]:
epochs = 15                                                                       #run 15 iterations
emb_dim = 128                                                                     #set embbeding dimension as 128
batch_size = 100                                                                  # Run 100 at a time. higher, training will be faster
model0 = Sequential()
model0.add(Embedding(n_most_common_words,emb_dim, input_length=X_Deep.shape[1]))  #use input length, and common words as embbeding layer
model0.add(SpatialDropout1D(0.5))                                                 #Remove 1D from the neurons 
model0.add(LSTM(128, dropout=0.5, recurrent_dropout=0.5))                         #128 as hidden layer # close 50% for each neural layer  
model0.add(Dense(2, activation='softmax'))                                        #return output value as sum 1
model0.compile(optimizer=tf.train.AdamOptimizer(),loss='binary_crossentropy', metrics=['acc']) #compile model by optimize it with adam
print(model0.summary())                                                                             # show the summary of model build
history0 = model0.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2)  #add validation layer

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, 30, 128)           1280000   
_________________________________________________________________
spatial_dropout1d_9 (Spatial (None, 30, 128)           0         
_________________________________________________________________
lstm_24 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_16 (Dense)             (None, 2)                 258       
Total params: 1,411,842
Trainable params: 1,411,842
Non-trainable params: 0
_________________________________________________________________
None
Train on 134000 samples, validate on 33500 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


### 10.1.2.  2-Layer LSTM - with 15 Epochs 

<font color='DodgerBlue'>Running 2 layers LSTM with the following parameters and capturing the performance metrics
    <ul><li>LSTM = 100</li>
         <li>LSTM = 50</li>
        <li>activation = softmax</li>
    <li>loss = binary</li></ul></font>

In [187]:
epochs = 15
emb_dim = 256
batch_size = 200
model2 = Sequential()
model2.add(Embedding(n_most_common_words,emb_dim ,input_length=X_Deep.shape[1]))
model2.add(LSTM(100, dropout=0.5, recurrent_dropout=0.5, return_sequences=True))
model2.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))                                        
model2.add(Dense(2, activation='softmax'))
model2.compile(optimizer=tf.train.AdamOptimizer(),loss='binary_crossentropy', metrics=['acc']) 
print(model2.summary())                                                                          
history2 = model2.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_26 (Embedding)     (None, 30, 256)           2560000   
_________________________________________________________________
lstm_39 (LSTM)               (None, 30, 100)           142800    
_________________________________________________________________
lstm_40 (LSTM)               (None, 50)                30200     
_________________________________________________________________
dense_24 (Dense)             (None, 2)                 102       
Total params: 2,733,102
Trainable params: 2,733,102
Non-trainable params: 0
_________________________________________________________________
None
Train on 134000 samples, validate on 33500 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


###  10.1.3. CNN + 2LSTM with 15 epochs

<font color='DodgerBlue'>Running CNN + 2 layers LSTM with the following parameters and capturing the performance metrics
    <ul><li>LSTM = 300</li>
         <li>LSTM = 150</li>
        <li>activation = softmax, relu</li>
        <li>pooling size = 5</li>
    <li>loss = binary</li></ul></font>

In [188]:
epochs = 15
emb_dim = 128
batch_size = 1000
model1 = Sequential()
model1.add(Embedding(n_most_common_words,emb_dim, input_length=X_Deep.shape[1]))   #CNN- do all the heavy computation
model1.add(Conv1D(128, 3, padding='same'))               #Conv1D used text, 2D for image , same, input length= output
model1.add(Activation('relu'))                            #128 filters, 3 windows , relu return 0 value if -ve exist
model1.add(MaxPooling1D(pool_size=5))                   #reduce size, to reduce the amount of parameters/computation in the network.             
model1.add(Dropout(0.2))
model1.add(LSTM(300, dropout=0.5, recurrent_dropout=0.5, return_sequences=True))
model1.add(LSTM(150, dropout=0.5, recurrent_dropout=0.7))
model1.add(Dense(2, activation='softmax'))
model1.compile(optimizer=tf.train.AdamOptimizer(),loss='binary_crossentropy', metrics=['acc'])
print(model1.summary())                                                                           
history1 = model1.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_27 (Embedding)     (None, 30, 128)           1280000   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 30, 128)           49280     
_________________________________________________________________
activation_2 (Activation)    (None, 30, 128)           0         
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 6, 128)            0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 6, 128)            0         
_________________________________________________________________
lstm_41 (LSTM)               (None, 6, 300)            514800    
_________________________________________________________________
lstm_42 (LSTM)               (None, 150)               270600    
__________

### 11. Save Model in hdf5 file

In [179]:
!pip install h5py
from keras.models import model_from_json                     #import the keras model 
model1_json = model1.to_json()                                  #save it in json file
with open("/Users/yeezhianliew/Desktop/model1.json", "w") as json_file:
    json_file.write(model1_json)
model1.save_weights("/Users/yeezhianliew/Desktop/model1.h1")
print("Saved model to disk")                                  #save the model 

[33mYou are using pip version 18.0, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Saved model to disk
