The workflow of this notebook:
    1. Read data 'olidtrain.csv'.
    2. Data manipulation(cleaning texts, remove duplicates, change column names etc.)
    3. Data exploration is in the separate repository - jupyter notebook file - 'NLP_Offensive_Speech_Exploratory_Analysis'.
    4. Feature Engineering: TF/TFIDF/Word2Vec Embeddings (Glove)
    5. Models (Logistic Regression, Naive Bayes, Random Forest, Support Vector Machine)
    6. Deep Learning (Deep Neural Network, Convolutional Neural Network, LSTM)
    7. .py version of this whole project is in the separate repository - 'Offensive_Language_Detection_NLP'

installing required packages

In [1]:
!pip3 install regex

[33mYou are using pip version 19.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [373]:
import re
import numpy as np
import csv
import pandas as pd
from time import time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample
from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.utils import shuffle

import gensim
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from tqdm import tqdm
tqdm.pandas(desc="progress-bar")

import string
import nltk
from nltk.tokenize import word_tokenize
import itertools
from collections import defaultdict
from nltk.corpus import stopwords 

### 1. Read Data

In [7]:
df_train = pd.read_csv('olidtrain.csv')
print(len(df_train))
df_train.head()

13240


Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,86426,@USER She should ask a few native Americans wh...,OFF,UNT,
1,90194,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF,TIN,IND
2,16820,Amazon is investigating Chinese employees who ...,NOT,,
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,UNT,
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,,


### 2.Data Manipulation

#### 2.1 Removing the duplicated tweets

In [8]:
df_train[df_train.duplicated(['tweet'])]

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
2723,16759,@USER @USER @USER @USER @USER @USER @USER @USE...,NOT,,
3121,32524,@USER Fuck off,OFF,TIN,IND
3877,65857,A new bill aims to send masked Antifa activist...,NOT,,
4072,22953,@USER Looks Like The Jokes On Liberals Again. ...,NOT,,
4103,67789,@USER Bullshit.,NOT,,
4222,15862,@USER @USER @USER @USER @USER @USER @USER @USE...,NOT,,
5154,64333,@USER @USER He is.,NOT,,
5306,46503,@USER An obvious last minute liberal ploy to d...,NOT,,
6883,17969,#Conservatives left #frustrated as #Congress p...,NOT,,
7135,88916,@USER Liberals ruin everything.,NOT,,


In [9]:
#get rid of duplicated tweets
df_train.drop_duplicates(subset = 'tweet',keep = False,inplace = True)

In [10]:
print('After removing duplicated tweets:')
print(len(df_train))

After removing duplicated tweets:
13179


#### 2.2 Reformating the dataframe

In [21]:
df = df_train[['tweet','subtask_a']]
df.head()

Unnamed: 0,tweet,subtask_a
0,@USER She should ask a few native Americans wh...,OFF
1,@USER @USER Go home you’re drunk!!! @USER #MAG...,OFF
2,Amazon is investigating Chinese employees who ...,NOT
3,"@USER Someone should'veTaken"" this piece of sh...",OFF
4,@USER @USER Obama wanted liberals &amp; illega...,NOT


In [22]:
# change the label to 1 and 0
df['subtask_a'] = (df['subtask_a'] == 'OFF').astype(int)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,tweet,subtask_a
0,@USER She should ask a few native Americans wh...,1
1,@USER @USER Go home you’re drunk!!! @USER #MAG...,1
2,Amazon is investigating Chinese employees who ...,0
3,"@USER Someone should'veTaken"" this piece of sh...",1
4,@USER @USER Obama wanted liberals &amp; illega...,0


In [24]:
#rename the column name
df.columns = ['text', 'label']
df.head()

Unnamed: 0,text,label
0,@USER She should ask a few native Americans wh...,1
1,@USER @USER Go home you’re drunk!!! @USER #MAG...,1
2,Amazon is investigating Chinese employees who ...,0
3,"@USER Someone should'veTaken"" this piece of sh...",1
4,@USER @USER Obama wanted liberals &amp; illega...,0


#### 2.3 Cleaning tweets

In [374]:
def cleaning(txt):

    cleanReview = ''
    review = word_tokenize(txt)
    stopword = []

    for i in review:
        #replace consecutive non-ASCII characters with a space
        i = re.sub(r'[^\x00-\x7F]+',' ',i)
        #removes punctuation 
        #i.translate(str.maketrans('', '', string.punctuation)) 
        # remove digits
        i = re.sub(r'\d+', '', i)
        # remove html tags
        i = re.sub('(?:<[^>]+>)', '',i)
        #remove more than one space
        i = re.sub(r"\s+","", i)
        #lower case first letter and keep all uppercase word
        if i.isupper():
            i = i
        else:
            i = i.lower()
        stopword = stopwords.words('english')
        stopword = list(set(stopword))
        Extra = ["@USER","USER","URL",".",";",":","/","\\",",","#","@","$","&",")","(","\""]
        stopword = stopword + Extra
        if i not in stopword:
            cleanReview = cleanReview + ' ' + i 
            #print(cleanReview)
    return cleanReview

In [30]:
# apply cleaning to all tweets
df['text'] = df['text'].apply(lambda x :cleaning(x)) 
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,text,label
0,ask native americans take,1
1,go home drunk ! ! ! MAGA trump,1
2,amazon investigating chinese employees sellin...,0
3,someone should'vetaken '' piece shit volcano '',1
4,obama wanted liberals amp illegals move red s...,0


### 3. Data Exploration

* Data Exploration is in a seperate repository 'NLP_Offensive_Speech_Exploratory_Analysis'

### 4. Feauture Engineering

* In this project, I used TF, TFIDF and Word2Vec Embeddings. This part is under a function of building models.

### 5. Buidling Models

#### 5.1 Baseline Model Using Polarity

* If the polarity score is below 0, the tweet is classified as 'Offensive', if above 0, 
  the tweet is classified as 'Not Offensive'. 

In [36]:
from textblob import TextBlob
%config InlineBackend.figure_format = 'retina'

In [37]:
%%time
tbresult = [TextBlob(i).sentiment.polarity for i in df['text']]
tbpred = [1 if n<0 else 0 for n in tbresult]

CPU times: user 4.19 s, sys: 244 ms, total: 4.44 s
Wall time: 4.63 s


In [43]:
result = confusion_matrix(tbpred,df['label'])
print ("Accuracy Score: {0:.2f}%".format(accuracy_score(df['label'], tbpred)*100))
print ("-"*80)
print ("Confusion Matrix\n")
print (pd.DataFrame(result))
print ("-"*80)
print ("Classification Report\n")
print (classification_report(df['label'], tbpred))

Accuracy Score: 68.51%
--------------------------------------------------------------------------------
Confusion Matrix

      0     1
0  7159  2515
1  1635  1870
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.74      0.81      0.78      8794
           1       0.53      0.43      0.47      4385

    accuracy                           0.69     13179
   macro avg       0.64      0.62      0.62     13179
weighted avg       0.67      0.69      0.68     13179



#### 5.1 Test Train Split

In [325]:
def train_test_split(df,sampling_method):
    split = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state =42)

    for train_index,test_index in split.split(df,df['label']):
        strat_train_set = df.iloc[train_index]
        strat_test_set = df.iloc[test_index]
    if sampling_method =='Oversampling':
        strat_train_set = OverSampling(strat_train_set)
    trainTweet=strat_train_set['text']
    testTweet = strat_test_set['text']
    trainLabel=strat_train_set['label']
    testLabel=strat_test_set['label']
    return trainTweet,testTweet,trainLabel,testLabel
   

#### 5.2 Oversampling

In [324]:
def OverSampling(strat_train_set):
    df_majority = strat_train_set[strat_train_set['label'] == 0]
    df_minority = strat_train_set[strat_train_set['label'] == 1]
    major_count = len(df_majority)
    # oversample minority class
    df_minority_oversampled = resample(df_minority, 
                                 replace = True,              # sample with replacement
                                 n_samples = major_count,     # to match majority class 
                                 random_state = 42)    
     
         
    strat_train_set = pd.concat([df_majority, df_minority_oversampled])   # Combine majority class with oversampled minority class
    print("Train dataset calss distribution: \n", strat_train_set.label.value_counts())
  
    return strat_train_set


#### 5.3 A Function of Features (TF/TFIDF) + Classifiers + Sampling Method

In [367]:
def model_checker(vectorizer,classifier,sampling_method):

    print(classifier)
    print('\n')
    
    trainTweet,testTweet,trainLabel,testLabel = train_test_split(df,sampling_method)
    
    pipeline = Pipeline([('vectorizer',vectorizer),
                               ('classifier',classifier)])
    
    t0 = time() 
    sentiment_fit = pipeline.fit(trainTweet,trainLabel)  
    y_pred = sentiment_fit.predict(testTweet)
    train_test_time = time() - t0
    
    accuracy = accuracy_score(testLabel,y_pred)
    confusion_result = confusion_matrix(y_pred,testLabel)
    
    print("accuracy score: {0:.2f}%".format(accuracy*100))
    print("train and test time: {0:.2f}s".format(train_test_time))
    print('-'*80)
    print ("Confusion Matrix\n")
    print (pd.DataFrame(confusion_result))
    print('-'*80)
    print ("Classification Report\n")
    print (classification_report(testLabel,y_pred))    

#### 5.4 TF/TFIDF + Models

* I listed a few models below, there can be many combinations.

##### 5.4.1 Logistic Regression + CountVectorizer +  Unigram + Without Oversampling

In [371]:
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
lr = LogisticRegression(solver ='liblinear',penalty='l1')
model_checker(word_vectorizer,lr,'None') # without oversampling

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


accuracy score: 76.10%
train and test time: 2.62s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1560  431
1   199  446
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.78      0.89      0.83      1759
           1       0.69      0.51      0.59       877

    accuracy                           0.76      2636
   macro avg       0.74      0.70      0.71      2636
weighted avg       0.75      0.76      0.75      2636



##### 5.4.2 Logistic Regression + CountVectorizer + Unigram + Oversampling

In [333]:
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
lr = LogisticRegression(solver ='liblinear',penalty='l1')
model_checker(word_vectorizer,lr,'Oversampling')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


Train dataset calss distribution: 
 1    7035
0    7035
Name: label, dtype: int64
accuracy score: 75.95%
train and test time: 1.13s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1471  346
1   288  531
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.81      0.84      0.82      1759
           1       0.65      0.61      0.63       877

    accuracy                           0.76      2636
   macro avg       0.73      0.72      0.72      2636
weighted avg       0.7

##### 5.4.3 Random Forest + CountVectorizer + Unigram + Without Oversampling

In [334]:
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
RF = RandomForestClassifier(n_estimators = 10,bootstrap=False,max_features = 'sqrt',criterion = 'entropy', random_state=100)
model_checker(word_vectorizer,RF,'None')

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)


accuracy score: 75.46%
train and test time: 3.42s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1571  459
1   188  418
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.77      0.89      0.83      1759
           1       0.69      0.48      0.56       877

    accuracy                           0.75      

##### 5.4.4 Random Forest + CountVectorizer + Unigram + Oversampling

In [335]:
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
RF = RandomForestClassifier(n_estimators = 10,bootstrap=False,max_features = 'sqrt',criterion = 'entropy', random_state=100)
model_checker(word_vectorizer,RF,'Oversampling')

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)


Train dataset calss distribution: 
 1    7035
0    7035
Name: label, dtype: int64
accuracy score: 75.46%
train and test time: 3.51s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1623  511
1   136  366
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.76      0.92      0.83      1759
           1       0.7

##### 5.4.5 Random Forest + TFIDF + Unigram + Without Oversampling

In [336]:
word_vectorizer = TfidfVectorizer(
    strip_accents='unicode',
    analyzer='word', #char 66.39%
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
RF = RandomForestClassifier(n_estimators = 10,bootstrap=False,max_features = 'sqrt',criterion = 'entropy', random_state=100)
model_checker(word_vectorizer,RF,'None')

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)


accuracy score: 75.87%
train and test time: 3.67s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1622  499
1   137  378
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.76      0.92      0.84      1759
           1       0.73      0.43      0.54       877

    accuracy                           0.76      

##### 5.4.6 Random Forest + TFIDF + Unigram + Oversampling

In [337]:
model_checker(word_vectorizer,RF,'Oversampling')

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)


Train dataset calss distribution: 
 1    7035
0    7035
Name: label, dtype: int64
accuracy score: 74.81%
train and test time: 3.92s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1650  555
1   109  322
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.75      0.94      0.83      1759
           1       0.7

##### 5.4.7 Naive Bayes + CountVectorizer + Unigram + Without Oversampling

In [338]:
NB = MultinomialNB()
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word', #char 66.39%
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
model_checker(word_vectorizer,NB,'None')

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


accuracy score: 74.62%
train and test time: 0.40s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1493  403
1   266  474
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.79      0.85      0.82      1759
           1       0.64      0.54      0.59       877

    accuracy                           0.75      2636
   macro avg       0.71      0.69      0.70      2636
weighted avg       0.74      0.75      0.74      2636



##### 5.4.8 Naive Bayes + CountVectorizer + Unigram + Oversampling

In [339]:
model_checker(word_vectorizer,NB,'Oversampling')

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


Train dataset calss distribution: 
 1    7035
0    7035
Name: label, dtype: int64
accuracy score: 70.26%
train and test time: 0.40s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1305  330
1   454  547
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.80      0.74      0.77      1759
           1       0.55      0.62      0.58       877

    accuracy                           0.70      2636
   macro avg       0.67      0.68      0.68      2636
weighted avg       0.71      0.70      0.71      2636



##### 5.4.9 SVM + CountVectorizer + Unigram + Without Oversampling

In [340]:
SVM = svm.SVC(kernel = 'linear', probability = True, random_state = 101)
word_vectorizer = CountVectorizer(
    strip_accents='unicode',
    analyzer='word', #char - 66.39%
    token_pattern=r'\w{1,}',
    ngram_range=(1, 1), 
    max_features=10000)
model_checker(word_vectorizer,SVM,'None')

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=101,
    shrinking=True, tol=0.001, verbose=False)


accuracy score: 74.39%
train and test time: 146.28s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1479  395
1   280  482
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.79      0.84      0.81      1759
           1       0.63      0.55      0.59       877

    accuracy                           0.74      2636
   macro avg       0.71      0.70      0.70      2636
weighted avg       0.74      0.74      0.74      2636



##### 5.4.10 SVM + CountVectorizer + Unigram + Without Oversampling

In [341]:
model_checker(word_vectorizer,SVM,'Oversampling')

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=101,
    shrinking=True, tol=0.001, verbose=False)


Train dataset calss distribution: 
 1    7035
0    7035
Name: label, dtype: int64
accuracy score: 74.13%
train and test time: 224.96s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1460  383
1   299  494
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.79      0.83      0.81      1759
           1       0.62      0.56      0.59       877

    accuracy                           0.74      2636
   macro avg       0.71      0.70      0.70      2636
weighted avg       0.74      0.74      0.74      2636



#### 5.5 Word2Vec +  Models

In [270]:
def load_glove_model(glove_file):
    """
    :param glove_file: embeddings_path: path of glove file.
    :return: glove model
    """
    print("Loading Glove Model")
    f = open(glove_file,'r', encoding = "utf8")
    model = {}
    for line in f:
        splitLine = line.split()
        word = splitLine[0]
        embedding = np.array([float(val) for val in splitLine[1:]])
        model[word] = embedding
    print("Done.",len(model)," words loaded!")
    return model  


In [377]:
# load pre-trained glove model from https://nlp.stanford.edu/projects/glove/

In [292]:
GloveModel = load_glove_model("glove.twitter.27B.200d.txt")  

Loading Glove Model
Done. 1193514  words loaded!


In [293]:
count_total = 0        # Number of words in original tweet including duplicated words
count_in = 0           # Number of words in Glove pre-trained da
count_out = 0          # Number of words are not in Glove pretrained data
out_words_list = []    # A list of words that are not found in Glove pretrained data


In [297]:
# get vector for each word, add vectors and take the average of the vector
def tweet_to_vector(tweet, GloveModel, num_features):   
    
    global count_total, count_in, count_out
    word_count = 0
    feature_vectors = np.zeros((num_features), dtype = "float32")
    
    for word in tweet.split(' '):
        count_total += 1
        if word in GloveModel.keys():   
            count_in += 1
            word_count += 1
            feature_vectors += GloveModel[word]
        else:
            count_out += 1
            out_words_list.append(word)
    if (word_count != 0):
        feature_vectors /= word_count
    return feature_vectors

In [303]:
# example
tweet_to_vector('hello my friend',GloveModel,200)

array([ 1.57867670e-01, -7.02119991e-02, -1.88197002e-01, -1.41797006e-01,
       -3.75759989e-01,  1.30613334e-02,  5.85546672e-01,  5.71029000e-02,
       -2.23953322e-01, -2.97366858e-01, -1.39348343e-01,  2.77816653e-01,
       -4.18916672e-01,  1.90225348e-01, -2.81933337e-01, -4.32376653e-01,
        2.68950015e-01, -1.33072659e-01, -7.90596604e-02, -2.02959344e-01,
        1.28099993e-01,  7.58896694e-02, -3.03083342e-02,  1.35471657e-01,
       -3.21693346e-02,  8.35429966e-01,  8.08740035e-02,  1.01371765e-01,
        1.15334235e-01, -1.82173327e-01, -2.33048663e-01, -1.58611000e-01,
       -3.43452007e-01,  3.68813306e-01, -3.49330008e-02, -3.90546657e-02,
       -1.18868656e-01,  2.80440331e-01,  1.80993333e-01,  1.15845345e-01,
       -3.08980018e-01,  6.86606690e-02, -5.00606559e-02,  1.49970010e-01,
       -2.02950343e-01, -1.29555658e-01,  2.13615999e-01,  6.93316609e-02,
        1.42356008e-01, -7.88826719e-02,  2.55999010e-04, -1.42173335e-01,
       -2.82210320e-01, -

In [346]:
# get word2vec vector for each tweet        
def get_tweet_vectors(tweets, GloveModel, num_features):    
    curr_ind = 0
    tweet_feature_vecs = np.zeros((len(tweets), num_features), dtype = "float32")
    
    for tweet in tweets:
        if curr_ind % 2000 == 0:
            print('Word2vec vectorizing tweet %d of %d' %(curr_ind, len(tweets)))
        tweet_feature_vecs[curr_ind] = tweet_to_vector(tweet, GloveModel, num_features)
        curr_ind += 1
    return tweet_feature_vecs   

In [347]:
# build models based on word2vec embeddings    
def Word2Vec_Model(df, classifier,sampling_method):
    print(classifier)
    print('\n')
    
    trainTweet,testTweet,trainLabel,testLabel = train_test_split(df,sampling_method)        
    pipeline = Pipeline([('classifier',classifier)])
    
    global count_total, count_in, count_out
    global out_words_list
    count_total, count_in, count_out = 0, 0, 0 
    out_words_list = []    
    
    trainVec = get_tweet_vectors(trainTweet, GloveModel, 200) # it has to be same as read in txt dimension which is 200.
    testVec = get_tweet_vectors(testTweet, GloveModel, 200) # glove.twitter.27B.200d.txt
    
    print("Glove word embedding statistic\n", "count_total: %d/" %count_total, "count_in: %d/" %count_in, "count_out: %d/" %count_out)
    print("Number of unique words without embedding: %d" %len(set(out_words_list)))
    print("Words without embedding: \n", set(out_words_list))
    
    t0 = time() 
    pipeline.fit(trainVec,trainLabel)  
    y_pred = pipeline.predict(testVec)
    train_test_time = time() - t0    
                        
    accuracy = accuracy_score(testLabel,y_pred)
    confusion_result = confusion_matrix(y_pred,testLabel)
    
    print("accuracy score: {0:.2f}%".format(accuracy*100))
    print("train and test time: {0:.2f}s".format(train_test_time))
    print('-'*80)
    print ("Confusion Matrix\n")
    print (pd.DataFrame(confusion_result))
    print('-'*80)
    print ("Classification Report\n")
    print (classification_report(testLabel,y_pred))

##### 5.5.1 Logistic Regression + Word2Vec Embedding

In [348]:
Word2Vec_Model(df,lr,'None')

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


Word2vec vectorizing tweet 0 of 10543
Word2vec vectorizing tweet 2000 of 10543
Word2vec vectorizing tweet 4000 of 10543
Word2vec vectorizing tweet 6000 of 10543
Word2vec vectorizing tweet 8000 of 10543
Word2vec vectorizing tweet 10000 of 10543
Word2vec vectorizing tweet 0 of 2636
Word2vec vectorizing tweet 2000 of 2636
Glove word embedding statistic
 count_total: 176468/ count_in: 136341/ count_out: 40127/
Number of unique words without embedding: 6695
Words without embedding: 


accuracy score: 74.24%
train and test time: 5.19s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1584  504
1   175  373
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.76      0.90      0.82      1759
           1       0.68      0.43      0.52       877

    accuracy                           0.74      2636
   macro avg       0.72      0.66      0.67      2636
weighted avg       0.73      0.74      0.72      2636



##### 5.5.2 Random Forest + Word2Vec Embedding

In [310]:
Word2Vec_Model(df,RF)

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=100,
                       verbose=0, warm_start=False)


Word2vec vectorizing tweet 0 of 10543
Word2vec vectorizing tweet 2000 of 10543
Word2vec vectorizing tweet 4000 of 10543
Word2vec vectorizing tweet 6000 of 10543
Word2vec vectorizing tweet 8000 of 10543
Word2vec vectorizing tweet 10000 of 10543
Word2vec vectorizing tweet 0 of 2636
Word2vec vectorizing tweet 2000 of 2636
Glove word embedding statistic
 count_total: 176468/ count_in: 136341/ count_out: 40127/
Number of unique words without embedding: 6695
Words without embedding: 


accuracy score: 71.62%
train and test time: 6.77s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1592  581
1   167  296
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.73      0.91      0.81      1759
           1       0.64      0.34      0.44       877

    accuracy                           0.72      2636
   macro avg       0.69      0.62      0.63      2636
weighted avg       0.70      0.72      0.69      2636



##### 5.5.3 SVM + Word2Vec Embedding

In [312]:
SVM = svm.SVC(kernel = 'linear', probability = True, random_state = 101)

In [313]:
Word2Vec_Model(df,SVM)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=101,
    shrinking=True, tol=0.001, verbose=False)


Word2vec vectorizing tweet 0 of 10543
Word2vec vectorizing tweet 2000 of 10543
Word2vec vectorizing tweet 4000 of 10543
Word2vec vectorizing tweet 6000 of 10543
Word2vec vectorizing tweet 8000 of 10543
Word2vec vectorizing tweet 10000 of 10543
Word2vec vectorizing tweet 0 of 2636
Word2vec vectorizing tweet 2000 of 2636
Glove word embedding statistic
 count_total: 176468/ count_in: 136341/ count_out: 40127/
Number of unique words without embedding: 6695
Words without embedding: 


accuracy score: 74.20%
train and test time: 216.76s
--------------------------------------------------------------------------------
Confusion Matrix

      0    1
0  1618  539
1   141  338
--------------------------------------------------------------------------------
Classification Report

              precision    recall  f1-score   support

           0       0.75      0.92      0.83      1759
           1       0.71      0.39      0.50       877

    accuracy                           0.74      2636
   macro avg       0.73      0.65      0.66      2636
weighted avg       0.74      0.74      0.72      2636



### 6. Deep Learning 

####  6.1 Deep Neural Network

In [150]:
from keras.models import Sequential
from keras.layers import Dense
 
model = Sequential()
 
model.add(Dense(units=500, activation='relu', input_dim=len(vectorizer.get_feature_names())))
model.add(Dense(units=1, activation='sigmoid'))
 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 500)               2584000   
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 501       
Total params: 2,584,501
Trainable params: 2,584,501
Non-trainable params: 0
_________________________________________________________________


###### Training 

In [151]:
model.fit(X_train[:-100], trainLabel[:-100], 
          epochs=2, batch_size=128, verbose=1, 
          validation_data=(X_train[-100:], trainLabel[-100:]))

Train on 10443 samples, validate on 100 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x152507c50>

###### Evaluation

In [152]:
scores = model.evaluate(vectorizer.transform(testTweet), testLabel, verbose=1)
print("Accuracy:", scores[1])  

Accuracy: 0.7651745080947876


In [155]:
testpred_dnn

array([[0.09984761],
       [0.11356628],
       [0.10797295],
       ...,
       [0.17173888],
       [0.14873533],
       [0.03092741]], dtype=float32)

In [157]:
import numpy as np 
testpred_dnn = model.predict(vectorizer.transform(testTweet))
#testpred_dnn1 = np.argmax(testpred_dnn, axis=1)
print(classification_report(testLabel, testpred_dnn1))

              precision    recall  f1-score   support

           0       0.67      1.00      0.80      1759
           1       0.00      0.00      0.00       877

    accuracy                           0.67      2636
   macro avg       0.33      0.50      0.40      2636
weighted avg       0.45      0.67      0.53      2636



  'precision', 'predicted', average, warn_for)


####  6.2 Convolutional Network

In [66]:
word2idx = {word: idx for idx, word in enumerate(vectorizer.get_feature_names())}
tokenize = vectorizer.build_tokenizer()
preprocess = vectorizer.build_preprocessor()
 
def to_sequence(tokenizer, preprocessor, index, text):
    words = tokenizer(preprocessor(text))
    indexes = [index[word] for word in words if word in index]
    return indexes
 
print(to_sequence(tokenize, preprocess, word2idx, "This is an important test!"))  # [2269, 4453]
X_train_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in trainTweet]
print(X_train_sequences[0])
 

[2299, 4594]
[4669, 2659, 4626, 1967, 1941, 5062, 4782, 2101, 803, 3718, 2576, 3112, 1959, 358, 2934, 1632, 1476, 2969, 3670, 4991, 3077]


In [68]:
# Compute the max lenght of a text
MAX_SEQ_LENGHT = len(max(X_train_sequences, key=len))
print("MAX_SEQ_LENGHT=", MAX_SEQ_LENGHT)
 
from keras.preprocessing.sequence import pad_sequences
N_FEATURES = len(vectorizer.get_feature_names())
X_train_sequences = pad_sequences(X_train_sequences, maxlen=MAX_SEQ_LENGHT, value=N_FEATURES)
print(X_train_sequences[0])

MAX_SEQ_LENGHT= 34
[5167 5167 5167 5167 5167 5167 5167 5167 5167 5167 5167 5167 5167 4669
 2659 4626 1967 1941 5062 4782 2101  803 3718 2576 3112 1959  358 2934
 1632 1476 2969 3670 4991 3077]


In [135]:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Embedding
 
model = Sequential()
model.add(Embedding(len(vectorizer.get_feature_names()) + 1,
                    64,  # Embedding size
                    input_length=MAX_SEQ_LENGHT))
model.add(Conv1D(64, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 34, 64)            330752    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 30, 64)            20544     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 6, 64)             0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 384)               0         
_________________________________________________________________
dense_7 (Dense)              (None, 64)                24640     
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 65        
Total params: 376,001
Trainable params: 376,001
Non-trainable params: 0
________________________________________________

##### Training

In [136]:
model.fit(X_train_sequences[:-100], trainLabel[:-100], 
          epochs=3, batch_size=512, verbose=1,
          validation_data=(X_train_sequences[-100:], trainLabel[-100:]))
 

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 10443 samples, validate on 100 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.callbacks.History at 0x14ed834d0>

##### Transform

In [143]:
# Transform the text data to sequences and pad them
X_test_sequences = [to_sequence(tokenize, preprocess, word2idx, x) for x in testTweet]
X_test_sequences = pad_sequences(X_test_sequences, maxlen=MAX_SEQ_LENGHT, value=N_FEATURES)

##### Evaluation

In [144]:
scores = model.evaluate(X_test_sequences, testLabel, verbose=1)
print("Accuracy:", scores[1]) 

Accuracy: 0.6843702793121338


In [149]:
import numpy as np 
testpred = model.predict(X_test_sequences)
test_Y_pred = np.argmax(testpred, axis=1)
print(classification_report(testLabel, test_Y_pred))

              precision    recall  f1-score   support

           0       0.67      1.00      0.80      1759
           1       0.00      0.00      0.00       877

    accuracy                           0.67      2636
   macro avg       0.33      0.50      0.40      2636
weighted avg       0.45      0.67      0.53      2636



#### 6.3 LSTM Network

Long Short Term Memory Network works for sequence data, it treats text data as a sequential data,instead of a bag of words or ngram. It works for text classification.

In [124]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
 
model = Sequential()
model.add(Embedding(len(vectorizer.get_feature_names()) + 1,
                    64,  # Embedding size
                    input_length=MAX_SEQ_LENGHT))
model.add(LSTM(64))
model.add(Dense(units=1, activation='sigmoid'))
 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 34, 64)            330752    
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 65        
Total params: 363,841
Trainable params: 363,841
Non-trainable params: 0
_________________________________________________________________
None


##### Training

In [128]:
model.fit(X_train_sequences[:-100], trainLabel[:-100], 
          epochs=2, batch_size=128, verbose=1, 
          validation_data=(X_train_sequences[-100:], trainLabel[-100:]))
 

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 10443 samples, validate on 100 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x149c77ed0>

##### Evaluation

In [129]:
scores = model.evaluate(X_test_sequences, testLabel, verbose=1)
print("Accuracy:", scores[1]) 

Accuracy: 0.762139618396759


In [133]:
import numpy as np 
testpred = model.predict(X_test_sequences)
test_Y_pred = np.argmax(testpred, axis=1)
print(classification_report(testLabel, test_Y_pred))

              precision    recall  f1-score   support

           0       0.67      1.00      0.80      1759
           1       0.00      0.00      0.00       877

    accuracy                           0.67      2636
   macro avg       0.33      0.50      0.40      2636
weighted avg       0.45      0.67      0.53      2636



In [134]:
confusion_matrix(testLabel,test_Y_pred)

array([[1759,    0],
       [ 877,    0]])

### 7. Conclusion

We can compare the results from above, the deep learning models do not necessarily outperform 
the traditional models.