<h2> Introduction </h2>
In this kernel we will use Logistic Regression and Deep Learning methodologies to detect whether a text comment is a normal comment or a toxic (there are 6 toxic categories here). 



<h2> Goals: </h2>
<ul>
<li> Convert text comment to vectors using TFIDF, then build logistic regression to predict 6 toxic labels seperately</li>
<li> Convert text comment to tokens using Tokenizer, then build deep learning model to predict 6 toxic labels all together (multi label classification) </li>
<li> Convert text comment to tokens using Tokenizer, build deep learning model with GloVe weights, then predict 6 toxic labels all together (multi label classification) </li>
<li> Ensemble all 3 models to get a better prediction</li>


<h2> Outline: </h2>
    
    
I. <b>Data Overview </b><br>
a) [Load Data](#load)<br>
b) [Data Overview](#overview)<br><br>
    
II. <b>TFIDF + Logistic Regression </b><br>
a) [TFIDF](#TFIDF)<br>
b) [MaxAbsScaler](#scaler)<br>
c) [Logistic Regression](#lr)<br>

III. <b>Tokenizer + Deep learning </b><br>
a) [Tokenizer](#token)<br>
b) [Padding](#padding)<br>
c) [Build Deep learning model](#dlmodel)<br>


IV. <b>Tokenizer + Deep learning + GloVe</b><br>
a) [Load Embedding from GloVe](#embedding)<br>
b) [Build deep learning model](#dlmodel2)<br>

    
V. <b>Ensemble and model comparison</b><br>
a) [Ensemble 3 models](#ensemble)<br>
b) [Model performance comparison](#performance)<br>
c) [Error analysis](#error)<br>

# Prepare the data

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from sklearn.pipeline import make_union
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, MaxAbsScaler
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
import keras
import tensorflow as tf
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model,Sequential
from keras import initializers, regularizers, constraints, optimizers, layers

## Loading the train and test files
<a id="load"></a>

In [None]:

train = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')
test = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
test_label = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip')
print(train.head())
print(test.head())
print(test_label.head())

In [None]:
print('-'*20+'Train data label distribution'+'-'*20 +'\n')
print(train.drop(['id','comment_text'],axis=1).apply(pd.Series.value_counts,normalize=True))
print("\n")

print('-'*20+'Join test data and label then remove the -1s'+'-'*20 +'\n')
test_wlabel = pd.merge(test, test_label, how='left',on='id')

test_wlabel = test_wlabel[test_wlabel['toxic']!= -1]
print(test_wlabel.drop(['id','comment_text'],axis=1).apply(pd.Series.value_counts,normalize=True))
print("\n")
print("test data count")
print(test_wlabel.count())

## Data Overview
<a id="Overview"></a>

check if there is any nulls in the data

In [None]:
train.isnull().any(),test.isnull().any() #No nulls

Prepare the data for modeling

In [None]:
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y_train = train[list_classes]
y_test = test_wlabel[list_classes]
list_sentences_train = train["comment_text"]
list_sentences_test = test_wlabel["comment_text"]

# TFIDF + Logistic

## TFIDF at word level and char level
<a id="TFIDF"></a>

In [None]:
# all_text = pd.concat([list_sentences_train, list_sentences_test]) # Fit only on Training data

In [None]:
word_vectorizer = TfidfVectorizer(max_features=30000, 
                                  stop_words='english',
                                  strip_accents = 'ascii', 
                                  token_pattern=r'[a-zA-Z]{1,}',
                                  ngram_range=(1, 4))

word_vectorizer.fit(list_sentences_train)
train_word_features = word_vectorizer.transform(list_sentences_train)
test_word_features = word_vectorizer.transform(list_sentences_test)

In [None]:
char_vectorizer = TfidfVectorizer(analyzer='char', 
                                  max_features=30000, 
                                  strip_accents='ascii',
                                  ngram_range=(1, 4))

char_vectorizer.fit(list_sentences_train)
train_char_features = char_vectorizer.transform(list_sentences_train)
test_char_features = char_vectorizer.transform(list_sentences_test)

In [None]:
train_features = hstack([train_char_features,train_word_features])
test_features = hstack([test_char_features,test_word_features])

## MaxAbsScaler 
<a id='scaler'> </a>

In [None]:
scaler = MaxAbsScaler()
scaler.fit(train_features)
train_features_scaled = scaler.transform(train_features)
test_features_scaled = scaler.transform(test_features)

<h2>Logistic Regression</h2>
<a id='lr'></a>
<li>Use warm start to speed up convergence , and utilize the learning from the previous model.</li>
<li>Use C = 0.01 to specify stronger regularization, and speed up convergence. </li>
<li>Use solver = 'sag' to speed up convergence.</li>
<li>Use class_weight to reduce the imbalance issue.</li>

In [None]:
%%time
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
score = []
y_pred_lr = pd.DataFrame()
for class_name in class_names:
    train_target = y_train[class_name]
    test_target = y_test[class_name]
    weight = sum(train_target==0)*0.2/sum(train_target==1)
    lr = LogisticRegression(solver = 'sag', C=0.01, class_weight={0:0.2,1: weight},max_iter=1000, warm_start=True)
    lr.fit(train_features_scaled, train_target)
    y_pred = lr.predict(train_features_scaled)
    auc_score = roc_auc_score(train_target,y_pred)
    print('Train score for class {} is {}'.format(class_name, auc_score))
    y_proba = lr.predict_proba(test_features_scaled)[:,1]
    y_pred_lr[class_name] = y_proba
    auc_score = roc_auc_score(test_target,y_proba>=0.5)
    score.append(auc_score)
    print('Test score for class {} is {}'.format(class_name, auc_score))

    feature_importance = pd.DataFrame({'coef':lr.coef_[0],'feature_name':char_vectorizer.get_feature_names()+word_vectorizer.get_feature_names()})
    # feature_importance
    feature_importance.sort_values('coef',ascending=False, inplace=True)
    print("top 10 variables are : {}".format(feature_importance['feature_name'].head(10).tolist()))
print("\nscore for total class is {}".format(np.mean(score)))

# Tokenizer + LSTM

## Tokenizer
<a id='token'> <a>

In [None]:
max_features = 20000
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

In [None]:
#commented it due to long output
#for occurence of words
#tokenizer.word_counts
#for index of words
#tokenizer.word_index

<h2>Padding</h2>
<a id='paddnig'> </a>
    
make sure they all have same input length

In [None]:
maxlen = 200
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [None]:
totalNumWords = [len(one_comment) for one_comment in list_tokenized_train]

In [None]:
plt.hist(totalNumWords,bins = np.arange(0,410,10))#[0,50,100,150,200,250,300,350,400])#,450,500,550,600,650,700,750,800,850,900])
plt.show()

## Build the deep learning model
<a id='dlmodel'> <a>

In [None]:
inp = Input(shape=(maxlen, )) 
embed_size = 100
x = Embedding(max_features, embed_size)(inp)
x = LSTM(50, return_sequences=True,name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
batch_size = 32
epochs = 2
early_stopping_cb = keras.callbacks.EarlyStopping(patience=2) # Leave it here for future use, but in this case 2 epochs are good
model.fit(X_t,y_train, batch_size=batch_size, epochs=2, validation_split=0.1,callbacks=[early_stopping_cb])

View the output of embedding layer

In [None]:
# from keras import backend as K

# # with a Sequential model
# get_3rd_layer_output = K.function([model.layers[0].input],
#                                   [model.layers[1].output])
# layer_output = get_3rd_layer_output([X_t[:1]])[0]
# layer_output.shape
# # layer_output

In [None]:
y_pred = model.predict(X_te)
print("AUCROC score of the model", roc_auc_score(y_test,y_pred))

Plot the AUCROC curve to get the appropriate decision threshold (0.03 is chosen in this job)

# Tokenizer + LSTM + Pretrained model GloVe (100d)

## Load GloVe embedding file
<a id='embedding'> <a>

In [None]:
EMBEDDING_FILE='../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt'

In [None]:
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [None]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
# emb_mean,emb_std

In [None]:
## Use the same parameters as before
# embed_size = 100 # how big is each word vector
# max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
# maxlen = 200 # max number of words in a comment to use

In [None]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

## Build the deep learning model with embedding weights initialization
<a id='dlmodel2'> <a>

In [None]:
inp2 = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp2)
x = LSTM(50, return_sequences=True)(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(6, activation="sigmoid")(x)
model2 = Model(inputs=inp2, outputs=x)
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
model2.summary()

In [None]:
batch_size = 32
epochs = 2
early_stopping_cb = keras.callbacks.EarlyStopping(patience=2) # Leave it here for future use, but in this case 2 epochs are good
model2.fit(X_t,y_train, batch_size=batch_size, epochs=2, validation_split=0.1,callbacks=[early_stopping_cb])

In [None]:
y_pred2 = model2.predict(X_te)
print("AUCROC score of the model", roc_auc_score(y_test,y_pred2))

# Ensemble and model comparison

## Ensemble 3 models
<a id='ensemble'></a>

In [None]:
y_pred = pd.DataFrame(y_pred,columns=class_names)
y_pred2 = pd.DataFrame(y_pred2,columns=class_names)

In [None]:
y_test.head()

In [None]:
ensemble_score = pd.DataFrame()

for class_name in class_names:
    ensemble_score[class_name] = (y_pred_lr[class_name] + y_pred[class_name] + y_pred2[class_name])/3
# #     ensemble_score[class_name] = (y_pred[class_name] + y_pred2[class_name])/2
#     fpr, tpr, thresholds = roc_curve(y_test[class_name],ensemble_score[class_name])
#     idx = np.argmax(tpr - fpr ) 
#     print('Confusion matrix for class {}'.format(class_name))
#     print(confusion_matrix(y_test[class_name],ensemble_score[class_name]>=thresholds[idx]))
#     print('classification report for class {}'.format(class_name))
#     print(classification_report(y_test[class_name],ensemble_score[class_name]>=thresholds[idx]))




## Model performance comparison
<a id='performance'></a>

In [None]:
print("AUCROC score of Logistic Regression", roc_auc_score(y_test,y_pred_lr))
print("AUCROC score of LSTM", roc_auc_score(y_test,y_pred))
print("AUCROC score of LSTM+GloVe", roc_auc_score(y_test,y_pred2))
print("\n")
print("AUCROC score of Ensemble", roc_auc_score(y_test,ensemble_score))

In [None]:
fig, ax = plt.subplots(1,4,figsize=(30,8))
for i in range(6):
    ax[0].set_title('Logistic Regression')
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred_lr.iloc[:,i])
    ax[0].plot(fpr, tpr, label=class_names[i])
    ax[0].legend(loc='best')
    
    ax[1].set_title('LSTM')
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred.iloc[:,i])
    ax[1].plot(fpr, tpr, label=class_names[i])
    ax[1].legend(loc='best')
    
    ax[2].set_title('LSTM+GloVe')
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred2.iloc[:,i])
    ax[2].plot(fpr, tpr, label=class_names[i])
    ax[2].legend(loc='best')
    
    ax[3].set_title('Ensemble')
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],ensemble_score.iloc[:,i])
    ax[3].plot(fpr, tpr, label=class_names[i])
    ax[3].legend(loc='best')

ax[0].plot([0, 1], [0, 1],'r--')    
ax[1].plot([0, 1], [0, 1],'r--')    
ax[2].plot([0, 1], [0, 1],'r--')
ax[3].plot([0, 1], [0, 1],'r--')

plt.show()

In [None]:
optimal_threshods_lr = []
optimal_threshods_dl1 = []
optimal_threshods_dl2 = []
optimal_threshods_es = []

print("-"*20+"Logistic Regression decision threshold:"+"-"*20)
for i in range(len(class_names)):
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred_lr.iloc[:,i])
    idx = np.argmax(tpr - fpr ) 
    print("optimal fpr, tpr: ",fpr[idx], ",",tpr[idx])
    print("optimal threshold: ",thresholds[idx])
    optimal_threshods_lr.append(thresholds[idx])

    
print("-"*20+"LSTM decision threshold:"+"-"*20)

for i in range(len(class_names)):
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred.iloc[:,i])
    idx = np.argmax(tpr - fpr ) 
    print("optimal fpr, tpr: ",fpr[idx], ",",tpr[idx])
    print("optimal threshold: ",thresholds[idx])
    optimal_threshods_dl1.append(thresholds[idx])

print("-"*20+"LSTM+GloVe decision threshold:"+"-"*20)
for i in range(len(class_names)):
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],y_pred2.iloc[:,i])
    idx = np.argmax(tpr - fpr ) 
    print("optimal fpr, tpr: ",fpr[idx], ",",tpr[idx])
    print("optimal threshold: ",thresholds[idx])
    optimal_threshods_dl2.append(thresholds[idx])
    
print("-"*20+"Ensemble decision threshold:"+"-"*20)
for i in range(len(class_names)):
    fpr, tpr, thresholds = roc_curve(y_test.iloc[:,i],ensemble_score.iloc[:,i])
    idx = np.argmax(tpr - fpr ) 
    print("optimal fpr, tpr: ",fpr[idx], ",",tpr[idx])
    print("optimal threshold: ",thresholds[idx])
    optimal_threshods_es.append(thresholds[idx])

In [None]:
class_names=['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
y_pred_label1 = pd.DataFrame()
y_pred_label2 = pd.DataFrame()
y_pred_label3 = pd.DataFrame()
y_pred_label4 = pd.DataFrame()


for i in range(len(class_names)):
    print('classification report for class {}'.format(class_names[i]))
    
    print("-"*20+"Logistic Regression:"+"-"*20)
#     print('Confusion matrix for class {}'.format(class_names[i]))
#     print(confusion_matrix(y_test.iloc[:,i],y_pred[:,i]>=optimal_threshods_lr[i]))  
    y_pred_label1[class_names[i]] = y_pred_lr.iloc[:,i]>=optimal_threshods_lr[i]
    print(classification_report(y_test.iloc[:,i], y_pred_label1[class_names[i]]))
    
    
    print("-"*20+"LSTM:"+"-"*20)
    y_pred_label2[class_names[i]] = y_pred.iloc[:,i]>=optimal_threshods_dl1[i]
    print(classification_report(y_test.iloc[:,i], y_pred_label2[class_names[i]]))
    
    
    print("-"*20+"LSTM+GloVe:"+"-"*20)
    y_pred_label3[class_names[i]] = y_pred2.iloc[:,i]>=optimal_threshods_dl2[i]
    print(classification_report(y_test.iloc[:,i], y_pred_label3[class_names[i]]))
    
    print("-"*20+"ensemble_score:"+"-"*20)
    y_pred_label4[class_names[i]] = ensemble_score.iloc[:,i]>=optimal_threshods_es[i]
    print(classification_report(y_test.iloc[:,i],y_pred_label4[class_names[i]]))

From the above comparison, the ensemble method performs the best!

## Error Analysis
<a id='error'></a>

In [None]:
test_text_score = pd.concat([list_sentences_test.reset_index(drop=True),\
                             y_test.reset_index(drop=True),\
                             y_pred_lr.add_suffix('_pred1'),\
                             y_pred.add_suffix('_pred2'),\
                             y_pred2.add_suffix('_pred3'),
                             ensemble_score.add_suffix('_pred4'),
                             y_pred_label1.add_suffix('_label1').astype(int),
                             y_pred_label2.add_suffix('_label2').astype(int),
                             y_pred_label3.add_suffix('_label3').astype(int),
                             y_pred_label4.add_suffix('_label4').astype(int)
                            ] ,axis=1)
test_text_score.head()

In [None]:
# Take a look at the labels that are wrongly predicted
label = 'toxic'

test_text_score[(test_text_score[label]==0) & (test_text_score[label+'_label4'] !=0)][['comment_text', label+'_label1',label+'_label2',label+'_label3']].head(10)

In [None]:
# test_text_score.loc[12,'comment_text']