# Comparison of different Models on text data

In this notebook several machine learning algorithms are applied to the jicksaw text data and their performance in terms of AUC (area under the curve) is compared; additionally model accuracy is evaluated (while the extended auc used in the competition  is not considered). Models considered are naive-Bayes, logistic regression, light-GBM and a neural-network and their performance is evaluated using cross-validation. An aspect considered here is the inbalance of class frequencies. While for most algorithms in this notebook a feature matrix based on weights of particular words of the comment-text is employed (TF-idf), the neural-network approach is based on pre-trained word-embeddings.

### Contents

* [Introduction](#intro)
* [The Data](#data)
  * First exploratory analysis; Data Preparation; TF-idf
* [Model: Naive Bayes](#NB)
  * Model estimation; Cross-validation; Down-sampling
* [Model: Logistic Regression](#LR)
  * Cross-Validation; Downsampling; Rebalancing class-weights
* [Model: Gradient Boosting (XBR)](#XBR)
  * Cross-Validation; Rebalancing class-weights
* [Model: Neural-Network (LTSM)](#LTSM)
  * Cross-Validation; Rebalancing class-weights
* [Final Thoughts](#final)

_(author: T.Payer)_

# Introduction<a class="anchor" id="first"></a>

# The Data <a class="anchor" id="data"></a>
### First exploratory analysis

In [None]:
import numpy as np # linear algebra
import pandas as pd
import os
import matplotlib.pyplot as plt
import gc
import time
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text  import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import average_precision_score, roc_curve, roc_auc_score
from sklearn import metrics

In the following we will load the training data and the test data. The training data are used to train prediction models. Test data are only considered here to know the structure of information used to make future prediction (which is the comment text only). 

In [None]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')
print(train.shape)
print(test_df.shape)
print(test_df.head())

So training data contain approximately 18 times more instances than the test set and, apart from an identifier, the test-set only contains the comment-text. In the following we consider the first lines of each of the two data sets

In [None]:
train.head()

The training set therefore contains an identifying column (id), the comment made by the contributer with the corresponding text stored in the column 'comment_text', and the column 'target' representing a probability of how likely this comment is considered to be toxic. There are additional columns with ratings on how toxic this comment was considered by some human raters, as well as a rating of the identity and type of insult of the receiver of the message.The columns of the test set:

In [None]:
test_df.head()

As can be seen from the test set, the models we train for predictions (apart from an identifying column) contain only text for making predictions (and no further variables). I will therefore  focus on the text columns to make predictions and model comparisons. As such I extract from training data the target and the text column:

In [None]:
train=train.iloc[:,1:3]
print(train.head())
y_trainAll =  np.where(train['target']>=0.5, 1, 0)
X_trainAll=train.drop('target',axis=1)

For making predictions I round all target-values with a toxicity-probabilty of 50% or more up to one, while values below are rounded to 0.


In [None]:
sum(y_trainAll)/len(y_trainAll)

In [None]:
print(sum(y_trainAll))
print(len(y_trainAll))
(len(y_trainAll) - sum(y_trainAll))/sum(y_trainAll)

So we see that there are approximately 11.5 times more non-toxic comments (labelled with 0) than there are toxic comments (labelled with 1). The graph below illustrates that difference.

In [None]:
import seaborn as sns
sns.countplot(y_trainAll)

### Preparation of the data

Before analysing the data some preliminary cleaning is applied to remove punctuation and numbers and transform all words to lower case.

In [None]:
import re, string, timeit, datetime

def clean(train_clean):
    tic = datetime.datetime.now()
    train_clean['comment_text']=train_clean['comment_text'].str.replace('[0-9]+',' ') ### remove numbers
    train_clean['comment_text']=train_clean['comment_text'].apply(lambda x : x.lower()) ### to lower case
    train_clean['comment_text']=train_clean['comment_text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
    train_clean['comment_text']=train_clean['comment_text'].str.replace('[0-9]',' ') ### remove numbers
    tac = datetime.datetime.now(); time = tac - tic; print("To lower time" + str(time))
    print("remove punct time" + str(time))
    gc.collect()
    return(train_clean)


train_cl=clean(X_trainAll)
train_cl.head()

For making many machine learning algorithms applicable to text data, text data have to be transormed into something these algorithms can operate on. A common way is to create a column for each particular word occuring in the texts; then for a given line the column-cell of a particular word is given the number equivalent to the number of times that particular word appears in the comment text while for column-cells of words not occuring in that text-line the value 0 is assigned. 

A variation of that concept is known as tf-idf (term-frequency inverse-document frequency). Here again for every word (included in the text analysis) a separate column is created. The term-frequency referes to the relative frequency of a partiuclar word in the text-line. This value is adjusted by the inverse-document frequency, where the term of adjustment is based on the inverse of the frequency the particular word occurs over all text-lines (in effect, this should down-weight words such as 'is' or 'the' occuring in very many documents which are therefore unlikely to help in distinguishing between the two classes). 

In the following the tf-idf technique is applied; additionally to considering only single words also sequences of two words (2-grams) are considered. 

In [None]:
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True, strip_accents='unicode',  analyzer='word',
     stop_words='english', ngram_range=(1, 2),token_pattern=r'(?u)\b[A-Za-z]+\b',  #erhoehen auf 2
     max_features=50000) 

tfidf_train = word_vectorizer.fit_transform(train_cl['comment_text'])
print(word_vectorizer.get_feature_names()[:10])
print( len( word_vectorizer.get_feature_names() ))

gc.collect()

In [None]:
out=word_vectorizer.vocabulary_ ; list(out)[1:10]

**Creation of a word-cloud**

In the following I will create a word-cloud for toxic and another one for non-toxic comments. Each of this clouds holds the most important words among its comments. To avoid memory  problems I will only consider a subset of 10% of all comments.

In [None]:
n_size=len(y_trainAll); print(n_size/10)
sub_sample = np.random.choice(range(0, n_size), size=180487, replace=False).tolist()
#sub_sample[:20]

In [None]:
zw_df=pd.concat([train_cl, pd.DataFrame(y_trainAll)], axis=1)
zw_df.columns=['comment_text', 'target']
#print(zw_df.shape)
zw_df=zw_df.iloc[sub_sample,:]
#print(zw_df.shape)

toxic_comments = zw_df[zw_df['target'] >= .5]['comment_text'].values
toxic_comments = ' '.join(toxic_comments)

non_toxic_comments = zw_df[zw_df['target'] < .5]['comment_text'].values
non_toxic_comments = ' '.join(non_toxic_comments)
del zw_df, test_df, out ; gc.collect() 

In [None]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

from wordcloud import WordCloud
wordcloud_toxic = WordCloud(max_font_size=100, max_words=100, background_color="white",  stopwords=stop_words).generate(toxic_comments)
plt.figure(figsize=[15,5])    
# Display the generated image:
plt.title("Wordcloud: Toxic comments")
plt.imshow(wordcloud_toxic, interpolation='bilinear')
plt.axis("off")
plt.show()
del wordcloud_toxic,X_trainAll  , toxic_comments,   train_cl, word_vectorizer, train  ; 
gc.collect()

In [None]:
wordcloud_non_toxic = WordCloud(max_font_size=100, max_words=100, background_color="white",  stopwords=stop_words).generate(non_toxic_comments)
plt.figure(figsize=[15,5])
plt.title("Wordcloud: Non-Toxic comments")
plt.imshow(wordcloud_non_toxic, interpolation='bilinear')
plt.axis("off")
plt.show()

del wordcloud_non_toxic,  non_toxic_comments, stop_words, 
gc.collect()

Many of the most important words in both classes appear to be similar. So from the word clouds classification may not appear imediately obvious.

To get more insight into the data further EDA would be useful. There are already many very good notebook-kernels treating EDA  in depth for this data-set; I will reference a few of them at the end of this noteboook. In the following I will continue with considering different prediction models

# Model: Naive Bayes <a class="anchor" id="NB"></a>

The naive Bayes model is a popular model in machine learning frequently applied to text-mining applications. It is a simple probablistic classifier derived from the classic Bayesian theorem, where for certain dependence structures the simplifying assumption of independence was made (therefore naive). 

In [None]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.metrics import average_precision_score, roc_curve, roc_auc_score
from sklearn.model_selection import StratifiedKFold

In [None]:
def train_and_predictNB(alpha, train_x,train_y, valid_x,valid_y):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha)# count_train: best auc=0.769 bei alpha = 0.0 on count_train
    nb_classifier.fit(train_x,train_y)
    pred = nb_classifier.predict_proba(valid_x); pred=pd.DataFrame(pred); #print(pred[:3])
    auc = roc_auc_score(valid_y, pred[1]); #print(auc);# print(pred[1]) #print('AUC: %.3 f' % auc)
    pred = nb_classifier.predict(valid_x)
    score = metrics.accuracy_score(valid_y, pred)
    del nb_classifier, pred
    return [round(score,5), round(auc,5)]

In [None]:
X_train_tf, X_valid_tf, y_train, y_valid = train_test_split(tfidf_train, y_trainAll, test_size = 0.2, random_state = 53)

Some preliminary analysis have shown that parameter alpha of 1.2 is a reasonable choice. I will thus use this parameter value in the subsequent.

In [None]:
alpha=1.2
print('Alpha: ', alpha)
out=train_and_predictNB(alpha, X_train_tf, y_train, X_valid_tf, y_valid)
print('Accuracy: ', out[0])                             
print('AUC: ',out[1])

In [None]:
acc_out=[]; auc_out=[]
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=123)
i = 1
for train_index, valid_index in skf.split(tfidf_train, y_trainAll):
    print("\nFold {}".format(i)); i+=1
    print(len(train_index));print(len(valid_index))
    out=train_and_predictNB(alpha, tfidf_train[train_index], y_trainAll[train_index], tfidf_train[valid_index], y_trainAll[valid_index])
    print(out)
    acc_out.append(out[0]); auc_out.append(out[1])
print(acc_out)   ; print(auc_out) 
print("Mean-Acc: ", round(np.mean(acc_out),5) )
print("Mean-AUC: ", round(np.mean(auc_out),5) )

### Downsampling

A common situation of concern is when data-sets are inbalanced, that is one group (such as non-toxic comments) are by far more present in the data-set than another group (such as toxic comments). Depending on the choice of perfomance measure estimation results may not be desirable.

There are different strategies to handle this situation. Among those are up-sampling and down-sampling. In downsampling only a (a randomly sampled) subgroup of the majority group's samples is included in the data for training the model, while upsampling includes resampling from the minority group to achieve a balanced training data-set. In the following I use the downsampling approach.

In [None]:
np.random.seed(seed=234)
i_class0 = np.where(y_trainAll == 0)[0] ; i_class1 = np.where(y_trainAll == 1)[0]
n_class0 = len(i_class0) ; n_class1 = len(i_class1)
i_class0_downsampled = np.random.choice(i_class0, size=n_class1, replace=False)
ds_index=np.concatenate((i_class1,i_class0_downsampled))
print(n_class1); print(n_class0); print(len(ds_index))

y_train_ds=y_trainAll[ds_index]; tfidf_train_ds =tfidf_train[ds_index]

for repeated application I turn this into a function

In [None]:
def downsample(x_orig, y_orig):
    np.random.seed(seed=234)
    i_class0 = np.where(y_orig == 0)[0] ; i_class1 = np.where(y_orig == 1)[0]
    n_class0 = len(i_class0) ; n_class1 = len(i_class1)
    if n_class0 > n_class1:
        i_class0_downsampled = np.random.choice(i_class0, size=n_class1, replace=False);
        ds_index=np.concatenate((i_class1,i_class0_downsampled))
    else: 
        i_class1_downsampled = np.random.choice(i_class1, size=n_class0, replace=False);
        ds_index=np.concatenate((i_class0,i_class1_downsampled)) 
    #print(n_class1); print(n_class0); print(len(ds_index))

    y_ds=y_orig[ds_index]; X_ds =x_orig[ds_index]
    return X_ds, y_ds
    

In [None]:
acc_out=[]; auc_out=[]
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=123)
i = 1
for train_index, valid_index in skf.split(tfidf_train, y_trainAll):
    tfidf_train_ds, y_train_ds = downsample(tfidf_train[train_index], y_trainAll[train_index])
    out=train_and_predictNB(alpha, tfidf_train_ds, y_train_ds, tfidf_train[valid_index], y_trainAll[valid_index])
    acc_out.append(out[0]); auc_out.append(out[1])
print(acc_out)   ; print(auc_out) 
print("Mean-Acc: ", round(np.mean(acc_out),5) )
print("Mean-AUC: ", round(np.mean(auc_out),5) )

### Results

The subsequent table shows results of the 5-fold cross-validation estimations carried out above. The estimate of the (mean-) AUC indicates some slight improvement over the standard version (although improvement seems not to be large relative to variation). More striking is the difference in mean accuracy between the two different approaches.

|Scheme|AUC|ACC|
|---|---|---|
|5-CV|0.87622|0.92784
|5-CV (Downsample)|0.87647|0.75082

# Logistic Regression <a class="anchor" id="LR"></a>

Logistic regression is a common binary classification model in which a linear combination of input-variables is appropriately transformed to the probability of a binary event. The version implemented in Python (sklearn) is an extension of the classic model in which a penalty-term is added to the objective function to constrain or regularize coefficient estimates; depending on the type of regularisation imposed, measuring size of estimates in absolute or quadratic terms, one refers to L1 or L2 regularisation. Application of this regularisation is very useful if the number of variables (features) is large relative to the number of observations.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
def train_and_predictLogR(cl,c_weight=None):                                                      #c=0.8; l1 =0.9444   l2 ist schlechter
    logreg = LogisticRegression(C=cl,penalty='l1',class_weight=c_weight, solver='liblinear')    #class_weight : dict or ‘balanced’, optional (default=None)
    logreg.fit(X_train_tf, y_train)
    pred = logreg.predict_proba(X_valid_tf);pred=pd.DataFrame(pred)
    
    auc = roc_auc_score(y_valid, pred[1]); print('auc: ',auc)     
    pred = logreg.predict(X_valid_tf)
    score = metrics.accuracy_score(y_valid, pred)
    del logreg, pred
    return score

#print('Score: ', train_and_predictLogR(1))   
#classos = np.arange(0.001,3,.2)

classos =[.4,.6,.8 ]

for classo in classos:
    print('classo: ', classo)
    print('Score: ', train_and_predictLogR(classo))                              #0.8782946199369265
    print()

### Cross-Validation

In [None]:
def train_and_predictLogR(c_par, train_x,train_y, valid_x,valid_y, c_weight=None):
    logreg = LogisticRegression(C=c_par,penalty='l1', solver='liblinear' , class_weight=c_weight)   
    logreg.fit(train_x, train_y)
    pred = logreg.predict_proba(valid_x);pred=pd.DataFrame(pred)        
    auc = roc_auc_score(valid_y, pred[1]); #print(auc);# print(pred[1]) #print('AUC: %.3 f' % auc)
    pred = logreg.predict(valid_x)
    score = metrics.accuracy_score(valid_y, pred)
    return [round(score,5), round(auc,5)]

In [None]:
c_par=0.6
start = time.time()
acc_out=[]; auc_out=[]   ;         
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=123)
i = 1
for train_index, valid_index in skf.split(tfidf_train, y_trainAll):
    #print("\nFold {}".format(i)); i+=1  #print(len(train_index));print(len(valid_index))
    out=train_and_predictLogR(c_par, tfidf_train[train_index], y_trainAll[train_index], tfidf_train[valid_index], y_trainAll[valid_index])    #print(out)
    acc_out.append(out[0]); auc_out.append(out[1])
    
    
print(acc_out)   ; print(auc_out) 
print("Mean-Acc: ", round(np.mean(acc_out),5) )
print("Mean-AUC: ", round(np.mean(auc_out),5) )   ;end = time.time(); print((end - start)/60)

### Downsampling

In [None]:
start = time.time()
acc_out=[]; auc_out=[]   ;       
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=123)
i = 1
for train_index, valid_index in skf.split(tfidf_train, y_trainAll):
    tfidf_train_ds, y_train_ds = downsample(tfidf_train[train_index], y_trainAll[train_index])
    out=train_and_predictLogR(c_par, tfidf_train_ds, y_train_ds, tfidf_train[valid_index], y_trainAll[valid_index])    #print(out)
    acc_out.append(out[0]); auc_out.append(out[1])
    
    
print(acc_out)   ; print(auc_out) 
print("Mean-Acc: ", round(np.mean(acc_out),5) )
print("Mean-AUC: ", round(np.mean(auc_out),5) )   ;end = time.time(); print((end - start)/60)

### Class-Weighted Approach

Another option to face class-inbalances is by adjusting the weights of each instance according to the class it belongs to. The logistic regression function of the sklearn-package offers such an option. This can be achieved by setting the 'class_weight' argument to _balanced_. Here the sklearn-documentation states  that "The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data".

In [None]:
start = time.time()
acc_out=[]; auc_out=[]   ;       
nfold = 5
skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=123)
i = 1
for train_index, valid_index in skf.split(tfidf_train, y_trainAll):
    out=train_and_predictLogR(c_par, tfidf_train[train_index], y_trainAll[train_index], tfidf_train[valid_index], y_trainAll[valid_index],c_weight='balanced')    #print(out)
    acc_out.append(out[0]); auc_out.append(out[1])
    
    
print(acc_out)   ; print(auc_out) 
print("Mean-Acc: ", round(np.mean(acc_out),5) )
print("Mean-AUC: ", round(np.mean(auc_out),5) )  ;end = time.time(); print((end - start)/60)

### Results

The results from 5-fold cross-validation show the logistic regression model to considerably improve prediction results with respect to the previously considered Naive Bayes approach. 

No further improvement was reached applying downsampling or to apply different weights according to class-membership using (exact) class-rebalancing weights. These techniques, however, do show again a reduction in accuracy.

|Scheme|AUC|ACC|
|---|---|---|
|5-CV |0.94339|0.94725
|(Downsample) 5-CV |0.94154|0.89766
| (balanced weights) 5-CV|0.94190|0.89964 


# Gradient Boosting (LightGBM) <a class="anchor" id="XBR"></a>

A class of machine-learning algorithms often applied are tree-based models. Two frequent choices are random forests and gradient-boosting models. Due to their good performance the second choice, boosting models, have become very popular. In the following I apply the light-GBM to the text data. 

To find an appropriate model I allow for a high number of trees while at the same time applying 'early-stopping' which will stop further estimation if no improvement is made.

In [None]:
import time
import lightgbm as lgb
train_data = lgb.Dataset(X_train_tf, y_train)
valid_data = lgb.Dataset(X_valid_tf, y_valid ) #tfidf_test, reference=train_data)

param = {
    'num_trees':5000,   #0.942217   ;   0.94268   #0.94338  (20; 0.1);0.94316 (30,0.1);  0.9434 (25,0.1/32min); 0.943271 (25;0.05/55)
    'learning_rate':0.1,
    "objective": "binary",
    'num_leaves':25,
    'metric': ['auc'],
    "num_threads": -1,
    "early_stopping_rounds":20,
    "verbose":1,
    'boost_from_average': False,    
}

start = time.time()
bdt = lgb.train(param, train_data, valid_sets=[valid_data], verbose_eval=100)  
end = time.time(); print((end - start)/60)

In [None]:
def train_and_predictLGBM01(train_x,train_y, valid_x,valid_y, num_trees=1):                  #[1126]	valid_0's auc: 0.943421
    param = {
    'num_trees':num_trees,    'learning_rate':0.1,  "objective": "binary",  'num_leaves':25,
    'metric': ['auc'],   "num_threads": -1,   # "early_stopping_rounds":20,
    "verbose":1,'boost_from_average': False,     #'is_unbalance': True,                       
     #'scale_pos_weight': ch_weights,                        
     }
    train_data = lgb.Dataset(train_x, train_y)
    bdt = lgb.train(param, train_data,  verbose_eval=500) 
    pred = bdt.predict(valid_x)  ;         
    auc = roc_auc_score(valid_y, pred); #print(auc);# print(pred[1]) #print('AUC: %.3 f' % auc)
    pred_dichotom=np.where(pred >=0.5, 1, 0); pred=pd.DataFrame(pred)
    #pred = bdt.predict(valid_x)
    score = metrics.accuracy_score(valid_y, pred_dichotom)
    return [round(score,5), round(auc,5)]

In [None]:
#train_and_predictLGBM01(X_train_tf, y_train, X_valid_tf, y_valid , num_trees=120)#


The above model was then  estimated  using 5-fold cross-validation. For time-restrictions of the execution of this notebook this has been carried out  separately in a previous version. The model has then also been estimated using an internal balancing scheme of the algorithm, as well as some manual re-weighting of the toxic class by the factor 3. Results are shown in the subsequent talbe. With respect to mean-AUC for standard cross-validation we obtain a value of 0.9417. There is some indication that by re-weighting some slight improvements can be achieved.  

|Light GBM |	AUC |	ACC|
|---|---|---|
|5-CV 	|0.9417 |	0.9474|
|balanced 5-CV |	0.9418| 	0.9028|
|3 x pos-weight| 	0.9422 |	0.9420| 

In [None]:
del train_data, X_train_tf, X_valid_tf,bdt,  valid_data, out, tfidf_train, y_trainAll
gc.collect()

In [None]:
for name in dir():
    if not name.startswith('_'):
        del globals()[name]

for name in dir():
    if not name.startswith('_'):
        del locals()[name]

%reset -f        
import gc        
gc.collect()        

# Neural-Network (LTSM) <a class="anchor" id="LTSM"></a>

In the following I consider a neural network. The main concept of 'what' will be analysed is somewhat different. The previous methods have mainly considered if a particular word (or a particular sequence of words, n-grams) is present in the comment-text. It was then the combination of words present in the comment-text according to which cases where classified. 

In the current situation sentences and their structura (apart from some cleaning) are maintained in the analysis. This allows that predictions may also take into account the structure of the sentence. In fact, words within a sentence are mapped into a d-dimensional space by the use of so-called embeddings; thus the new d-dimensional word-vectors, at least to a certain degree, represent 'meaning' rather than a partular word. The whole procedure can be further improved by using embeddings which have prviously already been trained on large amounts of text, for example on webpages of Wikipedia. This is the approach used subsequently.

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()
import time

from keras.preprocessing import text, sequence
from keras import backend as K
from keras.models import load_model
import keras
import pickle
from sklearn.model_selection import train_test_split

In [None]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')

In the following we carry out some basic cleaning of the comment-texts: some punctuation and special characters.

In [None]:
def clean_text(x):
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

train["comment_text"] = train["comment_text"].progress_apply(lambda x: clean_text(x))

In [None]:
train_data = train["comment_text"]
label_data = train.target.apply(lambda x: 0 if x < 0.5 else 1)
train_data.shape, label_data.shape

Tokenization and transforming the sentences; we allow a maximal length of sequences to have 200 words.

In [None]:
MAX_LEN = 200
CHARS_TO_REMOVE = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“”’\'∞θ÷α•à−β∅³π‘₹´°£€\×™√²—'

tokenizer = text.Tokenizer(filters=CHARS_TO_REMOVE)
tokenizer.fit_on_texts(list(train_data) )

train_data = tokenizer.texts_to_sequences(train_data)
train_data = sequence.pad_sequences(train_data, maxlen=MAX_LEN)

For a first analysis I will split the data-set into training and validation

In [None]:
x_train, x_val, y_train, y_val = train_test_split(train_data, label_data, test_size = 0.35, random_state = 53)

We create functions to load embeddings and build the embedding matrix

In [None]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')

def load_embeddings(path):
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

def build_matrix(word_index, path):
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            pass
    return embedding_matrix



In [None]:
EMBEDDING_FILES = [ '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec']

start = time.time()

embedding_matrix = np.concatenate(
    [build_matrix(tokenizer.word_index, f) for f in EMBEDDING_FILES], axis=-1)

end = time.time(); elapsed = end - start; print(elapsed/60)
gc.collect()
embedding_matrix.shape

In [None]:
from keras.models import Sequential, Model
from keras.optimizers import  Adam
from keras.layers import Flatten, Dense, Embedding, Dropout, Bidirectional, Input, add #,  CuDNNLSTM,
from keras.layers import concatenate,  SpatialDropout1D, Conv1D, GlobalAveragePooling1D, GlobalMaxPooling1D, LSTM, CuDNNLSTM
from keras.utils import plot_model
import matplotlib.pyplot as plt
%matplotlib inline

#from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from sklearn.metrics import roc_auc_score
import tensorflow as tf
import timeit

def auroc(y_true, y_pred):
    return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)

For the current data I will use a small neural network. It includes the embedding-matrix, two LSTM-layers and two standard, dense layers. As there are time-computation restrictions for this nodebook I only use 64 nodes per layer.

In [None]:
n_layers=64

def build_model(embedding_matrix):
    words = Input(shape=(None,))
    x = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=True)(words)
    x = SpatialDropout1D(0.2)(x)
    x = CuDNNLSTM(n_layers, return_sequences=True)(x)
    x = CuDNNLSTM(n_layers, return_sequences=True)(x)
    x = GlobalMaxPooling1D()(x)
    

    x = Dense(n_layers, activation='relu')(x)
    x = Dense(64, activation='relu')(x)
    result = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=words, outputs=[result])
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=["accuracy",auroc])
    return model



In [None]:
#dir()

In [None]:
del tokenizer, train, train_data, label_data
gc.collect()

In [None]:
start = time.time()
model = build_model(embedding_matrix)
history = model.fit(x_train, y_train,
                    epochs=6,
                    batch_size=1024,
                    validation_data=(x_val, y_val))

end = time.time(); elapsed = end - start; print(elapsed/60)

In [None]:
def plot_accuracy(acc,val_acc):
  # Plot training & validation accuracy values
  plt.figure()
  plt.plot(acc)
  plt.plot(val_acc)
  plt.title('Model accuracy')
  plt.ylabel('Accuracy')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Test'], loc='upper left')
  plt.show()

def plot_loss(loss,val_loss):
  plt.figure()
  plt.plot(loss)
  plt.plot(val_loss)
  plt.title('Model loss')
  plt.ylabel('Loss')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Test'], loc='upper right')
  plt.show()

def plot_auc(auroc,val_auroc):
  plt.figure()
  plt.plot(auroc)
  plt.plot(val_auroc)
  plt.title('Model AUC')
  plt.ylabel('AUC')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Test'], loc='upper right')
  plt.show()

Subsequently we see plots of different performance measures vs the corresponding epochs:

In [None]:
plot_loss(history.history['loss'], history.history['val_loss'])
plot_accuracy(history.history['acc'], history.history['val_acc'])
plot_auc(history.history['auroc'], history.history['val_auroc'])

The graphs indicate that 2-epochs form a reasonable choice.

### Cross-Validation and Re-Weighting

For the Neural-Network presented above I  previously carried out 5-fold cross-validation using GPU (see kernel-version 2). Results are presented in the table below: 

|Structure| 	AUC| 	ACC|
|---|---|---|
|5-CV| 	0.9605| 	0.9512|
|3/1 - upweight 5-CV| 	0.9612| 	0.9381|


The table shows a mean-AUC estimate of 0.9605 which is a
considerable improvement compared to previous models. 

Applying some re-weighting of the lower represented toxic comments by a factor of 3 indicates some slight improvement of the AUC estimate.

# Final Thoughts <a class="anchor" id="final"></a>

In this notebook I compared several models for class-prediction on the jigsaw text data. In particular, I used the Naive Bayes model, the logistic regression model with regularization capabilities, the boosting model light-GBM, and a small neural network employing recurrent neural network layers. As performance measures I mainly considered the area-under-the-curve AUC , but also evaluated the corresponding accuracy. The data exhibit class-imbalance with approximately 11 times more none-toxic comments in the data-set than toxic messages. 

To get a reliable estimate of the model performance and how it may generalize to new data I employed cross-validation using five folds. I additionally considered some techniques to deal with class-imbalances: downsampling and/ or re-weighting. Results in mean-AUC showed often some improvements although they were often small relative to the underlying variation.

Naive-Bayes models achieved the lowest performance with a mean-AUC of around 0.876. The logistic regression model employing (penalty based on) a L1-regularization showed a fairly large improvement with a mean-AUC of 0.943. Application of the light-GBM exhibited similar results with a mean-AUC of 0.942. A considerable improvement could then be achieved by using a small recurrent neural network with two LSTM- layers with an approximate performance of 0.961. Due to restrictions of notebook execution-time especially computationally expensive models had to be chosen rather small. 

The performance measure used for the competition was an extention of the AUC taking into account performance over different classes. For some of the models considered here,  options to adjust to this extended measure may be by adjusting the sample weights or loss functions accordingly.

Finally, there were several kernels I  benefited from and I want to express my gratitude. Some of those will be listed below (although there were many more).



**EDA**

https://www.kaggle.com/kabure/simple-eda-hard-views-w-easy-code

https://www.kaggle.com/ekhtiar/unintended-eda-with-tutorial-notes

https://www.kaggle.com/nz0722/simple-eda-text-preprocessing-jigsaw

https://www.kaggle.com/s7anmerk/lean-import-to-save-ram-and-eda


**Embeddings**

https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda

https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part2-usage


**LTSM**

https://www.kaggle.com/thousandvoices/simple-lstm?scriptVersionId=12514554



**further refs**

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://fasttext.cc/docs/en/english-vectors.html

http://www.tfidf.com/
