# Toxicity // Second Keras LSTM
Project: [Jigsaw Unintended Bias in Toxicity Classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)



## Introduction

Second version of my initial kernel, which was stripped down to its simplest form because I ran out of memory on the kernel for reasons still unknown.

This second step was meant to be an iteration with more preprocessing: contraction.

However, the second version does not rank at all and at the time of writing, I do not understand why.

Based on [Simple LSTM kernel](https://www.kaggle.com/thousandvoices/simple-lstm). Credit to @thousandvoice for base model and initial preprocessing.

My objective is not to win. This is my first real world Keras implementation.

I updated / generalized some code, added some comments, added preprocessing steps, etc.

## Ideas for improvement

General:
* Do some EDA
* GridSearch (while avoiding memory problems...)
* Limit number of words used in the neural network to avoid overfitting
* ...


Pre-processing:
* Do something with words not found in vocabulary and sort out:
 * first names, last names, places, etc.
 * spelling mistakes (fix them)
 * acronyms (replace with words)
 * etc.
 
* Try:
 * weighted average of embeddings
 * ...

# Configuration

## Import

In [1]:
# Import libraries
# MAIN
import numpy as np
import pandas as pd
import requests
import re
import math
import seaborn as sns
import operator
import logging
import os

# Keras
from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, Dropout, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing import text, sequence
from keras.callbacks import LearningRateScheduler

Using TensorFlow backend.


## Lookups

In [2]:
SPECIAL_CHARS_MAPPING = {"_":" ", "`":" "}
SPECIAL_CHARS = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

def clean_special_chars(text):
    for p in SPECIAL_CHARS_MAPPING:
        text = text.replace(p, SPECIAL_CHARS_MAPPING[p])    
    for p in SPECIAL_CHARS:
        text = text.replace(p, f' {p} ')     
    return text

In [3]:
CONTRACTION_LOOKUP_EN = {"ain't": "is not"
                      , "aren't": "are not"
                      ,"can't": "cannot"
                      , "'cause": "because"
                      , "could've": "could have"
                      , "couldn't": "could not", "didn't": "did not"
                      ,  "doesn't": "does not", "don't": "do not"
                      , "hadn't": "had not", "hasn't": "has not"
                      , "haven't": "have not", "he'd": "he would"
                      ,"he'll": "he will", "he's": "he is", "how'd": "how did"
                      , "how'd'y": "how do you", "how'll": "how will"
                      , "how's": "how is",  "I'd": "I would"
                      , "I'd've": "I would have", "I'll": "I will"
                      , "I'll've": "I will have","I'm": "I am", "I've": "I have"
                      , "i'd": "i would", "i'd've": "i would have"
                         , "i'll": "i will",  "i'll've": "i will have"
                         ,"i'm": "i am", "i've": "i have", "isn't": "is not"
                         , "it'd": "it would", "it'd've": "it would have"
                         , "it'll": "it will", "it'll've": "it will have"
                         ,"it's": "it is", "let's": "let us", "ma'am": "madam"
                         , "mayn't": "may not", "might've": "might have"
                         ,"mightn't": "might not"
                         ,"mightn't've": "might not have"
                         , "must've": "must have", "mustn't": "must not"
                         , "mustn't've": "must not have", "needn't": "need not"
                         , "needn't've": "need not have"
                         ,"o'clock": "of the clock", "oughtn't": "ought not"
                         , "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not"
                         , "shan't've": "shall not have", "she'd": "she would"
                         , "she'd've": "she would have", "she'll": "she will"
                         , "she'll've": "she will have", "she's": "she is", "should've": "should have"
                         , "shouldn't": "should not", "shouldn't've": "should not have"
                         , "so've": "so have","so's": "so as", "this's": "this is"
                         ,"that'd": "that would", "that'd've": "that would have"
                         , "that's": "that is", "there'd": "there would"
                         , "there'd've": "there would have", "there's": "there is"
                         , "here's": "here is","they'd": "they would", "they'd've": "they would have"
                         , "they'll": "they will", "they'll've": "they will have"
                         , "they're": "they are", "they've": "they have", "to've": "to have"
                         , "wasn't": "was not", "we'd": "we would", "we'd've": "we would have"
                         , "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not"
                         , "what'll": "what will", "what'll've": "what will have"
                         , "what're": "what are",  "what's": "what is", "what've": "what have"
                         , "when's": "when is", "when've": "when have", "where'd": "where did"
                         , "where's": "where is", "where've": "where have", "who'll": "who will"
                         , "who'll've": "who will have", "who's": "who is", "who've": "who have"
                         , "why's": "why is", "why've": "why have", "will've": "will have"
                         , "won't": "will not", "won't've": "will not have", "would've": "would have"
                         , "wouldn't": "would not", "wouldn't've": "would not have"
                         , "y'all": "you all", "y'all'd": "you all would"
                         ,"y'all'd've": "you all would have","y'all're": "you all are"
                         ,"y'all've": "you all have","you'd": "you would"
                         , "you'd've": "you would have", "you'll": "you will"
                         , "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [4]:
def known_contractions(embed):
    """
    Returns an array of contractions from the lookup that are found in an embedding matrix
    """
    known = []
    for contract in CONTRACTION_LOOKUP_EN:
        if contract in embed:
            known.append(contract)
    return known

## Read data

In [6]:
import os
print(os.listdir("../input"))

['fasttext-crawl-300d-2m', 'jigsaw-unintended-bias-in-toxicity-classification']


In [7]:
train = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')

# Prepare

## Word embeddings

In [8]:
def get_coefs(word, *arr):
    """
    Assign coefficient to word
    """
    return word, np.asarray(arr, dtype='float32')

In [9]:
def load_embeddings(path):
    """
    # Load embeddings from a file path and assign coefficients
    """
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

In [10]:
# FILES

filepath_crawl = '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec'
embeddings_crawl = load_embeddings(filepath_crawl)

# Preprocessing

## Configure

In [11]:
NUM_MODELS = 2
BATCH_SIZE = 512
LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
EPOCHS = 4
MAX_LEN = 220

## Text preprocessing and tokenizer

This section is based on [this kernel](https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2) by [@theoviel](https://www.kaggle.com/theoviel), which itself is based on [this kernel](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings) by [@Dieter](https://www.kaggle.com/christofhenkel).

I added comments and tweaked a little bit.


In [12]:
# Map the missing contractions
def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [13]:
def preprocess(data):
    """
    Preprocess an array of strings:
    1) clean special characters
    2) clean contractions
    3) ...
    4) ... to be continued
    """

    # clean special characters
    data = data.astype(str).apply(lambda x: clean_special_chars(x))
    
    # clean contractions
    data = data.astype(str).apply(lambda x: clean_contractions(x, CONTRACTION_LOOKUP_EN))
    return data

x_train = preprocess(train['comment_text'])
x_test = preprocess(test['comment_text'])

In [14]:
# transform target into boolean
y_train = np.where(train['target'] >= 0.5, 1, 0)

# auxiliary results
# Q: why does it include the target?
y_aux_train = train[['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']]

## Tokenize
Transform text into integer tokens.

In [None]:
tokenizer = text.Tokenizer()

# tokenize both train and test data
tokenizer.fit_on_texts(list(x_train) + list(x_test))

# Transform text in sequence of integers
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

# Pad sequences to the same length (padded with 0 by default)
x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAX_LEN)

In [None]:
def build_matrix(word_index):
    """
    Build embeddings matrix from train and test data
    
    @args: word_index
    """
  
    # Use FastText Crawl only for memory purposes
    embedding_index = embeddings_crawl
    
    # create matrix
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            pass
    return embedding_matrix

embedding_matrix = build_matrix(tokenizer.word_index)

# Model

## Configure model

In [None]:
def build_model(embedding_matrix, num_aux_targets, loss_fn='binary_crossentropy', optimizer='adam'):
    
    # Create Input Layer
    words = Input(shape=(MAX_LEN,))
    
    # Feature Scaling
    x = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(words)
    
    # Dropout regularization to avoid overfitting
    # How it works: at each iteration of the training, some neurons are
    # randomly disabled to prevent them from being too dependent on each
    # other when they learn their correlations (because we don't have the same configuration each time)
    
    # This version performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements.
    # If adjacent frames within feature maps are strongly correlated 
    # (as is normally the case in early convolution layers) then regular dropout will not regularize the activations and 
    # will otherwise just result in an effective learning rate decrease.
    # In this case, SpatialDropout1D will help promote independence between feature maps and should be used instead.
    x = SpatialDropout1D(0.3)(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)

    # In the last few years, experts have turned to global average pooling (GAP) layers to minimize overfitting 
    # by reducing the total number of parameters in the model. 
    # Similar to max pooling layers, GAP layers are used to reduce the spatial dimensions of a three-dimensional tensor. 
    
    # Q: Unclear on why we are using both here.
    hidden = concatenate([
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    
    # Add two rectifier function hidden layers
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    
    # Output layer
    result = Dense(1, activation='sigmoid')(hidden)
    
    # Auxiliary results (categorization)
    # ex. 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat'
    aux_result = Dense(num_aux_targets, activation='sigmoid')(hidden)
    
    # Create model, including auxiliary resuts
    model = Model(inputs=words, outputs=[result, aux_result])
    
    # Binary: toxic or not.
    model.compile(loss=loss_fn, optimizer=optimizer, metrics = ['accuracy'])

    return model

In [None]:
checkpoint_predictions = []
weights = []

for model_idx in range(NUM_MODELS):
    model = build_model(embedding_matrix, y_aux_train.shape[-1])
    for global_epoch in range(EPOCHS):
        
        #start_time = time.time()

        #print('Epoch {}/{} \t starttime={:.2f}s'.format(
        #      global_epoch + 1, EPOCHS, start_time))
        
        model.fit(
            x_train,
            [y_train, y_aux_train],
            batch_size=BATCH_SIZE,
            epochs=1,
            verbose=2,
            callbacks=[
                LearningRateScheduler(lambda epoch: 1e-3 * (0.6 ** global_epoch))
            ]
        )
        
        #elapsed_time = time.time() - start_time
        #print('Epoch {}/{} \t time={:.2f}s'.format(
        #      global_epoch + 1, EPOCHS, elapsed_time))
        
        checkpoint_predictions.append(model.predict(x_test, batch_size=2048)[0].flatten())
        weights.append(2 ** global_epoch)

## Predict

In [None]:
predictions = np.average(checkpoint_predictions, weights=weights, axis=0)

## Submit

In [None]:
submission = pd.DataFrame.from_dict({
    'id': test['id'],
    'prediction': predictions
})

In [None]:
submission.to_csv('submission.csv', index=False)