# Ranking Toxic Comments using LSTM network maybe?

So while ray is trying the more efficient approach of using naive-bayes classification, I'd try training an entire LSTM to solve this problem... in jupyter notebook... so uh.. yeah

In [23]:
# Import
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Embedding, GRU, Dense, Dropout
from sklearn.preprocessing import OneHotEncoder

My plan is to use two LSTM networks, to encode both comments and output a one-hot vector with 2 values, one for if it's greater and one for if it's less. Given that the dataset only identifies if one is greater than the other, we can only make determinations based on that, so the network reflects that

Something like this:

![Toxicity LSTM Network](toxicity-lstm-network.svg)

I think the size of each layer can be determined by some hyperparameter tuning (or just guess & check since I'm not google and don't have infinite compute power)

But first we need to download and extract the data. This requires `unzip`

In [21]:
%%bash
# Download and extract data (run if on Linux/MacOS)
rm -rf data
mkdir data
cd data
kaggle competitions download --force -c jigsaw-toxic-severity-rating
unzip jigsaw-toxic-severity-rating.zip

Downloading jigsaw-toxic-severity-rating.zip to /Users/surenderkharbanda/Documents/GitHub/toxicitykaggle/data


100%|██████████| 6.72M/6.72M [00:00<00:00, 16.3MB/s]



Archive:  jigsaw-toxic-severity-rating.zip
  inflating: comments_to_score.csv   
  inflating: sample_submission.csv   
  inflating: validation_data.csv     


Now that we have our data, we can create a function to pull it and convert it into a tensorflow dataset for our model. The worker is the id of the person that scored the comments. They're not really important so we can ignore that column. So for each row, we'll need the row as is and also the reverse of the row (a,b and b,a). This way we can train the model to understand both greater than and less than cases. Otherwise all of these outputs will be just 1. Prior experience has taught me that word based encoding is not all that useful. We'll encode using characters. We can do this using `tf.keras.layers.StringLookup`. This will learn the best way to encode our characters by analyzing the dataset

In [49]:
def input_data(train_frac=0.7, shuffle=200, batch=200, repeat=3, display=False):
    """
    Extract and preprocess data for trainer
    
    :param shuffle: size of groups to shuffle rows in
    :param batch: size of batches to segment data into
    :param repeat: number of times to repeat dataset
    :param display: if true, print a sample of data
    """
    # Pull data from csv
    print('Pulling data...')
    csv_data = pd.read_csv('data/validation_data.csv')
    csv_data = csv_data[['less_toxic', 'more_toxic']]
    
    # Our inputs are labeled as "more toxic" and "less toxic"
    # but we want to pass in both with their comparision being
    # unknown, as the network is supposed to figure that out.
    # So, we create two sets, one of which has the order swapped
    # and we name both sequence columns as "sequence A" and 
    # "sequence B" We then assign a label to each, with the 
    # original having 'greater' and the swapped one having 'less'. 
    # Therefore, the network will see both cases for each input 
    # in the dataset and can train for both
    print('Generating labeled data...')
    labeled_data_greater = csv_data.copy()
    labeled_data_greater.rename(
        columns={'less_toxic': 'seq_a', 'more_toxic': 'seq_b' }, 
        inplace=True)
    labeled_data_greater['label'] = 'greater'
    labeled_data_less = csv_data.copy()
    labeled_data_less.rename(
        columns={ 'more_toxic': 'seq_a', 'less_toxic': 'seq_b' }, 
        inplace=True)
    labeled_data_less['label'] = 'less'
    labeled_data = pd.concat([ labeled_data_greater, labeled_data_less ])
    labeled_data = labeled_data.sample(frac=1)
    
    # Now we take all sequences of characters and convert them to sequences 
    # of integers. We can do that using keras's StringLookup preprocessing 
    # layer, which will take a string and spit out an array of integers 
    # encoding the characters of the string. This layer will need to scan 
    # the dataset to determine the appropriate encoding vocabulary for the 
    # characters. Since both sequences essentially contain all of the data, 
    # we can just use one of the sequences for the StringLookup to scan.
    print('Encoding inputs...')
    seq_a = tf.strings.unicode_split(labeled_data['seq_a'], input_encoding='UTF-8')
    seq_b = tf.strings.unicode_split(labeled_data['seq_b'], input_encoding='UTF-8')
    encoder = tf.keras.layers.StringLookup()
    encoder.adapt(seq_a)
    vocab_size = encoder.vocabulary_size()
    if display:
        print('Vocab size:', vocab_size)
    seqint_a = encoder(seq_a)
    seqint_b = encoder(seq_b)
    
    # Then we can one-hot encode our labels using scikit-learn's 
    # OneHotEncoder class. Since we know our labels ahead of time, 
    # I figured we don't need to train it. HOWEVER, scikit-learn 
    # doesn't seem to think so, as it expects us to call fit on 
    # the data anyway... so yeah.
    print('Encoding labels...')
    label_encoder = OneHotEncoder(
        categories=[['greater', 'less']], 
        handle_unknown='ignore')
    label_array = labeled_data['label'].values.reshape(-1, 1)
    label_encoder.fit(label_array)
    labels = label_encoder.transform(label_array)
    labels = tf.constant(labels.toarray())
    
    # Create dataset. Shuffle, batch, repeat, etc.
    print('Creating dataset...')
    dataset = tf.data.Dataset.from_tensor_slices(
        ((seqint_a, seqint_b), labels))
    dataset = dataset.shuffle(shuffle)
    dataset = dataset.batch(batch)
    dataset = dataset.repeat(repeat)
    
    # Display a sample
    if display:
        print('Dataset spec:', dataset)
        for (seq_a, seq_b), output in dataset.take(1):
            print(f'Sample input sequence A:', seq_a[0])
            print(f'Sample input sequence B:', seq_b[0])
            print(f'Sample output labels:', output[0])
            
    # Split into training and testing data
    print('Splitting into training and testing...')
    train_num = int(train_frac*len(dataset))
    train_dataset = dataset.take(train_num)
    test_dataset = dataset.skip(train_num)
    
    # Return training data, testing data, and vocab size
    return train_dataset, test_dataset, vocab_size
    
# Run input data function to test it out
train_dataset, test_dataset, vocab_size = input_data(display=True)

Pulling data...
Generating labeled data...
Encoding inputs...
Vocab size: 475
Encoding labels...
Creating dataset...
Dataset spec: <RepeatDataset shapes: (((None, None), (None, None)), (None, 2)), types: ((tf.int64, tf.int64), tf.float64)>
Sample input sequence A: tf.Tensor(
[ 1 63  5  7 13  5 11  6  8 16  1 28 28 23  1  8 12 17 17  2  8  3  1 15
  4 12  1 17  2  3  1  8  4 16  2  1  2 13 12 14  5  3  6  4  7  1 22  2
 19  4  9  2  1 15  4 12  1  5 14  3  1  5  8  1  5  1 20  6 24  6 18  2
 13  6  5  1  2 13  6  3  4  9 21  1 25  5  7 13  5 11  6  8 16  1 10  5
  8  1  3 10  2  1  6  7  3  2  7  3  6  4  7  1  4 19  1 13  2  8  3  9
  4 15  6  7 17 21 28 28 23  3  8  1  6  3  1 14  4  7  8  3  9 12 14  3
  6 25  2  1  3  4  1 20  5  9  7  1 18  2  4 18 11  2  1  4 19 19  1 22
  2  6  7 17  1  5  1  9  2 25  2  9  3  1 17  4  4  7  1 51 28 28 22 12
  3  1  3 10  5  3  8  1  1 20 10  5  3  1  1  3 10  2  1  9  5  7 24  8
  1  4 19  1 20  6 24  6 18  2 13  6  5  1  5  9  2  1 19 12 11 11 

After this. We build the model using Keras's framework, train it and then validate it on the test set.

The original problem calls for ranking comments based on toxicity, so we can use this network as a comparator function to sort the list of toxic comments. So, first, the network. I'm creating a function which would return a model based on parameters. This will be used for the optimization step.

In the last minute, I decided that, instead of an LSTM network, I would be using a GRU network

In [62]:
def create_model(vocab_size, embed_units=64, recur_units=64, dense_units=64, dropout_rate=0.1):
    """
    Create model using parameters
    """
    # A LSTM network
    input_a = Input((None,), name='input_a')
    embed_a = Embedding(vocab_size, embed_units, name='embed_a')(input_a)
    recur_a = GRU(recur_units, name='recur_a')(embed_a)
    drop_a = Dropout(dropout_rate, name='drop_a')(recur_a)
    
    # B LSTM network
    input_b = Input((None,), name='input_b')
    embed_b = Embedding(vocab_size, embed_units, name='embed_b')(input_b)
    recur_b = GRU(recur_units, name='recur_b')(embed_b)
    drop_b = Dropout(dropout_rate, name='drop_b')(recur_b)
    
    # Concatenation and dense layers
    concat = tf.concat([ drop_a, drop_b ], axis=1, name='concatenate')
    dense = Dense(dense_units, activation='relu', name='dense')(concat)
    drop_d = Dropout(dropout_rate, name='drop_d')(dense)
    output = Dense(2, activation='relu', name='labels')(drop_d) # 2 labels in the output layer
    
    # Final model configuration
    model = Model([input_a, input_b], output)
    model.summary()
    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.CategoricalCrossentropy(),
        metrics=['accuracy']
    )
    return model

# Horay model created
tf.keras.backend.clear_session()
model = create_model(vocab_size)

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_a (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_b (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embed_a (Embedding)            (None, None, 64)     30400       ['input_a[0][0]']                
                                                                                                  
 embed_b (Embedding)            (None, None, 64)     30400       ['input_b[0][0]']                
                                                                                              

Now we fit

In [63]:
model.fit(train_dataset)



  6/634 [..............................] - ETA: 1:43:49 - loss: nan - accuracy: 0.4842

Exception ignored in: <function ScopedTFGraph.__del__ at 0x112e36f70>
Traceback (most recent call last):
  File "/Users/surenderkharbanda/Documents/GitHub/toxicitykaggle/env/lib/python3.9/site-packages/tensorflow/python/framework/c_api_util.py", line 58, in __del__
    self.deleter(self.graph)
KeyboardInterrupt: 

KeyboardInterrupt

