# Ranking Toxic Comments using LSTM network maybe?

So while ray is trying the more efficient approach of using naive-bayes classification, I'd try training an entire LSTM to solve this problem... in jupyter notebook... so uh.. yeah

In [18]:
# Import
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

My plan is to use two LSTM networks, to encode both comments and output a one-hot vector with 2 values, one for if it's greater and one for if it's less. Given that the dataset only identifies if one is greater than the other, we can only make determinations based on that, so the network reflects that

Something like this:

![Toxicity%20LSTM%20Network-2.svg](attachment:Toxicity%20LSTM%20Network-2.svg)

I think the size of each layer can be determined by some hyperparameter tuning (or just guess & check since I'm not google and don't have infinite compute power)

But first we need to download and extract the data. This requires `unzip`

In [10]:
# Download and extract data (run if on Linux/MacOS)
! rm -rf data
! kaggle competitions download --force -c jigsaw-toxic-severity-rating
! unzip jigsaw-toxic-severity-rating.zip -d data

Downloading jigsaw-toxic-severity-rating.zip to /Users/surenderkharbanda/Documents/GitHub/toxicitykaggle
 89%|█████████████████████████████████▉    | 6.00M/6.72M [00:00<00:00, 14.9MB/s]
100%|██████████████████████████████████████| 6.72M/6.72M [00:00<00:00, 16.2MB/s]
Archive:  jigsaw-toxic-severity-rating.zip
  inflating: data/comments_to_score.csv  
  inflating: data/sample_submission.csv  
  inflating: data/validation_data.csv  
[34mdata[m[m                             requirements.txt
[34menv[m[m                              toxic-lstm.ipynb
jigsaw-toxic-severity-rating.zip toxiccomments.ipynb
comments_to_score.csv sample_submission.csv validation_data.csv


Now that we have our data, we can create a function to pull it and convert it into a tensorflow dataset for our model. The worker is the id of the person that scored the comments. They're not really important so we can ignore that column. So for each row, we'll need the row as is and also the reverse of the row (a,b and b,a). This way we can train the model to understand both greater than and less than cases. Otherwise all of these outputs will be just 1. Prior experience has taught me that word based encoding is not all that useful. We'll encode using characters. We can do this using `tf.keras.layers.StringLookup`. This will learn the best way to encode our characters by analyzing the dataset

In [39]:
def input_data(train_frac=0.7, shuffle=200, batch=200, repeat=3, display=False):
    """
    Extract and preprocess data for trainer
    
    :param shuffle: size of groups to shuffle rows in
    :param batch: size of batches to segment data into
    :param repeat: number of times to repeat dataset
    :param display: if true, print a sample of data
    """
    # Pull data from csv
    csv_data = pd.read_csv('data/validation_data.csv')
    csv_data = csv_data[['less_toxic', 'more_toxic']]
    
    # Our inputs are labeled as "more toxic" and "less toxic"
    # but we want to pass in both with their comparision being
    # unknown, as the network is supposed to figure that out.
    # So, we create two sets, one of which has the order swapped
    # and we name both sequence columns as "sequence A" and 
    # "sequence B" We then assign a label to each, with the 
    # original having 'greater' and the swapped one having 'less'. 
    # Therefore, the network will see both cases for each input 
    # in the dataset and can train for both
    labeled_data_greater = csv_data.copy()
    labeled_data_greater.rename(
        columns={'less_toxic': 'seq_a', 'more_toxic': 'seq_b' }, 
        inplace=True)
    labeled_data_greater['label'] = 'greater'
    labeled_data_less = csv_data.copy()
    labeled_data_less.rename(
        columns={ 'more_toxic': 'seq_a', 'less_toxic': 'seq_b' }, 
        inplace=True)
    labeled_data_less['label'] = 'less'
    labeled_data = pd.concat([ labeled_data_greater, labeled_data_less ])
    labeled_data = labeled_data.sample(frac=1)
    
    # Now we take all sequences of characters and convert them to sequences 
    # of integers. We can do that using keras's StringLookup preprocessing 
    # layer, which will take a string and spit out an array of integers 
    # encoding the characters of the string. This layer will need to scan 
    # the dataset to determine the appropriate encoding vocabulary for the 
    # characters. Since both sequences essentially contain all of the data, 
    # we can just use one of the sequences for the StringLookup to scan.
    seq_a = tf.strings.unicode_split(labeled_data['seq_a'], input_encoding='UTF-8')
    seq_b = tf.strings.unicode_split(labeled_data['seq_b'], input_encoding='UTF-8')
    encoder = tf.keras.layers.StringLookup()
    encoder.adapt(seq_a)
    seqint_a = encoder(seq_a)
    seqint_b = encoder(seq_b)
    
    # Then we can one-hot encode our labels using scikit-learn's 
    # OneHotEncoder class. Since we know our labels ahead of time, 
    # I figured we don't need to train it. HOWEVER, scikit-learn 
    # doesn't seem to think so, as it expects us to call fit on 
    # the data anyway... so yeah.
    label_encoder = OneHotEncoder(
        categories=[['greater', 'less']], 
        handle_unknown='ignore')
    label_array = labeled_data['label'].values.reshape(-1, 1)
    label_encoder.fit(label_array)
    labels = label_encoder.transform(label_array)
    labels = tf.constant(labels.toarray())
    
    # Create dataset. Shuffle, batch, repeat, etc.
    dataset = tf.data.Dataset.from_tensor_slices(
        ((seqint_a, seqint_b), labels))
    dataset.shuffle(shuffle)
    dataset.batch(batch)
    dataset.repeat(repeat)
    
    # Display a sample
    if display:
        print('Dataset spec:', dataset)
        for (seq_a, seq_b), output in dataset.take(1):
            print(f'Sample input sequence A:', seq_a)
            print(f'Sample input sequence B:', seq_b)
            print(f'Sample output labels:', output)
            
    # Split into training and testing data and return it
    train_num = int(train_frac*len(dataset))
    train_dataset = dataset.take(train_num)
    test_dataset = dataset.skip(train_num)
    return train_dataset, test_dataset
    
# Run input data function to test it out
train_dataset, test_dataset = input_data(display=True)

Dataset spec: <TensorSliceDataset shapes: (((None,), (None,)), (2,)), types: ((tf.int64, tf.int64), tf.float64)>
Sample input sequence A: tf.Tensor(
[32 28 28 29 10  5  3 39  8  1  5  1 32 32 18  2  9  8  4  7  5 11  1  5
  3  3  5 14 24 32 32 51  1  1 41 34 41  1  6  3 39  8  1  4 22 25  6  4
 12  8  1  3 10  5  3  1 15  4 12  1 54 12  8  3  1 10  5 25  2  1  5  1
 25  2  7 13  2  3  3  5  1  5 17  5  6  7  8  3  1  5  1 18  4  8  3  2
  9 21  1  1 44  4 12  1 10  5 25  2  1 16  5 13  2  1 19  5 11  8  2  1
  5 14 14 12  8  5  3  6  4  7  8  1  5 17  5  6  7  8  3  1 16  2  1  3
 10  5  3  1 13  6 25  2  1 16 12 14 10  1 13  2  2 18  2  9 21  1 32], shape=(167,), dtype=int64)
Sample input sequence B: tf.Tensor(
[ 1  1 34  9  5  7  2  1  8 12 14 24  8  1 11  4  3  8  1  4 19  1 18  2
  7  6  8 26  1 20  6  3 10  1 17  2  7  6  3  5 11  1 20  5  9  3  8 21
  1 27  7 13  1 10  2  1 11  6 24  2  8  1 17  2  3  3  6  7 17  1 16  4
 11  2  8  3  2 13  1 22 15  1 50  6 14 10  5  2 11  1 62  

After this. We build the model using Keras's framework, train it and then validate it on the test set.

The original problem calls for ranking comments based on toxicity, so we can use this network as a comparator function to sort the list of toxic comments