# Jigsaw Rate Severity - Simple LSTM

**Work:**
 - Forked https://www.kaggle.com/elcaiseri/jigsaw-keras-embedding-lstm
 - Revised data prep and model architecture to run with single input (text) and get single score (relative severity of toxicity)
     - Target is created by using the (less) and (more) information to assign a value that adheres to all (less) and (more) information
 - Revised optimizer and manually tuned learning rate for better performance
 - Added text augmentation

**References and Acknowledgements:**
 - https://www.kaggle.com/elcaiseri/jigsaw-keras-embedding-lstm
 - https://www.kaggle.com/elcaiseri
 - https://www.kaggle.com/c/jigsaw-toxic-severity-rating/overview
 - https://github.com/tensorflow/tensorflow/issues/38613
 - https://www.kaggle.com/yeayates21/commonlit-text-augmentation-eng-to-fre-to-eng/notebook

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from random import sample
import time

import os
from tqdm.notebook import tqdm

from sklearn.model_selection import train_test_split
from textblob import TextBlob

## Data Wrangling

In [None]:
PATH = '/kaggle/input/jigsaw-toxic-severity-rating/'
valid_data = pd.read_csv(PATH + 'validation_data.csv')
comment_data = pd.read_csv(PATH + 'comments_to_score.csv')
sub = pd.read_csv(PATH + 'sample_submission.csv')

In [None]:
valid_data.sort_values('worker', inplace=True)
valid_data.head()

In [None]:
valid_data.values.shape

## Quick EDA

Can text be found more than once in either column?  - Answer: Yes

In [None]:
txteg = valid_data.values[0,2] # get text example from more_toxic
valid_data[valid_data['less_toxic']==txteg].head() # look for example in less_toxic

## Data Preprocessing

#### Create Target

- We use the (less) and (more) information to assign values that adhere to all (less) and (more) information
- We assign values to each text, then we loop through the data repeatedly, revising the values each time if there are cases where the value does not adhere
- If all values adhere to the (less) and (more) information, then we should see fewer revisision with each round

In [None]:
#################################
# get all unique texts
#################################
uts = list(set(valid_data['more_toxic'].values.tolist() + valid_data['less_toxic'].values.tolist()))
# store texts in a dictionary with default value -1
ut_dict = {}
for ut in uts:
    ut_dict[ut] = -1

#################################
# set values for unique texts given information from valid_data (relatively more or less)
#################################
epochs = 60 # number of times to loop over the dataset and make revisions to dictionary
lrevisions = []
lreversals = []
for i in range(epochs+1):
    revisions = 0
    reversals = 0
    for index, row in valid_data.iterrows():
        if (ut_dict[row['less_toxic']]==-1) and (ut_dict[row['more_toxic']]==-1): # both undefined
            ut_dict[row['less_toxic']]=random.uniform(0, 100)
            ut_dict[row['more_toxic']]=random.uniform(0, 100)
            revisions += 2
        elif (ut_dict[row['less_toxic']]!=-1) and (ut_dict[row['more_toxic']]==-1): # less defined, more not
            cap = ut_dict[row['less_toxic']]
            val = random.uniform(cap, 100)
            ut_dict[row['more_toxic']] = val
            revisions += 1
        elif (ut_dict[row['less_toxic']]==-1) and (ut_dict[row['more_toxic']]!=-1): # less not defined, more defined
            cap = ut_dict[row['more_toxic']]
            val = random.uniform(0, cap)
            ut_dict[row['less_toxic']] = val
            revisions += 1
        else: # both defined
            if ut_dict[row['less_toxic']]<ut_dict[row['more_toxic']]:
                pass # this is good to go
            else: # more < less, which is wrong
                changeType = random.choice([1,2,3]) # select 1 of 3 different types of revisions
                if changeType==1: # reverse values
                    more = ut_dict[row['more_toxic']]
                    less = ut_dict[row['less_toxic']]
                    ut_dict[row['more_toxic']] = less + random.uniform(-1, 1) # more = less + jitter
                    ut_dict[row['less_toxic']] = more + random.uniform(-1, 1) # less = more + jitter
                elif changeType==2: # set more to less + 1-ish
                    ut_dict[row['more_toxic']] = ut_dict[row['less_toxic']] + random.uniform(0, 1)
                elif changeType==3: # set less to more - 1-ish
                    ut_dict[row['less_toxic']] = ut_dict[row['more_toxic']] - random.uniform(0, 1)
                revisions += 1
                reversals += 1
    lrevisions.append(revisions)
    lreversals.append(reversals)
    if i % 5 == 0:
        print("Round {} completed with {} total revisions and {} reversals.".format(i,revisions,reversals))
print("All rounds completed.")

In [None]:
print("Algorithm Performance - all runs")
pd.DataFrame({'Revisions':lrevisions,'Reversals':lreversals}).plot(figsize=(12, 6));

In [None]:
print("Algorithm Performance - excluding the 1st run")
pd.DataFrame({'Revisions':lrevisions[1:],'Reversals':lreversals[1:]}).plot(figsize=(12, 6));

#### Compile Training Data & Target

- Add all text in valid_data to a training dataset/list with a corresponding target
- Apply some augmentation to each text since we have duplicate texts
- Apply some jitter to the target value since we have duplicate texts and augmentations, and just for some regularization

In [None]:
# initialize lists
toxic_text = []
target = []
augmentation_percent = 0.90

#################################
# loop through valid_data and add text & target to training data lists
# - also add a small jitter since we have duplicate text examples, for some regularization
#################################
for index, row in tqdm(valid_data.iterrows()):
    if random.uniform(0, 1)<augmentation_percent: # only augment x% of the time
        try: # augmentations
            augshuf = random.uniform(0, 1)
            if augshuf<0.35:
                french_translation = str(TextBlob(row['more_toxic']).translate(to='fr'))
                more_toxic = str(TextBlob(french_translation).translate(to='en')) # back to Eng
                french_translation = str(TextBlob(row['less_toxic']).translate(to='fr'))
                less_toxic = str(TextBlob(french_translation).translate(to='en')) # back to Eng
            else: # remove a random word
                rand_word = sample(list(set(row['more_toxic'].split(" "))))[0]
                more_toxic = row['more_toxic'].replace(rand_word, '')
                rand_word = sample(list(set(row['less_toxic'].split(" "))))[0]
                more_toxic = row['less_toxic'].replace(rand_word, '')
            toxic_text.append(more_toxic)
            target.append(ut_dict[row['more_toxic']] + random.uniform(-1, 1)) # value plus small jitter
            toxic_text.append(less_toxic)
            target.append(ut_dict[row['less_toxic']] + random.uniform(-1, 1)) # value plus small jitter
        except:
            toxic_text.append(row['more_toxic'])
            target.append(ut_dict[row['more_toxic']] + random.uniform(-1, 1)) # value plus small jitter
            toxic_text.append(row['less_toxic'])
            target.append(ut_dict[row['less_toxic']] + random.uniform(-1, 1)) # value plus small jitter
    else:
        toxic_text.append(row['more_toxic'])
        target.append(ut_dict[row['more_toxic']] + random.uniform(-1, 1)) # value plus small jitter
        toxic_text.append(row['less_toxic'])
        target.append(ut_dict[row['less_toxic']] + random.uniform(-1, 1)) # value plus small jitter

#### Final Training Data

In [None]:
print("Text list length: ", len(toxic_text))
print("Target list length: ", len(target))

In [None]:
training_data = pd.DataFrame()
training_data['text'] = toxic_text
training_data['target'] = target
training_data.head()

In [None]:
plt.hist(target, label='training target distribution');
plt.legend();

In [None]:
training_data.to_csv('jigsaw_rate_severity_training_data.csv', index=False)