In [5]:
#Importing pandas
import pandas as pd

#Importing numpy
import numpy as np

# Importing regex
import re

#train_test_split is used to split the dataset to create train and validation dataset
from sklearn.model_selection import train_test_split

# keras library is used to create the embedding layers, tokenizers, LSTM Model, Dense layers for classifications, padding the shorter sequences etc
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

#Since, we are working with inputs which come under the domain of natural language, NLTK (natural language tool kit)
# is essential library for input processing.
import nltk
# Used to get stop words which can be removed from the inputs, in the data cleaning step.
from nltk.corpus import stopwords

# Using this to reduce the words in the input to their base form.
from nltk.stem import WordNetLemmatizer
# Importing word_tokenize to split the imput text into individual words/tokens.
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to C:\Users\Aashish
[nltk_data]     Jai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Aashish
[nltk_data]     Jai\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Aashish
[nltk_data]     Jai\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## What Does The Model Use?
Long Short Term Memory Model (RNN)

## Advantages over other models
1. LSTMs can capture long-range dependencies and sequential patterns in the input data.
2.  LSTMs can handle variable-length input sequences which is good for our text classification problem.
3. LSTMs automatically learn relevant features from the sequential input, eliminating the need for manual feature engineering
4. LSTMs are well-suited for processing textual data due to their ability to capture semantic relationships and dependencies


## Loading the Data

In [7]:
# Load ing the data
data = pd.read_csv('./Data/train/train.csv')
X_test = pd.read_csv("./Data/test/test.csv")
y_test = pd.read_csv("./Data/test_labels/test_labels.csv")


## Data Preprocessing

In this step we are pre-processing the data so that we can feed that into our model.
1. we remove the stop words.
2. We lemmatize the text. (Reducing the words into their base form so that words can be standardized, and all variations in the meaning of the same word can be mapped on to one base form.)
3. Tokenize the text. (Split the text into individual words. )

## Reasoning Behind Preprocessing

This is an essential before we could test our model on the test inputs.

The inputs are classifed into 6 categories/classes. If the input belongs to a certain class then the value for that column will be 1. Similarly if the input doesnot belong to a particular class the value in that will be 0.

The rows that are being removed here have value of -1 in all their columns. The provider of the dataset explains that these rows were added sometime later and hence they have values of -1. Since, the train data set did not have any input of -1, we decided to remove those particular rows.

In [8]:
def preprocess_text(text):
    stop_words = set(stopwords.words('english')) # words like this, and, or, is do not contribute much to the meaning of a sentence, so removing such words is preferrable
    lemmatizer = WordNetLemmatizer() # converting the words to their base form because there is no much difference between them, and this also makes the model less complex. Words like running are converted to run
    words = word_tokenize(text) # converting text to individual words
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words] # removing the stop words
    words = [re.sub(r'\d+', '', word) for word in words] # remove numbers using regex
    return ' '.join(words) # rejoining the split sentences

# getting the labels and data
data['comment_text'] = data['comment_text'].apply(preprocess_text)

X_train = data['comment_text']
y_train = data[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

X_test['comment_text'] = X_test['comment_text'].apply(preprocess_text)
X_test = X_test['comment_text']
y_test = y_test[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]


# test data had rows that were not used for testing. These had labels equated to -1. This data should not be used for predictions and is hence removed
indices_to_remove = y_test[y_test['toxic'] == -1].index

y_test = y_test.drop(indices_to_remove)
X_test = X_test.drop(indices_to_remove)

## Creating the Model

The LSTMClassifier class employs an LSTM-based architecture for text classification. In the training phase, the fit method preprocesses the input text data using a Tokenizer converting the text to numerical data, determining the vocabulary size and maximum sequence length (so that shorter sequences can be padded to make their length equal to the longer sentence). The model architecture consists of an embedding layer, converting integers to dense vectors which are used to capture word-to-word sequential dependencies, an LSTM layer remembering the old seen toekns, and a dense layer for classification with sigmoid activation which is a very good activiation function for multi-class classification. The use of the Adam optimizer and binary cross-entropy loss is chosen for binary classification. Sequences are padded to a uniform length, addressing the requirement of neural networks for fixed-length inputs. During training, the model learns to associate toxic comment labels with the processed text data.



In [6]:

class LSTMClassifier:
    def __init__(self):
        self.tokenizer = Tokenizer() # used to convert text to numeric data that can be processed by the model 

    def fit(self, X, y, epochs=1, batch_size=32, validation_split=0.2):
        self.tokenizer.fit_on_texts(X)
        self.num_words = len(self.tokenizer.word_index) + 1 # Calculate the number of unique words in the vocabulary. This determines the number of unique tokens the model can handle.
        self.input_length = max(len(x) for x in self.tokenizer.texts_to_sequences(X)) # the max possible input to exist in the training data. NN's require inputs of fixed length, so we need to know this variable so that we can padd the shorter sentences 
        
        self.model = Sequential() # using this to create a simple feed forward stack of layers

        self.model.add(Embedding(self.num_words, 64, input_length=self.input_length)) # using this layers so that the integers can be converted to Dense vectors which are useful for capturaing sequential dependency between words. They can be used to calculate similarity, direction etc.
        
        #defining the model we want to use
        self.model.add(LSTM(64))
        # ''' Dense layer is used for classification. The parameters include the number of output units (6, corresponding to the number of classes in this task, and the activation function 'sigmoid because it is suitable for binary classification problems. '''
        self.model.add(Dense(6, activation='sigmoid'))

        # compiling with Adam optimizer and the binary cross-entropy loss function because they are suitable for binary classification
        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        
        X = self.tokenizer.texts_to_sequences(X)
        X = pad_sequences(X, maxlen=self.input_length) # padding shorter sentences to be the same length as the longest input becuase NNs require inputs to be same length

        # training the model
        self.model.fit(X, y, epochs=epochs, batch_size=batch_size, validation_split=validation_split)

    def evaluate(self, X, y):
        X = self.tokenizer.texts_to_sequences(X) # convert text to numerical data that the model can process
        X = pad_sequences(X, maxlen=self.input_length) # padding shorter sentences to be the same length as the longest input becuase NNs require inputs to be same length
        return self.model.evaluate(X, y) # evaluate the model

    def predict(self, text):
        text = self.tokenizer.texts_to_sequences([text]) # convert text to numerical data that the model can process
        text = pad_sequences(text, maxlen=self.input_length) # padding shorter sentences to be the same length as the longest input becuase NNs require inputs to be same length
        return self.model.predict(text) # predicting the classes the text belongs to




## Training the Model

We can see that the model is performing very well as both training and validation loss are low and training and validation accuracy is very high

In [9]:
# Initializing and fitting the model
lstm_classifier = LSTMClassifier()
lstm_classifier.fit(X_train, y_train)








## Evaluating the Model
The model performs very well, with a great accuracy, and very low loss.

In [10]:
# Evaluating the model
loss, accuracy = lstm_classifier.evaluate(X_test, y_test)
print('Test loss:', loss)
print('Test accuracy:', accuracy)

Test loss: 0.07041707634925842
Test accuracy: 0.9974522590637207


## Predictions

In [11]:
# Predicting labels based on a custom comment
labels = ['Toxic', 'Severly Toxic', 'Obscene', 'Threat', 'Insult', 'Identity Hate']
new_comment = 'You are a jerk and a liar! I hate you'
prediction = lstm_classifier.predict(new_comment)
pred_df = pd.DataFrame(prediction, columns=labels)
pred_df




Unnamed: 0,Toxic,Severly Toxic,Obscene,Threat,Insult,Identity Hate
0,0.96439,0.135611,0.835378,0.042641,0.74836,0.171097


In [12]:
# Predicting labels based on a custom comment
new_comment = 'youre the a horrible human being'
prediction = lstm_classifier.predict(new_comment)
pred_df = pd.DataFrame(prediction, columns=labels)
pred_df



Unnamed: 0,Toxic,Severly Toxic,Obscene,Threat,Insult,Identity Hate
0,0.807801,0.028417,0.316519,0.032774,0.410591,0.064306


## Suggested Improvements
If the provided test data did not contain -1’s, there would have been more data to test the
model on. This way, we would have gotten more accurate evaluation metrics. the main problem with the dataset was that the dataset was biased towards clean data. The classes that we wanted to predict were quite low. Due to this our model wasn’t able to predict such labels with high confidence.
To counter this issue we can take mulitple steps:
- We could use bootstrapping which is sampling with
replacement from observed data to get around the issue
of class imbalance
- Another approach could be to train the model using increments of smaller data samples with a uniform distribution for each label. We can use this method in addition to Bootstrapping. This means we will effectively have a model that will  have seen roughly equal instances for each class.

## FIN