# Compilation of NLP models

## Outline

Explore the following NLP models:
- TFIDF
- Logistic Regression
- Naive Bayes
- SVM
- XGBoost
- Word Vectors

- Simple RNNs
- Word Embeddings
- LSTM
- GRU
- Bi-Directional RNNs
- Encoder-Decoder Models
- Attention Models
- Transformers
- BERT 

Referenced sources
- Notebook #1 : https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert/notebook
- Notebook #2: https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
- Dataset: https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/overview

In [49]:
# EDA tools
import pandas as pd
import numpy as np
import itertools
import datetime as dt

# display and tracking of iterative processes
from tqdm import tqdm

# plotting tools
import matplotlib.pyplot as plt
import seaborn as sns

# xgb
import xgboost as xgb

# sklearn tools
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# TensorFlow/ Keras
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN, Dense, Activation, Dropout, Embedding, Bidirectional

### Configuring TPU

In [50]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

    
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


### Import and check data

In [51]:
# import and check data
train_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')

In [52]:
train_df.info()
validation_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223549 entries, 0 to 223548
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             223549 non-null  object
 1   comment_text   223549 non-null  object
 2   toxic          223549 non-null  int64 
 3   severe_toxic   223549 non-null  int64 
 4   obscene        223549 non-null  int64 
 5   threat         223549 non-null  int64 
 6   insult         223549 non-null  int64 
 7   identity_hate  223549 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 13.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            8000 non-null   int64 
 1   comment_text  8000 non-null   object
 2   lang          8000 non-null   object
 3   toxic         8000 non-null   int64 
dtypes: int64(2), object(2)
memory usage:

In [53]:
train_df.isna().sum()
# validation_df.isna().sum()
# test_df.isna().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [54]:
train_df.head(3)
# validation_df.head(3)
# test_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0


### Approach: Binary Classification of topic toxicity

In [55]:
# drop all columns except for 'toxic'

train_df.drop(columns = ['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], inplace= True)

In [56]:
train_df.head(3)

Unnamed: 0,id,comment_text,toxic
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0


In [57]:
# check for feature skewness
train_df['toxic'].value_counts()

toxic
0    202165
1     21384
Name: count, dtype: int64

### EDA

In [58]:
# max word count in a comment
max_length = train_df['comment_text'].map(lambda x: len(x.split())).max()
max_length

2321

In [59]:
# min word count in a comment
train_df['comment_text'].map(lambda x: len(x.split())).min()

1

In [60]:
# add feature: word count
train_df['word_count'] = train_df['comment_text'].map(lambda x: len(x.split()))

In [61]:
# groupby toxic comments and average word count difference
train_df.groupby(['toxic']).agg({'word_count': 'mean'})

Unnamed: 0_level_0,word_count
toxic,Unnamed: 1_level_1
0,68.415161
1,48.573466


Data suggest that toxic comments are straight to the point (e.g less word count).

In [62]:
train_df['comment_text'].values

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       ...,
       '==shame on you all!!!== \n\n You want to speak about gays and not about romanians...',
       'MEL GIBSON IS A NAZI BITCH WHO MAKES SHITTY MOVIES. HE HAS SO MUCH BUTTSEX THAT HIS ASSHOLE IS NOW BIG ENOUGH TO BE CONSIDERED A COUNTRY.',
       '" \n\n == Unicorn lair discovery == \n\n Supposedly, a \'unicorn lair\' has been discovered in

### Train/Test Split

In [63]:
# shuffle default = True
X_train, X_test, y_train, y_test = train_test_split(train_df['comment_text'].values, train_df['toxic'].values, test_size=0.25, stratify=train_df['toxic'], random_state=22)

### Recurrent Neural Network

https://stackoverflow.com/questions/76029717/having-trouble-correctly-importing-tensorflow-tokenizer-and-tensorflow-padded-se

In [64]:
# tokenise data with keras.text.Tokenizer

token = text.Tokenizer()

### Fit the tokenizer to the text data

In [65]:
# fit tokenise on train data. fit_on_texts accept both array and list
token.fit_on_texts(X_train)

# maps word to corresponding index
word_index = token.word_index

In [66]:
# transform on X_train and X_test
X_train_seq = token.texts_to_sequences(X_train)
X_test_seq = token.texts_to_sequences(X_test)

In [67]:
# pad data so that they are of uniform length, so neural network architectures can process sequential data
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen= max_length)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen= max_length)

### Simple RNN
### Issue: Vanishing Gradient Problem

### Determine embedding vector space dimension: https://ai.stackexchange.com/questions/28564/how-to-determine-the-embedding-size

In [68]:
# set embedding_dim
embedding_dim = 300

In [69]:
# # instantiate a sequential model
# model = Sequential()

# # turns indexes in to Dense vectors

# model.add(Embedding(
#     input_dim = len(word_index)+1, # match the number of unique words or tokens in your vocabulary.
#     output_dim = embedding_dim, # vector space dimensions. depends on the specific task, but common values range from 64 to 512. Equals to # of dimensional vector
#     input_length = max_length # maximum sequence length in your data
# ))

# # add a Simple RNN layer of x neurons
# model.add(SimpleRNN(embedding_dim)) # 300 = number of neurons/ cells

# # add one Dense layer
# model.add(Dense(
#     1, # add one dense layer
#     activation = 'sigmoid' # add sigmoid activation layer
# ))

# model.compile(
#     loss = 'binary_crossentropy',
#     optimizer = 'adam',
#     metrics = ['accuracy'], # tracked classification metric
# )

# model.summary()

In [70]:
# model.fit(X_train_pad, y_train, epochs=2, batch_size=512)

### Predict test result

In [71]:
# predict = model.predict(X_test_pad)

### AUC Score

In [72]:
# track auc scores
auc_scores = []

In [73]:
# # append SimpleRnn auc score
# auc_scores.append({'SimpleRnn': round(roc_auc_score(y_test,predict),4)})
# print(f'auc: {roc_auc_score(y_test,predict):.4f}')

In [74]:
# X_train_seq[:1]

### Use pre-trained word embeddings instead of training from scratch

### Load GloVe Embeddings - Standford GloVe link: https://nlp.stanford.edu/projects/glove/


In [75]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.array([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print(f'Found {len(embeddings_index)} word vectors')

2196018it [03:40, 9958.96it/s] 

Found 2196017 word vectors





### LSTM - Advance RNN with memory cells and gates. Address vanishing gradient problems.

In [76]:
# word_index.items()

In [77]:
# creating embedding matrix of 0s, in shape of row = len of word_index +1 ,and columns = vector space dimensions
# loop through word_index, get the word and check if the word exist in embeddings_index (created from GloVe Vectors)
# if yes, assign corresponding GloVe vector to word index

embeddings_matrix = np.zeros((len(word_index) + 1, 300))

for word, i in tqdm(word_index.items()):
    embeddings_vector = embeddings_index.get(word)
    if embeddings_vector is not None:
        embeddings_matrix[i] = embeddings_vector


100%|██████████| 248872/248872 [00:00<00:00, 333399.82it/s]


In [78]:
# embeddings_matrix

### LSTM Model

In [79]:
# # initialise a sequential model
# model = Sequential()

# # Add Embedding as the first layer
# # Use embedding matrix as weights instead of training again
# model.add(Embedding(
#     len(word_index)+1,
#     embedding_dim, # should vector space dimension tie to model input?
#     weights = [embeddings_matrix],
#     input_length = max_length,
#     trainable = False # set trainable to False so we dont retrain model again
# ))

# # Add LSTM layers
# model.add(LSTM(
#     embedding_dim, # should it tie to Embedding vector space dimension?
#     dropout = 0.3,
#     recurrent_dropout = 0.3
# ))

# # Add Dense layer
# model.add(Dense(
#     1, 
#     activation = 'sigmoid'
# ))

# # Compile model with loss function and optimiser
# model.compile(
#     loss = 'binary_crossentropy',
#     optimizer = 'adam',
#     metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
# )

# model.summary()


In [80]:
# model.fit(X_train_pad, y_train, epochs=2, batch_size=512)

In [81]:
# predict = model.predict(X_test_pad)

In [82]:
# # append LSTM auc score
# auc_scores.append({'LSTM': round(roc_auc_score(y_test,predict),4)})
# print(f'auc: {roc_auc_score(y_test,predict):.4f}')

### Gated Recurrent Unit - GRU
### Design to solve vanishing gradient problem. Similar to LSTM.

### Update Gate: How much information to pass along to the future
### Reset Gate: How much information of past information to forget

In [83]:
# with strategy.scope():
    
#     # initialise a sequential model
#     model = Sequential()

#     # Add Embedding as the first layer
#     # Use embedding matrix as weights instead of training again
#     model.add(Embedding(
#         len(word_index)+1,
#         embedding_dim,
#         weights = [embeddings_matrix],
#         input_length = max_length,
#         trainable = False # set trainable to False so we dont retrain model again
#     ))

#     # Add GRU layers
#     model.add(GRU(
#         embedding_dim,
#         dropout = 0.3,
#         recurrent_dropout = 0.3
#     ))

#     # Add Dense layer
#     model.add(Dense(
#         1, 
#         activation = 'sigmoid'
#     ))

#     # Compile model with loss function and optimiser
#     model.compile(
#         loss = 'binary_crossentropy',
#         optimizer = 'adam',
#         metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
#     )

# model.summary()
    

In [84]:
# model.fit(X_train_pad, y_train, epochs=2, batch_size=512)

In [85]:
# predict = model.predict(X_test_pad)

In [86]:
# # append GRU auc score
# auc_scores.append({'GRU': round(roc_auc_score(y_test,predict),4)})
# print(f'auc: {roc_auc_score(y_test,predict):.4f}')

### Bi-Directional RNN

Two independent RNNs. #1 with input sequence in normal time order. #2 with input sequence in reverse time order.

Outputs of two networks are concatenated or summed at each time steps, depending on options.

Allow networks to have both forward and backward information about the sequence at each time step.

In [87]:
with strategy.scope():
    
    # initialise a sequential model
    model = Sequential()

    # Add Embedding as the first layer
    # Use embedding matrix as weights instead of training again
    model.add(Embedding(
        len(word_index)+1,
        embedding_dim, # should this be the same as RNN/GRU/LSTM input?
        weights = [embeddings_matrix],
        input_length = max_length,
        trainable = False # set trainable to False so we dont retrain model again
    ))

    # Add Bidirectional RNN layers
    model.add(Bidirectional(LSTM(
        embedding_dim, # this should tie to embedding 
        dropout = 0.3,
        recurrent_dropout = 0.3
    )))

    # Add Dense layer
    model.add(Dense(
        1, 
        activation = 'sigmoid'
    ))

    # Compile model with loss function and optimiser
    model.compile(
        loss = 'binary_crossentropy',
        optimizer = 'adam',
        metrics = ['accuracy'] # tracked metrics. one can have multiple metrices in a list
    )

model.summary()



In [None]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=256)

Epoch 1/2
[1m110/655[0m [32m━━━[0m[37m━━━━━━━━━━━━━━━━━[0m [1m38:38[0m 4s/step - accuracy: 0.9040 - loss: 0.2778

In [None]:
predict = model.predict(X_test_pad)

In [None]:
# append Bi-Directional auc score
auc_scores.append({'Bi-Directional': round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

### Sequence to Sequence - Revisit this

### Attention Models