# Compilation of NLP models

## Outline

Explore the following NLP models:
- TFIDF
- Logistic Regression
- Naive Bayes
- SVM
- XGBoost
- Word Vectors

- Simple RNNs
- Word Embeddings
- LSTM
- GRU
- Bi-Directional RNNs
- Encoder-Decoder Models
- Attention Models
- Transformers
- BERT 

Referenced sources
- Notebook #1 : https://www.kaggle.com/code/tanulsingh077/deep-learning-for-nlp-zero-to-transformers-bert/notebook
- Notebook #2: https://www.kaggle.com/code/abhishek/approaching-almost-any-nlp-problem-on-kaggle
- Dataset: https://www.kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification/overview

In [2]:
# EDA tools
import pandas as pd
import numpy as np
import itertools
import datetime as dt

# display and tracking of iterative processes
from tqdm import tqdm

# plotting tools
import matplotlib.pyplot as plt
import seaborn as sns

# xgb
import xgboost as xgb

# sklearn tools
from sklearn.metrics import mean_squared_error, roc_auc_score
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split

# TensorFlow/ Keras
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN, Dense, Activation, Dropout, Embedding

### Configuring TPU

In [3]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

    
if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)

REPLICAS:  1


### Import and check data

In [4]:
# import and check data
train_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv')
validation_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/validation.csv')
test_df = pd.read_csv('/kaggle/input/jigsaw-multilingual-toxic-comment-classification/test.csv')

In [5]:
train_df.info()
validation_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223549 entries, 0 to 223548
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             223549 non-null  object
 1   comment_text   223549 non-null  object
 2   toxic          223549 non-null  int64 
 3   severe_toxic   223549 non-null  int64 
 4   obscene        223549 non-null  int64 
 5   threat         223549 non-null  int64 
 6   insult         223549 non-null  int64 
 7   identity_hate  223549 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 13.6+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            8000 non-null   int64 
 1   comment_text  8000 non-null   object
 2   lang          8000 non-null   object
 3   toxic         8000 non-null   int64 
dtypes: int64(2), object(2)
memory usage:

In [6]:
train_df.isna().sum()
# validation_df.isna().sum()
# test_df.isna().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

In [7]:
train_df.head(3)
# validation_df.head(3)
# test_df.head(3)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0


### Approach: Binary Classification of topic toxicity

In [8]:
# drop all columns except for 'toxic'

train_df.drop(columns = ['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], inplace= True)

In [9]:
train_df.head(3)

Unnamed: 0,id,comment_text,toxic
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0


In [10]:
# check for feature skewness
train_df['toxic'].value_counts()

toxic
0    202165
1     21384
Name: count, dtype: int64

### EDA

In [11]:
# max word count in a comment
max_length = train_df['comment_text'].map(lambda x: len(x.split())).max()
max_length

2321

In [12]:
# min word count in a comment
train_df['comment_text'].map(lambda x: len(x.split())).min()

1

In [13]:
# add feature: word count
train_df['word_count'] = train_df['comment_text'].map(lambda x: len(x.split()))

In [14]:
# groupby toxic comments and average word count difference
train_df.groupby(['toxic']).agg({'word_count': 'mean'})

Unnamed: 0_level_0,word_count
toxic,Unnamed: 1_level_1
0,68.415161
1,48.573466


Data suggest that toxic comments are straight to the point (e.g less word count).

In [15]:
train_df['comment_text'].values

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       ...,
       '==shame on you all!!!== \n\n You want to speak about gays and not about romanians...',
       'MEL GIBSON IS A NAZI BITCH WHO MAKES SHITTY MOVIES. HE HAS SO MUCH BUTTSEX THAT HIS ASSHOLE IS NOW BIG ENOUGH TO BE CONSIDERED A COUNTRY.',
       '" \n\n == Unicorn lair discovery == \n\n Supposedly, a \'unicorn lair\' has been discovered in

### Train/Test Split

In [16]:
# shuffle default = True
X_train, X_test, y_train, y_test = train_test_split(train_df['comment_text'].values, train_df['toxic'].values, test_size=0.25, stratify=train_df['toxic'], random_state=22)

### Recurrent Neural Network

https://stackoverflow.com/questions/76029717/having-trouble-correctly-importing-tensorflow-tokenizer-and-tensorflow-padded-se

In [17]:
# tokenise data with keras.text.Tokenizer

token = text.Tokenizer()

### Fit the tokenizer to the text data

In [18]:
# fit tokenise on train data. fit_on_texts accept both array and list
token.fit_on_texts(X_train)

In [19]:
# transform on X_train and X_test
X_train_seq = token.texts_to_sequences(X_train)
X_test_seq = token.texts_to_sequences(X_test)

In [20]:
# pad data so that they are of uniform length, so neural network architectures can process sequential data
X_train_pad = sequence.pad_sequences(X_train_seq, maxlen= max_length)
X_test_pad = sequence.pad_sequences(X_test_seq, maxlen= max_length)

### Simple RNN

### Determine embedding vector space dimension: https://ai.stackexchange.com/questions/28564/how-to-determine-the-embedding-size

In [21]:
# instantiate a sequential model
model = Sequential()

# turns indexes in to Dense vectors
model.add(Embedding(
    input_dim = len(token.word_index)+1, # match the number of unique words or tokens in your vocabulary.
    output_dim = 300, # vector space dimensions. depends on the specific task, but common values range from 64 to 512. Equals to # of dimensional vector
    input_length = max_length # maximum sequence length in your data
))

# add a Simple RNN layer of x neurons
model.add(SimpleRNN(100)) # 100 = number of neurons/ cells

# add one Dense layer
model.add(Dense(
    1, # add one dense layer
    activation = 'sigmoid' # add sigmoid activation layer
))

model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy'], # tracked classification metric
)



In [22]:
model.fit(X_train_pad, y_train, epochs=2, batch_size=512)

Epoch 1/2


I0000 00:00:1730010792.313665      78 service.cc:145] XLA service 0x7c9d640ca9b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1730010792.313734      78 service.cc:153]   StreamExecutor device (0): Tesla P100-PCIE-16GB, Compute Capability 6.0
I0000 00:00:1730010793.528297      78 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m2620/2620[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m623s[0m 236ms/step - accuracy: 0.9180 - loss: 0.2447
Epoch 2/2
[1m2620/2620[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m617s[0m 236ms/step - accuracy: 0.9642 - loss: 0.0983


<keras.src.callbacks.history.History at 0x7c9f0aae3e20>

### Predict test result

In [23]:
predict = model.predict(X_test_pad)

[1m1747/1747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m98s[0m 56ms/step


### AUC Score

In [38]:
# track auc scores
auc_scores = []

In [39]:
# append SimpleRnn auc score
auc_scores.append({'SimpleRnn': round(roc_auc_score(y_test,predict),4)})
print(f'auc: {roc_auc_score(y_test,predict):.4f}')

auc: 0.8774


In [46]:
X_train_seq[2]

[21134,
 993,
 1013,
 4,
 5130,
 993,
 27,
 12787,
 7,
 34,
 14,
 69,
 78,
 40,
 19,
 55,
 414,
 1310,
 993,
 122,
 12,
 30,
 41,
 27,
 315,
 925,
 3,
 53,
 8,
 505,
 142,
 2291,
 7,
 50,
 686,
 13,
 200,
 3,
 6960,
 461,
 687,
 68,
 158,
 25,
 163,
 642,
 11,
 10,
 28,
 993,
 27,
 1011,
 9,
 88,
 14,
 824,
 262,
 1230,
 25,
 95,
 880,
 13,
 2341,
 7,
 669,
 236,
 62,
 11,
 33,
 42,
 559,
 63,
 1143,
 10,
 29,
 8,
 5,
 340,
 221,
 56,
 82,
 1294,
 10,
 7750,
 2,
 1784,
 178,
 4284,
 4,
 178,
 1645,
 7750,
 22,
 158,
 941,
 67,
 25,
 181,
 340,
 111,
 10,
 29,
 11,
 8,
 569,
 10,
 13,
 394,
 3,
 29,
 4,
 42,
 151,
 1,
 769,
 7,
 19,
 46,
 980,
 151,
 4862,
 3494,
 10,
 5665,
 29,
 7750,
 4,
 7,
 76,
 14,
 718,
 3,
 217,
 9,
 7,
 19,
 204,
 9,
 2672,
 114,
 3468,
 114,
 981,
 29,
 4250,
 10,
 30,
 226,
 10,
 301,
 3,
 1,
 28,
 7,
 4412,
 499,
 7,
 19,
 4661,
 645,
 30,
 1053,
 10,
 97,
 562,
 7,
 19,
 55,
 4661,
 707,
 893,
 17,
 22,
 7,
 19,
 14,
 7,
 3775,
 6,
 9,
 142,
 113,
 778,
 2,

### Load GloVe Embeddings
### Standford GloVe link: https://nlp.stanford.edu/projects/glove/

In [3]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/kaggle/input/glove840b300dtxt/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print(f'Found {len(embeddings_index)} word vectors')

2196018it [03:41, 9921.70it/s] 

Found 2196017 word vectors.



