## Content 
   * Simple RNN's
   * Word Embeddings : Definition and How to get them
   * LSTM's
   * GRU's
   * BI-Directional RNN's
   * Encoder-Decoder Models (Seq2Seq Models)
   * Attention Models
   * Transformers - Attention is all you need
   * BERT

In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, SimpleRNN
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.utils import to_categorical  # np_utils functionality is now in to_categorical
from sklearn import preprocessing, decomposition, model_selection, metrics, pipeline
from tensorflow.keras.layers import GlobalMaxPooling1D, Conv1D, MaxPooling1D, Flatten, Bidirectional, SpatialDropout1D
from tensorflow.keras.preprocessing import sequence, text
from tensorflow.keras.callbacks import EarlyStopping


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff

### Setup TPU
A Tensor Processing Unit (TPU) is a type of hardware accelerator designed by Google specifically for machine learning workloads, particularly for neural network training and inference. TPUs are part of Google's broader AI hardware strategy and are optimized to handle large-scale computations for deep learning tasks more efficiently than general-purpose processors like CPUs (Central Processing Units) or even GPUs (Graphics Processing Units).

In [1]:
import tensorflow as tf

# Detect if TPU is available and initialize it
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # Detect TPU
    print('Running on TPU:', tpu.master())
except ValueError:
    tpu = None

if tpu:
    # Connect to TPU
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)  # Create a TPU strategy
else:
    strategy = tf.distribute.MirroredStrategy()  # For GPU or multi-GPU machines

  from pandas.core import (


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)


In [6]:
train = pd.read_csv('/Users/rufen/Downloads/jigsaw-toxic-comment-train.csv')
validation = pd.read_csv('/Users/rufen/Downloads/jigsaw_validation.csv')
test = pd.read_csv('/Users/rufen/Downloads/jigsaw_test.csv')

In [7]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [8]:
train.drop(['severe_toxic','obscene','threat','insult','identity_hate'],axis=1,inplace=True)

In [9]:
train = train.loc[:12000,:]

In [10]:
train.shape

(12001, 3)

In [19]:
train['comment_text'].apply(lambda x:len(str(x).split())).max()

1403

In [20]:
def roc_auc(predictions,target):
    '''
    This methods returns the AUC Score when given the Predictions
    and Labels
    '''
    
    fpr, tpr, thresholds = metrics.roc_curve(target, predictions)
    roc_auc = metrics.auc(fpr, tpr)
    return roc_auc

In [21]:
xtrain, xvalid, ytrain, yvalid = train_test_split(train.comment_text.values, train.toxic.values, 
                                                  stratify=train.toxic.values, 
                                                  random_state=42, 
                                                  test_size=0.2, shuffle=True)

### RNN's
Recurrent Neural Networks (RNNs) are a type of neural network that is particularly well-suited for sequential data and time series analysis. They are designed to handle input data of variable length and to capture dependencies and patterns across sequences. Here are a few reasons why RNNs are preferred over simple feedforward neural networks for sequential data:

Handling sequential data: RNNs are designed to handle sequential data where the order of inputs matters. They have a "memory" element that allows them to process each input in the context of previous inputs. This makes them suitable for tasks like speech recognition, language modeling, time series prediction, and machine translation.

Variable input length: Unlike traditional feedforward neural networks, RNNs can process inputs of variable lengths. This flexibility is crucial for tasks where the length of the input sequences can vary, such as natural language processing tasks.

Temporal dependencies: RNNs are capable of capturing temporal dependencies in sequential data. They can remember information from previous time steps and use it to make predictions at the current time step. This makes them well-suited for tasks that involve analyzing time series data or sequences with long-range dependencies.

Parameter sharing: RNNs have shared weights across time steps, which allows them to efficiently learn patterns in sequential data. This parameter sharing helps in reducing the number of parameters to be learned compared to simple feedforward neural networks, making RNNs more effective for tasks with sequential data.

Backpropagation through time: RNNs use a technique called backpropagation through time (BPTT) to update the model's weights. This technique allows the network to learn from sequences of data by unfolding the network in time and applying the standard backpropagation algorithm. This enables the network to learn complex patterns in sequential data.

While RNNs are powerful for handling sequential data, they also have some limitations such as difficulty in learning long-term dependencies (vanishing or exploding gradient problem) and difficulty in capturing dependencies that are very far apart in the sequence. To address some of these limitations, more advanced architectures like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) have been developed, which are extensions of the basic RNN architecture.

In [22]:
# using keras tokenizer here
token = text.Tokenizer(num_words=None)
max_len = 1500

token.fit_on_texts(list(xtrain) + list(xvalid))
xtrain_seq = token.texts_to_sequences(xtrain)
xvalid_seq = token.texts_to_sequences(xvalid)

#zero pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xvalid_pad = sequence.pad_sequences(xvalid_seq, maxlen=max_len)

word_index = token.word_index

In [23]:
%%time
with strategy.scope():
    # A simpleRNN without any pretrained embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     input_length=max_len))
    model.add(SimpleRNN(100)) 
    # 100 refers to the number of units (also known as neurons or hidden states)
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
   
    # Binary cross-entropy measures the difference between the true labels and the predicted probabilities.
    
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1500, 300)         13049100  
                                                                 
 simple_rnn (SimpleRNN)      (None, 100)               40100     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 13089301 (49.93 MB)
Trainable params: 13089301 (49.93 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
CPU times: user 115 ms, sys: 93.8 ms, total: 209 ms
Wall time: 333 ms


In [25]:
model.fit(xtrain_pad, ytrain, epochs=5, batch_size=64*strategy.num_replicas_in_sync) #Multiplying by Strategy to run on TPU's

Epoch 1/5


2024-06-18 15:46:05.314724: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x2d32f08e0>

In [26]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

2024-06-18 15:54:51.011713: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


Auc: 0.83%


In [27]:
scores_model = []
scores_model.append({'Model': 'SimpleRNN','AUC_Score': roc_auc(scores,yvalid)})

### LSTM's

Simple RNN's were certainly better than classical ML algorithms and gave state of the art results, but it failed to capture long term dependencies that is present in sentences . So in 1998-99 LSTM's were introduced to counter to these drawbacks.

### LSTM VS RNN

循环神经网络(RNN)是一种针对序列数据处理的神经网络结构。它的主要特点是通过循环层实现数据的持久化，使得网络可以记忆之前的信息，从而对序列数据进行建模。RNN的一个重要变种是长短时记忆网络(LSTM)，它可以有效地解决传统 RNN中存在的梯度消失和梯度爆炸的问题。LSTM通过引入记忆单元和门控机制，使得网络可以选择性地记忆或遗忘之前的信息，从而更加有效地学习长序列数据的特征。

* (1) RNN没有细胞状态而LSTM通过细胞状态记忆信息;
* (2) RNN激活函数只有tanh 函数而 LSTM 通过输入门、遗忘门、输出门引入 sigmoid 函数并结合 tanh 函数，添加求和操作，减少梯度消失和梯度爆炸的可能性;
* (3) RNN只能够处理短期依赖问题; LSTM 既能够处理短期依赖问题，又能够处理长期依赖问题。

In [31]:
# load the GloVe vectors in a dictionary:

embeddings_index = {}
f = open('/Users/rufen/Downloads/glove.840B.300d.txt','r',encoding='utf-8')
for line in tqdm(f):
    values = line.split(' ')
    word = values[0]
    coefs = np.asarray([float(val) for val in values[1:]])
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

2196018it [01:10, 31186.00it/s]

Found 2196017 word vectors.





In [32]:
# create an embedding matrix for the words we have in the dataset
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

100%|███████████████████████████████████| 43496/43496 [00:07<00:00, 5848.97it/s]


In [33]:
%%time
with strategy.scope():
    
    # A simple LSTM with glove embeddings and one dense layer
    model = Sequential()
    model.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=max_len,
                     trainable=False))

    model.add(LSTM(100, dropout=0.3, recurrent_dropout=0.3))
    # dropout 是一种正则化技术，用于防止模型过拟合。
    # 在训练过程中，以 0.3（即 30%）的概率随机地丢弃输入单元的某些值。这意味着在每次训练迭代中，有 30% 的输入单元会被忽略，有助于提高模型的泛化能力。
    # recurrent_dropout 是针对 LSTM 单元内部递归状态的 dropout。
    # 在 LSTM 层的时间步之间，以 0.3（即 30%）的概率丢弃 LSTM 单元的递归连接。这有助于防止模型在处理长时间序列时过拟合。
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
    
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 1500, 300)         13049100  
                                                                 
 lstm (LSTM)                 (None, 100)               160400    
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 13209601 (50.39 MB)
Trainable params: 160501 (626.96 KB)
Non-trainable params: 13049100 (49.78 MB)
_________________________________________________________________
CPU times: user 133 ms, sys: 156 ms, total: 289 ms
Wall time: 1.06 s


In [34]:
model.fit(xtrain_pad, ytrain, epochs=5, batch_size=64*strategy.num_replicas_in_sync)

Epoch 1/5


2024-06-18 16:43:04.368929: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x481955630>

In [35]:
scores = model.predict(xvalid_pad)
print("Auc: %.2f%%" % (roc_auc(scores,yvalid)))

2024-06-18 17:28:39.623845: W tensorflow/core/framework/dataset.cc:959] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.


Auc: 0.97%


In [36]:
scores_model.append({'Model': 'LSTM','AUC_Score': roc_auc(scores,yvalid)})

### GRU's

Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU's are a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results . GRU's were designed to be simpler and faster than LSTM's and in most cases produce equally good results and thus there is no clear winner.