### Connect to Kaggle

We will be using data available on Kaggle platform for this exercise. The data is available at https://www.kaggle.com/c/word2vec-nlp-tutorial/data. We will first connect Colab to Kaggle. Instructions for downloading kaggle data to Colab can be found [in this post](https://towardsdatascience.com/setting-up-kaggle-in-google-colab-ebb281b61463).

In [None]:
!pip install kaggle --quiet

In [None]:
#Make a directory for Kaggle
!mkdir .kaggle

In [None]:
#Connect Google drive to colab
from google.colab import drive
drive.mount('/gdrive')

In [None]:
#Copy kaggle.json file. Change gdrive folder based on where you have saved your json file from Kaggle
!cp '/gdrive/My Drive/AI-ML/Machine-Learning/Code/Utilities/kaggle.json' /content/.kaggle/kaggle.json

In [None]:
#Check if json file is there
!ls -l /content/.kaggle

In [None]:
!mkdir ~/.kaggle
!cp /content/.kaggle/kaggle.json ~/.kaggle/kaggle.json
!kaggle config set -n path -v{/content}
!chmod 600 /root/.kaggle/kaggle.json

Verify Kaggle connection

In [None]:
!kaggle datasets list

#### Download Movie Reviews data

In [None]:
!kaggle competitions download -c word2vec-nlp-tutorial -p /content

In [None]:
#Confirm data has been downloaded
!ls -l

In [None]:
!unzip word2vec-nlp-tutorial.zip

In [None]:
!ls -l

Import the dataset as pandas dataframe

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('labeledTrainData.tsv.zip',header=0, delimiter="\t", quoting=3)

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.groupby(['sentiment']).count()

In [None]:
df.loc[1500, 'review']

Split Data into Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train, test = train_test_split(df, test_size=0.8, random_state=42)

In [None]:
train.shape, test.shape

In [None]:
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)

In [None]:
X_train = train['review']
y_train = train['sentiment']

In [None]:
X_test = test['review']
y_test = test['sentiment']

# Build the Tokenizer

In [None]:
import tensorflow as tf

In [None]:
desired_vocab_size = 30000 #Vocablury size
t = tf.keras.preprocessing.text.Tokenizer(num_words=desired_vocab_size) # num_words -> Vocablury size

In [None]:
#Fit tokenizer with actual training data
t.fit_on_texts(X_train.tolist())

In [57]:
len(t.word_index)

42770

In [32]:
#Vocabulary
print(t.word_index)



In [None]:
#t.word_counts

# Prepare Training and Test Data

Get the word index for each of the word in the review

In [29]:
X_train[0]

'"This is an early film \\"Pilot\\" for the hit Canadian tv show Trailer Park Boys. It was played to executives at a few networks before Showcase decided to sign them up for a tv series. Great acting and a very funny cast make this one of the best cult comedy films. The movie plot is that these two small time criminals go around \\"exterminating\\" peoples pets for money. If you have a dog next door whos barking all night these are the guys you go to! But they get into trouble when they come across a job too big for them to deal with and end up in a shootout. Watch this movie if you want to understand the beginning of the tv series. I highly recommend it!<br /><br />Rated R for swearing, violence, and drug use.<br /><br />Its not too offensive either (they dont actually show killing animals)"'

In [30]:
X_train = t.texts_to_sequences(X_train.tolist())

In [31]:
print(X_train[0])

[11, 6, 32, 384, 19, 1707, 16, 1, 560, 1613, 260, 122, 1232, 1594, 1105, 9, 13, 250, 5, 8270, 29, 2, 164, 7605, 154, 4447, 833, 5, 1929, 93, 57, 16, 2, 260, 211, 86, 117, 3, 2, 52, 168, 175, 95, 11, 28, 4, 1, 110, 1112, 213, 107, 1, 17, 115, 6, 12, 131, 103, 390, 55, 3470, 141, 185, 24300, 4113, 5913, 16, 288, 45, 23, 26, 2, 1078, 378, 1368, 12618, 18189, 30, 295, 131, 24, 1, 448, 23, 141, 5, 18, 33, 77, 85, 1065, 51, 33, 207, 646, 2, 280, 94, 191, 16, 93, 5, 894, 15, 3, 130, 57, 8, 2, 4448, 106, 11, 17, 45, 23, 174, 5, 402, 1, 456, 4, 1, 260, 211, 10, 551, 383, 9, 7, 7, 1079, 1663, 16, 6226, 599, 3, 1412, 343, 7, 7, 98, 21, 94, 2708, 339, 33, 4621, 163, 122, 821, 1413]


In [None]:
t.sequences_to_texts([X_train[0]])

In [33]:
X_test = t.texts_to_sequences(X_test.tolist())

In [None]:
t.texts_to_sequences(['My name is XYZ'])

How many words in each review?

In [35]:
len(X_train[2000])

207

# Pad Sequences - Important

In [36]:
#Define maximum number of words to consider in each review
max_review_length = 300

In [37]:
#Pad training and test reviews
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                        maxlen=max_review_length,
                                                        padding='pre',
                                                        truncating='post')

X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test,
                                                       maxlen=max_review_length,
                                                       padding='pre',
                                                       truncating='post')

In [38]:
X_train.shape

(5000, 300)

In [39]:
X_test.shape

(20000, 300)

In [40]:
X_train[100]

array([    0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     2,  1015,     3,  1418,   441,    42,     1,  2780,
          16,     1,  2859,  1589,  1377,   411,  9108, 11205,  2178,
         237,    36,  1826,  6705,    14,     1, 12771,  1437,   254,
           8,  2922,     4,     1,  3402,    34,   638,     1,  1113,
       10041,     5,    22,  3098,     6,     1,  1589,  1227,     8,
         830,     1,   153,  3135,   971,  1116,    34,   174,     5,
        4496,     2,  5144,   827,     7,     7,     1,    19,     6,
          44,    32,  6302,   801,    42,     1, 10042,     4,     1,
        3135,   827,

# Build the Graph

In [41]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

Add Embedding layer
 - Embedding Layer Input = Batch_Size * Length of each review

In [42]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    input_length=max_review_length) #Number of words in each review
          )

In [43]:
model.input

<KerasTensor: shape=(None, 300) dtype=float32 (created by layer 'embedding_input')>

In [44]:
model.output

<KerasTensor: shape=(None, 300, 50) dtype=float32 (created by layer 'embedding')>

Embedding Layer Output -
[Batch_Size , Review Length , Embedding_Size]

Flatten the Output

In [45]:
model.add(tf.keras.layers.Flatten())

In [46]:
model.output

<KerasTensor: shape=(None, 15000) dtype=float32 (created by layer 'flatten')>

Add Dense Layer

In [47]:
model.add(tf.keras.layers.BatchNormalization())

In [48]:
model.output

<KerasTensor: shape=(None, 15000) dtype=float32 (created by layer 'batch_normalization')>

In [49]:
model.add(tf.keras.layers.Dense(128, activation='relu'))

In [50]:
model.output

<KerasTensor: shape=(None, 128) dtype=float32 (created by layer 'dense')>

In [51]:
model.add(tf.keras.layers.Dropout(0.25))

In [52]:
model.output

<KerasTensor: shape=(None, 128) dtype=float32 (created by layer 'dropout')>

Use Dense layer for output layer

In [53]:
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [54]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [55]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           1500050   
                                                                 
 flatten (Flatten)           (None, 15000)             0         
                                                                 
 batch_normalization (Batch  (None, 15000)             60000     
 Normalization)                                                  
                                                                 
 dense (Dense)               (None, 128)               1920128   
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                        

# Execute the graph

In [56]:
model.fit(X_train,y_train,
          epochs=5,
          batch_size=32,
          validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7a52a463bdf0>

Using Pre-Trained Embeddings

In [58]:
import gensim.downloader as api
import numpy as np

Available Word2Vec/Glove Pretained models -> [Link](https://github.com/RaRe-Technologies/gensim-data)

In [59]:
#Load Glove model (similar to Word2Vec)
glove_model = api.load('glove-wiki-gigaword-50')



In [60]:
#Size of the model
glove_model.vectors.shape

(400000, 50)

In [None]:
#glove_model.vocab

In [61]:
glove_model['india']

array([-0.20356 , -0.8707  , -0.19172 ,  0.73862 ,  0.18494 ,  0.14926 ,
        0.48079 , -0.21633 ,  0.72753 , -0.36912 ,  0.13397 , -0.1143  ,
       -0.18075 , -0.64683 , -0.18484 ,  0.83575 ,  0.48179 ,  0.76026 ,
       -0.50381 ,  0.80743 ,  1.2195  ,  0.3459  ,  0.22185 ,  0.31335 ,
        1.2066  , -1.8441  ,  0.14064 , -0.99715 , -1.1402  ,  0.32342 ,
        3.2128  ,  0.42708 ,  0.19504 ,  0.80113 ,  0.38555 , -0.12568 ,
       -0.26533 ,  0.055264, -1.1557  ,  0.16836 , -0.82228 ,  0.20394 ,
        0.089235, -0.60125 , -0.032878,  1.3735  , -0.51661 ,  0.29611 ,
        0.23951 , -1.3801  ], dtype=float32)

In [62]:
#Initialize embedding matrix for our dataset with 10000+1 rows (1 for padding word)
#and 50 columns (as embedding size is 50)
embedding_matrix = np.zeros((desired_vocab_size + 1, 50))

In [63]:
embedding_matrix.shape

(30001, 50)

In [64]:
embedding_matrix[200]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [None]:
#t.word_index.items()

In [65]:
words_found = 0
for word, i in sorted(t.word_index.items(),key=lambda x:x[1]):
    if i > (desired_vocab_size+1):
        break
    try:
        embedding_vector = glove_model[word] #Reading word's embedding from Glove model for a given word
        embedding_matrix[i] = embedding_vector
        words_found += 1
    except:
        pass

In [66]:
words_found

26397

In [67]:
embedding_matrix[1]

array([ 4.18000013e-01,  2.49679998e-01, -4.12420005e-01,  1.21699996e-01,
        3.45270008e-01, -4.44569997e-02, -4.96879995e-01, -1.78619996e-01,
       -6.60229998e-04, -6.56599998e-01,  2.78430015e-01, -1.47670001e-01,
       -5.56770027e-01,  1.46579996e-01, -9.50950012e-03,  1.16579998e-02,
        1.02040000e-01, -1.27920002e-01, -8.44299972e-01, -1.21809997e-01,
       -1.68009996e-02, -3.32789987e-01, -1.55200005e-01, -2.31309995e-01,
       -1.91809997e-01, -1.88230002e+00, -7.67459989e-01,  9.90509987e-02,
       -4.21249986e-01, -1.95260003e-01,  4.00710011e+00, -1.85939997e-01,
       -5.22870004e-01, -3.16810012e-01,  5.92130003e-04,  7.44489999e-03,
        1.77780002e-01, -1.58969998e-01,  1.20409997e-02, -5.42230010e-02,
       -2.98709989e-01, -1.57490000e-01, -3.47579986e-01, -4.56370004e-02,
       -4.42510009e-01,  1.87849998e-01,  2.78489990e-03, -1.84110001e-01,
       -1.15139998e-01, -7.85809994e-01])

Building a model using Pre-Trained embeddings

In [68]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [69]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    weights=[embedding_matrix],
                                    trainable=False,
                                    input_length=max_review_length) #Number of words in each review
          )

In [70]:
model.output

<KerasTensor: shape=(None, 300, 50) dtype=float32 (created by layer 'embedding')>

In [71]:
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [72]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [73]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           1500050   
                                                                 
 flatten (Flatten)           (None, 15000)             0         
                                                                 
 batch_normalization (Batch  (None, 15000)             60000     
 Normalization)                                                  
                                                                 
 dense (Dense)               (None, 128)               1920128   
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                        

In [74]:
model.fit(X_train,y_train,
          epochs=5,
          batch_size=32,
          validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7a51ef084970>

#### Using CNN

In [75]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [76]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    weights=[embedding_matrix],
                                    trainable=False,
                                    input_length=max_review_length) #Number of words in each review
          )

In [77]:
model.add(tf.keras.layers.BatchNormalization())

In [78]:
model.output

<KerasTensor: shape=(None, 300, 50) dtype=float32 (created by layer 'batch_normalization')>

In [79]:
#Replace dense with Conv
model.add(tf.keras.layers.Conv1D(128,
                                 (3),
                                 activation='relu'))

In [80]:
model.output

<KerasTensor: shape=(None, 298, 128) dtype=float32 (created by layer 'conv1d')>

In [81]:
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [82]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [83]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           1500050   
                                                                 
 batch_normalization (Batch  (None, 300, 50)           200       
 Normalization)                                                  
                                                                 
 conv1d (Conv1D)             (None, 298, 128)          19328     
                                                                 
 dropout (Dropout)           (None, 298, 128)          0         
                                                                 
 flatten (Flatten)           (None, 38144)             0         
                                                                 
 dense (Dense)               (None, 1)                 38145     
                                                        

In [84]:
model.fit(X_train,y_train,
          epochs=5,
          batch_size=32,
          validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7a521f30ba30>

#### With Simple RNN

In [85]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [86]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    weights=[embedding_matrix],
                                    trainable=False,
                                    input_length=max_review_length) #Number of words in each review
          )

In [87]:
model.output

<KerasTensor: shape=(None, 300, 50) dtype=float32 (created by layer 'embedding')>

Add Simple RNN layer

In [88]:
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.SimpleRNN(128,
                                    activation='relu')) #RNN State - size of memory

In [89]:
model.output

<KerasTensor: shape=(None, 128) dtype=float32 (created by layer 'simple_rnn')>

In [90]:
model.add(tf.keras.layers.Dropout(0.25))

Output layer

In [91]:
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [92]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [93]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 300, 50)           1500050   
                                                                 
 batch_normalization (Batch  (None, 300, 50)           200       
 Normalization)                                                  
                                                                 
 simple_rnn (SimpleRNN)      (None, 128)               22912     
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 1523291 (5.81 MB)
Trainable params: 23141 (90.39 KB)
Non-trainable params: 1500150 (5.72 MB)
_______________

In [94]:
model.fit(X_train,y_train,
          epochs=5,
          batch_size=32,
          validation_data=(X_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7a51eefbd4b0>

In [None]:
model.fit(X_train,y_train,
          epochs=10,
          initial_epoch=5,
          batch_size=32,
          validation_data=(X_test, y_test))

#### Using LSTM

In [None]:
#Initialize model
tf.keras.backend.clear_session()
model = tf.keras.Sequential()

In [None]:
model.add(tf.keras.layers.Embedding(desired_vocab_size + 1, #Vocablury size
                                    50, #Embedding size
                                    weights=[embedding_matrix],
                                    trainable=False,
                                    input_length=max_review_length) #Number of words in each review
          )

In [None]:
model.output

In [None]:
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.LSTM(128)) #RNN State - size of memory

In [None]:
model.output

In [None]:
model.add(tf.keras.layers.Dropout(0.25))

In [None]:
#Output
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [None]:
#Compile the model
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
model.fit(X_train,y_train,
          epochs=5,
          batch_size=32,
          validation_data=(X_test, y_test))

In [None]:
model.fit(X_train,y_train,
          epochs=10,
          initial_epoch=5,
          batch_size=32,
          validation_data=(X_test, y_test))