<a href="https://colab.research.google.com/github/singhsourav0/Deep-Learning-Odyssey/blob/main/21_SimpleRNN_IntegerEncoding_Embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<p><font color=\"blueviolet\">This notebook contains a simple explanation of why we use Recurrent Neural Networks (RNNs) instead of Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs), and how we can use RNNs with tokenization, including two approaches: integer encoding and embedding.</font></p>

**Why RNNs?**
- **RNNs for Sequential Data**: <font color='lime'>RNNs are specifically designed to handle sequential data, making them suitable for tasks like text analysis, time series prediction, and speech recognition.</font>

- **Capturing Temporal Dependencies**: <font color='lime'>Unlike ANNs and CNNs, RNNs can capture temporal dependencies in data because they maintain a memory of previous inputs, making them effective for tasks where past context is important.</font>

**Using RNNs with Tokenization:**
1. **Integer Encoding**:
   - **Explanation**: <font color='lime'>Integer encoding involves converting each word in a text into a unique integer. This creates a numerical representation of the text, which can be fed into the RNN.</font>
   - **How to Use**:<font color='lime'> Use a tokenizer to tokenize the text and assign a unique integer to each word. Then, feed the encoded sequences into the RNN.</font>

2. **Embedding**:
   - **Explanation**: <font color='lime'>Embedding involves representing words as dense vectors in a high-dimensional space. Each word is mapped to a vector of real numbers, which captures semantic relationships between words.</font>
   - **How to Use**: <font color='lime'>Use a tokenizer to tokenize the text and generate word embeddings. These embeddings can then be fed directly into the RNN.</font>

**Applying RNNs on IMDB Dataset for Sentiment Analysis:**

- <font color='lime'>We demonstrate the application of RNNs for sentiment analysis using the IMDb dataset. We utilize both integer encoding and embedding approaches to preprocess the text data before training a simple RNN model.</font>

<font color='teal'>By understanding these concepts, we can leverage RNNs with tokenization to effectively process and analyze sequential data like text in our Deep learning projects.</font>

<h1><font color='teal'><b>Integer Encoding

In [61]:
import numpy as np

docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [62]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(oov_token='<nothing>')

In [63]:
tokenizer.fit_on_texts(docs)

In [64]:
tokenizer.word_index

{'<nothing>': 1,
 'india': 2,
 'jeetega': 3,
 'hip': 4,
 'ki': 5,
 'jai': 6,
 'kohli': 7,
 'sachin': 8,
 'dhoni': 9,
 'go': 10,
 'hurray': 11,
 'bhai': 12,
 'bharat': 13,
 'mata': 14,
 'modi': 15,
 'ji': 16,
 'inquilab': 17,
 'zindabad': 18}

In [65]:
tokenizer.word_counts

OrderedDict([('go', 1),
             ('india', 4),
             ('hip', 2),
             ('hurray', 1),
             ('jeetega', 3),
             ('bhai', 1),
             ('bharat', 1),
             ('mata', 1),
             ('ki', 2),
             ('jai', 2),
             ('kohli', 2),
             ('sachin', 2),
             ('dhoni', 2),
             ('modi', 1),
             ('ji', 1),
             ('inquilab', 1),
             ('zindabad', 1)])

In [66]:
tokenizer.document_count

10

In [67]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[10, 2],
 [2, 2],
 [4, 4, 11],
 [3, 12, 3, 2, 3],
 [13, 14, 5, 6],
 [7, 7],
 [8, 8],
 [9, 9],
 [15, 16, 5, 6],
 [17, 18]]

In [68]:
from keras.utils import pad_sequences

In [69]:
sequences = pad_sequences(sequences, padding = 'post')

In [70]:
sequences

array([[10,  2,  0,  0,  0],
       [ 2,  2,  0,  0,  0],
       [ 4,  4, 11,  0,  0],
       [ 3, 12,  3,  2,  3],
       [13, 14,  5,  6,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [ 9,  9,  0,  0,  0],
       [15, 16,  5,  6,  0],
       [17, 18,  0,  0,  0]], dtype=int32)

In [71]:
import tensorflow as tf
from tensorflow import keras
from keras.datasets import imdb
from tensorflow.keras.models import Sequential
from keras.layers import Dense,SimpleRNN, Embedding, Flatten

In [72]:
(x_train, y_train), (x_test, y_test) = imdb.load_data()

In [73]:
x_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

In [74]:
len(x_train[0])

218

In [75]:
len(x_train[2])

141

In [76]:
x_train = pad_sequences(x_train, padding = 'post', maxlen =50)
x_test = pad_sequences(x_test, padding = 'post', maxlen = 50)

In [77]:
x_train[0]

array([2071,   56,   26,  141,    6,  194, 7486,   18,    4,  226,   22,
         21,  134,  476,   26,  480,    5,  144,   30, 5535,   18,   51,
         36,   28,  224,   92,   25,  104,    4,  226,   65,   16,   38,
       1334,   88,   12,   16,  283,    5,   16, 4472,  113,  103,   32,
         15,   16, 5345,   19,  178,   32], dtype=int32)

In [78]:
model = Sequential()
model.add(SimpleRNN(32,input_shape =(50,1),return_sequences=False))
model.add(Dense(1, activation ='sigmoid'))

In [79]:
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_3 (SimpleRNN)    (None, 32)                1088      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1121 (4.38 KB)
Trainable params: 1121 (4.38 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [80]:
model.compile(loss ='binary_crossentropy', optimizer = 'adam', metrics =['accuracy'] )
model.fit(x_train, y_train,epochs =5, validation_data =(x_test, y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7fdde1d83010>

<font color='coral'>Using embeddings instead of integer encoding allows the model to understand the meaning and context of words, reduces computational complexity, adapts to the task during training, generalizes better to unseen words, and leverages pre-trained representations for improved performance.</font>
<p><font color=\"tea\">Overall, using word embeddings instead of integer encoding enhances the model's ability to understand and process natural language by capturing semantic relationships, reducing dimensionality, enabling learning of parameters, facilitating generalization, and leveraging pre-trained embeddings for improved performance.</p></font>

<h1><font color='teal'><b>Embedding

In [81]:
docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [82]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [83]:
tokenizer.fit_on_texts(docs)

In [84]:
len(tokenizer.word_index)

17

In [85]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[9, 1],
 [1, 1],
 [3, 3, 10],
 [2, 11, 2, 1, 2],
 [12, 13, 4, 5],
 [6, 6],
 [7, 7],
 [8, 8],
 [14, 15, 4, 5],
 [16, 17]]

In [86]:
from keras.utils import pad_sequences
sequences = pad_sequences(sequences,padding='post')
sequences

array([[ 9,  1,  0,  0,  0],
       [ 1,  1,  0,  0,  0],
       [ 3,  3, 10,  0,  0],
       [ 2, 11,  2,  1,  2],
       [12, 13,  4,  5,  0],
       [ 6,  6,  0,  0,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [14, 15,  4,  5,  0],
       [16, 17,  0,  0,  0]], dtype=int32)

In [102]:
model2 = Sequential()

model2.add(Embedding(17, output_dim=2, input_length = 5))

In [103]:
model2.summary()

Model: "sequential_10"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_7 (Embedding)     (None, 5, 2)              34        
                                                                 
Total params: 34 (136.00 Byte)
Trainable params: 34 (136.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [104]:
model2.compile('adam','accuracy')

In [106]:
print(sequences)

[[ 9  1  0  0  0]
 [ 1  1  0  0  0]
 [ 3  3 10  0  0]
 [ 2 11  2  1  2]
 [12 13  4  5  0]
 [ 6  6  0  0  0]
 [ 7  7  0  0  0]
 [ 8  8  0  0  0]
 [14 15  4  5  0]
 [16 17  0  0  0]]


In [132]:
from keras.datasets import imdb
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [117]:
(X_train,Y_train),(X_test,Y_test) = imdb.load_data()

In [118]:
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)

In [126]:
model2 = Sequential()
model2.add(Embedding(10000, 2, input_length=50, embeddings_initializer='uniform'))
model2.add(SimpleRNN(32, return_sequences=False))
model2.add(Dense(1, activation='sigmoid'))

In [122]:
model2.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_9 (Embedding)     (None, 50, 2)             20000     
                                                                 
 simple_rnn_6 (SimpleRNN)    (None, 32)                1120      
                                                                 
 dense_6 (Dense)             (None, 1)                 33        
                                                                 
Total params: 21153 (82.63 KB)
Trainable params: 21153 (82.63 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [134]:
import numpy as np
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense
from keras.preprocessing import sequence

# Set the parameters
max_features = 10000  # Number of words to consider as features
maxlen = 100  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

# Build the model
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=batch_size,
                    validation_split=0.2)

# Evaluate the model
loss, accuracy = model.evaluate(input_test, y_test)
print('Test loss:', loss)
print('Test accuracy:', accuracy)


Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 100)
input_test shape: (25000, 100)
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 1.0124222040176392
Test accuracy: 0.7741600275039673
