## Sentiment Analysis
And now time to see a recurrent neural network in action. For this example, we are going to do something called sentiment analysis.

The formal definition of this term from Wikipedia is as follows:

*the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.*

The example we’ll use here is classifying movie reviews as either postive, negative or neutral.


### Movie Review Dataset
Well start by loading in the IMDB movie review dataset from keras. This dataset contains 25,000 reviews from IMDB where each one is already preprocessed and has a label as either positive or negative. Each review is encoded by integers that represents how common a word is in the entire dataset. For example, a word encoded by the integer 3 means that it is the 3rd most common word in the dataset

In [62]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
import tensorflow as tf
import os 
import numpy as np

vocab_size = 88584 #his value represents the number of unique words in the IMDB dataset that will be considered for training the model.
# any data getting 88584 value is the least occuring data in the code
maxlen = 250
batch_size = 64

(train_data , train_lables) , (test_data , test_lables) = imdb.load_data(num_word = vocab_size)

In [63]:
print(len(train_data[0]))
p = list(train_data[i])
for i in p:
 print(i , end=",")

218
1,13,377,1045,4,420,7,5311,88579,11,4,11282,137,13,16,581,11,1226,2415,36,69,4453,7,90,1266,11,27,1935,137,112,22825,11,2619,24,17,12,679,46,8,2910,4,2619,1066,21,8,2910,90,39,4,1071,14,16,1429,3032,5468,886,13,303,557,15,6,22,69,77,93,34,4285,44,4,420,21,12,16,93,18,298,248,25478,2886,3845,1745,188,6,1763,1360,766,11,1226,2415,13,5339,180,8,4,719,44139,87980,1315,5,16006,1247,6,3602,13,16,584,61,980,33,4,58,16,64,928,11,4,1232,7,4587,32843,5,7157,1419,4808,1318,51,6,530,22,138,286,12,626,8,6,7042,311,69,12,24,77,93,18,248,12,100,28,188,35,735,4372,42,241,50,9,57,96,8,2372,4,277,37,4,455,9,9,115,828,39,4,311,4125,18026,2235,6,371,3240,2167,29,9,4869,16859,1192,6,2104,34,27,322,5,5158,8,1813,12100,5,10473,34,27,9741,33,157,34,8000,134,3388,1421,29,214,8,235,565,976,10,10,997,8,1277,90,429,32,3986,9,6,33407,256,8,321,471,34,1660,7592,11,31,7,27,5655,354,29,215,985,4,15128,1062,7,1766,13666,4,5243,4697,29,47,33,27,11383,5,752,32,4,12665,34,27,9741,8,7539,15,4,14964,60,47,6,1435,455,4,

## More Preprocessing
If we have a look at some of our loaded in reviews, we'll notice that they are different lengths. This is an issue. We cannot pass different length data into our neural network. Therefore, we must make each review the same length. To do this we will follow the procedure below:
- if the review is greater than 250 words then trim off the extra words
- if the review is less than 250 words add the necessary amount of 0's to make it equal to 250.

In [64]:
print(type(train_data[1]))
print(type(test_data[1]))


<class 'list'>
<class 'list'>


In [65]:
train_data = sequence.pad_sequences(train_data, maxlen)
test_data = sequence.pad_sequences(test_data , maxlen)

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=32, input_length=maxlen),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

# Build the model by calling it with an input shape without this there will be no parameter for the model to be trained on
model.build(input_shape=(None, maxlen))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()



In [None]:
history = model.fit(train_data , train_lables , epochs=10 , validation_split=0.2) # we'ar using 20% of data as validation data

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 93ms/step - accuracy: 0.6933 - loss: 0.5518 - val_accuracy: 0.8514 - val_loss: 0.3541
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m74s[0m 118ms/step - accuracy: 0.9282 - loss: 0.1998 - val_accuracy: 0.8686 - val_loss: 0.3192
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m83s[0m 119ms/step - accuracy: 0.9676 - loss: 0.0985 - val_accuracy: 0.8616 - val_loss: 0.3557
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 113ms/step - accuracy: 0.9809 - loss: 0.0625 - val_accuracy: 0.8398 - val_loss: 0.4277
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m68s[0m 109ms/step - accuracy: 0.9804 - loss: 0.0585 - val_accuracy: 0.8646 - val_loss: 0.4520
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m69s[0m 111ms/step - accuracy: 0.9934 - loss: 0.0262 - val_accuracy: 0.8436 - val_loss: 0.6167
Epoch 7/10


In [None]:
result = model.evaluate(test_data,test_lables)
print(result)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 20ms/step - accuracy: 0.8320 - loss: 0.7528
[0.7459179759025574, 0.8336799740791321]


## Making Predictions
Now let’s use our network to make predictions on our own reviews. 

Since our reviews are encoded well need to convert any review that we write into that form so the network can understand it. To do that well load the encodings from the dataset and use them to encode our own data.

In [97]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# Load the IMDB word index
word_index = imdb.get_word_index()

def encode_text(text):
    # Convert the text to a sequence of words
    tokens = text_to_word_sequence(text)
    # Encode the tokens to their corresponding indices, using 0 for unknown words
    tokens = [word_index[word] if word in word_index else 0 for word in tokens]
    # Pad the sequences to a fixed length
    return sequence.pad_sequences([tokens], maxlen=250)[0]  # Adjust maxlen as needed

In [98]:
text = "The movie was amazing"
encoded = encode_text(text)
print(encoded)

[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   

In [100]:
# Assuming word_index is already defined and encoded is the output from encode_text
reverse_index_index = {value: key for (key, value) in word_index.items()}

def decode_integer(integers):
    text = ""  # YE EMPTY STRING HAI JISME HAM KUCH ADD KARENGE BAAD ME 
    for num in integers:
        # HAM SIRF USSI KO ADD KARENGE JO 0 NA HO KYUKI 0 KA MATLAB WAHA KUCH NAHI HAI
        if num != 0:  # Skip padding
            text += reverse_index_index.get(num, '') + " "  # Use get to avoid KeyError AND ISME TEXT STRING ME WORDS AAD KARENGE WITH NUM INDEX AUR FIR SPACE

    return text[:-1] # 

# Example usage
decoded_text = decode_integer(encoded)
print(decoded_text)

the movie was amazing


## NOW IT'S TIME TO MAKE PREDICTION

In [111]:
def predict(text):
    encodedtext = encode_text(text) # we encode the text
    pred = np.zeros((1,250)) # it is a numpy array of shape 1X250 with every single number as 0
    pred[0] = encodedtext  # the first letter of numpy array is the text we want
    result = model.predict(pred) 
    print(result[0])

positive_predict =  input("Write the sentence and see if it;s positive or negetive : ")
print(predict(positive_predict)) # HIGGER THE NUMBER THE MORE POSITIVE THE SENTENCE IS :)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
[0.8886344]
None
