<a href="https://colab.research.google.com/github/shaheerzubery/Deeplearning/blob/main/TextSentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Sentiment Analysis of IMDB data Set**

In [None]:
import tensorflow as tf

In [None]:
import tensorflow.keras as keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### **Import The Data File**

In [None]:
imdb_reviews = pd.read_csv("/content/imdb_reviews.csv")
Test_reviews = pd.read_csv("/content/test_reviews.csv")

In [None]:
imdb_reviews.head()

Unnamed: 0,Reviews,Sentiment
0,<START this film was just brilliant casting lo...,positive
1,<START big hair big boobs bad music and a gian...,negative
2,<START this has to be one of the worst films o...,negative
3,<START the <UNK> <UNK> at storytelling the tra...,positive
4,<START worst mistake of my life br br i picked...,negative


In [None]:
Test_reviews.head()

### **Preprocessing the data**
We can not pass the string data to our model directly, so we need to transform the string data into integer format.For this we can map each distinct word as a distinct integer for eg.{'this':14 , 'the':1}.We already have a file that contains the mapping from words to integers so we are going to load that file.

Now we import the file in which we define words into integers/indexs


In [None]:
word_index = pd.read_csv("/content/word_indexes.csv")

In [None]:
word_index.head()

Unnamed: 0,Words,Indexes
0,tsukino,52009
1,nunnery,52010
2,sonja,16819
3,vani,63954
4,woods,1411


Next we are going to convert the word_index dataframe into a python dictionary so that we can use it for converting our reviews from string to integer format

In [None]:
word_index = dict(zip(word_index.Words, word_index.Indexes))

In the above step we are splitting the two columns


In [None]:
word_index["<PAD>"] = 0
word_index["<START"] = 1
word_index["<UNK>"] = 2
word_index["UNUSED"] = 3

Now we define a function review_encoder that encodes the reviews into integer format according to the mapping specified by word_index file.

In [None]:
def review_encoder(text):
  arr=[word_index[word] for word in text]
  return arr

Now we split the dataset into reviews and Sentiments

In [None]:
train_data,train_labels = imdb_reviews['Reviews'], imdb_reviews['Sentiment']
test_data, test_labels = Test_reviews['Reviews'], Test_reviews['Sentiment']

Now break the string into Token (Tokenization)

In [None]:
train_data = train_data.apply(lambda review : review.split())
test_data = test_data.apply(lambda review : review.split())

In [None]:
train_data[0]

as we already tokenized the sentence now we apply review_ecoder to it so that it will convert it into integers


In [None]:
train_data=train_data.apply(review_encoder)
test_data=test_data.apply(review_encoder)

In [None]:
train_data.head()

0    [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, ...
1    [1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463,...
2    [1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5...
3    [1, 4, 2, 2, 33, 2804, 4, 2040, 432, 111, 153,...
4    [1, 249, 1323, 7, 61, 113, 10, 10, 13, 1637, 1...
Name: Reviews, dtype: object

now convert sentiments into indexes


In [None]:
def sentiment_encoder(sentiment):
  if sentiment == "positive":
    return 1
  else:
    return 0

train_labels = train_labels.apply(sentiment_encoder)
test_labels = test_labels.apply(sentiment_encoder)


Before giving the review as an input to the model we need to perform following preprocessing steps:

The length of each review should be made equal for the model to be working correctly.

We have chosen the length of each review to be 300.

If the review is longer than 300 words we are going to cut the extra part of the review.

If the review is contains less than 500 words we are going to pad the review with zeros to increase its length to 300.

In [None]:
train_data = keras.preprocessing.sequence.pad_sequences(train_data, value=word_index["<PAD>"], padding="post" , maxlen=300)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, value=word_index["<PAD>"], padding="post" , maxlen=300)

### **BUILDING THE MODEL**

Our model is a neural network and it consits of the following layers :

one word embedding layer which creates word embeddings of length 16 from integer encoded review.

second layer is global average pooling layer which is used to prevent overfitting by reducing the number of parameters.

then a dense layer which has 16 hidden units and uses relu as activation function

the final layer is the output layer which uses sigmoid as activation function

In [None]:
model=keras.Sequential([keras.layers.Embedding(10000,16, input_length=300),
                        keras.layers.GlobalAveragePooling1D(),
                        keras.layers.Dense(16, activation='relu'),
                        keras.layers.Dense(1,activation='sigmoid')])

### **compiling the model**
Adam is used as optimization function for our model.

Binary cross entropy loss function is used as loss function for the model.

Accuracy is used as the metric for evaluating the model.

In [None]:
model.compile(optimizer="adam", loss="binary_crossentropy" , metrics=['accuracy'] )

Now we will train the model


In [None]:
history = model.fit(train_data,train_labels, epochs = 30, batch_size = 512, validation_data=(train_data, train_labels))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


In [None]:
loss,accuracy = model.evaluate(test_data,test_labels)



### **TESTING**

In [None]:
index = np.random.randint(1,1000)
user_review = Test_reviews.loc[index]
print(user_review)


Reviews      <START journalist bob <UNK> <UNK> and sometime...
Sentiment                                             negative
Name: 633, dtype: object


In this it take a variable name (user_review1) and in this take the test data of same index which we generated randomly then it open up the integer of that index and then model is applied on it to predict the value

In [None]:
print(user_review)
user_review1 = test_data[index]
user_review1 = np.array([user_review1])
if(model.predict(user_review1) > 0.5 ). astype("int32"):
  print("positive Sentiment")
else:
  print("Negative Sentiment")



Reviews      <START journalist bob <UNK> <UNK> and sometime...
Sentiment                                             negative
Name: 633, dtype: object
Negative Sentiment
