<a href="https://colab.research.google.com/github/roht20/Portfolio/blob/master/Sentiment_Analysis_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sentiment Analysis **

Let's import the necessary libraries and set a random seed to be able to reproduce the same set of splits and values in the code

In [0]:
import numpy as np
np.random.seed(42)

In [176]:
import pandas as pd

df = pd.read_csv('labeledTrainData.tsv',header=0, delimiter="\t", quoting=3)

df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [177]:
df.shape

(25000, 3)

The dataset is of the shape 25000 x 3

Let's us split the data into training and testing with the ratio 80:20

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['review'],
    df['sentiment'],
    test_size=0.2, 
    random_state=42
)

In [179]:
X_train.shape

(20000,)

Prepare Data

1.Convert reviews to Number sequences using Tokenizer

In [0]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

# Considering top 5000 words = Vocablury size
top_words = 5000
t = Tokenizer(num_words=top_words)

#Fit tokenizer of training data
t.fit_on_texts(X_train.tolist())

#Get the word index for each of the word in the review
X_train = t.texts_to_sequences(X_train.tolist())
X_test = t.texts_to_sequences(X_test.tolist())

In [181]:
print('Length of review# 32 is: ', len(X_train[32]))
print('Length of review# 1208 is: ', len(X_train[1208]))

Length of review# 32 is:  317
Length of review# 1208 is:  117


**As seen above comparing two random documents the length varies, so let's standardize the length of each of the document using padding**

In [0]:
from tensorflow.python.keras.preprocessing import sequence

#Length for each review
max_review_length = 300

X_train = sequence.pad_sequences(X_train,maxlen=max_review_length,
                                 padding='post')

X_test = sequence.pad_sequences(X_test, maxlen=max_review_length, 
                                padding='post')

**Now comparing the length of the same documents as above, we should get exactly 300 as thats the max_review_length chosen**

In [183]:
print('Length of review# 32 is: ', len(X_train[32]))
print('Length of review# 1208 is: ', len(X_train[1208]))

Length of review# 32 is:  300
Length of review# 1208 is:  300


**Just to reconfirm below, the values have been padded with 0's to make it a standard size**

In [184]:
X_train[1208]

array([  11,   17,    6,    3,  977,    3,   62,    4,    3,  183,  251,
        311,    1,  317,    2,    9,   63,  585,   21,  622,   14,    1,
         17,   18,    9,    6,    3,   82,   62,   10,   13, 4142,   31,
         11,   19,    1,  112,    2,    1,   62,  117,   82,   10,   37,
         11,   19,   85,    9,    6,    3,  278,   62,   42,   68,    3,
        543,   12,   10,   13,   46,    2,   10,  229,  788,   15,    1,
         12,    6,  396,   85,   34,  485,    5,  127,  130,  111,   12,
         94,   10,  383,   12, 1441,   25,    5,   64,   11,   19,  318,
          1,  183,  657,    2,   31,    1,  845,  138,   36,   11,   19,
         11,   19,   22,   67, 1631,    1,   17, 1689,   36, 4960,   39,
         98,  143,   62,   12,  563,    8,    1,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

**Its time to Build the Graph**

Importing the required libraries

In [0]:
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dropout, Dense, Embedding, Flatten, LSTM

**Here I am choosing the vocabulory size of 50 ie each word is represented by 50 numbers**

In [0]:
embedding_vector_length = 50 

# Instantiating the sequential model
model = Sequential()

**Adding Embedding Layer**

In [0]:
model.add(
    Embedding(top_words+1, 
                    embedding_vector_length, 
                    input_length=max_review_length))

**Output from Embedding is 3 dimension 
batch_size x max_review_length x embedding_vector_length. 
We need to flatten the output for Dense layer **

In [0]:
#Flattening the output from Embedding layer to feed it into the Dense Layer
model.add(Flatten())

#Dense Layers - Fully Connected Layer with 3 hidden layers and activation function used 'relu'
model.add(Dense(200,activation='relu'))
model.add(Dense(100,activation='relu'))
model.add(Dense(60,activation='relu'))
model.add(Dense(30,activation='relu'))

#Output layer - Single value is output and activation function used is sigmoid which travels from 0 to 1
model.add(Dense(1,activation='sigmoid'))

#Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

**Finally let's fit the model**

In [189]:
model.fit(X_train,y_train,
          epochs=10,
          batch_size=128,
          shuffle=True, 
          validation_data=(X_test, y_test))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fd222e58080>

**With an accuracy of 99.63% on trained data and validation accuracy of 84.46% our model seems to be performing just about okay, however we can improve the model using different techniques such as dropout to avoid overfitting, LSTM and pre-embedding techniques**

**Let's us try to predict and check if a particular document reflects either positive or negative sentiment with the default threshold of value 0.5**

In [191]:
model.predict(X_test[0:2])

array([[1.3534251e-06],
       [9.9997747e-01]], dtype=float32)

**Seems like both the first 2 documents reflect negative sentiments as the values are less than the set threshold**