The objective of this project is to build a text classification model that analyses the customer's 
sentiments based on their reviews in the IMDB database. The model uses a complex deep learning model to build 
an embedding layer followed by a classification algorithm to analyse the sentiment of the customers.

The Dataset of 50,000 movie reviews from IMDB, labelled by sentiment (positive/negative). 
Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For 
convenience, the words are indexed by their frequency in the dataset, meaning the for that has index 1 is the most 
frequent word. Use the first 20 words from each review to speed up training, using a max vocabulary size of 
10,000. As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

PROJECT OBJECTIVE: To Build a sequential NLP classifier which can use input text parameters to determine the 
customer sentiments

1. Import and analyse the data set. 
Hint: - Use `imdb.load_data()` method
 - Get train and test set
 - Take 10000 most frequent words


In [1]:
import tensorflow
from tensorflow.keras.datasets import imdb

vocab_size = 10000

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocab_size)

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


In [2]:
print("Shape of X_train:",X_train.shape)
print("Shape of X_test:",X_test.shape)

Shape of X_train: (25000,)
Shape of X_test: (25000,)


2. Perform relevant sequence adding on the data. 

In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from sklearn.model_selection import train_test_split

#We will use padding to keep each review at 500 words
maxlen=500

X_train = pad_sequences(X_train, maxlen = maxlen, padding = 'pre')
X_test =  pad_sequences(X_test, maxlen = maxlen, padding = 'pre')

X = np.concatenate((X_train, X_test), axis = 0)
y = np.concatenate((y_train, y_test), axis = 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 )
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size = 0.2)

3. Perform following data analysis: 

• Print shape of features and labels

In [4]:
print("Number of items in X_train:",len(X_train))
print("Number of unique words in X_train:",len(np.unique(np.hstack(X_train))))
print(" ")
print("Number of items in X_test:",len(X_test))
print("Number of unique words in X_test:",len(np.unique(np.hstack(X_test))))
print(" ")
print("Number of items in X_validation:",len(X_validation))
print("Number of unique words in X_validation:",len(np.unique(np.hstack(X_validation))))

Number of items in X_train: 32000
Number of unique words in X_train: 9999
 
Number of items in X_test: 10000
Number of unique words in X_test: 9997
 
Number of items in X_validation: 8000
Number of unique words in X_validation: 9988


• Print value of any one feature and it's label

In [39]:
print("Performed below")

Performed below


4. Decode the feature value to get original sentence

In [5]:
def decode_review(review_index):
    word_index = imdb.get_word_index()
    reverse_index = dict([(value, key) for (key, value) in word_index.items()]) 
    review_decoded = " ".join( [reverse_index.get(i - 3, "#") for i in X_test[review_index]] )

    print("The movie review index is:",review_index)
    print("The movie review is:",X_test[review_index])
    print("The movie review decoded is:", review_decoded) 
    print("The movie review sentiment as reported by reviewer is (0 for negative and 1 for positive):",y_test[review_index])

In [6]:
decode_review(0)

The movie review index is: 0
The movie review is: [   0    0    0    0    0    0    0    0    0    1    4  236 4457    9
    6   22   15    9  981    8   30  253    5 1127   12 2880   33  399
  199   10   10    9    6  194  500 2243   37    2  194  500  140  822
    2    6 1169   29    9 3601   23  170   23   35 5144   19    6  604
    7   84  587    6 3852  773 5413    6  860 1656    6  232   37  495
   18   27 1169  773 3447    5   35 2045  232   37   47 6678   90   23
  111  773    2    2    4  213    9    8 2078   51    9 2417    8   30
    2 3593 4457   15  556    4  236 5144    8    4 1609   36   80   30
  397    8   14    2 1609  656    6    2 2196   15 4136 6354  446    4
 7879  103  397    8    4 1609   36  515  169    4    2   63  497    8
 1258   21   27 1056  214    2 3447 1241 5190   15  473    8 2335    4
 4457   32  367    8  763   12    8   27 1594    7 7764 1389  137   36
   71  245    2    2   68 1250    5  304    4 7879  245   39   68 1250
  725 3582    4  604 2503  

5. Design, train, tune and test a sequential NLP model.  

In [7]:
from tensorflow.keras.layers import Dense, Embedding, LSTM, Dropout, MaxPooling1D, Conv1D
from tensorflow.keras.models import Model, Sequential

#We will use Long Short Term Memory (LSTM)
# Model
model = Sequential()
model.add(Embedding(vocab_size, 256, input_length = maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(256, 5, padding = 'same', activation = 'relu'))
model.add(Conv1D(128, 5, padding = 'same', activation = 'relu'))
model.add(MaxPooling1D(pool_size = 3))
model.add(Conv1D(64, 5, padding = 'same', activation = 'relu'))
model.add(MaxPooling1D(pool_size = 3))
model.add(LSTM(75))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 256)          2560000   
_________________________________________________________________
dropout (Dropout)            (None, 500, 256)          0         
_________________________________________________________________
conv1d (Conv1D)              (None, 500, 256)          327936    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 500, 128)          163968    
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 166, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 166, 64)           41024     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 55, 64)            0

In [8]:
#Fit model
model.fit(X_train, y_train, validation_data = (X_validation, y_validation), epochs = 10, batch_size = 64, verbose = True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x28d0d830c40>

In [9]:
from sklearn.metrics import classification_report

y_prediction = model.predict_classes(X_test)
print("Classification Report:", classification_report(y_prediction, y_test))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Classification Report:               precision    recall  f1-score   support

           0       0.89      0.90      0.89      4920
           1       0.90      0.89      0.90      5080

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000



6. Use the designed model to print the prediction on any one sample.  

In [10]:
print('testing...')

import random
review_index=random.randint(0,10000)

decode_review(review_index)
print('Predicted sentiment of user by model:', y_prediction[review_index][0])

testing...
The movie review index is: 5797
The movie review is: [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0   