# [IMDB Movie reviews sentiment classification](https://keras.io/datasets/)
Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

As a convention, "0" does not stand for a specific word, but instead is used to encode any unknown word.

# using the dataset



```python
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=None,
                                                      skip_top=0,
                                                      maxlen=None,
                                                      seed=113,
                                                      start_char=1,
                                                      oov_char=2,
                                                      index_from=3)
```


**Returns:**

- 2 tuples:
    - x_train, x_test: list of sequences, which are lists of indexes (integers). If the num_words argument was specific, the maximum possible index value is num_words-1. If the maxlen argument was specified, the largest possible sequence length is maxlen.
    - y_train, y_test: list of integer labels (1 or 0).

**Arguments**:

- **path**: if you do not have the data locally (at '~/.keras/datasets/' + path), it will be downloaded to this location.
- **num_words**: integer or None. Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data.
- **skip_top**: integer. Top most frequent words to ignore (they will appear as oov_char value in the sequence data).
- **maxlen**: int. Maximum sequence length. Any longer sequence will be truncated.
- **seed**: int. Seed for reproducible data shuffling.
- **start_char**: int. The start of a sequence will be marked with this character. Set to 1 because 0 is usually the padding character.
- **oov_char**: int. words that were cut out because of the num_words or skip_top limit will be replaced with this character.
- **index_from**: int. Index actual words with this index and higher

In [0]:
%tensorflow_version 2.x

In [0]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

In [0]:
data= tf.keras.datasets.imdb

In [39]:
(train_data,train_labels),(test_data,test_labels) = data.load_data(num_words=80000)
for i in range(10):
    print(train_data[i])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
[1, 194, 1153, 194, 8255, 78, 228, 5

# formatting and understanding data

In [0]:
word_index = data.get_word_index()
word_index = {k:v+3 for k,v in word_index.items()} # very important
word_index['<PAD>'] = 0
word_index['<START>'] = 1
word_index['<UNK>'] = 2
word_index['<UNUSED>'] = 3

In [41]:
print(list(word_index.items())[:10])

[('fawn', 34704), ('tsukino', 52009), ('nunnery', 52010), ('sonja', 16819), ('vani', 63954), ('woods', 1411), ('spiders', 16118), ('hanging', 2348), ('woody', 2292), ('trawling', 52011)]


In [42]:
word_index.get('br')

10

In [43]:
reverse_word_index = {v:k for k,v in word_index.items()}
print(reverse_word_index)



In [44]:
reverse_word_index.get(15,'?')

'that'

In [45]:
" ".join([reverse_word_index.get(i,"?") for i in train_data[0]][:200])

"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and sh

In [0]:
def decode_review(matrix):
    return " ".join([reverse_word_index.get(i,"?") for i in matrix])

In [47]:
for i in range(10):
    print(decode_review(test_data[i]))

<START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss
<START> this film requires a lot of patience because it focuses on mood and character development the plot is very simple and many of the scenes take place on the same set in frances austen's the sandy dennis character apartment but the film builds to a disturbing climax br br the characters create an atmosphere rife with sexual tension and psychological trickery it's very interesting that robert altman directed this considering the style and structure of his other films still the trademark altman audio style is evident here and there i think what really makes this film work is the brilliant performance by sandy dennis it's definitely 

# adding a pad sequence to the reviews

In [48]:
print("train data size :",len(train_data),"test data size :",len(test_data))

train data size : 25000 test data size : 25000


In [49]:
print("lenght of review is different")
for i in range(10):
    print(len(train_data[i]))

lenght of review is different
218
189
141
550
147
43
123
562
233
130


In [0]:
train_data = tf.keras.preprocessing.sequence.pad_sequences(train_data,value=word_index["<PAD>"],padding="post",maxlen=256)
test_data = tf.keras.preprocessing.sequence.pad_sequences(test_data,value=word_index["<PAD>"],padding="post",maxlen=256)

In [51]:
print("lenght of review is now same")
for i in range(10):
    print(len(train_data[i]))

lenght of review is now same
256
256
256
256
256
256
256
256
256
256


In [52]:
print(train_data[5]) # padding added

[    1   778   128    74    12   630   163    15     4  1766  7982  1051
 43222    32    85   156    45    40   148   139   121   664   665    10
    10  1361   173     4   749     2    16  3804     8     4   226    65
    12    43   127    24 15344    10    10     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

# CREATING A MODEL

In [0]:
tf.keras.layers.Embedding?

## Global Average pooling 
*Global Average Pooling is an operation that calculates the average output of each feature map in the previous layer. This fairly simple operation reduces the data significantly and prepares the model for the final classification layer. It also has no trainable parameters – just like Max Pooling*

![GAP](https://i0.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2019/05/Global-Average-Pooling-full-network.png?resize=1024%2C287&ssl=1)


[more details](https://adventuresinmachinelearning.com/global-average-pooling-convolutional-neural-networks/)

In [0]:
tf.keras.layers.GlobalAveragePooling1D?

In [55]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(80000,16))
model.add(tf.keras.layers.GlobalAveragePooling1D())
model.add(tf.keras.layers.Dense(16,activation=tf.keras.activations.relu))
model.add(tf.keras.layers.Dense(1,activation=tf.keras.activations.sigmoid))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 16)          1280000   
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 1,280,289
Trainable params: 1,280,289
Non-trainable params: 0
_________________________________________________________________


In [0]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])


In [57]:
x_val = train_data[:10000]
x_train = train_data[10000:]
print(x_val.shape,x_train.shape)

(10000, 256) (15000, 256)


In [58]:
y_val = train_labels[:10000]
y_train=  train_labels[10000:]
print(y_val.shape,y_train.shape)


(10000,) (15000,)


In [59]:
%load_ext tensorboard
from datetime import datetime
logdir= "logs/scalers/"+datetime.now().strftime('%Y%m%d-%H%M%S')
tb_call = tf.keras.callbacks.TensorBoard(log_dir=logdir)


The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [60]:
fitted_model = model.fit(x_train,y_train,
                         epochs=20,batch_size=512,
                         validation_data=(x_val,y_val),
                         callbacks=[tb_call]
                         )

Train on 15000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [72]:
%tensorboard --logdir 'logs/scalers'

Output hidden; open in https://colab.research.google.com to view.

In [62]:
results = model.evaluate(test_data,test_labels)
print(results)

[0.3112873345088959, 0.87228]


In [63]:
for i in range(10):
    print('-'* 30)
    test_review = test_data[i]
    predict = model.predict([[test_review]])
    print("review:")
    print(decode_review(test_review))
    print("prediction:",round(predict[0][0]))
    print("actual:",test_labels[i])

------------------------------
review:
<START> please give this one a miss br br kristy swanson and the rest of the cast rendered terrible performances the show is flat flat flat br br i don't know how michael madison could have allowed this one on his plate he almost seemed to know this wasn't going to work out and his performance was quite lacklustre so all you madison fans give this a miss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD

In [0]:
model.save('sentiment_model.h5')

In [0]:
lmodel = tf.keras.models.load_model('sentiment_model.h5')

In [0]:
def review_encode(s):
    encoded = [1]
    for word in s.split():
        if word.lower() in word_index:
            encoded.append(word_index[word.lower()])
        else:
            encoded.append(2)
        
    return encoded

In [0]:
review = """Ford V Ferrari is one of the best movies I've seen this year, 
and for someone who has little interest in cars besides minivans and SUVs,
that's saying a lot. Just like race cars produced by its namesakes, Ford v. 
Ferrari is sleek and fast a powerful and expensive machine.
A supremely well balanced combination of corporate rivalry, on track competitiveness and human drama."""

In [68]:
encoded_review = review_encode(review)
print(encoded_review)

[1, 2108, 1964, 14507, 9, 31, 7, 4, 118, 102, 207, 110, 14, 2, 5, 18, 294, 37, 47, 117, 602, 11, 1880, 1371, 2, 5, 2, 198, 660, 6, 2, 43, 40, 1522, 1880, 1055, 34, 94, 2, 2108, 2, 14507, 9, 18475, 5, 702, 6, 976, 5, 3269, 2, 6, 12765, 73, 6460, 2221, 7, 4452, 2, 23, 1406, 43558, 5, 406, 2]


In [69]:
# padding 
padded_review =  tf.keras.preprocessing.sequence.pad_sequences([encoded_review],
                                                               value=word_index["<PAD>"],
                                                               padding="post",maxlen=256)
print(padded_review)

[[    1  2108  1964 14507     9    31     7     4   118   102   207   110
     14     2     5    18   294    37    47   117   602    11  1880  1371
      2     5     2   198   660     6     2    43    40  1522  1880  1055
     34    94     2  2108     2 14507     9 18475     5   702     6   976
      5  3269     2     6 12765    73  6460  2221     7  4452     2    23
   1406 43558     5   406     2     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0 

In [70]:
prediction = model.predict(padded_review)
print(prediction)

[[0.6035935]]
