<a href="https://colab.research.google.com/github/raj963/NLTK/blob/master/SentimentClassification_RNNs_GloVe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sentiment Analysis of Reviews using RNNs in TensorFlow, with pre-built embeddings

Modified from original code here: https://github.com/adeshpande3/LSTM-Sentiment-Analysis/blob/master/Oriole%20LSTM.ipynb

In [0]:
%tensorflow_version 1.3.0
import tensorflow as tf
print(tf.__version__)

`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.3.0`. This will be interpreted as: `1.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.
1.15.2


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#### Some imports to make code compatible with Python 2 as well as 3

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [0]:
import collections
import math
import os
import random
import tarfile
import re

In [0]:
from six.moves import urllib

In [0]:
import numpy as np
import matplotlib as mp
import matplotlib.pyplot as plt
# import tensorflow as tf

In [0]:
print(np.__version__)
print(mp.__version__)
print(tf.__version__)
%tensorflow_version 1.3.0

1.16.4
2.2.4
1.15.2
`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1.3.0`. This will be interpreted as: `1.x`.


TensorFlow is already loaded. Please restart the runtime to change versions.


#### Download, unzip and untar files in an automated way

In [0]:
DOWNLOADED_FILENAME = 'ImdbReviews.tar.gz'

def download_file(url_path):
    if not os.path.exists(DOWNLOADED_FILENAME):
        filename, _ = urllib.request.urlretrieve(url_path, DOWNLOADED_FILENAME)

    print('Found and verified file from this path: ', url_path)
    print('Downloaded file: ', DOWNLOADED_FILENAME)

### Extract reviews and the corresponding positive and negative labels from the dataset

In [0]:
TOKEN_REGEX = re.compile("[^A-Za-z0-9 ]+")


def get_reviews(dirname, positive=True):
    label = 1 if positive else 0

    reviews = []
    labels = []
    for filename in os.listdir(dirname):
        if filename.endswith(".txt"):
            with open(dirname + filename, 'r+') as f:
                review = f.read().decode('utf-8')
                review = review.lower().replace("<br />", " ")
                review = re.sub(TOKEN_REGEX, '', review)
                
                reviews.append(review)
                labels.append(label)
    
    return reviews, labels           

def extract_labels_data():
    # If the file has not already been extracted
    if not os.path.exists('aclImdb'):
        with tarfile.open(DOWNLOADED_FILENAME) as tar:
            tar.extractall()
            tar.close()
        
    positive_reviews, positive_labels = get_reviews("aclImdb/train/pos/", positive=True)
    negative_reviews, negative_labels = get_reviews("aclImdb/train/neg/", positive=False)

    data = positive_reviews + negative_reviews
    labels = positive_labels + negative_labels

    return labels, data

In [0]:
URL_PATH = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

download_file(URL_PATH)

Found and verified file from this path:  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Downloaded file:  ImdbReviews.tar.gz


In [0]:
labels, data = extract_labels_data()

In [0]:
labels[:5]

[1, 1, 1, 1, 1]

In [0]:
data[:5]

[u'this movie has been poorly received and badly reviewed the book by rebecca west was written in 1918 soon after wwi when shell shock and traumainduced amnesia were not clichs as the reviewers call it many books and movies later it is difficult to go back in time and live as the characters lived the realities of the time the war and the horror of the experience of the first war to use lethal gas the british class system the wife thought allimportant the hopeless spinster and the lover from the past still seen with the eyes of love being as young and as beautiful as she was 20 years ago  alan bates as the amnesiac soldier who will die if he isnt allowed to see margaret the girl of his youthful dreams builds on the devotion his character showed in far from the madding crowd having seen that performance it is possible to sense his strong romantic attachment to the girl who didnt live up to the familys and societys expectations margaret says we quarreled and as you rowed away you turned y

In [0]:
len(labels), len(data)

(25000, 25000)

In [0]:
max_document_length = max([len(x.split(" ")) for x in data])
print(max_document_length)

2470


### How many words to consider in each review?

Majority of the reviews fall under 250 words. This a number we've chosen based on some analysis of the data:

* Count the number of words in each file and divide by number of files to get an average i.e. **avg_words_per_file = total_words / num_files**
* Plot the words per file on matplot lib and try find a number which includes a majority of files

Word embeddings all have the same dimensionality which you can specify. A document is a vector of word embeddings (one dbpedia instance is a document in this case)

* Each document should be of the **same length**, documents longer than the MAX_SEQUENCE_LENGTH are truncated to this length
* The other documents will be **padded** by a special symbol to be the same max length

In [0]:
MAX_SEQUENCE_LENGTH = 250

### Use a pre-trained model for embeddings

Instead of training our model on our own dataset we will use a pre-trained model.

This is much better because these word vectors will be more generalized as they have been trained on a different dataset. These embeddings are trained using GloVe, a vector generation model very simalar to word2vec. 

In [0]:
words = np.load('wordsList.npy')

In [0]:
words[:5], len(words)

(array(['0', ',', '.', 'of', 'to'], dtype='|S68'), 400000)

### Map every word to a unique index

The words are in the order and the position of the word in the word list is its index.

In [0]:
def get_word_index_dictionary(words):
    
    dictionary = {}
    
    index = 0
    for word in words:
        dictionary[word] = index
        index += 1
    
    return dictionary

dictionary = get_word_index_dictionary(words)        

#### The most common words have lower index values

In [0]:
dictionary['and'], dictionary['this'], dictionary['together'], dictionary['supreme']

(5, 37, 600, 1399)

### Convert the sentences so they're represented in the form of word indexes

Use the word index mapping that we created earlier in order to look up the index for individual words

In [0]:
review_ids = []

def convert_reviews_to_ids(data, words):
    words_list = words.tolist()

    progress = 0
    for review in data:
        review_id = []
        
        index = 0
        for word in review:
            if index >= MAX_SEQUENCE_LENGTH:
                break;
            
            try:
                review_id.append(dictionary[word])
            except KeyError:
                review_id.append(0)
            
            index += 1
        if len(review_id) < MAX_SEQUENCE_LENGTH:
            review_id = np.pad(review_id, (0, MAX_SEQUENCE_LENGTH - index), 'constant')

        review_ids.append(np.array(review_id))
        progress += 1
        
        if progress % 1000 == 0:
            print("Completed: ", progress)

In [0]:
convert_reviews_to_ids(data, words)

Completed:  1000
Completed:  2000
Completed:  3000
Completed:  4000
Completed:  5000
Completed:  6000
Completed:  7000
Completed:  8000
Completed:  9000
Completed:  10000
Completed:  11000
Completed:  12000
Completed:  13000
Completed:  14000
Completed:  15000
Completed:  16000
Completed:  17000
Completed:  18000
Completed:  19000
Completed:  20000
Completed:  21000
Completed:  22000
Completed:  23000
Completed:  24000
Completed:  25000


In [0]:
review_ids[19825]

array([3410, 1911,    7, 3814, 2159, 1110, 1968,    0, 2159, 5918,   41,
       1534,    0, 1534, 1110, 1110, 1993, 1534,    0, 5025,   41, 4652,
       1110,    0,    7,    0, 3410, 4868, 4868, 1968,    0,   41, 1968,
       1110,    7,    0, 1534, 2159, 1110, 2404, 1110,    0, 1993,    7,
       1911, 2159,   41, 3814,    0, 3410, 4868, 5025, 1968,   41, 1110,
          0, 5918,    7, 5140, 3814,    0,    7, 3814, 1968,    0, 6891,
       4868, 5918, 3814,    0, 1864, 5025, 1110, 1110, 1534, 1110,    0,
         41, 3814,    0,    7,    0, 3814, 1110,   41, 5025,    0, 1534,
         41, 1993, 4868, 3814,    0, 1864, 4868, 1993, 1110, 1968, 3524,
          0, 5140, 5918, 1110, 1911, 1110,    0, 1864,    7, 3814,    0,
       3524, 4868, 6479,    0, 3410, 4868,    0, 5140, 1911, 4868, 3814,
       3410,    0, 5140,    7, 2159, 1864, 5918,    0, 2159, 5918, 1110,
          0, 1993, 4868, 2404,   41, 1110,    0,    7, 3814, 1968,    0,
       3524, 4868, 6479, 5025, 5025,    0, 3880,   

### Load this saved file to get the reviews in the IMDB dataset represented using word indexes

These have been pre-calculated and saved, and will help you if your id mapping code takes too long to run

In [0]:
review_ids = np.load('idsMatrix.npy')

In [0]:
review_ids.shape

(25000, 250)

In [0]:
review_ids[:5]

array([[174943,    152,     14, ...,      0,      0,      0],
       [ 26494,     46, 399999, ...,   2153,    144,      7],
       [  6520, 399999,     21, ...,      0,      0,      0],
       [    37,     14,   2407, ...,      0,      0,      0],
       [    37,     14,     36, ...,      0,      0,      0]], dtype=int32)

In [0]:
x_data = review_ids
y_output = np.array(labels)

vocabulary_size = len(words)
print(vocabulary_size)

400000


In [0]:
data[3:5]

[u'no this is not no alice fairy tale my friends this wonderland fable is based on the true story of the gruesome bloody wonderland murders that occurred back in 80s california at the center of this bloodbath was no other than johnny wad himself yes john holmes daddy dingdong used other shotguns than his infamous 13inch milk machine besides being a legendary adult film actor holmes was as also a hardcore drug addict who befriended various hollywood junkies val kilmer was occasionally majestic as holmes but for once this holmes character did not milk it through completely the film possesses a whos who of supporting players josh lucas  dylan mcdermott as hollywood riffraffs  kate bosworth  lisa kudrow as the women in holmes life and eric bogosian as a menacing tinsletown entrepreneur these characters do play integral parts directly or indirectly in the wonderland murders out of this support group it was josh lucas who was the most fierce  impressive as the ardent ron launius lucas is gra

In [0]:
x_data[3:5]

array([[    37,     14,   2407, 201534,     96,  37314,    319,   7158,
        201534,   6469,   8828,   1085,     47,   9703,     20,    260,
            36,    455,      7,   7284,   1139,      3,  26494,   2633,
           203,    197,   3941,  12739,    646,      7,   7284,   1139,
             3,  11990,   7792,     46,  12608,    646,      7,   7284,
          1139,      3,   8593,     81,  36381,    109,      3, 201534,
          8735,    807,   2983,     34,    149,     37,    319,     14,
           191,  31906,      6,      7,    179,    109,  15402,     32,
            36,      5,      4,   2933,     12,    138,      6,      7,
           523,     59,     77,      3, 201534,     96,   4246,  30006,
           235,      3,    908,     14,   4702,   4571,     47,     36,
        201534,   6429,    691,     34,     47,     36,  35404,    900,
           192,     91,   4499,     14,     12,   6469,    189,     33,
          1784,   1318,   1726,      6, 201534,    410,     41, 

In [0]:
y_output[:5]

array([1, 1, 1, 1, 1])

#### Shuffle the data so the training instances are randomly fed to the RNN

In [0]:
np.random.seed(22)
shuffle_indices = np.random.permutation(np.arange(len(x_data)))

x_shuffled = x_data[shuffle_indices]
y_shuffled = y_output[shuffle_indices]

In [0]:
TRAIN_DATA = 5000
TOTAL_DATA = 6000

train_data = x_shuffled[:TRAIN_DATA]
train_target = y_shuffled[:TRAIN_DATA]

test_data = x_shuffled[TRAIN_DATA:TOTAL_DATA]
test_target = y_shuffled[TRAIN_DATA:TOTAL_DATA]

In [0]:
from tensorflow.keras.models import Sequential

In [0]:
# import tensorflow.compat.v1 as tf
# tf.disable_v2_behavior()

W0329 19:35:12.563184 140190949189504 deprecation.py:323] From /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/compat/v2_compat.py:88: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term


In [0]:
tf.reset_default_graph()

x = tf.placeholder(tf.int32, [None, MAX_SEQUENCE_LENGTH])
y = tf.placeholder(tf.int32, [None])

In [0]:
batch_size = 25
embedding_size = 50
max_label = 2

### Embeddings to represent words

These embeddings have been pre-built using GloVe a word vector embedding algorithm just like word2vec. The matrix will contain 400,000 word vectors, each with a dimensionality of 50.

* *saved_embeddings* This is a matrix which holds the embeddings for every word in the vocabulary. The values have been pre-loaded and were generated using the GloVe algorithm
* *embeddings* The embeddings for the words which are input as a part of one training batch

In [0]:
saved_embeddings = np.load('wordVectors.npy')
embeddings = tf.nn.embedding_lookup(saved_embeddings, x)

In [0]:
saved_embeddings

array([[ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ],
       [ 0.013441 ,  0.23682  , -0.16899  , ..., -0.56657  ,  0.044691 ,
         0.30392  ],
       [ 0.15164  ,  0.30177  , -0.16763  , ..., -0.35652  ,  0.016413 ,
         0.10216  ],
       ...,
       [-0.51181  ,  0.058706 ,  1.0913   , ..., -0.25003  , -1.125    ,
         1.5863   ],
       [-0.75898  , -0.47426  ,  0.4737   , ...,  0.78954  , -0.014116 ,
         0.6448   ],
       [-0.79149  ,  0.86617  ,  0.11998  , ..., -0.29996  , -0.0063003,
         0.3954   ]], dtype=float32)

In [0]:
embeddings

<tf.Tensor 'embedding_lookup/Identity:0' shape=(?, 250, 50) dtype=float32>

In [0]:
# resolver = tf.contrib.cluster_resolver.TPUClusterResolver('grpc://' + os.environ['COLAB_TPU_ADDR'])
# tf.contrib.distribute.initialize_tpu_system(resolver)
# strategy = tf.contrib.distribute.TPUStrategy(resolver)

W0329 19:58:04.787231 139956296599424 lazy_loader.py:50] 
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0329 19:58:06.962148 139956296599424 module_wrapper.py:139] From /tensorflow-1.15.2/python2.7/tensorflow_estimator/python/estimator/api/_v1/estimator/__init__.py:12: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.



KeyError: ignored

In [0]:
lstmCell = tf.contrib.rnn.BasicLSTMCell(embedding_size)
lstmCell = tf.contrib.rnn.DropoutWrapper(cell=lstmCell, output_keep_prob=0.75)

W0329 19:58:25.448853 139956296599424 deprecation.py:323] From <ipython-input-41-41dee28f93a4>:1: __init__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.LSTMCell, and will be replaced by that in Tensorflow 2.0.


### Results from an RNN of LSTM cells

(ouput, (**final_state**, other_state_info))

We're interested in the final state of this RNN because those are the encodings we feed into the prediction layer of our neural network

In [0]:
_, (encoding, _) = tf.nn.dynamic_rnn(lstmCell, embeddings, dtype=tf.float32)

W0329 19:58:31.582591 139956296599424 deprecation.py:323] From <ipython-input-42-828d6e9e5bdf>:1: dynamic_rnn (from tensorflow.python.ops.rnn) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
W0329 19:58:31.689204 139956296599424 deprecation.py:323] From /tensorflow-1.15.2/python2.7/tensorflow_core/python/ops/rnn_cell_impl.py:735: add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
W0329 19:58:31.702788 139956296599424 deprecation.py:506] From /tensorflow-1.15.2/python2.7/tensorflow_core/python/ops/rnn_cell_impl.py:739: calling __init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constru

In [0]:
encoding

<tf.Tensor 'rnn/while/Exit_3:0' shape=(?, 50) dtype=float32>

#### A densely connected prediction layer

* *activation=None* because the activation will be part of the tf.nn.sparse_softmax_cross_entropy_with_logits
* *cross_entropy* the loss function for probability distributions
* *max_label* the number of outputs of the prediction layer, here is 2, positive or negative

In [0]:
logits = tf.layers.dense(encoding, max_label, activation=None)

W0329 19:58:37.929779 139956296599424 deprecation.py:323] From <ipython-input-44-30584d302e79>:1: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.Dense instead.
W0329 19:58:37.932073 139956296599424 deprecation.py:323] From /tensorflow-1.15.2/python2.7/tensorflow_core/python/layers/core.py:187: apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.__call__` method instead.


In [0]:
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y)
loss = tf.reduce_mean(cross_entropy)

#### Find the output with the highest probability and compare against the true label

In [0]:
prediction = tf.equal(tf.argmax(logits, 1), tf.cast(y, tf.int64))
accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32))

In [0]:
optimizer = tf.train.AdamOptimizer(0.01)
train_step = optimizer.minimize(loss)

In [0]:
num_epochs = 20

In [0]:
init = tf.global_variables_initializer()

In [0]:
with tf.Session() as session:
    init.run()
    
    for epoch in range(num_epochs):
        
        num_batches = int(len(train_data) // batch_size) + 1
        
        for i in range(num_batches):
            # Select train data
            min_ix = i * batch_size
            max_ix = np.min([len(train_data), ((i+1) * batch_size)])

            x_train_batch = train_data[min_ix:max_ix]
            y_train_batch = train_target[min_ix:max_ix]
            
            train_dict = {x: x_train_batch, y: y_train_batch}
            
            
            session.run(train_step, feed_dict=train_dict)
            
            train_loss, train_acc = session.run([loss, accuracy], feed_dict=train_dict)

        test_dict = {x: test_data, y: test_target}
        test_loss, test_acc = session.run([loss, accuracy], feed_dict=test_dict)    
        print('Epoch: {}, Test Loss: {:.2}, Test Acc: {:.5}'.format(epoch + 1, test_loss, test_acc)) 
        saver = tf.train.Saver()
        save_path = saver.save(session, "TessorFlowModel.ckpt")
        print("Model saved in path: %s" % save_path)

Epoch: 1, Test Loss: 0.7, Test Acc: 0.494
Model saved in path: TessorFlowModel.ckpt
Epoch: 2, Test Loss: 0.68, Test Acc: 0.544
Model saved in path: TessorFlowModel.ckpt
Epoch: 3, Test Loss: 0.74, Test Acc: 0.489
Model saved in path: TessorFlowModel.ckpt
Epoch: 4, Test Loss: 0.72, Test Acc: 0.49
Model saved in path: TessorFlowModel.ckpt
Epoch: 5, Test Loss: 0.71, Test Acc: 0.485
Model saved in path: TessorFlowModel.ckpt
Epoch: 6, Test Loss: 0.7, Test Acc: 0.506
Model saved in path: TessorFlowModel.ckpt
Epoch: 7, Test Loss: 0.71, Test Acc: 0.516
Model saved in path: TessorFlowModel.ckpt
Epoch: 8, Test Loss: 0.69, Test Acc: 0.543
Model saved in path: TessorFlowModel.ckpt
Epoch: 9, Test Loss: 0.67, Test Acc: 0.54
Model saved in path: TessorFlowModel.ckpt
Epoch: 10, Test Loss: 0.67, Test Acc: 0.547
Model saved in path: TessorFlowModel.ckpt
Epoch: 11, Test Loss: 0.68, Test Acc: 0.55
Model saved in path: TessorFlowModel.ckpt
Epoch: 12, Test Loss: 0.68, Test Acc: 0.549
Model saved in path: Tes

In [0]:
saver = tf.train.Saver()
save_path = saver.save(session, "TessorFlowModel.ckpt")
print("Model saved in path: %s" % save_path)

ERROR! Session/line number was not unique in database. History logging moved to new session 60


RuntimeError: ignored