## Imports

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

We need to import several things from Keras.

In [0]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [0]:
import imdb

Change this if you want the files saved in another directory.

In [0]:
# imdb.data_dir = "data/IMDB/"

Automatically download and extract the files.

In [0]:
imdb.maybe_download_and_extract()

- Download progress: 100.0%
Download finished. Extracting files.
Done.


Load the training- and test-sets.

In [0]:
x_train_text, y_train = imdb.load_data(train=True)
x_test_text, y_test = imdb.load_data(train=False)

In [0]:
print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

Train-set size:  25000
Test-set size:   25000


Combine into one data-set for some uses below.

In [0]:
data_text = x_train_text + x_test_text

Print an example from the training-set to see that the data looks correct.

In [0]:
y_train[1]

1.0

## Tokenizer

A neural network cannot work directly on text-strings so we must convert it somehow. There are two steps in this conversion, the first step is called the "tokenizer" which converts words to integers and is done on the data-set before it is input to the neural network. The second step is an integrated part of the neural network itself and is called the "embedding"-layer, which is described further below.

We may instruct the tokenizer to only use e.g. the 10000 most popular words from the data-set.

In [0]:
num_words = 10000

In [0]:
tokenizer = Tokenizer(num_words=num_words)

In [0]:
%%time
tokenizer.fit_on_texts(data_text)

CPU times: user 9.47 s, sys: 82.5 ms, total: 9.55 s
Wall time: 9.58 s


In [0]:
if num_words is None:
    num_words = len(tokenizer.word_index)

In [0]:
# tokenizer.word_index

We can then use the tokenizer to convert all texts in the training-set to lists of these tokens.

In [0]:
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)

For example, here is a text from the training-set:

In [0]:
# x_train_text[1]

'hello. i just watched this movie earlier today for the 14th time in 3 days. i am a history teacher that has wayyyyy too much time on my hands. i need a life. i found the movie containing a striking resemblance to broke back mountain. i also found that i look a lot like jean Lafitte if he were white. also, my favorite line in the entire movie was from Mr. Petey--"this baby can shoot a chipmunk\'s eye from 300 yards!!" oh, and my favorite scene in the movie was when the British were coming in, and the one drummer who was so devoted to his work, and he drummed till the death, as if that drum would end the war altogether....but it wouldn\'t. well, thats all i would like to say about this movie. OH, one more thing..bonnie brown is an insane physco bipolar mood swinging BEEYOTCH. that is all.'

We also need to convert the texts in the test-set to tokens.

In [0]:
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

## Padding and Truncating Data



In [0]:
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

The average number of tokens in a sequence is:

In [0]:
np.mean(num_tokens)

221.27716

The maximum number of tokens in a sequence is:

In [0]:
np.max(num_tokens)

2208

In [0]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens

544

In [0]:
np.sum(num_tokens < max_tokens) / len(num_tokens)

0.94528

In [0]:
pad = 'pre'

In [0]:
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,
                            padding=pad, truncating=pad)

In [0]:
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)

We have now transformed the training-set into one big matrix of integers (tokens) with this shape:

In [0]:
x_train_pad.shape

(25000, 544)

The matrix for the test-set has the same shape:

In [0]:
x_test_pad.shape

(25000, 544)

For example, we had the following sequence of tokens above:

This has simply been padded to create the following sequence. Note that when this is input to the Recurrent Neural Network, then it first inputs a lot of zeros. If we had padded 'post' then it would input the integer-tokens first and then a lot of zeros. This may confuse the Recurrent Neural Network.

In [0]:
# x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## Tokenizer Inverse Map



In [0]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

Helper-function for converting a list of tokens back to a string of words.

In [0]:
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

For example, this is the original text from the data-set:

In [0]:
x_train_text[1]

'hello. i just watched this movie earlier today for the 14th time in 3 days. i am a history teacher that has wayyyyy too much time on my hands. i need a life. i found the movie containing a striking resemblance to broke back mountain. i also found that i look a lot like jean Lafitte if he were white. also, my favorite line in the entire movie was from Mr. Petey--"this baby can shoot a chipmunk\'s eye from 300 yards!!" oh, and my favorite scene in the movie was when the British were coming in, and the one drummer who was so devoted to his work, and he drummed till the death, as if that drum would end the war altogether....but it wouldn\'t. well, thats all i would like to say about this movie. OH, one more thing..bonnie brown is an insane physco bipolar mood swinging BEEYOTCH. that is all.'

We can recreate this text except for punctuation and other symbols, by converting the list of tokens back to words:

In [0]:
tokens_to_string(x_train_tokens[1])

"hello i just watched this movie earlier today for the time in 3 days i am a history teacher that has too much time on my hands i need a life i found the movie containing a striking resemblance to broke back mountain i also found that i look a lot like jean if he were white also my favorite line in the entire movie was from mr this baby can shoot a eye from 300 oh and my favorite scene in the movie was when the british were coming in and the one who was so devoted to his work and he till the death as if that drum would end the war altogether but it wouldn't well thats all i would like to say about this movie oh one more thing bonnie brown is an insane mood swinging that is all"

## Create the Recurrent Neural Network

We are now ready to create the Recurrent Neural Network (RNN). We will use the Keras API for this because of its simplicity. See Tutorial #03-C for a tutorial on Keras.

In [0]:
model = Sequential()

In [0]:
embedding_size = 8

In [0]:
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
model.add(GRU(units=16, return_sequences=True))

In [0]:
model.add(GRU(units=8, return_sequences=True))

This adds the third and final GRU with 4 output units. This will be followed by a dense-layer, so it should only give the final output of the GRU and not a whole sequence of outputs.

In [0]:
model.add(GRU(units=4))

Add a fully-connected / dense layer which computes a value between 0.0 and 1.0 that will be used as the classification output.

In [0]:
model.add(Dense(1, activation='sigmoid'))

Use the Adam optimizer with the given learning-rate.

In [0]:
optimizer = Adam(lr=1e-3)

Compile the Keras model so it is ready for training.

In [0]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [0]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru (GRU)                    (None, 544, 16)           1200      
_________________________________________________________________
gru_1 (GRU)                  (None, 544, 8)            600       
_________________________________________________________________
gru_2 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense (Dense)                (None, 1)                 5         
Total params: 81,961
Trainable params: 81,961
Non-trainable params: 0
_________________________________________________________________


## Train the Recurrent Neural Network

We can now train the model. Note that we are using the data-set with the padded sequences. We use 5% of the training-set as a small validation-set, so we have a rough idea whether the model is generalizing well or if it is perhaps over-fitting to the training-set.

In [0]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
CPU times: user 26min 27s, sys: 1min 11s, total: 27min 39s
Wall time: 14min 58s


<tensorflow.python.keras.callbacks.History at 0x7f86c5e52080>

## Performance on Test-Set

Now that the model has been trained we can calculate its classification accuracy on the test-set.

In [0]:
%%time
result = model.evaluate(x_test_pad, y_test)

CPU times: user 1min 29s, sys: 1.35 s, total: 1min 30s
Wall time: 1min 24s


In [0]:
print("Accuracy: {0:.2%}".format(result[1]))

Accuracy: 85.57%


## Example of Mis-Classified Text

In order to show an example of mis-classified text, we first calculate the predicted sentiment for the first 1000 texts in the test-set.

In [0]:
%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

CPU times: user 7.96 s, sys: 128 ms, total: 8.09 s
Wall time: 4.29 s


In [0]:
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])

In [0]:
cls_true = np.array(y_test[0:1000])

In [0]:
incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

Of the 1000 texts used, how many were mis-classified?

In [0]:
len(incorrect)

45

Let us look at the first mis-classified text. We will use its index several times.

In [0]:
idx = incorrect[0]
idx

16

The mis-classified text is:

In [0]:
text = x_test_text[idx]
text

"The title role of this western is played by Robert Walker, Jr. He's a young gun who with partner David Carradine gets separated after doing a contract hit on a Mexican general. In eluding their pursuers Carradine and Walker become separated. Walker comes upon the camp of lawman Robert Mitchum who takes a liking to Walker and makes him a protégé and reclamation project of sorts.<br /><br />This is the first of two films Robert Mitchum did with writer/director Burt Kennedy. The second was the more humorous The Good Guys and the Bad Guys. <br /><br />Not that Young Billy Young does not have its moments of hilarity. But it is a tripartite story involving the Walker reclamation, Mitchum's hunt for the bad who killed his son and a romantic triangle involving Mitchum, Angie Dickinson, and town boss Jack Kelly.<br /><br />The film abounds with nepotism. David Carradine is John's son. Dean Martin's daughter Deana is in this, Walker is the son of Robert Walker and Jennifer Jones and Mitchum's s

These are the predicted and true classes for the text:

In [0]:
y_pred[idx]

0.1878289

In [0]:
cls_true[idx]

1.0

## New Data

Let us try and classify new texts that we make up. Some of these are obvious, while others use negation and sarcasm to try and confuse the model into mis-classifying the text.

In [0]:
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

We first convert these texts to arrays of integer-tokens because that is needed by the model.

In [0]:
tokens = tokenizer.texts_to_sequences(texts)

To input texts with different lengths into the model, we also need to pad and truncate them.

In [78]:
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 544)

We can now use the trained model to predict the sentiment for these texts.

In [0]:
model.predict(tokens_pad)

array([[0.868934  ],
       [0.72526425],
       [0.33099633],
       [0.49190348],
       [0.3054021 ],
       [0.14959489],
       [0.5235635 ],
       [0.21565402]], dtype=float32)

A value close to 0.0 means a negative sentiment and a value close to 1.0 means a positive sentiment. These numbers will vary every time you train the model.

TWITTER CALLS


In [0]:

import tweepy

consumer_key= ''
consumer_secret = ''

access_token = ''
access_token_secret = ''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)


In [0]:
print('Enter the hastag term')
searchTerm = input()
numb = 10
tweetssss =tweepy.Cursor(api.search,q = searchTerm).items(numb)

Enter the hastag term
trump


In [0]:
listTweet = []
for val in tweetssss:
  # list.append(str(val.text))
  print(str(val.text))

In [0]:
tokens = tokenizer.texts_to_sequences(listTweet)
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(0, 544)

In [0]:
model.predict(tokens_pad)