<a href="https://colab.research.google.com/github/val93s/Machine_learning/blob/main/Copy_of_11_14_3_Activity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activity 11.14.3

Recurrent Neural Networks (RNNs) are a specific type of neural networks that are particularly good at deciphering language and text.  The [IMDB large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) contains 25,000 highly polar movie reviews for training, and 25,000 for testing.

In this activity, you will build a RNN that can classify the reviews as positive or negative.  As we build more and more complex neural networks, the specifications for each step become more abstract and complex, so you will do more running of pre-made code and less code writing in this activity.   

##Step 1: Install the necessary packages and functions
**Note**: Keras runs on top of the larger TensorFlow machine learning package in Python.

```
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import matplotlib.pyplot as plt
```


In [None]:
#Step 1
import numpy as np
import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()
import matplotlib.pyplot as plt





##Step 2: Import the IMDB review data
* Run the following code block to load the data, which is included as part of TensorFlow Datasets.
* `train_dataset` and `test_dataset` contain the training and testing data labeled as 0 if they are negative and 1 if they are positive.


In [None]:
#Step 2

dataset, info = tfds.load('imdb_reviews', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

train_dataset.element_spec

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...
Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


(TensorSpec(shape=(), dtype=tf.string, name=None),
 TensorSpec(shape=(), dtype=tf.int64, name=None))

##Step 3: Print examples of reviews and labels
* Run the following code block to print 6 reviews.
* Take a few minutes to read the text to get a sense of what the reviews in this database are like.

In [None]:
#Step 3

for example, label in train_dataset.take(6):
  print('text: ', example.numpy())
  print('label: ', label.numpy())

text:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
label:  0
text:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. 

##Step 4: Format the data to speed up model training
* Run the following code block to format the data in a way that allows the model to *pre-fetch* elements of the data set and perform different operations on them at the same time.
* For example, while the training step is running for the sample 1, the input pipeline is able to read in the data for the sample 2, etc.
* Now that we are dealing with very large datasets, it is increasingly important to think about ways to speed up the model building process.
* This step also creates text, labels pairs. You can see an example printed at the end.

In [None]:
#Step 4

BUFFER_SIZE = 10000
BATCH_SIZE = 64

train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = test_dataset.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

for example, label in train_dataset.take(1):
  print('texts: ', example.numpy()[:3])
  print()
  print('labels: ', label.numpy()[:3])

texts:  [b'Being born in the 1960\'s I grew up watching the TV "Movies of the Week" in the early 70\'s and loved the creepy movies that were routinely shown including "Crowhaven Farm", "Bad Ronald", "Satan\'s School for Girls", "Kolchak the Night Stalker", etc, but this one is just plain dumb.This is obviously the writer\'s trying to capitalize on the horrific Manson murders from a few years earlier. The movie stars Dennis Weaver of "McCloud" and "Duel" fame as a father who takes his family camping on a beach. The family encounters some hippies who for some reason decide to terrorize the family. The reason for this is never explained, and Weaver\'s pacifistic stance is hard to swallow. For God\'s sake, call the police, beat the hell of them or something, just don\'t sit there and whine about it. The acting is pretty lame, the story unbelievable, etc. Susan Dey looks cute in a bikini but that\'s about it. Ignore this if it ever airs on TV.'
 b'This is hands down the worst movie I can ev

##Step 5: Convert the text from words to numeric values
* This should seem familiar from the natural language processing modules.  Because machine learning models can only work with numbers—not text strings—we need to convert each word to a corresponding number.
* We are limiting our vocabulary size to 1000 words.  Words that don't make it into the top 1000 are coded as `[UNK]` for unknown.  

In [None]:
#Step 5
VOCAB_SIZE = 1000
encoder = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(train_dataset.map(lambda text, label: text))

Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


##Step 6: Print the vocabulary
* Run the following code block to print the top 20 words in the vocabulary.
* The first two entries are padding and unknown words.  After that, the words are sorted by frequency.  What is the most common word?

In [None]:
vocab = np.array(encoder.get_vocabulary())
vocab[:20]

array(['', '[UNK]', 'the', 'and', 'a', 'of', 'to', 'is', 'in', 'it', 'i',
       'this', 'that', 'br', 'was', 'as', 'for', 'with', 'movie', 'but'],
      dtype='<U14')

###Answer:

##Step 7: Print samples of the text vectorized movie reviews
* Run the following code block to visualize 5 movie reviews converted from text to numeric values for each word.
* Because the movie reviews are of different lengths, the encoder pads the ends of shorter reviews with 0s (zeros) so that all of the encoded reviews have the same length.

In [None]:
#Step 7

encoded_example = encoder(example)[:5].numpy()
encoded_example

array([[107,   1,   8, ...,   0,   0,   0],
       [ 11,   7, 952, ...,   0,   0,   0],
       [ 11,   7,   4, ...,   0,   0,   0],
       [ 11,  18,   7, ...,   0,   0,   0],
       [ 90,  76,  70, ...,   0,   0,   0]])

##Step 8: Specify the RNN
* Run the following code block to build the neural network.
* For this neural network, we are going to specify a `Sequential` model with five layers.
* The first layer is the encoder that we specified above, which converts the words in the reviews into corresponding numbers.
* The first hidden layer is an embedding layer. An embedding layer stores one vector per word. This layer can be trained so that words with similar meanings will have similar vectors.
* The next step is a bidirectional wrapper that is used with the RNN layer. This allows the data to move forward and backwards through the RNN layer and then concatenates the final output.
* The final two dense layers do some final processing and convert the vector representation required by the RNN to a single classification output.  
* If the prediction returned by the model is greater than or equal to 0, the review is positive.  If the prediction returned is negative, then the review is negative.




In [None]:
#Step 8

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

##Step 9: Compile and fit the neural network
* Run the following code to compile and fit the RNN.
* This will take a little while!

In [None]:
#Step 9

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


##Step 10: Predict if a movie review is positive or negative
* Run the following code block to predict if a movie review is positive or negative.  
* Remember, an output value that is negative means the review is negative and an output value that is positive means the review is positive.


In [None]:
#Step 10

sample_text = ('This is a terrible movie.  You could not pay me a million dollars to watch it again.')
predictions = model.predict(np.array([sample_text]))

print(predictions)

[[-1.3418542]]


##Step 11: Submit your own movie review
* Modify the sample text with your own sample movie review.  See if it is positive or negative.  
* Can you confuse the RNN and get it to return the wrong prediction?


In [None]:
#Step 11 - For example:

sample_text = ('Plane is, in essence, the Frontier Airlines of action films: It’s cut-rate to a fault, makes you endure a lot of unpleasantness on the way to its final destination, and still leaves you with the distinct feeling that you didn’t even get what you paid for.')
predictions = model.predict(np.array([sample_text]))

print(predictions)

[[2.778026]]


##Step 12: Evaluate the model accuracy on the training data
Run the following code block to calculate the accuracy in the training data.


In [None]:
train_loss, train_acc = model.evaluate(train_dataset)

print('Training Accuracy', train_acc)

Training Accuracy 0.88264000415802


##Step 13: Evaluate the model accuracy on the testing data
Run the following code block to calculate the accuracy in the testing data.


In [None]:
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Accuracy', test_acc)

  1/391 [..............................] - ETA: 2:14 - loss: 0.3805 - accuracy: 0.8125

KeyboardInterrupt: ignored