## Recurrent Neural Networks

In this assignment, we will learn about recurrent neural networks. We will create an RNN and learn to classify text data.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

In [None]:
yelp = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/yelp_labeled.csv', error_bad_lines=False)

b'Skipping line 281: expected 2 fields, saw 3\nSkipping line 290: expected 2 fields, saw 3\nSkipping line 296: expected 2 fields, saw 3\nSkipping line 322: expected 2 fields, saw 3\nSkipping line 373: expected 2 fields, saw 3\nSkipping line 417: expected 2 fields, saw 3\nSkipping line 427: expected 2 fields, saw 3\nSkipping line 429: expected 2 fields, saw 3\nSkipping line 577: expected 2 fields, saw 3\nSkipping line 578: expected 2 fields, saw 3\nSkipping line 611: expected 2 fields, saw 3\nSkipping line 677: expected 2 fields, saw 3\nSkipping line 771: expected 2 fields, saw 3\nSkipping line 930: expected 2 fields, saw 3\nSkipping line 979: expected 2 fields, saw 4\nSkipping line 980: expected 2 fields, saw 3\n'


In [None]:
yelp.head()

Unnamed: 0,text,sentiment
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [None]:
yelp.sentiment.value_counts()

1    494
0    482
Name: sentiment, dtype: int64

We have loaded a Yelp review dataset above. A positive sentiment is classified as 1 and a negative sentiment is classified as 0. 

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def remove_stopwords(input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words)       

def stem_list(word_list):
    stemmed = []
    for word in word_list:
        stemmedword = stemmer.stem(word)
        stemmed.append(stemmedword)
    return stemmed

def normalize(terms):
    terms = terms.lower()
    terms = remove_stopwords(terms)
    word_delimiters = u'[\\[\\]\n.!?,;:\t\\-\\"\\(\\)\\\'\u2019\u2013 ]'
    term_list = re.split(word_delimiters, terms)
    trimmed = [x.rstrip() for x in term_list]
    stemmed = stem_list(trimmed)
    space = ' '
    normed = space.join(stemmed)
    normed = normed.replace('  ', ' ')
    return normed

In the code block above, we have functions to remove stopwords, stem, and normalize the text (remove special characters and trim white space). Apply the normalize function to every yelp review and assign the normalized text to a new column.

In [None]:
# Answer below:
yelp['normalized'] = [normalize(t) for t in yelp.text]
yelp['normalized']

0                                       wow  love place 
1                                        crust not good 
2                                not tasti textur nasti 
3      stop late may bank holiday rick steve recommen...
4                               select menu great price 
                             ...                        
971                       think food flavor textur lack 
972                              appetit instantli gone 
973                overal not impress would not go back 
974    whole experi underwhelm think we ll go ninja s...
975    wast enough life pour salt wound draw time too...
Name: normalized, Length: 976, dtype: object

Next, use the one hot function for text encoding and encode the normalized text. Determine the vocabulary size to perform the encoding.

In [None]:
len(set(''.join(yelp.normalized).split()))

1637

In [None]:
# Answer below:
from tensorflow.keras.preprocessing.text import one_hot
docs = yelp['normalized']
vocab_size = 5000
encoded_docs = [one_hot(doc, vocab_size) for doc in docs]


In [None]:
encoded_docs

[[3363, 4869, 4964],
 [4729, 598, 1337],
 [598, 3536, 622, 1946],
 [4039, 4691, 1519, 1162, 4852, 4725, 2373, 56, 4869, 1921],
 [1370, 4228, 359, 3270],
 [2549, 3026, 1028, 618, 4314],
 [2150, 2910, 1093],
 [3929, 4698, 4427, 875, 3285, 465, 1113, 4223, 1214, 3495],
 [4668, 359, 145],
 [359, 2983],
 [4966, 2880],
 [1593, 598, 2640, 3076],
 [1640, 3559, 3972, 2746, 3492, 1027, 1515, 3142, 4862],
 [1009, 1219, 4932, 1835, 844, 1226, 2168],
 [916, 2604, 591, 2264, 1286],
 [2230, 3559, 3678, 422, 1093, 2439],
 [4055, 56],
 [1646, 2276, 4743, 4966],
 [4964, 598, 1252, 4223, 2647, 1626, 3298],
 [598, 4698, 668],
 [2908, 3098],
 [4197, 3175],
 [4966, 3534, 1769],
 [875, 3972, 2321, 2321, 4068],
 [1635],
 [2934, 2083, 4788, 613, 17, 2998, 3131, 3051, 821, 1337],
 [2683, 306, 2427, 406, 4293],
 [2575, 4210, 359, 238, 941, 1418, 4944, 1191],
 [3771,
  901,
  2549,
  4197,
  678,
  4647,
  4197,
  2818,
  1665,
  2099,
  68,
  883,
  4698,
  796,
  1457],
 [4238, 1107, 3551],
 [3534, 3666, 4698, 

Convert the encoded sequences into a numpy array and make sure all reviews are the same length using the `pad_sequences` function in Keras.

In [None]:
np.max(ind_vars)

9993

In [None]:
# Answer below:
# make your features same length
from tensorflow.keras.preprocessing.sequence import pad_sequences

ind_vars = pad_sequences(encoded_docs)

Split the data into train and test. Use 20% for test. The sentiment column should be used as the target variable.

In [None]:
#@Split Train Test for Models
#Size of the test set and target variable to split the data.

df = yelp #@param dataframe
target = 'sentiment' #@param target
SIZE = 0.2 #@param split rate

y = df[target]
X = ind_vars

from sklearn.model_selection import train_test_split, cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=SIZE)
print('There are {:d} training samples and {:d} test samples'.format(X_train.shape[0], X_test.shape[0]))

There are 780 training samples and 196 test samples


Create a sequential model. The model should contain an embedding layer with input dim that is the size of the largest encoding in the vocabulary. The output dim should be 100, the input length is the number of columns in the training data. 
After the embedding layer, add a SimpleRNN layer with unit size 32, a dense layer of size 8 and a dense output layer.

In [None]:
y.nunique()

2

In [None]:
np.max(X_train)

4999

In [None]:
# Answer below:
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN
model = Sequential()

model.add(Embedding(vocab_size, 100, input_length=X_train.shape[1]))

model.add(SimpleRNN(32))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()


Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_13 (Embedding)     (None, 83, 100)           500000    
_________________________________________________________________
simple_rnn_13 (SimpleRNN)    (None, 32)                4256      
_________________________________________________________________
dense_26 (Dense)             (None, 8)                 264       
_________________________________________________________________
dense_27 (Dense)             (None, 1)                 9         
Total params: 504,529
Trainable params: 504,529
Non-trainable params: 0
_________________________________________________________________


Compile using the optimizer of your choice, use crossentropy for your loss function. Fit the model using a batch size of 128 and 50 epochs

In [None]:
# Answer below:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=128)



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7ff84cf05630>