# Semana 6

# LSTM sion for Sentiment Analysis

### Long Short-Term Memory (LSTM) network

##### About the gates

###### - Forget gate

For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this: 

$$\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1} $$

Here, $W_f$ are weights that govern the forget gate's behavior. We concatenate $[a^{\langle t-1 \rangle}, x^{\langle t \rangle}]$ and multiply by $W_f$. The equation above results in a vector $\Gamma_f^{\langle t \rangle}$ with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state $c^{\langle t-1 \rangle}$. So if one of the values of $\Gamma_f^{\langle t \rangle}$ is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of $c^{\langle t-1 \rangle}$. If one of the values is 1, then it will keep the information. 

###### - Update gate

Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate: 

$$\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2} $$ 

Similar to the forget gate, here $\Gamma_u^{\langle t \rangle}$ is again a vector of values between 0 and 1. This will be multiplied element-wise with $\tilde{c}^{\langle t \rangle}$, in order to compute $c^{\langle t \rangle}$.

###### - Updating the cell 

To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is: 

$$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3} $$

Finally, the new cell state is: 

$$ c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4} $$


###### - Output gate

To decide which outputs we will use, we will use the following two formulas: 

$$ \Gamma_o^{\langle t \rangle}=  \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}$$ 
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6} $$

Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the $\tanh$ of the previous state. 

<br>
<br>

## The IMDb Movie Review Dataset

In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

The dataset consists of 50,000 movie reviews from the original "train" and "test" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.
For simplicity, I assembled the reviews in a single CSV file.


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf
from tensorflow.contrib import rnn
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

Using TensorFlow backend.


In [2]:
#import pandas as pd
# if you want to download the original file:
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otherwise load local file
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


Let us shuffle the class labels.

In [3]:
import numpy as np
## uncomment these lines if you have dowloaded the original file:
#np.random.seed(0)
#df = df.reindex(np.random.permutation(df.index))
df[['review', 'sentiment']].to_csv('shuffled_movie_data.csv', index=False)

<br>
<br>

In [4]:
df=df[['review','sentiment']]

## Preprocessing Text Data

Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but "emoticons," convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form.

In [5]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Let's give it at try:

In [6]:
#df = df[df.sentiment != "Neutral"]
df['review'] = df['review'].apply(lambda x: x.lower())
df['review'] = df['review'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

print(df[ df['sentiment'] == 1].size)
print(df[ df['sentiment'] == 0].size)

for idx,row in df.iterrows():
    row[0] = row[0].replace('rt',' ')
    
max_fatures = 2000
#usando el Tokenizer de TensorFlow
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(df['review'].values)
X = tokenizer.texts_to_sequences(df['review'].values)#sequences
X = pad_sequences(X)

word_index = tokenizer.word_index #vocabulario del dataset
N_WORDS = len(word_index) #numero de palabras unicas en el dataset
print('%s palabras unicas.' %N_WORDS)


50000
50000
181595 palabras unicas.


In [7]:
X[111]

array([  0,   0,   0, ..., 300,  39,  54], dtype=int32)

In [8]:
word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'in': 7,
 'it': 8,
 'i': 9,
 'this': 10,
 'that': 11,
 'br': 12,
 'was': 13,
 'as': 14,
 'with': 15,
 'for': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'he': 27,
 'one': 28,
 'its': 29,
 'at': 30,
 'all': 31,
 'by': 32,
 'an': 33,
 'they': 34,
 'from': 35,
 'who': 36,
 'so': 37,
 'like': 38,
 'or': 39,
 'just': 40,
 'her': 41,
 'about': 42,
 'if': 43,
 'has': 44,
 'out': 45,
 'some': 46,
 'there': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'my': 53,
 'even': 54,
 'no': 55,
 'up': 56,
 'would': 57,
 'she': 58,
 'time': 59,
 'only': 60,
 'which': 61,
 'really': 62,
 'their': 63,
 'see': 64,
 'were': 65,
 'story': 66,
 'had': 67,
 'can': 68,
 'me': 69,
 'than': 70,
 'we': 71,
 'much': 72,
 'well': 73,
 'been': 74,
 'get': 75,
 'will': 76,
 'other': 77,
 'do': 78,
 'great': 79,
 'also': 80,
 'into': 81,
 'bad': 82,
 'be

## Learning (SciKit)

First, we define a generator that returns the document body and the corresponding class label:

In [9]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # skip header
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:

In [10]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:

In [11]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        docs.append(text)
        y.append(label)
    return docs, y

Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [12]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)


### Add model of Recurrent Neural Networks: LSTM

In [16]:
embed_dim = 50
lstm_out = 196 #64

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1862, 50)          100000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 1862, 50)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 196)               193648    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 394       
Total params: 294,042
Trainable params: 294,042
Non-trainable params: 0
_________________________________________________________________
None


### Datos para el entrenamiento y test

In [17]:
from sklearn.model_selection import train_test_split
Y = pd.get_dummies(df['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state = 42)
print('X : ',X.shape)
print('y : ',Y.shape)
print('X_training: ',X_train.shape)
print('y_training: ',Y_train.shape)
print('X_test: ',X_test.shape)
print('y_test: ',Y_test.shape)


X :  (50000, 1862)
y :  (50000, 2)
X_training:  (33500, 1862)
y_training:  (33500, 2)
X_test:  (16500, 1862)
y_test:  (16500, 2)


In [18]:
X_train[1]

array([   0,    0,    0, ...,   53, 1460,  949], dtype=int32)

In [19]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
 - 1450s - loss: 0.5002 - acc: 0.7558
Epoch 2/7
 - 1688s - loss: 0.4238 - acc: 0.8127
Epoch 3/7
 - 1858s - loss: 0.4197 - acc: 0.8150
Epoch 4/7
 - 1392s - loss: 0.3549 - acc: 0.8499
Epoch 5/7
 - 1392s - loss: 0.3248 - acc: 0.8639
Epoch 6/7
 - 1394s - loss: 0.2786 - acc: 0.8867
Epoch 7/7
 - 1393s - loss: 0.2616 - acc: 0.8941


<keras.callbacks.History at 0x7f545437a0b8>

### Extraendo un conjunto de validacion, y puntaje de mediciòn y el error

In [22]:
validation_size = 1500
X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.29
acc: 0.88


In [23]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):
    
    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]
   
    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1
       
    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")


pos_acc 86.27717391304348 %
neg_acc 90.44502617801047 %


### Add a new model LSTM Biridectional

Empezamos a obtener nuestras propias palabras embebidas con GLOVE

In [24]:
glove_file = 'glove.twitter.27B/glove.twitter.27B.' + str(50) + 'd.txt'
emb_dict = {}
glove = open(glove_file)
for line in glove:
    values = line.split()
    word = values[0]
    vector = np.asarray(values[1:], dtype='float32')
    if vector.shape[0]== 50:
        emb_dict[word] = vector
glove.close()
print('vocabulario glove size: ',len(emb_dict))

vocabulario glove size:  1193513


In [25]:
emb_dict

{'<user>': array([ 0.78704 ,  0.72151 ,  0.29148 , -0.056527,  0.31683 ,  0.47172 ,
         0.023461,  0.69568 ,  0.20782 ,  0.60985 , -0.22386 ,  0.7481  ,
        -2.6208  ,  0.20117 , -0.48104 ,  0.12897 ,  0.035239, -0.24486 ,
        -0.36088 ,  0.026686,  0.28978 , -0.10698 , -0.34621 ,  0.021053,
         0.54514 , -1.0958  , -0.274   ,  0.2233  ,  1.0827  , -0.029018,
        -0.84029 ,  0.58619 , -0.36511 ,  0.34016 ,  0.89615 ,  0.32757 ,
         0.24267 ,  0.68404 , -0.34374 ,  0.13583 , -2.2162  , -0.42537 ,
         0.46157 ,  0.88626 , -0.22014 ,  0.025599, -0.38615 ,  0.080107,
        -0.075323, -0.61461 ], dtype=float32),
 '.': array([ 0.68661 , -1.0772  ,  0.011114, -0.24075 , -0.3422  ,  0.64456 ,
         0.54957 ,  0.30411 , -0.54682 ,  1.4695  ,  0.43648 , -0.34223 ,
        -2.7189  ,  0.46021 ,  0.016881,  0.13953 ,  0.020913,  0.050963,
        -0.48108 , -1.0764  , -0.16807 , -0.014315, -0.55055 ,  0.67823 ,
         0.24359 , -1.3179  , -0.036348, -0.228   

In [26]:
embeddings_words = np.array([emb_dict[i] for i in emb_dict.keys()])
for i in range(embeddings_words.shape[0]):
    embeddings_words[i] = embeddings_words[i].reshape(1,50)
embeddings_words[0].shape
embeddings_words.shape



(1193513, 50)

Ahora definimos una RNN bidireccional de por lo menos 2 capas, y utilizaremos LSTM units y gradient clipping

### Building of graph

In [40]:
tf.reset_default_graph()
batchSize = 1000
lstmUnits = 64
numClasses = 1
learning_rate = 0.01
num_layers = 2


In [41]:
# with graph.as_default():
labels = tf.placeholder(tf.int32,[batchSize,numClasses])
inputs = tf.placeholder(tf.int32,[batchSize,50])
df_new = tf.Variable(tf.zeros([batchSize, 200, 50]),dtype=tf.float32)
df_new = tf.nn.embedding_lookup(embeddings_words,inputs)
keep_prob = tf.placeholder(tf.float32, name='keep_prob')
#     y_pred = rnn_model(data,lstmUnits,numClasses)

In [42]:
def BiRNN_LSTM():
    lstm = tf.contrib.rnn.LSTMCell(lstmUnits, reuse=tf.get_variable_scope().reuse)
    return tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)

cell = tf.contrib.rnn.MultiRNNCell([BiRNN_LSTM() for _ in range(num_layers)])
initial_state = cell.zero_state(batchSize, tf.float32)


In [43]:
outputs, final_state = tf.nn.dynamic_rnn(cell, df_new, initial_state=initial_state)

In [44]:
predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
cost = tf.losses.mean_squared_error(labels, predictions)    
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
correct_pred = tf.equal(tf.cast(tf.round(predictions), tf.int32), labels)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

In [45]:
def get_batches(x, y, batch_size=1000):
    
    n_batches = len(x)//batch_size
    x, y = x[:n_batches*batch_size], y[:n_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]


In [47]:
epochs = 7
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for e in range(epochs):
        state = sess.run(initial_state)
        
        for ii, (x, y) in enumerate(get_batches(X_train, Y_train, batchSize), 1):
            feed = {inputs: x,
                    labels: y[:, None],
                    keep_prob: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("Epoch: {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Train loss: {:.3f}".format(loss))

            if iteration%25==0:
                val_acc = []
                val_state = sess.run(cell.zero_state(batchSize, tf.float32))
                for x, y in get_batches(X_test, Y_test, batchSize):
                    feed = {inputs: x,
                            labels: y[:, None],
                            keep_prob: 1,
                            initial_state: val_state}
                    batch_acc, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    val_acc.append(batch_acc)
                print("Val acc: {:.3f}".format(np.mean(val_acc)))
            iteration +=1





ValueError: Cannot feed value of shape (1000, 1862) for Tensor 'Placeholder_1:0', which has shape '(1000, 50)'