# RNN para analisis de sentimiento

<br>
<br>

## The IMDb Movie Review Dataset

En esta seccion, vamos a entrener una red neuronal recurrente para clasificacion de peliculas mendiente un 'review', que consiste en un conjunto de textos de 50k IMDb el cual ha sido colectado.

> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics

[Source: http://ai.stanford.edu/~amaas/data/sentiment/]

Los datos consisten de 50,000 movie reviews de un conjunto con entrenamiento y test . las etiquetas de datos de 50 mil  a la derecha estan etiquetadas con 1 si es una opinion positiva y 0 si es una opinion negativa respectivamente.
para simplificar todo esto lo leemos desde un archovo CSV, como se ve en la siguiente linea de codigo.


In [1]:
import pandas as pd
# forma de lectura online no me funciona XD
#df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')
# otra forma de leer desde disco local
df = pd.read_csv('shuffled_movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49995,"OK, lets start with the best. the building. al...",0
49996,The British 'heritage film' industry is out of...,0
49997,I don't even know where to begin on this one. ...,0
49998,Richard Tyler is a little boy who is scared of...,0
49999,I waited long to watch this movie. Also becaus...,1


<br>
<br>

## Preprocesamiento de Texto

Ahora, vamos a definir una simple funcion 'tokenizer', que consiste en dividir todo una oracion en tokens, es decir divide los datos en una expresion simple de palabra, asi se eliminan, las etiquetas html, tambien vamos a eliminar las palabras que no sean necesarias como, articulos, conectores, etc, esto se consigue con los stop words tambien vamos a eliminar los emoticos y ponerlos en una simple lista de palabras como se ve en el siguente ejemplo.

In [2]:
import numpy as np
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) #+ ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

vamos a intentar ahora tokenizar con un ejemplo que contenta los suficiente para ver si funciona o no.

In [3]:
#
print(tokenizer('This :) is a <a> test!"hoa" :-)</br>'))

#tamaño de la palabra
print(len(tokenizer('This :) is a <a> test!"hoa" :-)</br>')))

#convertir tokesn a lista de lista

def toList(text):
    review = []
    for w in text:
        review.append([w])
    return review
    
print(toList(tokenizer('This :) is a <a> test!"hoa" :-)</br>')))


['test', 'hoa']
2
[['test'], ['hoa']]


## Learning (SciKit)

bien, ahora que funciona esto de tokenizar, el siguiente paso es definir nuestro generador de cuerpo de documento para pasarlo de manera que podamos tokenizar cada review de los 50 mil, en un conjunto de palabras.

In [4]:
def stream_docs(path):
    with open(path, 'r') as csv:
        next(csv) # saltar cabecera
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

para confirmar que la funcion `stream_docs`  retorna el docuento identado correctamente vamos a ejecutar el siguiente codigo antes de implementar la funcion `get-minibatch`.

In [5]:
next(stream_docs(path='shuffled_movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

genial!, ahora que tenemos definimos nuestra funcion `stream_docs`, vamos ahora implementar `get minibatch` esta funcion obtiene un numero especifico de tamaño (`size`) de documentos:

In [6]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    for _ in range(size):
        text, label = next(doc_stream)
        
        docs.append(toList(tokenizer(text)))
        y.append(label)
    return docs, y

In [7]:
# primero vamos cargar todos los datos en un X y Y para extraer las caracteristias 

X,Y = get_minibatch(stream_docs(path='shuffled_movie_data.csv'), size= 50000) # get string of text

#print(ar[0][0])

## Extraccion de feature  de word2vect con gensim desde Cero

en es paso vamos usar la libreria Gensim para generar nuestro proppio diccionarion de un Word2vect embedding
de tal forma que tambien podemos definir el tamaño de los vectores por palabra, para este ejemplo y por motivos de prueba solo consideramos 20 el tamaño de cada vector,sin embargo se recommienda de 100-300 para el optimo desempeño, lo que se hace aqui es:
- generar nuestro modelo word2vect = n
- almacenamos las palabras con fuerte relacion de vencidad en un diccionario = w2v
- consideramos el promedio de los vectores para su mejor representacion de los features
- finalmente se almacena  en un X_total (total de features 'matriz'50000 x 20)

In [8]:
# tamaño se texto X, Y

print('size of X:',len(X))
print('size of X[4]: ',len(X[4]))
#print('dato :',X[0])

#definir el numero de  el word embedding para cada review
import gensim

    
#word to vect para un review 

#recorrer para los 50 de X
#X_train = []

    
#print(X[0])    

X_total = []

for i in range(len(X)):
    n = gensim.models.Word2Vec(X[i], size= 20,min_count=1)
    w2v = dict(zip(n.wv.index2word, n.wv.syn0))
    letra = list(n.wv.vocab)
    da = []
    for i in range(len(w2v)):
        da.append(w2v[letra[i]])
    X_total.append(np.mean(da, axis = 0))



size of X: 50000
size of X[4]:  62




Next, we will make use of the "hashing trick" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. Details of the bag-of-words model for document classification can be found at  [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329).

In [9]:
from sklearn.feature_extraction.text import HashingVectorizer
vect = HashingVectorizer(decode_error='ignore', 
                         n_features=2**21,
                         preprocessor=None, 
                         tokenizer=tokenizer)

# Exercise 1: define features based on word embeddings (pre-trained word2vec / Glove/Fastext emebddings can be used)
# Define suitable d dimension, and sequence length

#Definiremos el dato para el test
X_train,X_test=X_total[:45000], X_total[45000:]
X_test = np.asarray(X_test)
X_train = np.asarray(X_train)

X_train.shape[1], X_test.shape

(20, (5000, 20))

## Recurrent Neural Network
ahora vamos a usar LSTM de keras para entrenar nuestro modelo y luego lo evaluaremos para hacer el test ademas 
de la red neuronal recurrente bidireccional

In [27]:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import Bidirectional
from keras.layers import Embedding


embed_dim = 128
lstm_out = 200
batch_size = 32

model = Sequential()
model.add(Embedding(2500, embed_dim,input_length = X_train.shape[1], dropout = 0.2))
model.add(LSTM(lstm_out, dropout = 0.2, recurrent_dropout = 0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())


  


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 20, 128)           320000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 200)               263200    
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 402       
Total params: 583,602
Trainable params: 583,602
Non-trainable params: 0
_________________________________________________________________
None


In [28]:
X = np.asarray(X_train)
#Y = np.asarray(Y)
#Y_train = np.zeros((5000,2))
y_1 = []
y_2 = []
#validar Y_train
for k in range(len(Y)):
    if Y[k] == 1:
        y_1.append(Y[k])  
        y_2.append(0)
    else:
        y_1.append(0)  
        y_2.append(Y[k])
Y_total = np.column_stack((np.array(y_1),np.array(y_2)))
#print('penultimo: ',Y_total[0][0])

#dividir y
Y_train = np.asarray(Y_total[:45000])
Y_test = np.asarray(Y_total[45000:])
        
Y_train.shape , Y_test.shape
# Aqui definimos el modelo.
#Y_train.shape

model.fit(X_train, Y_train, batch_size =batch_size, nbepoch = 1, verbose = 5)



Epoch 1/1


<keras.callbacks.History at 0x7f2a9a434cc0>

In [33]:
X_test.shape, Y_test.shape
score, acc= model.evaluate(X_test, Y_test, verbose = 1, batch_size = batch_size)




In [34]:
print("")
print("score: %.4f"% (score))
print("validation accuracy: %.2f" % (acc))


score: 0.5606
validation accuracy: 0.75


### Con la red neuronal MLPClassifier  (skLearn)

una ves implementado nuestro propia red neuronal ahora vamos probar con la red neuronal del sklearn
con la cual tambien haremos el entrenamiento y el test.
asi que hechemos las ganas y a probar.

In [35]:
from sklearn.neural_network import MLPClassifier
#clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,20), random_state=1)
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, max_iter=300)
classes = np.array([0, 1])
clf.partial_fit(X_train, Y[:45000], classes=classes)
y_h2 = clf.predict(X_test)


print('SKLEARN NN accuracy of Test: ',accuracy_score(y_pred=y_h2,y_true=Y[45000:])*100,'%')


SKLEARN NN accuracy of Test:  49.36 %


Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. 

In [9]:
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1, max_iter=1)
doc_stream = stream_docs(path='shuffled_movie_data.csv')

# Exercise 2: Define at least a Three layer neural network. Define its structure (number of hidden neurons, etc)
# Define a nonlinear function for hidden layers.
# Define a suitable loss function for binary classification
# Implement the backpropagation algorithm for this structure
# Do not use Keras / Tensorflow /PyTorch etc. libraries
# Train the model using SGD

In [10]:
#import pyprind
#pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    #pbar.update()

Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify "new" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:

In [13]:
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))
#Exercise 3: compare  with your Neural Network

Accuracy: 0.867


I think that the predictive performance, an accuracy of ~87%, is quite "reasonable" given that we "only" used the default parameters and didn't do any hyperparameter optimization. 

After we estimated the model perfomance, let us use those last 5,000 test samples to update our model.

In [18]:
clf = clf.partial_fit(X_test, y_test)

<br>
<br>

# Model Persistence

In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on "new data."

So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to "serialize a Python object structure". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently.

To install:
conda install -c anaconda joblib

In [21]:
import joblib
import os
if not os.path.exists('./pkl_objects'):
    os.mkdir('./pkl_objects')
    
joblib.dump(vect, './vectorizer.pkl')
joblib.dump(clf, './clf.pkl')

['./clf.pkl']

Using the code above, we "pickled" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on <module '__main__'>` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` "right".

In [22]:
%%writefile tokenizer.py
from nltk.stem.porter import PorterStemmer
import re
from nltk.corpus import stopwords

stop = stopwords.words('english')
porter = PorterStemmer()

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    text = [w for w in text.split() if w not in stop]
    tokenized = [porter.stem(w) for w in text]
    return text

Writing tokenizer.py


In [23]:
from tokenizer import tokenizer
joblib.dump(tokenizer, './tokenizer.pkl')

['./tokenizer.pkl']

Now, let us restart this IPython notebook and check if the we can load our serialized objects:

In [24]:
import joblib
tokenizer = joblib.load('./tokenizer.pkl')
vect = joblib.load('./vectorizer.pkl')
clf = joblib.load('./clf.pkl')

After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook.

In [25]:
example = ['I did not like this movie']
X = vect.transform(example)
clf.predict(X)

array([0])

In [26]:
example = ['I loved this movie']
X = vect.transform(example)
clf.predict(X)

array([1])