## Building a RNN to classify text using pre-trained embeddings

We will create a very small dataset with 10 sentences, 8 for training and 2 for test, to perfrom classification.

We will use pre-trained embeddings. **Note** that these embeddings are trained on general sentences. If you need more specific embeddings you must train them using your examples. In this case we have not enough examples to do so.



In [0]:
from __future__ import print_function
import keras
import pandas as pd
import re
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Embedding, LSTM, Bidirectional, Flatten
from keras.models import Sequential
import matplotlib.pyplot as plt
%matplotlib inline

## Load the data
Usually data is stored in csv files (use of pandas). In this case we manually create sentences to perform a rudimental sentiment analisys.

In [0]:
traind = {'sentence': [
      'Quella pizza era buona',
      'Questo gelato fa schifo',
      'Il film non mi piace molto',
      'La sua nuova canzone non mi piace',
      'Adoro il gelato',
      'Quella canzone mi piace molto',
      'Che brutto!',
      'Mi piace!'
], 'sentiment': [
      'buono',
      'cattivo',
      'cattivo',
      'cattivo',
      'buono',
      'buono',
      'cattivo',
      'buono'
]}

testd = {'sentence': [
      'La zuppa non mi piace per niente',
      'Questa zuppa sembra ottima'
], 'sentiment': [
      'cattivo',
      'buono'
]}

train_data = pd.DataFrame(data=traind)
test_data = pd.DataFrame(data=testd)

print(train_data)
print(test_data)

Now we have to clean textual data by removing capital letters and undesired characters such as punctuation.

In [0]:
train_data['sentence'] = train_data['sentence'].apply(lambda x: x.lower()) # to lowercase
train_data['sentence'] = train_data['sentence'].apply((lambda x: re.sub('[^a-z\s]', '', x))) # remove all characters that are not in a-z

test_data['sentence'] = test_data['sentence'].apply(lambda x: x.lower()) # to lowercase
test_data['sentence'] = test_data['sentence'].apply((lambda x: re.sub('[^a-z\s]', '', x))) # remove all characters that are not in a-z

print(train_data)
print(test_data)

The sentences must be splitted in order to get the single words. To do so we can use the Tokenizer.

The Tokenizer class of keras allows allows you to tokenize, taking a text and mapping each word into a sequence of integers. Using this class we tokenize the phrases of the dataset.

In [0]:
vocab_size = 27 # number of different words, we have 27 different words so we can set anumber greater or equal to 27

tokenizer = Tokenizer(num_words=vocab_size, split=' ')
tokenizer.fit_on_texts(train_data['sentence'].values)
X_train = tokenizer.texts_to_sequences(train_data['sentence'].values)
X_train = pad_sequences(X_train)
Y_train = pd.get_dummies(train_data['sentiment']).values

print('train')
print(X_train)
print(Y_train)

# same for test data
tokenizer.fit_on_texts(test_data['sentence'].values)
X_test = tokenizer.texts_to_sequences(test_data['sentence'].values)
X_test = pad_sequences(X_test)
Y_test = pd.get_dummies(test_data['sentiment']).values

print('test')
print(X_test)
print(Y_test)

In [0]:
embed_dim = 300 # size of Word2Vec embeddings
lstm_out = 5

### Create the model
We use two new layers:

- **Embedding**: Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]. This layer can only be used as the first layer in a model.

- **LSTM**: Implements the Long Short-Term Memory layer.


 `Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)`

- input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
- output_dim: int >= 0. Dimension of the dense embedding.
- embeddings_initializer: Initializer for the embeddings matrix 

LSTM has many parameters, some of them are:
- units: Positive integer, dimensionality of the output space.
- activation: Activation function to use (see activations). Default: hyperbolic tangent (tanh). If you pass None, no activation is applied (ie. "linear" activation: a(x) = x).



In [0]:
model = Sequential()
model.add(Embedding(vocab_size, embed_dim, input_length=X_train.shape[1]))
model.add(LSTM(lstm_out, activation='tanh', dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


In [0]:
history = model.fit(X_train, Y_train, epochs=50, 
                    batch_size=100, verbose=2, shuffle=True)



In [0]:
fig, ax = plt.subplots()
ax.plot(history.history["loss"],'r', marker='.', label="Train Loss")
ax.plot(history.history["acc"],'g', marker='.', label="Train acc")
ax.legend()

In [0]:
model.predict(X_test)

In [0]:
model.evaluate(X_test, Y_test)

Let's try another network.

We try to use a Bidirectional LSTM, were the state is passed both forward and backward.

**Bidirectional** is a bidirectional wrapper for RNNs.

In [0]:
model1 = Sequential()
model1.add(Embedding(vocab_size, embed_dim, input_length=X_train.shape[1]))
model1.add(Bidirectional(LSTM(lstm_out, activation='tanh', dropout=0.2, recurrent_dropout=0.2)))
model1.add(Dense(2, activation='softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.summary()

history = model1.fit(X_train, Y_train, epochs=50, 
                    batch_size=100, verbose=2, shuffle=True)


In [0]:
model1.predict(X_test)

In [0]:
model1.evaluate(X_test, Y_test)

We can also add more LSTM layers.

To do so we must fix the parameter `return_sequences` to True in all the LSTM except the last.

In [0]:
model2 = Sequential()
model2.add(Embedding(vocab_size, embed_dim, input_length=X_train.shape[1]))
model2.add(Bidirectional(LSTM(lstm_out, return_sequences=True, # this option must be used if we have more LSTM
                             activation='tanh', dropout=0.2, recurrent_dropout=0.2)))
model2.add(LSTM(lstm_out, return_sequences=True))
model2.add(LSTM(lstm_out))
model2.add(Dense(2, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()

history = model2.fit(X_train, Y_train, epochs=50, 
                    batch_size=100, verbose=2, shuffle=True)

In [0]:
model2.predict(X_test)

In [0]:
model2.evaluate(X_test, Y_test)

### Use of pretrained Embeddings
As seen, the performances are really poor due to the very small size of the dataset.

We can try to use pre-trained embeddings (probably with this dataset we won't get good results as well).

To do so we must download the pre-trained model, there are model for Word2Vec, GloVe, and many others. For each model there may be also many different versions tailored on different languages.

In our case, we should download the model for the Italian language. Usualy such models are around 1-2 GBs.

Once downloaded we must load them:

In [0]:
from gensim.models import Word2Vec

# Here we load a pretrained model and keep only the embeddings we need
# GloVe downloadable here: https://drive.google.com/file/d/1ZODMv0guq8OgZN0fq_V3J3s5ZqsjW99U/view?usp=sharing
# m = Word2Vec.load('./glove_WIKI')
# Word2Vec downloadable here: https://drive.google.com/file/d/17LPOi8aVISuwq4g5hn2Ddj-cQYrbXlgZ/view?usp=sharing
# m = Word2Vec.load(
#    './wiki_iter=5_algorithm=skipgram_window=10_size=300_neg-samples=10.m')
embedding_matrix = zeros((vocab_size, embed_dim))

for word, i in tokenizer.word_index.items():
  if word in m.wv:
    print(word)
    embedding_matrix[i] = m.wv[word] # we take here only the embedding of the words we have in our vocabulary

Now we can build our model.

In [0]:
model2 = Sequential()
model.add(Embedding(vocab_size, embed_dim, weights=[embedding_matrix],
                    input_length=X.shape[1], trainable=False))
model2.add(Bidirectional(LSTM(lstm_out, return_sequences=True, # this option must be used if we have more LSTM
                             activation='tanh', dropout=0.2, recurrent_dropout=0.2)))
model2.add(LSTM(lstm_out, return_sequences=True))
model2.add(LSTM(lstm_out))
model2.add(Dense(2, activation='softmax'))

What changes her is the `Embedding` layer, where we initialize the weights withthose of the pre-trained model (`weights=[embedding_matrix]`).
Moreover, we can decide whether to tune these weigths during the training or keep them fixed by using the parameter `trainable`.