# Sentiment analysis

Sentiment analysis is the task of classifying sentences according to a subjective notion or an affective state. In this notebook, we will perform such an analysis on movie reviews from the imdb database. This database contains thousands of movie reviews and the associated sentiment (positive or negative). We will train differend models to classify the sentences.

#### Below is an example code to load the data. We obtain list of arrays of integers. Each integer represents a word (lowest integers correspond to most frequent words)

In [None]:
from tensorflow.keras.datasets import imdb

vocab_size = 10000 # maximum number of words to keep (keeps the most frequent)
maxlen = 200 # maximum number of words in a review (otherwise the review is truncated)

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=vocab_size,
                                                      seed=0,
                                                      index_from=3,
                                                      maxlen=maxlen)



#### Below are some methods to recover the actual text from the word numbers

In [None]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
word_to_id["<UNUSED>"] = 3

id_to_word = {value:key for key,value in word_to_id.items()}

def word_from_id(i):
    return id_to_word[i]

def words_from_id_list(l):
    return ' '.join(word_from_id(i) for i in l)

print(len(id_to_word))
words_from_id_list(x_train[1])

### Using 1D Convolution.

In this part, we will consider that the sentences all have the same lengths. This can be achieved using `sequence.pad_sequences(corpus, maxlen=maxlen)` where sequence is a module in `tensorflow.keras.preprocessing`. This method pads the sentences at their beginning using a specific padding word id.

#### Pad the test and train sentences to the right up to `maxlen=200` words. Use the `pad_sequences` method. Check the shapes of the padded sequences.

#### Define a model which takes a sentence as input (i.e. 200 word ids) and performs an embedding, a conv1d and a sigmoid operation to classify. Check the documentation for the Embedding layer.

In [None]:

def build_cnn_model():

    return model

#### Compile your model. Compile it with the appropriate loss and optimizer. Can you think of a metric to monitor as well ?

#### Fit your model, use a stopping criterion on some validation data splitted from the train set.

#### Evaluate your model

### Using LSTM

#### Define a new model, this time containing an embedding, an LSTM and a Dense layer, to perform the same classification task. Then, train it and evaluate it as above.

In [None]:
from tensorflow.keras.layers import LSTM

def build_lstm_model():
