# Opinion Mining & Sentiment Analysis: Deep Learning Techniques

**Text Mining unit**

_Prof. Gianluca Moro^, Dott. Ing. Nicola Piscaglia° – DISI, University of Bologna_

^name.surname@unibo.it


°name.surname@bbs.unibo.it


**Bologna Business School** - Alma Mater Studiorum Università di Bologna

## Setup

Import external libraries (thus verifying they are correctly installed)

In [None]:
%tensorflow_version 1.x

import gzip
import numpy as np
import pandas as pd
import gensim
import tensorflow as tf
import keras
import matplotlib.pyplot as plt
%matplotlib inline

Check GPU and limit memory usage

In [None]:
devices = tf.config.experimental_list_devices()

[print(device) for device in devices] # print all devices

#!nvidia-smi # check GPU configuration

if devices:
  try:
    tf.config.experimental.set_memory_growth = True
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    gpus = tf.config.experimental.list_physical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

Define a utility function to download data files if they are not already present in working directory

In [None]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.isfile(file):
        urlretrieve(url, file)

## Deep Learning and Neural Networks for Text Mining and Sentiment Analysis

_Deep learning_ denotes a general approach to machine learning employing **multi-layered models** to obtain **accurate representation of input data**: features are no longer extracted manually, but infered in the learning process

DL is mostly based on _neural networks_, very flexible learning models with arbitrary complexity

The use of DL and NN allows deep understanding of text without manually defining rules, lexicons, etc.

## TensorFlow and Keras

- **TensorFlow** by Google is one of the most used computation frameworks for deep learning
  - TF works by building a _computational graph_ where each node represents an operation between _tensors_ (N-dimensional arrays)
    - sums, products, derivatives, ...
  - a graph can run either on CPU or (where available) on GPU for accelerated parallel computation
- **Keras** provides an high-level API for building and training neural networks using TensorFlow as a backend
  - networks can be built simply by stacking different layers with many configurable (hyper)parameters
  - high-level commands are provided to train and evaluate networks on given datasets

## Dataset: Movie Reviews

We have a collection of user reviews extracted from IMDb (the _Internet Movie Database_) labeled as positive or negative

We want to train a model to understand the positive or negative orientation of any review

We start by loading the training dataset, containing 25,000 samples with two attributes
- `label` indicates the orientation of the review, can be "pos" or "neg"
- `text` contains the full text of the review

In [None]:
download("imdb-train.csv.gz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/imdb-train.csv.gz")

In [None]:
train_set = pd.read_csv("imdb-train.csv.gz", sep="\t", names=["label", "text"])

In [None]:
train_set.shape

Let's view some rows of the dataset, after increasing the lenght of shown text

In [None]:
pd.options.display.max_colwidth = 100

In [None]:
train_set.head()

Some HTML tags are present within reviews, specifically `<br />` to indicate newlines: we write a function which, applied on a text, replaces such tags with ASCII newline `\n`

In [None]:
def strip_tags(text):
    return text.replace("<br />", "\n")

We apply the function to all texts in the dataset

In [None]:
train_set["text"] = train_set["text"].apply(strip_tags)

Positive and negative reviews are evenly distributed

In [None]:
train_set["label"].value_counts()

## Multi-Layer Perceptron

In their usual form, neural networks are composed by a stack of _densely connected_ layers of nodes: each node in a layer receives the output of all nodes of the underlying layer. Such networks are also known as _multi-layer perceptrons_.

A MLP receives a vector as input and its topmost layer produces a vector as output, an arbitrary number of _hidden layers_ can be inserted inbetween to produce intermediate representations of data

Let's start by training a neural network for sentiment classification feeded with vector space model representations of reviews

We initialize a vector space using tf.idf term weighting and filtering out very rare terms

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(min_df=3)

Such vector space is built upon the training reviews and their document-term matrix is produced

In [None]:
train_dtm = vect.fit_transform(train_set["text"])
train_dtm

Similarly to training matrices for scikit-learn models, this is a 2D array where each row (1st axis) is a training observation and each column (2nd axis) is a feature: each row is a possible input to the neural network

We extract the number of distinct terms in the vector space, used to define the structure of the neural network

In [None]:
num_terms = len(vect.get_feature_names_out())

In [None]:
num_terms

We want our neural network to indicate in output the correct class of each review, either "pos" or "neg"

The common approach to classification with neural networks is to have one output node for each class and train them to output 1 on the right node and 0 on the others

For this, we must extract from the `label` column a "target" matrix, where each row contains the values which the network should give as output for each review
- `[1, 0]` for positive reviews
- `[0, 1]` for negative reviews

We define a function `make_target` which converts a given pos/neg labels series into a target matrix

In [None]:
def make_target(labels):
    return pd.DataFrame({
        "pos": labels == "pos", # if the label is "pos" then return 1 else return 0
        "neg": labels == "neg" # if the label is "neg" then return 1 else 0
    }).astype(int)

We then apply it to the training set labels

In [None]:
train_target = make_target(train_set["label"])

We obtain a matrix where each row is the expected network output, either `[1, 0]` (positive) or `[0, 1]` (negative)

In [None]:
train_target.head()

Let's define the structure of the neural network

We create a _sequential_ model, i.e. we define the network as a sequence of interconnected layers (alternatively, we could create non-linear structures by manually connecting layers to each other using Keras Functional API. To learn more, see: https://keras.io/guides/functional_api/)

In [None]:
from keras.models import Sequential
model = Sequential()

In [None]:
#Functional API version
# inputs = keras.Input(shape=(num_terms,))

In this first example we create a single-layered network, where inputs are directly connected to the output nodes

As discussed above, the output nodes must be 2, one for each class; we use the _softmax_ activation function to ensure that the output is a valid probability distribution
- we will never get a perfect `[1, 0]` as output in practice, but we will get outputs like `[0.99, 0.01]`

In the first layer (in this case the only one) we also have to specify with `input_dim` the size of input vectors, in this case the number of terms in the vector space

In [None]:
from keras.layers import Dense
model.add(Dense(2, activation="softmax", input_dim=num_terms))

In [None]:
# Functional API version
# outputs = keras.layers.Dense(2, activation="softmax", input_dim=num_terms)(inputs)
# model = keras.Model(inputs=inputs, outputs=outputs, name="my_model")

With `summary` we can analyze the structure of our network and get the count of trainable parameters

In [None]:
model.summary()

In this case we have 35,852×2 = 71,704 weights + 2 biases = 71,706 trainable parameters

After defining the network structure, we _compile_ it to provide some general settings of the network and initialize accordingly the underlying TensorFlow data structures

In [None]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

- The _optimizer_ is the algorithm used to train the network: _Adam_ and other valid options are different variants of stochastic gradient descent (SGD), including learning rate decay, momentum, etc.
- The _loss_ is the measure to be minimized in the training process: the _cross entropy_ penalizes outputs which are not close to 1 on the correct class
- Additional _metrics_ can be computed for evaluation purposes: we use the accuracy, i.e. the percentage of correctly classified examples

We are now ready to train (_fit_) the network on given training examples, composed of inputs (tf.idf vectors) and target outputs (`[1, 0]` for positive reviews and `[0, 1]` for negative)

The training examples are shuffled and used to run SGD steps on _minibatches_ of a specified size (`batch_size`): this process is repeated for a given number of training _epochs_

`callbacks` parameters let us define a list of functions to be called by keras at the end of each epoch. This option is mainly used to implement some training logic (e.g. EarlyStopping) but if we runned out of memory we also could use it to call gc.collect() after each epoch in order to free some memory.

In [None]:
# Garbage Collector library
import gc

# Custom Callback To Include in Callbacks List At Training Time
class GarbageCollectorCallback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        gc.collect()

fit_history = model.fit(train_dtm, train_target, batch_size=200, epochs=10, callbacks=[GarbageCollectorCallback()])

During the training process we see how loss and accuracy measured on the training set vary, their evolution can also be obtained from the "history" object returned by `fit`

In [None]:
plt.plot(fit_history.history["loss"], "ro-")
plt.plot(fit_history.history["accuracy"], "bo-")
plt.legend(["Train loss", "Train accuracy"]);

We can see in the plot how the loss progressively decreases and accuracy progressively increases through training epochs

To get the raw output given by the network for a given input, we use the `predict` method: let's see for example the output for the first training review (labeled positive)

In [None]:
train_dtm[0]

In [None]:
model.predict(train_dtm[0])

We see that the first class (positive) has higher probability

We can directly get the predicted class index with `predict_classes`

In [None]:
# model.predict_classes(train_dtm[0].toarray()) --> [0]

or...

In [None]:
# Alternative Version working with Functional API / Tensorflow 2.0
prediction = model.predict(train_dtm[0])
print(prediction)

np.argmax(prediction, axis=-1)

Let's now evaluate the network on a separate test set of labeled reviews, provided in the `imdb-test.csv.gz` file

In [None]:
download("imdb-test.csv.gz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/imdb-test.csv.gz")

In [None]:
test_set = pd.read_csv("imdb-test.csv.gz", sep="\t", names=["label", "text"])

In [None]:
test_set.head(5)

Also in this dataset we have 25,000 reviews evenly distributed

In [None]:
test_set["label"].value_counts()

As before, we apply the HTML strip function to reviews

In [None]:
test_set["text"] = test_set["text"].apply(strip_tags)

We represent the test reviews in the vector space created on training reviews

In [None]:
test_dtm = vect.transform(test_set["text"])
test_dtm

We then convert pos/neg labels for test examples into target vectors

In [None]:
test_target = make_target(test_set["label"])
test_target

After processing the test set, we can fed it to the neural network for evaluation using the `evaluate` method

In [None]:
model.evaluate(test_dtm, test_target)

The method reports the loss (first value) and the accuracy (second value) measured on the given test set: our final goal is to maximize the accuracy

Let's now introduce a _hidden layer_ in the network between input and output, for example a layer of 128 nodes with linear activation which receive input vectors

In [None]:
model = Sequential()
model.add(Dense(128, input_dim=num_terms))

In [None]:
# Functional API version
# inputs = keras.layers.Input(shape=(num_terms,))
# x = Dense(128)(inputs)

The output of these 128 will be fed to the output layer, composed as above by 2 nodes with softmax activation

In [None]:
model.add(Dense(2, activation="softmax"))

In [None]:
# Functional API Version
# outputs = Dense(2, activation="softmax")(x)
# model = keras.Model(inputs = inputs, outputs=outputs, name="model_with_hidden_layer")

The number of network parameters to be trained is now much higher

In [None]:
model.summary()

Let's compile the network as before

In [None]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

To keep compute times limited, we fit this and subsequent networks running only 3 training epochs

In [None]:
model.fit(train_dtm, train_target, batch_size=200, epochs=3, callbacks=[GarbageCollectorCallback()])

Let's evaluate this new network on the same test set as before

In [None]:
model.evaluate(test_dtm, test_target)

Thanks to the hidden layer we had a very slight improvement, despite the lower number of training epochs

To make the model more expressive, we have to introduce non-linearity in hidden layers: for example, we replicate the model above using sigmoid activation in the hidden layer

We can create the model more concisely by providing the list of layers to be stacked

In [None]:
model = Sequential([
    Dense(128, activation="sigmoid", input_dim=num_terms),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(train_dtm, train_target, batch_size=200, epochs=3, callbacks=[GarbageCollectorCallback()])

In [None]:
model.evaluate(test_dtm, test_target)

Finally, let's test a deep model with three non-linear hidden layers

In [None]:
model = Sequential([
    Dense(256, activation="sigmoid", input_dim=num_terms),
    Dense(64, activation="sigmoid"),
    Dense(16, activation="sigmoid"),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
model.summary()

In [None]:
model.fit(train_dtm, train_target, batch_size=200, epochs=3)

In [None]:
model.evaluate(test_dtm, test_target)

Once the right network structure has been found, you can also tune the regularization of your neural network adding Dropout layers or playing with L1/L2 reg values as in the following training example. Notice that due to the regularization effect, more training epoch could be needed.

In [None]:
from keras import regularizers

model = Sequential([
    Dense(256, activation="sigmoid", input_dim=num_terms, kernel_regularizer=regularizers.L1L2(l1=1e-8, l2=1e-6)),
    Dense(64, activation="sigmoid"),
    Dense(16, activation="sigmoid"),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
model.fit(train_dtm, train_target, batch_size=200, epochs=3)

In [None]:
model.evaluate(test_dtm, test_target)

## Word Embedding

A _word embedding_ model is a dictionary mapping each known word to a **N-dimensional vector**

Such model is built by training a neural network on a bunch of text to predict the most likely word in a context defined by other words
- training is unsupervised: no labeling of text is needed


The resulting vector of each word somehow denotes its meaning: **semantically similar words are represented with similar vectors**. Moreover, operations between vectors can be used to **find words semantically related** to each other.

Word embedding models can be used to represent text in NLP tasks, including sentiment analysis

The **gensim** library provides means to represent and build word embedding models

## Training a Word2Vec model

We have a set of 5,000 movie reviews without any labeling: we can't train a sentiment classifier on them but we can train a word embedding model

We read the compressed text file `imdb-unsup-5k.txt.gz`, containing one review per line

In [None]:
download("imdb-unsup-5k.txt.gz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/imdb-unsup-5k.txt.gz")

In [None]:
with gzip.open("imdb-unsup-5k.txt.gz", "rt", encoding="utf8") as f: # open gzip file containing the dataset
    we_train_set = [strip_tags(line.strip()) for line in f] # for each line (review) of the file, strip the string and map tags

We have to preprocess each review by splitting text into tokens

gensim, the library used to train the word embedding model, provides a simple utility function for this
- alternatively any tokenization function can be used, e.g. `nltk.word_tokenize`

In [None]:
from gensim.utils import simple_preprocess

In [None]:
%%time

we_train_tokens = [simple_preprocess(text) for text in we_train_set] # we_train_tokens will be a matrix where the first axis contains each review tokens array while the second one represents the review tokens

# Wall time: time elapsed according to the computer's internal clock
# User-cpu time: the amount of time spent executing user-code 
# Sys cpu time: the amount of time spent in the kernel due to the need of privileged operations (like IO to disk)

In [None]:
we_train_set[0][:82]

In [None]:
we_train_tokens[0][:8]

We can now use the token sequences to train the Word2Vec embedding model

The most important parameter is the size of the word vectors we want to obtain
- in the original Word2Vec paper 300 is indicated as a good value
- here we use 50 as a tradeoff between accuracy and efficiency

In [None]:
wordvecs_size = 50

Other relevant parameters are
- the _window size_, i.e. the number of words before and after any word to consider as its context
- the minimum appearances of a term to be included in the model

We specify these options in the `Word2Vec` initializer, together with the set of token sequences used to train the model

In [None]:
%%time
wv_model = gensim.models.Word2Vec(
    we_train_tokens,
    size=wordvecs_size,
    window=5,
    min_count=5
)

Our Word2Vec model is now trained, we can get a reference to the word->vector mapping itself `wv` and drop the rest of the model object to free some memory

In [None]:
wv = wv_model.wv
del wv_model

## Exploring the word embedding model

How many distinct terms are represented in the model?

In [None]:
len(wv.vocab)

Which are these terms? `index2word` is an ordered list with more common terms coming first

In [None]:
wv.index2word[:10]

Let's see the vector of a word, e.g. "excellent"

In [None]:
wv.word_vec("excellent")

We can also compute and get (L2) normalized word vectors (for each vector the sum of the respective components squares will always be up to 1), used to compute cosine similarity

In [None]:
wv.init_sims()   # compute and cache normalized vectors (using L2-Normalization)

In [None]:
# True indicates to normalize the vector
wv.word_vec("excellent", True)

The word vector by itself doesn't give much information, but we can search for example which are the words with vectors most similar to this...

Let's use the `cosine_similarities` function to compute similarity between this vector and all the other ones, stored in the `vector` array

In [None]:
similarities_to_excellent = wv.cosine_similarities(
    wv.word_vec("excellent"),
    wv.vectors
)

We obtain an array of cosine similarity scores that has a component for each word represented in the model

In [None]:
similarities_to_excellent.shape

In [None]:
similarities_to_excellent[:5]

Let's label them with the term they refer to and sort by descending values

In [None]:
pd.Series(
    similarities_to_excellent,
    wv.index2word
).sort_values(ascending=False).head(10) # sort the values by descending similarity score and take the first 10

In this way we found **other words** other than "excellent" with a **strong positive connotation**!

For this the model provides a `most_similar` method, which also removes the reference word from the results

In [None]:
wv.most_similar("excellent")

We can similarly see what happens with a strongly negative word, e.g. "terrible"

In [None]:
wv.most_similar("terrible")

Other strongly negative words are found!

Another powerful function of word embedding models is to find words with specific syntactic and semantic relationships using vector arithmetics

Consider the relationship _"man" is to "woman" as "actor" is to X_ where the model has to find out that X = "actress"

Word2Vec produces vectors in such a way that _"man" - "woman" = "actor" - X_, so we can find X as the term whose vector is closest to _"actor" + "woman" - "man"_

Let's produce the vector representation of X...

In [None]:
composition = (wv.word_vec("actor", True)
             + wv.word_vec("woman", True)
             - wv.word_vec("man", True))

...and then find the words most similar to this composition

In [None]:
pd.Series(
    wv.cosine_similarities(composition, wv.vectors),
    wv.index2word
).sort_values(ascending=False).head(10)

Also in this case we can use `most_similar`, distinguishing words with positive and negative weight

In [None]:
wv.most_similar(
    positive=["actor", "woman"],
    negative=["man"]
)

According to randomness in the training process, the correct answer "actress" might be the most similar word or very close to it, but still the confidence of the model is limited

We proceed our analysis on a pretrained GloVe (_Global Vectors_) word embedding model, whose training procedure is similar to Word2Vec

We use a version trimmed down to the most common 100,000 terms of the 100d model trained on Wikipedia, available here: https://nlp.stanford.edu/projects/glove/

Arrays with words and vectors are provided in the `glove.npz` file

In [None]:
download("glove.npz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/glove.npz")

In [None]:
with np.load("glove.npz") as f:
    glove_words = f["words"]
    glove_vectors = f["vectors"]

We read the vector size from the loaded array

In [None]:
wordvecs_size = glove_vectors.shape[1]
wordvecs_size

We then create the word embedding model from the words

In [None]:
wv = gensim.models.KeyedVectors(wordvecs_size)
wv[glove_words.tolist()] = glove_vectors
wv.init_sims()

Searching on this model for the answer to _man : woman = actor : X_...

In [None]:
wv.most_similar(
    positive=["actor", "woman"],
    negative=["man"]
)

...the correct answer "actress" is more dominant on the other ones

Other examples with multiple pairs: finding the plural of a singular word...

In [None]:
wv.most_similar(
    positive=["mouse", "dogs", "cats"],
    negative=[         "dog",  "cat"]
)

...and finding the capital of a State

In [None]:
wv.most_similar(
    positive=["france", "rome",  "berlin"],
    negative=[          "italy", "germany"]
)

Another method provided by the model is `doesnt_match` finding the word which is the least related to the others in a given list

In [None]:
wv.doesnt_match(["cat", "mouse", "dog", "keyboard", "frog"])

## Representing text with word embedding

We now see how to leverage the word embedding model in a neural network for sentiment classification

We start by tokenizing texts of training reviews

In [None]:
%%time
train_tokens = [gensim.utils.simple_preprocess(text) for text in train_set["text"]]

Let's see an example of tokenized review

In [None]:
train_set["text"][0][:34]

In [None]:
train_tokens[0][:5]

We now convert these lists of text tokens into lists of indices of terms in the word embedding model, leaving out terms not present in the model

In [None]:
train_indices = [
    [wv.vocab[word].index for word in text if word in wv.vocab] # for each token (word) in the review, that is present in our Word2Vec vocabulary, get its index in the Word2Vec vocabulary
    for text in train_tokens # i.e. for each tokenized review
]

For example the begin of the review above is now represented with...

In [None]:
train_indices[0][:5] # [0] is the index of the first review, [:5] is used to get the first five word indexes of the first review

...which translated back into words would be... (notice that the first term was removed because not in the embedding model)

In [None]:
[wv.index2word[i] for i in train_indices[0][:5]]

Since we want to perform a review-level sentiment analysis, we have to find a way to represent each review using the respective word vectors.
As a first solution, we represent each review with the mean of normalized vectors of words contained in it: we obtain such vectors for all train reviews and stack them together in a matrix

In [None]:
train_we_repr = np.vstack([wv.vectors_norm[indices].mean(0) for indices in train_indices]) # i.e. for each review indices, get the relative word2vec vectors, compute their means and stack the resulting vectors in a matrix.

# This way we now have a matrix that has a row per training set review and as many columns as the number of word vector features (100)
train_we_repr.shape

We then create a MLP network with one hidden layer accepting such vectors in input

In [None]:
model = Sequential([
    Dense(128, activation="sigmoid", input_dim=wordvecs_size),
    Dense(2, activation="softmax")
])
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

As the input size of the network is much lower, so it is the number of parameters

In [None]:
model.summary()

Training is much faster than before, so we can increment the epochs and reduce the batch size, thus making more SGD steps in each epoch

In [None]:
model.fit(train_we_repr, train_target, batch_size=20, epochs=10)

Let's preprocess test reviews as we did for training ones, thus extracting tokens and converting them to indices...

In [None]:
test_tokens = [gensim.utils.simple_preprocess(text) for text in test_set["text"]]
test_indices = [
    [wv.vocab[word].index for word in text if word in wv.vocab]
    for text in test_tokens
]

...and obtaining means of word vectors for each review

In [None]:
test_we_repr = np.vstack([wv.vectors_norm[indices].mean(0) for indices in test_indices])

We can now evaluate the network on the test reviews

In [None]:
model.evaluate(test_we_repr, test_target)

The accuracy is not as good as before: with this representation we lose identity of the words in the documents other than their order

## Recurrent neural networks

MLPs are _feed-forward_ networks: their output at any time is only dependent from their input at the same time

On the other side, if we somehow introduce **memory** inside a network, we can make its output dependent from current as well as past inputs, thus we can process **sequential data**

_Recurrent_ neural networks include **cyclic connections** between nodes, making the output dependent from the state of the network at previous time steps and thus from previous inputs

### Sequential data

While an input example for a MLP must be represented with a vector of size S, an example for a recurrent NN is represented with a **sequence of vectors**, fed to the network in T subsequent time steps (T is equal for all examples)

Thus N input samples with input size S are no longer represented with a N×S array, but with a N×T×S array

Leveraging the word embedding model, we represent each review with the **sequence of word vectors** for the terms contained in it
- in this way, we consider both the identity of words (the vectors) and their order!

We start from the sequences of word indices `*_indices` (train_indices, test_indices) extracted above

We need to make all sequences of the same length (the T term above): we set a desired sequence size T, then we trim longer sequences to that size (taking the final T elements) and pad shorter sequences with null values: Keras' `pad_sequences` function does this
- larger T values would make training much slower

In [None]:
from keras.preprocessing.sequence import pad_sequences
max_words = 200
train_seq = pad_sequences(train_indices, max_words)

In [None]:
train_seq

The size of the matrix is the number of samples times the sequence length, i.e. N×T

In [None]:
train_seq.shape

### Building the network

Let's now create a neural network which gets such sequences as input

In [None]:
model = Sequential()

We first insert an `Embedding` layer, which translates each received value into the word vector from the embedding model

We need to specify the size of input and output and the word vectors to be used, taking them from the model; we also specify `trainable=False` to "freeze" our pretrained word vectors and exclude them from training

In [None]:
from keras.layers import Embedding
model.add(Embedding( # [1, 2, 3, ] --> [[2.23232, 2.4423, 223,2, ....], [], [] ]
    input_dim=len(wv.vocab),    # number of distinct vocabulary terms in Word2Vec model
    output_dim=wordvecs_size,   # size of word vectors (S)
    input_length=max_words,     # length of sequences (T)
    weights=[wv.vectors],       # pretrained Word2Vec vectors
    trainable=False
))

The output of this layer is a N×T×S tensor, we feed it to a recurrent layer which receives S-sized vectors for T time steps

_Gated Recurrent Units_ (GRU) are a simplified version of _Long Short-Term Memory_ (LSTM) units, which can potentially hold information in memory across many time steps; we use here a layer of 128 GRU cells

_Dropout_ randomly drops (sets to zero) a given ratio of input values at each time step: it is a technique to prevent model overfitting

In [None]:
from keras.layers import GRU
model.add(GRU(128, dropout=0.2))

While producing 128 output values at each time step, the GRU layer by default only returns the outputs at the final steps, i.e. when the whole input sequence has been fed to the network, thus the output size of this layer is N×128 (the time dimension collapses)

We can now finalize the network with the output layer, which receives the output of the GRU layer

In [None]:
model.add(Dense(2, activation="softmax"))

The model summary gives a recap of shapes of data across network layers other than parameters

In [None]:
model.summary()

We can now compile the network and train it on the padded sequences of word indices
- training of RNNs is quite slow, we again limit training to 3 epochs

In [None]:
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(train_seq, train_target, batch_size=200, epochs=3)

Let's now obtain the padded sequences also for the test reviews...

In [None]:
test_seq = pad_sequences(test_indices, max_words)

In [None]:
test_seq

...and use them to evaluate the model

In [None]:
model.evaluate(test_seq, test_target)

We have got an higher accuracy than the previous solution, thanks to the reviews representation as word sequences and the memory capability of the GRU network

## Cross domain classification

We trained our network on reviews of movies and tested its ability to classify sentiment in reviews of movies

Can we successfully apply our model to reviews pertaining to a different domain?

The `yelp-test-10k.csv.gz` file contains 10,000 labeled user reviews about restaurants extracted from Yelp

In [None]:
download("yelp-test-10k.csv.gz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/yelp-test-10k.csv.gz")

In [None]:
xdom_set = pd.read_csv("yelp-test-10k.csv.gz", sep="\t", names=["label", "text"])

In [None]:
xdom_set.head(5)

In [None]:
xdom_set["label"].value_counts()

We apply the same preprocessing steps we applied above

In [None]:
xdom_set["text"] = xdom_set["text"].apply(strip_tags)
xdom_tokens = [gensim.utils.simple_preprocess(text) for text in xdom_set["text"]]
xdom_indices = [
    [wv.vocab[word].index for word in text if word in wv.vocab]
    for text in xdom_tokens
]
xdom_seq = pad_sequences(xdom_indices, max_words)
xdom_target = make_target(xdom_set["label"])

In [None]:
model.evaluate(xdom_seq, xdom_target)

The network is fairly accurate, although it was trained on reviews of a different domain

Can we further improve this?

## Fine tuning the network

In the `yelp-train-2k.csv.gz` we have a set of 2,000 labeled Yelp reviews which can be used for training

We would like to make use of these in-domain reviews, without throwing away the model trained on the richer set of cross-domain reviews

We can "tune" the trained model with an additional training run on the new set of reviews, thus making it more oriented to the new domain and still using knowledge from the other

Let's load and view a summary of the file...

In [None]:
download("yelp-train-2k.csv.gz", "https://github.com/datascienceunibo/bbs-dl-lab-2019/raw/master/yelp-train-2k.csv.gz")

In [None]:
tune_set = pd.read_csv("yelp-train-2k.csv.gz", sep="\t", names=["label", "text"])

In [None]:
tune_set.head(5)

In [None]:
tune_set["label"].value_counts()

...and apply the usual preprocessing steps

In [None]:
tune_set["text"] = tune_set["text"].apply(strip_tags)
tune_tokens = [gensim.utils.simple_preprocess(text) for text in tune_set["text"]]
tune_indices = [
    [wv.vocab[word].index for word in text if word in wv.vocab]
    for text in tune_tokens
]
tune_seq = pad_sequences(tune_indices, max_words)
tune_target = make_target(tune_set["label"])

We now repeat the model training process on this set of reviews: the process is very fast due to the limited size of the dataset

In [None]:
model.fit(tune_seq, tune_target, epochs=5, batch_size=200)

Let's now repeat the evaluation on the Yelp test set loaded before

In [None]:
model.evaluate(xdom_seq, xdom_target)

We successfully boosted the model accuracy, combining even limited knowledge of the target domain with large knowledge extracted from a different domain

## Introduction to the Transformer
The transformers library is an open-source, community-based repository to train, use and share models based on 
the Transformer architecture [(Vaswani & al., 2017)](https://arxiv.org/abs/1706.03762) such as Bert [(Devlin & al., 2018)](https://arxiv.org/abs/1810.04805),
Roberta [(Liu & al., 2019)](https://arxiv.org/abs/1907.11692), GPT2 [(Radford & al., 2019)](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf),
XLNet [(Yang & al., 2019)](https://arxiv.org/abs/1906.08237), etc. 

Along with the models, the library contains multiple variations of each of them for a large variety of 
downstream-tasks like **Named Entity Recognition (NER)**, **Sentiment Analysis**, 
**Language Modeling**, **Question Answering** and so on.

### Before Transformer

Back to 2017, most of the people using Neural Networks when working on Natural Language Processing were relying on 
sequential processing of the input through [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network).

![rnn](http://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png)   

RNNs were performing well on large variety of tasks involving sequential dependency over the input sequence. 
However, this sequentially-dependent process had issues modeling very long range dependencies and 
was not well suited for the kind of hardware we're currently leveraging due to bad parallelization capabilities. 

Some extensions were provided by the academic community, such as Bidirectional RNN ([Schuster & Paliwal., 1997](https://www.researchgate.net/publication/3316656_Bidirectional_recurrent_neural_networks), [Graves & al., 2005](https://mediatum.ub.tum.de/doc/1290195/file.pdf)), 
which can be seen as a concatenation of two sequential process, one going forward, the other one going backward over the sequence input.

![birnn](https://miro.medium.com/max/764/1*6QnPUSv_t9BY9Fv8_aLb-Q.png)


And also, the Attention mechanism, which introduced a good improvement over "raw" RNNs by giving 
a learned, weighted-importance to each element in the sequence, allowing the model to focus on important elements.

![attention_rnn](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Example-of-Attention.png)  

### Then comes the Transformer  

The Transformers era originally started from the work of [(Vaswani & al., 2017)](https://arxiv.org/abs/1706.03762) who
demonstrated its superiority over [Recurrent Neural Network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network)
on translation tasks but it quickly extended to almost all the tasks RNNs were State-of-the-Art at that time.

One advantage of Transformer over its RNN counterpart was its non sequential attention model. Remember, the RNNs had to
iterate over each element of the input sequence one-by-one and carry an "updatable-state" between each hop. With Transformer, the model is able to look at every position in the sequence, at the same time, in one operation.

For a deep-dive into the Transformer architecture, [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder-and-decoder-stacks) 
will drive you along all the details of the paper.

![transformer-encoder-decoder](https://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png)

### BERT
For the rest of this introduction and some summarisation tasks, we will use the [BERT (Devlin & al., 2018)](https://arxiv.org/abs/1810.04805) architecture, as it's one of the most powerful for text processing and there are plenty of content about it
over the internet, it will be easy to dig more over this architecture if you want to. 

BERT was trained on a large text corpus, which gives architecture/model the ability to better understand the language and to learn variability in data patterns and generalizes well on several NLP tasks. As it is bidirectional that means BERT learns information from both the left and the right side of a token’s context during the training phase.

One key point of this model is that it can be used to generate **contextual** word embeddings: as opposed to Word2Vec and GloVe each token is represented differently based on the context. For instance, in BERT the word "bank" is represented with two different vectors for the sentences "open a bank account" and "on the river bank" as they have different meanings. Instead W2C and GloVe are context-free representations, so they would represent "bank" with the same vector for all the sentences in the corpus.

BERT is mostly composed by Encoder blocks from the Transformer architecture that let BERT achieve high performance in language modeling/understanding tasks.

<img src="https://humboldt-wi.github.io/blog/img/seminar/bert/bert_architecture.png"></img>

This model is first pre-trained in two ways: 

1) First, it is (pre)trained on large corpora of text to solve a LM task (predicting masked words in sequences of tokens); 
<img src="https://jalammar.github.io/images/BERT-language-modeling-masked-lm.png"></img>

2) To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?
<img src="https://jalammar.github.io/images/bert-next-sentence-prediction.png"></img>

Finally BERT can be fine-tuned for downstream tasks, as sentiment classification, adding a classification (usually linear) layer on top of the model.

### Getting started with transformers



The transformers library allows you to benefits from large, pretrained language models without requiring a huge and costly computational
infrastructure. Most of the State-of-the-Art models are provided directly by their author and made available in the library 
in PyTorch and TensorFlow in a transparent and interchangeable way. 

You will need to install the transformers library (if not already loaded). You can do so with this command:

In [None]:
!pip install transformers

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer, BertTokenizer

torch.set_grad_enabled(False) # we initially do not train any model so we disable gradient calculation

In [None]:
# Store the model we want to use
MODEL_NAME = "bert-base-uncased"

# We need to create the model and tokenizer
model = AutoModel.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

With only the above two lines of code, you're ready to use a BERT pre-trained model. 
The tokenizers will allow us to map a raw textual input to a sequence of integers representing our textual input
in a way the model can manipulate. Since we will be using a PyTorch model, we ask the tokenizer to return to us PyTorch tensors.

We can visualize this process graphically and then via code:
<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" />

You may have noticed that the word "rumination" has been splitted into two tokens (rum, ##ination). This is because BERT uses what is called a WordPiece tokenizer. It works by splitting words either into the full forms (e.g., one word becomes one token) or into word pieces — where one word can be broken into multiple tokens.

An example of where this can be useful is where we have multiple forms of words. For example:

| Word          | Token(s)                           |
| ------------- | ---------------------------------- |
| surf          | \['surf'\]                         |
| surfing       | \['surf', '##ing'\]                 |
| surfboarding  | \['surf', '##board', '##ing'\]       |
| surfboard     | \['surf', '##board'\]               |
| snowboard     | \['snow', '##board'\]               |
| snowboarding  | \['snow', '##board', '##ing'\]       |
| snow          | \['snow'\]                         |
| snowing       | \['snow', '##ing'\]                 |


By splitting words into word pieces, we have already identified that the words "surfboard" and "snowboard" share meaning through the wordpiece "##board" We have done this without even encoding our tokens or processing them in any way through BERT.

Using word pieces allows BERT to easily identify related words as they will usually share some of the same input tokens, which are then fed into the first layers of BERT.

In [None]:
tokens_pt = tokenizer("a visually stunning rumination on love", return_tensors="pt")

for key, value in tokens_pt.items():
    print("{}:\n\t{}".format(key, value))

The tokenizer automatically converted our input to all the inputs expected by the model. It generated some additional tensors on top of the IDs: 

- token_type_ids: This tensor will map every tokens to their corresponding segment (see below).
- attention_mask: This tensor is used to "mask" padded values in a batch of sequence with different lengths (see below).

You can check the Transformers [glossary](https://huggingface.co/transformers/glossary.html) for more information about each of those keys. 

Let's see some tokenized input examples to better understand  the meaning of the token type ids and attention masks.

In [None]:
# Single segment input
single_seg_input = tokenizer("This is a sample input")

# Multiple segment input
multi_seg_input = tokenizer("This is segment A", "This is segment B")

print("Single segment token (str): {}".format(tokenizer.convert_ids_to_tokens(single_seg_input['input_ids'])))
print("Single segment token (int): {}".format(single_seg_input['input_ids']))
print("Single segment type       : {}".format(single_seg_input['token_type_ids']))

# Segments are concatened in the input to the model, with 
print() 
print("Multi segment token (str): {}".format(tokenizer.convert_ids_to_tokens(multi_seg_input['input_ids'])))
print("Multi segment token (int): {}".format(multi_seg_input['input_ids']))
print("Multi segment type       : {}".format(multi_seg_input['token_type_ids']))

In [None]:
# Padding highlight
tokens = tokenizer(
    ["This is a sample", "This is another longer sample text"], 
    padding=True  # First sentence will have some PADDED tokens to match second sequence length
)

for i in range(2):
    print("Tokens (int)      : {}".format(tokens['input_ids'][i]))
    print("Tokens (str)      : {}".format([tokenizer.convert_ids_to_tokens(s) for s in tokens['input_ids'][i]]))
    print("Tokens (attn_mask): {}".format(tokens['attention_mask'][i]))
    print()

Now we can just feed the tokenized inputs directly into our model:

In [None]:
outputs = model(**tokens_pt)
last_hidden_state = outputs.last_hidden_state
pooler_output = outputs.pooler_output

print("Token wise output: {}, Pooled output: {}".format(last_hidden_state.shape, pooler_output.shape))

As you can see, BERT outputs two tensors:
 - One with the generated representation for every token in the input `(1, NB_TOKENS, REPRESENTATION_SIZE)`
 - One with an aggregated representation for the whole input `(1, REPRESENTATION_SIZE)`
 where:
  - `NB_TOKENS` represents the number of tokens in the sentence
  - `REPRESENTATION_SIZE` is the dimension of the hidden layer of the chosen BERT model (BERT-base has a 768-dim hidden layer)

The first, token-based, representation can be leveraged if your task requires to keep the sequence representation and you
want to operate at a token-level. This is particularly useful for Named Entity Recognition and Question-Answering.





For example, if you want the first token ([CLS]) last hidden state:

In [None]:
last_hidden_state[:, 0, :].numpy().shape

We can also visualize the tensors graphically (considering that `last_hidden_states[0]` in the image matches our `last_hidden_state` tensor).

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" />

The second, aggregated, representation is especially useful if you need to extract the overall context of the sequence and don't
require a fine-grained token-level. This is the case for Sentiment-Analysis of the sequence or Information Retrieval. Note that the first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Indeed, Pooler output is calculated as the last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. 

### Model loading

Here we load a pre-trained Bert-base (12 Encoder layers and 768-d for hidden states) model which has been fitted on cased texts.

One of the most powerful feature of transformers is its ability to seamlessly move from PyTorch to Tensorflow
without pain for the user.

For the rest of this notebook we will use the PyTorch version that is the default version in Transformers library. 

In [None]:
from transformers import TFBertModel, BertModel

# Let's load a BERT model for PyTorch
model_pt = BertModel.from_pretrained('bert-base-cased', output_hidden_states=True)

# Tensorflow Version
#model_tf = TFBertModel.from_pretrained('bert-base-cased') 

In [None]:
# transformers generates a ready to use dictionary with all the required parameters for the specific framework.
input_pt = tokenizer("This is a sample input", return_tensors="pt")

# Tensorflow version
# input_tf = tokenizer("This is a sample input", return_tensors="tf")

# Let's compare the outputs
output_pt = model_pt(**input_pt)

# Tensorflow version
# output_tf = model_tf(input_tf)

# Models outputs 2 values (The value for each tokens, the pooled representation of the input sentence)
for name in ["last_hidden_state", "pooler_output"]:
    print(name)
    print(output_pt[name].shape)
    print()

Everything is great so far, but how can we get word embeddings from this? As discussed, BERT base model uses 12 layers of transformer encoders, each output per token from each layer of these can be used as a word embedding! You probably wonder, which one is the best though? 

Well, this depends on the task but empirically, the authors identified that one of the best performing choices was to sum the last 4 layers, which is what we will be doing.

<img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png"></img>

As illustrated the best performing option is to concatenate the last 4 layers but in this post, the summing approach is used for convenience. More particularly, the performance difference is not that much, and also there is more flexibility for truncating the dimensions further, without losing much information.

Last 4 hidden states can be selected by slicing the hidden_states property

In [None]:
output_pt["hidden_states"][8:12]

One simple and really powerful Python library to deal with transformer embeddings including BERT and a large variety of NLP tasks is Flair. https://github.com/flairNLP/flair

### Want it lighter? Faster? Let's talk distillation! 

One of the main concerns when using these Transformer based models is the computational power they require. All over this notebook we are using BERT model as it can be run on common machines but that's not the case for all of the models.

For example, Google released **T5** an Encoder/Decoder architecture based on Transformer and available in `transformers` with no more than 11 billions parameters. Microsoft also recently entered the game with **Turing-NLG** using 17 billions parameters. This kind of model requires tens of gigabytes to store the weights and a tremendous compute infrastructure to run such models which makes it impracticable for the common man !

![transformers-parameters](https://raw.githubusercontent.com/huggingface/notebooks/main/examples/images/model_parameters.png)

With the goal of making Transformer-based NLP accessible to everyone Huggingface developed models that take advantage of a training process called **Distillation** which allows us to drastically reduce the resources needed to run such models with almost zero drop in performances.

Intuitively you can think of distillation as a process in which a lighter model is trained to replicate the predictions made by another larger model.

Going over the whole Distillation process is out of the scope of this notebook, but if you want more information on the subject you may refer to [this Medium article written by my colleague Victor SANH, author of DistilBERT paper](https://medium.com/huggingface/distilbert-8cf3380435b5), you might also want to directly have a look at the paper [(Sanh & al., 2019)](https://arxiv.org/abs/1910.01108)

In `transformers` some models have been distilled and made available directly in the library. 

In [None]:
from transformers import DistilBertModel

bert_distil = DistilBertModel.from_pretrained('distilbert-base-cased')
input_pt = tokenizer(
    'This is a sample input to demonstrate performance of distiled models especially inference time', 
    return_tensors="pt"
)

# Forward pass time comparison between BERT and DistillBERT
%time _ = bert_distil(input_pt['input_ids'])
%time _ = model_pt(input_pt['input_ids'])

## Sentiment Classification with DistillBERT



Here we replicate the experiment above about in-domain sentiment classification of the IMDB dataset, but this time using a BERT-based model via the HuggingFaces Transformers library. This will give you some insights how BERT can be used to improve the results in this task. 

Let's convert the labels to integer values using sklearn LabelEncoder class.

In [None]:
from sklearn import preprocessing

# String labels conversion to integers
le = preprocessing.LabelEncoder()

train_labels = le.fit_transform(train_set.label)
train_labels

In [None]:
# String labels conversion to integers
test_labels = le.fit_transform(test_set["label"])
test_labels

We already have a train and test dataset, but let's also also create a validation set which we can use for for evaluation
and tuning without training our test set results. Sklearn has a convenient utility for creating such splits:

In [None]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = train_test_split(train_set["text"].tolist(), train_labels, test_size=.1)

Alright, we are read in our dataset. Now let's tackle tokenization. We'll eventually train a classifier using
pre-trained DistilBert, so let's use the DistilBert tokenizer.

In [None]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Now we can simply pass our texts to the tokenizer. We'll pass `truncation=True` and `padding=True`, which will
ensure that all of our sequences are padded to the same length and are truncated to be no longer model's maximum input
length. This will allow us to feed batches of sequences into the model at the same time.

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_set["text"].tolist(), truncation=True, padding=True)

Now, let's turn our labels and encodings into a Dataset object. In PyTorch, this is done by subclassing a
`torch.utils.data.Dataset` object and implementing `__len__` and `__getitem__`. In TensorFlow, we pass our input
encodings and labels to the `from_tensor_slices` constructor method. We put the data in this format so that the data
can be easily batched such that each key in the batch encoding corresponds to a named parameter of the
`DistilBertForSequenceClassification.forward` method of the model we will train.

In [None]:
## PYTORCH CODE
import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

Now that our datasets our ready, we can fine-tune a model either with the 🤗
`Trainer`/`TFTrainer` or with native PyTorch/TensorFlow. See [training](https://huggingface.co/transformers/training.html).

#### Fine-tuning with Trainer

Let's create our custom metrics calculation function in order to measure the performance of the model

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

The steps above prepared the datasets in the way that the trainer is expected. Now all we need to do is create a model
to fine-tune, define the `TrainingArguments`/`TFTrainingArguments` and
instantiate a `Trainer`/`TFTrainer`.

In [None]:
from transformers import DistilBertForSequenceClassification

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')

Let's perform the training of our BERT model. We limit the fine-tuning to 1 epoch for time reasons. To get a better performance here you should fine-tune at least for 2-4 epochs.

In [None]:
## PYTORCH CODE
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, BertForSequenceClassification

torch.set_grad_enabled(True) # Enable gradient calculation to perform the training

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    do_eval=True,                    # enable/disable the evaluation on the validation set during the training
    evaluation_strategy='steps',     # whether to validate the model each N steps or at the end of each epoch
    eval_steps=200                   # If evaluation strategy is set to 'steps' then the model will be evaluted each 'eval_steps' on the validation set during the training
)
# If you want to customize more the training arguments...
# https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments


trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics
)

trainer.train()

Once the model is trained we can evaluate it on the test set.

In [None]:
trainer.evaluate(test_dataset)

We successfully boosted the performance got by the RNNs previously, even if we used the distilled version of BERT fine-tuned for only one epoch. BERT larger models can reach accuracies up to almost 96% on this dataset and task (http://nlpprogress.com/english/sentiment_analysis.html).

Finally, you can try to perform the sentiment classification on a arbitrary sentence.

In [None]:
sentence_to_be_classified = "I won't buy again this product."  # You can type an arbitrary sentence to be summarized

on_demand_test_encodings = tokenizer([sentence_to_be_classified], truncation=True, padding=True)
on_demand_test_dataset = IMDbDataset(on_demand_test_encodings, le.transform(['neg']))

result = trainer.predict(on_demand_test_dataset)
print("Logits: " + str(result.predictions))

We got the logit predictions, but if you want probabilities you have to convert them using a softmax transformation. Then we can get the predicted class.

In [None]:
import torch

# Convert logits to probabilities using softmax
p = torch.nn.functional.softmax(torch.from_numpy(result.predictions), dim=1)
print("Probabilities: " + str(p))

# Get the predicted classes for each output
top_p, top_class = p.topk(1, dim = 1)
print("Top class: " + str(top_class[0][0].item()))

print() 

if (top_class.numpy()[0] == 1):
  print('The sentence polarity is positive')
else:
  print('The sentence polarity is negative.')