# **Natural language processing tutorial**
In today's tutorial we will design and train deep neural networks to solve a text classification problem.

We will use [**TensorFlow**](https://ekababisong.org/gcp-ml-seminar/tensorflow/) framework and [**Keras**](https://keras.io/) open-source library to rapidly prototype deep neural networks.

# **Preliminary operations**
The following code downloads all the necessary material into the remote machine. At the end of the execution select the **File** tab to verify that everything has been correctly downloaded.

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

!tar -xzf aclImdb_v1.tar.gz

!rm aclImdb_v1.tar.gz
!rm -rf aclImdb/train/unsup

# **Useful modules import**
First of all, it is necessary to import useful modules used during the tutorial.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import text_dataset_from_directory
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split
from scipy.spatial import distance

# **Utility functions**
Execute the following code to define some utility functions used in the tutorial:
- **convert_dataset_to_list** converts an instance of the TensorFlow class [**Dataset**](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) containing text data and labels into two lists, the former containing text strings and the latter containing the corresponding labels;
- **plot_histograms** plots multiple histograms;
- **plot_history** draws in a graph the loss trend over epochs on both training and validation sets. Moreover, if provided, it draws in the same graph also the trend of the given metric;
- **plot_embedded_similar_words** plots an embedded vector space highlighting most and less similar words to a selected one.

In [None]:
def convert_dataset_to_list(dataset):
  data = []
  labels=[]
  for text_batch, label_batch in dataset:
    for i in range(text_batch.shape[0]):
      data.append(text_batch.numpy()[i])
      labels.append(label_batch.numpy()[i])

  return data,labels

def plot_histograms(hist_list,title_list,x_label=None,y_label=None,max_y=None,bins=range(0, 1500, 50),figsize=(20,6)):
  _,axs=plt.subplots(1,len(hist_list),figsize=figsize)

  for i, hist in enumerate(hist_list):
    axs[i].set_title(title_list[i])
    if x_label!=None:
      axs[i].set_xlabel('# words per review')
    if y_label!=None:
      axs[i].set_ylabel('count')
    if max_y!=None:
      axs[i].set_ylim(0,max_y)
    axs[i].hist(hist, bins=bins)

def plot_history(history,metric=None):
  fig, ax1 = plt.subplots(figsize=(10, 8))

  epoch_count=len(history.history['loss'])

  line1,=ax1.plot(range(1,epoch_count+1),history.history['loss'],label='train_loss',color='orange')
  ax1.plot(range(1,epoch_count+1),history.history['val_loss'],label='val_loss',color = line1.get_color(), linestyle = '--')
  ax1.set_xlim([1,epoch_count])
  ax1.set_ylim([0, max(max(history.history['loss']),max(history.history['val_loss']))])
  ax1.set_ylabel('loss',color = line1.get_color())
  ax1.tick_params(axis='y', labelcolor=line1.get_color())
  ax1.set_xlabel('Epochs')
  _=ax1.legend(loc='lower left')

  if (metric!=None):
    ax2 = ax1.twinx()
    line2,=ax2.plot(range(1,epoch_count+1),history.history[metric],label='train_'+metric)
    ax2.plot(range(1,epoch_count+1),history.history['val_'+metric],label='val_'+metric,color = line2.get_color(), linestyle = '--')
    ax2.set_ylim([0, max(max(history.history[metric]),max(history.history['val_'+metric]))])
    ax2.set_ylabel(metric,color=line2.get_color())
    ax2.tick_params(axis='y', labelcolor=line2.get_color())
    _=ax2.legend(loc='upper right')

def plot_embedded_similar_words(unique_reshaped_embedded_x,unique_reshaped_embedded_x_sorted_indices,sorted_val_unique_words,similar_count,figsize=(25,10),point_size=5):
  most_similar_point_coords=unique_reshaped_embedded_x[unique_reshaped_embedded_x_sorted_indices[:similar_count]]
  less_similar_point_coords=unique_reshaped_embedded_x[unique_reshaped_embedded_x_sorted_indices[-similar_count:]]

  point_colors=['blue' if i in unique_reshaped_embedded_x_sorted_indices[:similar_count] else 'red' if i in unique_reshaped_embedded_x_sorted_indices[-similar_count:] else 'gray' for i in range(unique_reshaped_embedded_x.shape[0])]

  _,axs=plt.subplots(1,3,figsize=figsize)
  axs[0].scatter(unique_reshaped_embedded_x[:,0],unique_reshaped_embedded_x[:,1],c=point_colors,s=point_size)
  axs[0].set_title('Embedded vector space')

  axs[1].scatter(most_similar_point_coords[:,0],most_similar_point_coords[:,1],c='blue',s=point_size)
  axs[1].set_title('Embedded most similar words')
  for i, label in enumerate(sorted_val_unique_words[:similar_count]):
    axs[1].annotate(label, (most_similar_point_coords[i][0], most_similar_point_coords[i][1]),fontsize=14)

  axs[2].scatter(less_similar_point_coords[:,0],less_similar_point_coords[:,1],c='red',s=point_size)
  axs[2].set_title('Embedded less similar words')
  for i, label in enumerate(sorted_val_unique_words[-similar_count:]):
    axs[2].annotate(label, (less_similar_point_coords[i][0], less_similar_point_coords[i][1]),fontsize=14)

# **Dataset**
This tutorial uses the [Stanford’s large movie review dataset](https://ai.stanford.edu/~amaas/data/sentiment/) for binary sentiment text classification.

The data set contains  a set of 25000 highly polar movie reviews for training, and 25000 for testing.

The following code loads in memory the dataset using the [**text_dataset_from_directory**](https://keras.io/api/data_loading/text/#textdatasetfromdirectory-function) function returning an instance of the TensorFlow class **Dataset**.

In [None]:
train_dataset = text_dataset_from_directory('aclImdb/train')
test_dataset = text_dataset_from_directory('aclImdb/test')

The **element_spec** attribute can be used to get the type specification of the elements of the dataset. In our case each element is a review and its label: 1 for “positive” and 0 for “negative”.

In [None]:
train_dataset.element_spec

The following code converts the training and test **Dataset** instances into lists of reviews and corresponding labels.

In [None]:
train_reviews,train_y=convert_dataset_to_list(train_dataset)
test_reviews,test_y=convert_dataset_to_list(test_dataset)

print('Training review count: ',len(train_reviews))
print('Training label count: ',len(train_y))
print('Test review count: ',len(test_reviews))
print('Test label count: ',len(test_y))

## **Visualization**
The first *review_count* training reviews can be shown by executing the following code.

In [None]:
review_count=5

for i in range(review_count):
  print(train_y[i],train_reviews[i])

## **Data preparation**
Most machine learning algorithms require data to be formatted in a specific way, so datasets generally require some amount of preparation before they can yield useful insights. Some datasets have values that are missing, invalid, or otherwise difficult for an algorithm to process.

### **Decode UTF-8-encoded string**
The 'b' character before review strings means that they are encoded using UTF-8 format. 

The following code converts strings from UTF-8 to Unicode format using the [**decode**](https://docs.python.org/3/library/stdtypes.html#bytes.decode) method. 

In [None]:
prepared_train_reviews=[review.decode('utf-8') for review in train_reviews]
prepared_test_reviews=[review.decode('utf-8') for review in test_reviews]

for i in range(review_count):
  print(train_y[i],prepared_train_reviews[i])

### **Remove HTML line break tag**
The HTML line break tag is present in most of the movie reviews.


In [None]:
br_tag_count=0
for review in prepared_train_reviews:
  br_tag_count+=review.count('<br />')
print('HTML line break tag occurrences in training set reviews: ',br_tag_count)

Because it is not an English vocabulary word, it is better to replace it with a blank space using the [**replace**](https://docs.python.org/3/library/stdtypes.html#str.replace) method.

In [None]:
prepared_train_reviews=[review.replace('<br />', ' ') for review in prepared_train_reviews]
prepared_test_reviews=[review.replace('<br />', ' ') for review in prepared_test_reviews]

## **Split data into training and validation sets**
In order to avoid overfitting during training, it is necessary to have a separate dataset (called validation set), in addition to the training and test datasets, to choose the optimal value for the hyperparameters. 

For this reason, *prepared_train_reviews* and *train_y* are divided into two subsets: training and validation sets. 

Scikit-learn library provides the function [**train_test_split**](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to separate a dataset into two parts.

The *val_size* variable represents the percentage (or the absolute number) of patterns to include in the validation set.

By default, **train_test_split** mixes patterns in order to avoid that returned datasets contain patterns belonging only to a subset of the classes.

In [None]:
val_size=5000

train_x, val_x, train_y, val_y = train_test_split(prepared_train_reviews, train_y, test_size=val_size, random_state=42,shuffle=True)

print('Training review count: ',len(train_x))
print('Training label count: ',len(train_y))
print('Validation review count: ',len(val_x))
print('Validation label count: ',len(val_y))

## **Convert label lists into Numpy arrays**
It is convenient to keep the data in the form of Numpy arrays instead of lists. While reviews need further processing before they can be transformed into Numpy arrays, the labels can be already converted.

The following code converts the labels from lists of integers into Numpy arrays using the Numpy [**array**](https://numpy.org/doc/stable/reference/generated/numpy.array.html) function.

In [None]:
train_y=np.array(train_y)
val_y=np.array(val_y)
test_y=np.array(test_y)

print('Training label shape: ',train_y.shape)
print('Validation label shape: ',val_y.shape)
print('Test label shape: ',test_y.shape)

# **Text tokenization**
Raw text cannot be directly fed into deep learning models. Text data must be encoded as numbers before it can be used as input or output for machine learning and deep learning models.

Keras provides the [**Tokenizer**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) class for preparing text documents for deep learning. It allows to vectorize a text corpus, by turning each text into a sequence of integers where each integer is the index of a word (or token) in a dictionary.

The following code creates a new instance of the **Tokenizer** class.

In [None]:
tokenizer = Tokenizer()

The [**fit_on_texts**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#fit_on_texts) method can be used to create the internal vocabulary based on training word frequency.

In [None]:
tokenizer.fit_on_texts(train_x)

The internal vocabulary can be accessed through the *word_counts* attribute: an ordered dictionary containing all words used to create the vocabulary and their corresponding frequency.

In [None]:
print(tokenizer.word_counts)
print('Word count: ',len(tokenizer.word_counts))

Another important attribute is *word_index*, a dictionary of words and their uniquely assigned integers.

In [None]:
print(tokenizer.word_index)

Once the **Tokenizer** has been fit on training data, the [**texts_to_sequences**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#texts_to_sequences) method can be used to encode documents of the training, validation and test datasets by transforming each movie review into a sequence of integers. 

The *num_words* parameter represents the maximum number of words to keep, based on word frequency.

In [None]:
num_words=50000

tokenizer.num_words=num_words

tokenized_train_x=tokenizer.texts_to_sequences(train_x)
tokenized_val_x=tokenizer.texts_to_sequences(val_x)
tokenized_test_x=tokenizer.texts_to_sequences(prepared_test_reviews)

The following code shows a result of the tokenization process on a training movie review.

In [None]:
idx=0
print(tokenized_train_x[idx])
print('Review length: ',len(tokenized_train_x[idx]))

To recover the original text, the [**sequences_to_texts**](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer#sequences_to_texts) method can be used.

In [None]:
print(tokenizer.sequences_to_texts([tokenized_train_x[idx]]))

# **Make all reviews of the same length**
As shown by the following code, each text sequence has (in most cases) a different number of words.

In [None]:
plot_histograms([[len(x) for x in tokenized_train_x],[len(x) for x in tokenized_val_x],[len(x) for x in tokenized_test_x]],
                ['Training set','Validation set','Test set'],x_label='# words per review',y_label='count',max_y=8000,bins=range(0, 1500, 50),figsize=(20,6))

To avoid this problem, the [**pad_sequences**](https://keras.io/api/preprocessing/timeseries/#padsequences-function) function can be used. It transforms each sequence into a Numpy array of predefined length (*maxlen* parameter):
- sequences that are shorter than *maxlen* are padded with zeros;
- sequences longer than *maxlen* are truncated.

In [None]:
maxlen=500

padded_train_x=pad_sequences(tokenized_train_x, maxlen=maxlen)
padded_val_x=pad_sequences(tokenized_val_x, maxlen=maxlen)
padded_test_x=pad_sequences(tokenized_test_x, maxlen=maxlen)

print('Training feature shape: ',padded_train_x.shape)
print('Validation feature shape: ',padded_val_x.shape)
print('Test feature shape: ',padded_test_x.shape)

# **Deep RNN**
In this section a deep RNN is implemented to binary classify movie reviews.

## **Model definition**
The following function creates a deep RNN model given:
- the number of timesteps in each input sequence (*timesteps*);
- the number of features in each timestep (*feature_count*);
- the number of units for each RNN layer (*unit_count_per_rnn_layer*).

The model returns a single target value given an entire sequence as input (*many-to-one*).

<u>Note that, the number of timesteps in each input sequence (*timesteps*) is set in advance only because it improves performance during training by creating tensors of fixed shapes. A *None* value can be used to admit variable-length input sequences.</u>

In Keras, a sequential is a stack of layers where each layer has exactly one input and one output. It can be created by passing a list of layers to the  constructor [**keras.Sequential**](https://keras.io/guides/sequential_model/).

[**Keras layers API**](https://keras.io/api/layers/) offers a wide range of built-in layers ready for use, including:
- [**Input**](https://keras.io/api/layers/core_layers/input/) - the input of the model. Note that, you can also omit the **Input** layer. In that case the model doesn't have any weights until the first call to a training/evaluation method (since it is not yet built);
- [**SimpleRNN**](https://keras.io/api/layers/recurrent_layers/simple_rnn/) - a fully-connected RNN where the output is to be fed back to input;
- [**Dense**](https://keras.io/api/layers/core_layers/dense/) - a fully-connected layer.

The *return_sequences* parameter of the **SimpleRNN** layer serves to return the full output time sequence (True), or only the last output (False).

In [None]:
def build_deep_rnn(timesteps,feature_count,unit_count_per_rnn_layer=[1]):
  model = keras.Sequential()
  model.add(layers.Input(shape=(timesteps,feature_count)))

  for i in range(len(unit_count_per_rnn_layer)):
    model.add(layers.SimpleRNN(unit_count_per_rnn_layer[i],activation='sigmoid',return_sequences=i<(len(unit_count_per_rnn_layer)-1)))

  if unit_count_per_rnn_layer[-1]>1:
    model.add(layers.Dense(1,activation='sigmoid'))

  return model

## **Model creation**
The following code creates a deep RNN model by calling the **build_deep_rnn** function defined above.

In [None]:
deep_rnn=build_deep_rnn(maxlen,1,unit_count_per_rnn_layer=[8])

## **Model visualization**
A string summary of the network can be printed by executing the following code.

In [None]:
deep_rnn.summary()

Alternatively, a plot of the neural network graph can be visualized.

In [None]:
keras.utils.plot_model(deep_rnn,show_shapes=True,show_layer_names=False)

## **Model compilation**
The compilation is the final step in configuring the model for training. 

The following code use the [**compile**](https://keras.io/api/models/model_training_apis/#compile-method) method to compile the model.
The important arguments are:
- the optimization algorithm (*optimizer*);
- the loss function (*loss*);
- the metrics used to evaluate the performance of the model (*metrics*).

The most common [optimization algorithms](https://keras.io/api/optimizers/#available-optimizers), [loss functions](https://keras.io/api/losses/#available-losses) and [metrics](https://keras.io/api/metrics/#available-metrics) are already available in Keras. You can either pass them to **compile** as an instance or by the corresponding string identifier. In the latter case, the default parameters will be used.

In [None]:
deep_rnn.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

## **Data shape expansion**
Recurrent neural networks expect the input data to be provided with a specific array structure in the form of: [*samples*, *time steps*, *features*] while our data is in the form: [*samples*, *time steps*].

We can transform our data into the expected structure using the Numpy function [**expand_dims**](https://numpy.org/doc/stable/reference/generated/numpy.expand_dims.html).

In [None]:
expanded_padded_train_x =np.expand_dims(padded_train_x, axis=2)
expanded_padded_val_x =np.expand_dims(padded_val_x, axis=2)
expanded_padded_test_x =np.expand_dims(padded_test_x, axis=2)

print('Expanded training feature shape: ',expanded_padded_train_x.shape)
print('Expanded validation feature shape: ',expanded_padded_val_x.shape)
print('Expanded test feature shape: ',expanded_padded_test_x.shape)

## **Training**
Now we are ready to train our model by calling the [**fit**](https://keras.io/api/models/model_training_apis/#fit-method) method.

It trains the model for a fixed number of epochs (*epoch_count*) using the training set (*expanded_padded_train_x*) divided into mini-batches of *batch_size* elements. During the training process, the performances will be evaluated on both training and validation (*expanded_padded_val_x*) sets.

Break training when a metric or the loss has stopped improving on the validation set, helps to avoid overfitting.

For this purpose, Keras provides a class called [**EarlyStopping**](https://keras.io/api/callbacks/early_stopping/). Important class parameters are:
- *monitor* - the name of the metric or the loss to be observed; 
- *patience* - the number of epochs with no improvement after which training will be stopped;
- *restore_best_weights* - whether to restore model weights from the epoch with the best value of the monitored quantity.

Once created an instance of the **EarlyStopping** class, it can be passed to the **fit** method in the *callbacks* parameter.

In [None]:
epoch_count = 20
batch_size = 1000
patience=5

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)

history = deep_rnn.fit(expanded_padded_train_x,train_y,validation_data=(expanded_padded_val_x,val_y),epochs=epoch_count,batch_size = batch_size,callbacks=[early_stop])

### **Visualize the training process**
We can learn a lot about our model by observing the graph of its performance over time during training.

The **fit** method returns an object (*history*) containing loss and metrics values at successive epochs for both training and validation sets.

The following code calls the **plot_history** function defined above to draw in a graph the loss and accuracy trend over epochs on both training and validation sets.

In [None]:
plot_history(history,metric='accuracy')

## **Performance evaluation on the test set**
The performance on the test set can be easily measured by calling the **evaluate** method.

In [None]:
results = deep_rnn.evaluate(expanded_padded_test_x, test_y, batch_size=batch_size,verbose=0)
print('Loss: {:.3f} Accuracy: {:.3f}'.format(results[0],results[1]))

The results obtained are very poor. This is mainly due to the *problem of sparsity*. 

### **Problem of sparsity**

With text tokenization, each word has been turned into an integer (representing the index of the word in a dictionary) but such integer values are not able to well represent the meaning similarity between words. 

For instance, as shown in the following cell, although ‘brilliant’ and ‘beautiful’ have similar meaning, their tokens are more distant than ‘brilliant’ and ‘horrible’.

In [None]:
first_word='brilliant'
second_word='beautiful'
third_word='horrible'

first_word_token=tokenizer.texts_to_sequences([first_word])[0][0]
second_word_token=tokenizer.texts_to_sequences([second_word])[0][0]
third_word_token=tokenizer.texts_to_sequences([third_word])[0][0]

print('Distance between \'{}\' and \'{}\': |{}-{}|={}'.format(first_word,second_word,first_word_token,second_word_token,abs(first_word_token-second_word_token)))
print('Distance between \'{}\' and \'{}\': |{}-{}|={}'.format(first_word,third_word,first_word_token,third_word_token,abs(first_word_token-third_word_token)))

# **Word embedding**
*Word embedding* is a class of approaches for representing words using a dense vector representation. Individual words are represented as real-valued vectors in a predefined vector space where a real-valued vector encodes the meaning of the corresponding word such that words similar in meaning are closer in the vector space.

A word embedding can be learned as part of a deep learning model where the position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

## **Model definition**
The following function creates a word embedding model given:
- the size of the vocabulary (*vocab_size*);
- the size of the embedded vector space (*output_dim*);
- the length of the input sequences (*input_length*).

Keras offers an [**Embedding**](https://keras.io/api/layers/core_layers/embedding/#embedding) layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer (as we already done with the **Tokenizer**). 

It is a flexible layer that can be used in different ways:
- it can be used alone to learn a word embedding that can be saved and used in another model later;
- it can be used as part of a deep learning model where the embedding is learned along with the model itself.

Usually the **Embedding** layer is defined as the first layer of a network specifying three arguments:
- *input_dim*: the size of the vocabulary in the text data (*num_words* variable used in text tokenization);
- *output_dim*: the size of the vector space in which words will be embedded;
- *input_length*: the length of input sequences (*maxlen* variable).

The output of the **Embedding** layer is a 2D array with one embedding for each word in the input sequence.

To directly connect an **Embedding** layer to a fully-connected layer, you must first flatten the 2D output to a 1D array using the [**Flatten**](https://keras.io/api/layers/reshaping_layers/flatten/) layer.

In [None]:
def build_embedding_model(vocab_size,output_dim,input_length):
  embedding_model=keras.Sequential()
  embedding_model.add(layers.Input(shape=(input_length)))
  embedding_layer=layers.Embedding(vocab_size, output_dim, input_length=input_length)
  embedding_model.add(embedding_layer)
  embedding_model.add(layers.Flatten())
  embedding_model.add(layers.Dense(1, activation='sigmoid'))
            
  return embedding_model,embedding_layer

## **Model creation**
The following code creates a word embedding model by calling the **build_embedding_model** function defined above. The *embedded_vector_dim* variable represents the size of the vector space of the **Embedding** layer. 

In [None]:
embedded_vector_dim=2

emb_model,emb_layer=build_embedding_model(num_words,embedded_vector_dim,maxlen)

## **Model visualization**
A string summary of the network can be printed by executing the following code.

In [None]:
emb_model.summary()

Alternatively, a plot of the neural network graph can be visualized.

In [None]:
keras.utils.plot_model(emb_model,show_shapes=True,show_layer_names=False)

## **Model compilation**
The following code compiles the model as already done before.

In [None]:
emb_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## **Training**
Now we are ready to train our word embedding model on the *padded_train_x* dataset by calling the **fit** method.

In [None]:
epoch_count = 50
batch_size = 1000
patience=5

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)

history = emb_model.fit(padded_train_x,train_y,validation_data=(padded_val_x,val_y),epochs=epoch_count,batch_size = batch_size,callbacks=[early_stop])

### **Visualize the training process**
The following code calls the **plot_history** function defined above to draw in a graph the loss and accuracy trend over epochs on both training and validation sets.

In [None]:
plot_history(history,metric='accuracy')

## **Data embedding**
To apply the trained **Embedding** layer to our data, it is necessary get its output. To do this, we need to create a new Keras [**Model**](https://keras.io/api/models/model/) and setting:
- its inputs equal to the input of the trained embedding model (*emb_model*);
- its outputs equal to the output of the trained **Embedding** layer (*emb_layer*).

By executing the following code, the new **Model** is created.

In [None]:
emb_layer_model=keras.Model(inputs=emb_model.input,outputs=emb_layer.output)

The [**predict**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.predict) method of the **Model** created above can be used to generate the word embeddings (*embedded_train_y*, *embedded_val_y* and *embedded_test_y*) of the training, validation and test sets (*padded_train_x*, *padded_val_x* and *padded_test_x*).

In [None]:
embedded_train_x=emb_layer_model.predict(padded_train_x)
embedded_val_x=emb_layer_model.predict(padded_val_x)
embedded_test_x=emb_layer_model.predict(padded_test_x)

print('Embedded training feature shape: ',embedded_train_x.shape)
print('Embedded validation feature shape: ',embedded_val_x.shape)
print('Embedded test feature shape: ',embedded_test_x.shape)

## **Embedded vector space visualization**
To use Matplotlib functionalities to plot embedded vectors, the data need to be reshaped from the form [*samples*, *time steps*, *features*] to the form [*words*, *features*].

We can transform our data into the expected structure using the Numpy function [**reshape**](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html).

Due to the size of the training and test sets, here we focused only on the visualization of the validation set that is the smallest one.

In [None]:
reshaped_embedded_val_x=embedded_val_x.reshape(-1,embedded_val_x.shape[-1])

print('Embedded validation feature shape: ',reshaped_embedded_val_x.shape)

The following code visualize the embedded validation set.

In [None]:
plt.scatter(reshaped_embedded_val_x[:,0],reshaped_embedded_val_x[:,1],c='gray',s=3)
plt.show()

## **Compute distance between embedded vectors**
To evaluate if the problem of sparsity has been solved by word embedding, the Euclidean distance between the embedded vectors of the words used before ('brilliant', 'beautiful' and 'horrible') will be computed.

To be used as input of the trained **Embedding** layer, the three word tokens are encapsulated at the beginning of a sequence of *maxlen* elements zero initialized.

The embedded vectors of the three words can be shown executing the following code.


In [None]:
word_token_sequence=np.zeros((1,maxlen))
word_token_sequence[0,0]=first_word_token
word_token_sequence[0,1]=second_word_token
word_token_sequence[0,2]=third_word_token

emb_word_token_sequence=emb_layer_model.predict(word_token_sequence)

first_word_embedding=emb_word_token_sequence[0][0]
second_word_embedding=emb_word_token_sequence[0][1]
third_word_embedding=emb_word_token_sequence[0][2]

print('\'{}\' embedding: {}'.format(first_word,first_word_embedding))
print('\'{}\' embedding: {}'.format(second_word,second_word_embedding))
print('\'{}\' embedding: {}'.format(third_word,third_word_embedding))

The [**euclidean**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html) function of the [**SciPy**](https://scipy.org/) library is used to compute the Euclidean distance between the three word embeddings.

In [None]:
print('Euclidean distance between \'{}\' and \'{}\' embeddings: {:.2f}'.format(first_word,second_word,distance.euclidean(first_word_embedding,second_word_embedding)))
print('Euclidean distance between \'{}\' and \'{}\' embeddings: {:.2f}'.format(first_word,third_word,distance.euclidean(first_word_embedding,third_word_embedding)))

Differently from token distances, ‘brilliant’ and ‘beautiful’ embeddings are much closer than ‘brilliant’ and ‘horrible’ ones.

## **Visualize most and less similar words**
In this section, to confirm the importance of word embedding, once a word is selected (*selected_word*) its most and less similar words in the embedded vector space will visualized.

First of all, the token of the *selected_word* must be derived using the **Tokenizer** created above in the text tokenization step.

In [None]:
selected_word='brilliant'

selected_word_token=tokenizer.texts_to_sequences([selected_word])

print('\'{}\'={}'.format(selected_word,selected_word_token[0][0]))

The Numpy **zeros** function is used to create an array of *maxlen* elements starting with the *selected_word_token* and followed by all zeros.

In [None]:
word_token_sequence=np.zeros((1,maxlen))
word_token_sequence[0,0]=selected_word_token[0][0]

print('Sequence shape: ',word_token_sequence.shape)

The corresponding embedded vector can be generated using the **predict** method of the trained **Embedding** layer.

In [None]:
emb_word_token_sequence=emb_layer_model.predict(word_token_sequence)
embedded_selected_word=emb_word_token_sequence[0,0]

print('Embedded sequence shape: ',emb_word_token_sequence.shape)
print('\'{}\' embedding={}'.format(selected_word,embedded_selected_word))

To find the most and less similar words, it is necessary to compute the distance in the embedded vector space between the selected word and all words in the validation set. 

Since the validation set contains multiple instances of the same word, to avoid wasting time computing multiple times the same distance, only an instance of each embedded vector is chosen using the Numpy [**unique**](https://numpy.org/doc/stable/reference/generated/numpy.unique.html) function.

In [None]:
unique_reshaped_embedded_val_x,unique_reshaped_embedded_val_x_indices = np.unique(reshaped_embedded_val_x,return_index=True, axis=0)

print('Unique embedded validation set shape: ',unique_reshaped_embedded_val_x.shape)

To compute the Euclidean distance between the embedded vector of the selected word (*embedded_selected_word*) and the embedded vectors of all unique words in the validation set (*unique_reshaped_embedded_val_x*), the [**cdist**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) function of the SciPy library is used.

In [None]:
distances=distance.cdist([embedded_selected_word], unique_reshaped_embedded_val_x)

print('Distances shape: ',distances.shape)

Now we need to recover the unique words of the validation set starting from their embedded vectors.

To do this, the following steps are executed:
1. the indices of the unique embedded vectors in the validation set are sorted in ascending order (according to their distance from the selected word) using the Numpy function [**argsort**](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html);
2. the sorted indices are backward mapped into 2D indices indicating the corresponding sequence ID in the validation set (*sorted_val_sequence_indices*) and the word ID inside the sequence (*sorted_val_sequence_word_indices*);
3. the sorted unique word tokens (*sorted_val_unique_tokens*) corresponding to the sorted unique embedded vectors are taken;
4. the sorted unique words (*sorted_val_unique_words*) corresponding to the sorted unique word tokens are recovered using the **sequences_to_texts** method of the **Tokenizer**. 

In [None]:
#1
unique_reshaped_embedded_val_x_sorted_indices=np.argsort(distances[0])

#2
sorted_val_sequence_indices=unique_reshaped_embedded_val_x_indices[unique_reshaped_embedded_val_x_sorted_indices]//embedded_val_x.shape[1]
sorted_val_sequence_word_indices=unique_reshaped_embedded_val_x_indices[unique_reshaped_embedded_val_x_sorted_indices]%embedded_val_x.shape[1]

#3
sorted_val_unique_tokens=padded_val_x[sorted_val_sequence_indices,sorted_val_sequence_word_indices]

#4
sorted_val_unique_words=tokenizer.sequences_to_texts([sorted_val_unique_tokens])[0].split()

print(sorted_val_unique_words)

The following code visualize the embedded vectors of the validation set, highlighting the most and less similar words to the selected one (*selected_word*). 

In [None]:
similar_count=10

plot_embedded_similar_words(unique_reshaped_embedded_val_x,unique_reshaped_embedded_val_x_sorted_indices,sorted_val_unique_words,similar_count,figsize=(20,6))

# **Deep RNN with word embedding**
In this section a deep RNN is combined with word embedding to binary classify movie reviews.

## **Model creation**
The following code creates a deep RNN model by calling the **build_deep_rnn** function defined above.

In this case the number of features used to represent each word (*feature_count*) is equal to the size of the embedded vector space (*embedded_vector_dim*).

In [None]:
emb_deep_rnn=build_deep_rnn(maxlen,embedded_vector_dim,unit_count_per_rnn_layer=[8])

## **Model visualization**
A string summary of the network can be printed by executing the following code.

In [None]:
emb_deep_rnn.summary()

Alternatively, a plot of the neural network graph can be visualized.

In [None]:
keras.utils.plot_model(emb_deep_rnn,show_shapes=True,show_layer_names=False)

## **Model compilation**
The following code compiles the model as already done before.

In [None]:
emb_deep_rnn.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

## **Training**
Now the model is ready to be trained on the embedded vectors by calling the **fit** method.

In [None]:
epoch_count = 50
batch_size = 1000
patience=5

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)

history = emb_deep_rnn.fit(embedded_train_x,train_y,validation_data=(embedded_val_x,val_y),epochs=epoch_count,batch_size = batch_size,callbacks=[early_stop])

### **Visualize the training process**
The following code calls the **plot_history** function defined above to draw in a graph the loss and accuracy trend over epochs on both training and validation sets.

In [None]:
plot_history(history,metric='accuracy')

## **Performance evaluation on the test set**
The performance on the test set can be measured by calling the **evaluate** method.

In [None]:
results = emb_deep_rnn.evaluate(embedded_test_x, test_y, batch_size=batch_size,verbose=0)
print('Loss: {:.3f} Accuracy: {:.3f}'.format(results[0],results[1]))

## **Classify user-defined movie reviews**
To evaluate the accuracy of the trained model on new data, please write, in the cell below, a positive and a negative movie review (in English).

In [None]:
reviews=['...',
         '...']

By executing the following code, the reviews will be:
1. tokenized;
2. padded;
3. expanded by inserting a new axis. 

In [None]:
#1
tokenized_reviews=tokenizer.texts_to_sequences(reviews)
print(tokenized_reviews)

#2
padded_reviews=pad_sequences(tokenized_reviews, maxlen=maxlen)
print('Padded review shape: ',padded_reviews.shape)

3#
expanded_padded_reviews =np.expand_dims(padded_reviews, axis=2)
print('Expanded review shape: ',expanded_padded_reviews.shape)

The embedded vectors can be derived by applying the trained **Embedding** layer to the pre-processed reviews.

In [None]:
embedded_reviews=emb_layer_model.predict(expanded_padded_reviews)

print('Embedded review shape: ',embedded_reviews.shape)

Finally, the classification results can be obtained using its **predict** method.

In [None]:
review_preds=emb_deep_rnn.predict(embedded_reviews)

for i,p in enumerate(review_preds):
  print('\'{}\' {} [{:.2f}]'.format(reviews[i],'POSITIVE' if p>=0.5 else 'NEGATIVE',p[0]))

# **Exercise 1**
Keras provides specific layers to implement *Long Short-Term Memory* ([**LSTM**](https://keras.io/api/layers/recurrent_layers/lstm/)) and *Gated Recurrent Units* ([**GRU**](https://keras.io/api/layers/recurrent_layers/gru/)) networks.

Evaluate the performance of LSTM and GRU networks on binary sentiment text classification problem using the embedded vectors as input. 

Function **build_deep_rnn** defined above can be used as starting point by replacing the **SimpleRNN** layers with **LSTM** or **GRU** layers.

# **Exercise 2**
Define and train a 1D CNN to binary classify movie reviews:
1. define a 1D CNN model implementing the **build_1d_cnn** function;
2. execute the training process;
3. evaluate the performance of the trained model on the test set.

## **Model definition**
Implement the following function to create a 1D CNN model given:
- the number of words in each input review (*word_count*);
- the number of features used to represent each word (*feature_count*).

The model returns a single target value given a movie review as input.

To create 1d convolutional layers, the [**Conv1D**](https://keras.io/api/layers/convolution_layers/convolution1d/) class provided by Keras can be used. An example on how to use 1D convolutional layers for text classification is reported [here](https://keras.io/examples/nlp/text_classification_from_scratch/#build-a-model).

In [None]:
def build_1d_cnn(timesteps,feature_count):
  #...

## **Model creation**
The following code creates a 1D CNN by calling the **build_1d_cnn** function defined above.

In [None]:
emb_1d_cnn=build_1d_cnn(maxlen,embedded_vector_dim)

## **Model visualization**
A string summary of the network can be printed by executing the following code.

In [None]:
emb_1d_cnn.summary()

Alternatively, a plot of the neural network graph can be visualized.

In [None]:
keras.utils.plot_model(emb_1d_cnn,show_shapes=True,show_layer_names=False)

## **Model compilation**
The following code compiles the model as already done for the deep RNN.

In [None]:
emb_1d_cnn.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])

## **Training**
Now we are ready to train our model by calling the **fit** method.

In [None]:
epoch_count = 50
batch_size = 1000
patience=5

early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=patience, restore_best_weights=True)

history = emb_1d_cnn.fit(embedded_train_x,train_y,validation_data=(embedded_val_x,val_y),epochs=epoch_count,batch_size = batch_size,callbacks=[early_stop])

### **Visualize the training process**
Call the **plot_history** function to draw the loss and accuracy trend over epochs on both training and validation sets.

In [None]:
plot_history(history,metric='accuracy')

## **Performance evaluation on the test set**
The **evaluate** method of the 1D CNN model is used to measure the performance on the test set.

In [None]:
results = emb_1d_cnn.evaluate(embedded_test_x, test_y, batch_size=batch_size,verbose=0)
print('Loss: {:.3f} Accuracy: {:.3f}'.format(results[0],results[1]))