<img src="https://www.nyp.edu.sg/content/dam/nyp/logo.png" width='200'/>

Welcome to the lab! Before we get started here are a few pointers on Jupyter notebooks.

1. The notebook is composed of cells; cells can contain code which you can run, or they can hold text and/or images which are there for you to read.

2. You can execute code cells by clicking the ```Run``` icon in the menu, or via the following keyboard shortcuts ```Shift-Enter``` (run and advance) or ```Ctrl-Enter``` (run and stay in the current cell).

3. To interrupt cell execution, click the ```Stop``` button on the toolbar or navigate to the ```Kernel``` menu, and select ```Interrupt ```.


# Lab 2 - Sentiment Analysis with Deep Learning (Keras)

Now that we have learn how to use Naive Bayes and SVM to classify sentiments in a document of text, let's now learn how to use Deep Learning to do the same.

In [None]:
from helpers import *
print ("Import helpers complete.")

## Section 2.1 - Load Data from CSV

Update the following code below to load the training and test data from the CSV files.


In [None]:
# TODO:
# Update the code below to indicate the correct file names, the columns used 
# input text, and the output label class, and some maximum limits
#
load_text_data_from_csv(
    "???",                              # CSV File to load the training data from
    "???",                              # CSV File to load the test data from
    "???",                              # Column name of the CSV that contains the input text
    "???")                              # Column name of the CSV that contains the class


Run the following cell to see how the data loaded from the CSV files look like, when it's stored in Python variables.

In [None]:
display_trainx_trainy()

## Section 2.2 - Build a Word Dictionary and Tokenize Text

The following function creates a dictionary of n words, based the words that appeared in all the training data. 

It also performs tokenization, which is a necessary step to split up a sentence into into words, and assigning a numeric identifier to the word. 

With Scikit-Learn, the tokenization (as well as Lemmatization) is handled by the NLTK toolkit and integrated into the Scikit-Learn processing pipeline. But in the case of Keras, we have to handle that by ourselves. While lemmatization is beneficial in Classical Machine Learning algorithms, it's improvement to performance may be less significant in Deep Learning, depending on the written language that you are trying to classify. In our case, we will proceed to classify our movie reviews without lemmatizing the text.

Go ahead to run the cell below to build our dictionary and data set.

In [None]:
# Split and tokenize all the strings into individual word index
# in a dictionary.
#
# NOTE: If you want to this again, you must re-load the data from CSV in
#       in step 2.1.
#
build_dictionary_and_tokenize_data(
    50000,                                        # Max number of words in dictionary
    2500)                                        # Max number of words per sentence.

Run the cell below to see how our sentences now look like.

You can see that each sentence has been converted into a series of numbers.

In [None]:
display_trainx_trainy()

## Section 2.3 - Load the Glove Word Embeddings

Word Embeddings are a well-recognized way of representing the meaning of a word in Machine Learning. Thanks to the vast sources of written text on the internet today from written articles, news, Wikipedias, user-generated content on social media, we have amassed a huge corpus of language data useful in training and producing a machine-learned representation of words.

A few of the Word Embeddings that have been pre-trained and made available for download are the Glove and Word2Vec embeddings. These embeddings are basically a dictionary look-up that maps a word to a series of numbers.

For this exercise, we will use the Glove Embeddings available here: https://nlp.stanford.edu/projects/glove/

We have already downloaded the Glove Embeddings file to the **"data/glove.6B.200d.txt"**. Update the path to that file and run the cell to load it up. 

In [None]:
# TODO:
# Set the path to the Glove Word Embedding file
#
load_glove_embedding("???")

Replace the ??? with any known English word, and run the following cell to see how a real word embedding looks like.

In [None]:
# TODO:
# Set any known English word to see its Word Embedding.
#
display_word_embedding("???")

Humans can visualize nearby words if we wrote all of them in Post-Its and pasted them in a 2-dimensional flipchart. But in practice, we need more than 2 dimensions to capture meaning in a word. The Glove Embedding that we use captures 200 dimensions of numbers per word. We won't be able to visualize nearby words with a 200-dimension representation, but machines will have no problems computing distances of words represented with any number of dimensions. 

By feeding the Word Embeddings of each word in a sentence to the Deep Learning model, we are essentially telling the Deep Learning model to make use of the meaning of each words in a paragraph to classify that paragraph. 

Run the following cell with any word to see how the machine can determine the closest matching/meaning words with the help of the Glove Word Embedding look up.

In [None]:
# TODO:
# Set any word here and run the cell to see which word is
# close in meaning to the one you supplied.
#
display_nearby_words("???")

## Section 2.4 - Create the Deep Learning Text Classification Model

The following creates the Deep Learning model for our Text Classification task. A typical Recurrent Neural Network will look like the following:

<img src="files/we_rnn_dl.PNG" height="100">

The Word Embedding captures the meaning of the words, while the Recurrent Neural Network attempts to make sense of the order and position of the words. 

---

In the cell below, we have some codes to create the above model architecture, with some hyper-parameters that you can change to alter the structure of the model and how the model learns.

For example, you may consider using GRU instead of LSTM, or you may choose to use a Bi-directional/Uni-directional model. A Uni-directional model takes in a sequence of words in the natural reading order. A Bi-directional model takes in a sentence in from first-to-last word and last-to-first word, allowing the model to capture contexts in both directions.

Update the hyper-parameters and then run the cell to create the model.

1. Recurrent Neural Network variant in the Recurrent Layer: **'lstm'**
2. Neurons in Recurrent Layer: **32**
2. Bi-directional: **True**
3. Optimizer: **'adam'**



In [None]:
# TODO:
# Update the hyper parameters before starting the training.
#
create_text_classifier_model_rnn(
    2,                                        # Number of classes to predict 
    2500,                                     # Max number words per sentence
    'glove',                                  # Word Embedding                  ('glove' / 'new')
    '???',                                    # RNN variant                     ('rnn' / 'gru' / 'lstm')
    0,                                        # Neurons in RNN layer            (typically 16 to 1024)
    True,                                     # Use bi-directional RNN?         (False - Uni-directional / True - Bi-directional)
    '???'                                     # Optimizer to learn              ('sgd' / 'adam')
)

## Section 2.5 - Train Your Model

Update the following parameters and then run the cell below to start the training. 

1. Learning rate = **0.01** (for 'sgd' optimizer), **0.001** (for 'adam' optimizer)
2. Batch size = **100**,
3. Number of epochs = **5**

Take a look at the accuracy of your classifications on the test data and compare its F1-score with your classical Machine Learning models. Try to adjust some of the hyper-parameters above including the RNN variant, the number of neurons, using/not using bi-directional networks, using a different optimizer, changing the batch size and the number of epochs. The run Section 2.4 and 2.5 again to train.

Try experimenting with different combinations of hyper-parameters to see if you can achieve a good F1-score.

In [None]:
# TODO:
# Set the learning rate of the deep learning network
# A large learning rate helps to adjust quickly the network weights to the optimal 
# goal, but it may also cause it to over-shoot its goal. A small learning rate
# causes the network to learn very slowly, but it may bring it to its nearest goal
#
set_learning_rate(0)


# TODO:
# Update the batch size, and the number of epochs to train.
#
train_text_classifier_model(
    0,                         # Batch size.
    0                          # Number of epochs/iterations to train this model.
    )

## Section 2.6 - Save Your Text Classification Model

Run the following code below to save your Text Classification model.


In [None]:
save_text_classifier_model("models/text_classifier.h5")

## Section 2.7 - Load Your Text Classification Model

Run the following code below to load your Text Classification model. 


In [None]:
load_text_classifier_model("models/text_classifier.h5")

## Section 2.8 - Try Out Your Model

Run the following code to try your model.

Discuss how you feel about this new classifier:

1. Did the accuracy of your model performance increase with Deep Learning?
2. What do you think you can do to further improve the accuracy of your model?
3. How does the accuracy of your model feel when you are testing it manually?


In [None]:
print ("Enter some text:")
user_text = input()
classify_text(user_text)

## Section 2.9 - Explore helpers.py

Go ahead again to examine the codes in helpers.py to see how we create the Deep Learning model in Keras.