<a href="https://colab.research.google.com/github/qcbegin/DSME6635-S24/blob/main/problem_sets/PS4_NLP_DL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Set 4 - Word2Vec and LSTM for Sentiment Analysis

### Due at 12:30PM, Tuesday, March 5, 2024

Please first copy the CoLab file onto your own Google Drive. Finish the questions below and submit the **CoLab link** of your solutions in [this Google Sheet](https://docs.google.com/spreadsheets/d/1nOE-saTptG73WMCONDB1Z3pt-jHhmDA_1OHpQVHqQ1M/edit#gid=840097885). The total achievable points are 8 for this problem set. Please name you solution as

- `Member1LastName_Member1FirstName-Member2LastName_Member2FirstName_PS4.ipynb` (e.g., `Cao_Leo-Zhang_Renyu_PS4.ipynb`)

In this problem set, we will start implementing a set of NLP techniques in the deep learning era. We will start from implementing the word2vec model for sentiment classification. Specifically, you will build your own deep learning model where the first layer is an embedding layer, the second layer is an LSTM layer of 128 words, and this LSTM layer fits into the last layer which is a sigmoid. This sigmoid layer is then used to predict the probabbility of text sentiment.

We will use the continuous bag of word (CBOW) architecture and the tweet data, and will then visualize some of the word vectors.

We will then leverage on existing NLP libraries to run sentiment classification using word2vec model.


## 1. Word Vectors for Sentiment Classification
To make your life slightly easier, we will use [**Keras**](https://keras.io/), instead of PyTorch, to build a word to vector model to classify the sentiment.

The first function will read in the famous [IMDB review data from Keras](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb), and put it into `X_train`, `y_train`, `X_test`, `y_test`. In `X_train` and `X_test`, each point is a list of numbers where each number basically represents a unique word in the entire vocabulary.

We will then use the `pad_sequence()` function to change each sequence in the training and testing data into equal length that is set by `maxlen` (by default 96). You can find the documentation of this function [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences).

In [1]:
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
import numpy as np

def load_data(num_words=20000, maxlen=96):
    """
    This function load the imdb data.
    Input:
        num_words: the number of unique words we will consider with highest frequency.
        maxlen: the maximum length of each sentence.
    Output:
        X_train, X_test: a list of items, each item has a list of numbers where each
        number represents a word.
        y_train, y_test: a list of one or zeros where one represents positive review
        and zero represents negative review.
    """
    X_train = y_train = X_test = y_test = None
    ### BEGIN SOLUTION
    (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)
    X_train = pad_sequences(X_train, maxlen=maxlen)
    X_test = pad_sequences(X_test, maxlen=maxlen)
    ### END SOLUTION
    return X_train, y_train, X_test, y_test

In [2]:
X_train, y_train, X_test, y_test = load_data()
assert X_train.shape == (25000, 96)
assert X_test.shape == (25000, 96)
assert X_train[0][0] == 12


## 2. Word Indices to Sentence

Now we are going to implement a function that helps to see the actual sentence instead of word numbers from the dataset. In particular, you will need to use the [`get_word_index()`](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index) function from the imdb dataset from keras to get the word corresponding to each number. We will then take a list of numbers, translate these numbers into actual words, and put these words into one string sequentially separated by space.  

**Remember we need to get rid of the padding first!**




In [4]:
def find_sentence(train_sentence):
    """
    This function takes a list of numbers in IMDB data and translates it into words.
    Input:
        train_sentence: a list of numbers where numbers represent words
    Output:
        final_sentence: a string consists of words in the list in sequential
        order and separated by space.
    """
    final_sentence = ""
    ### BEGIN SOLUTION
    word_index = imdb.get_word_index()
    index_word = {v: k for k, v in word_index.items()}
    index_word[0] = '[PAD]'
    final_sentence = ' '.join(index_word[i] for i in train_sentence if i != 0)
    ### END SOLUTION
    return final_sentence

In [5]:
assert find_sentence(X_test[0])[0:23] == 'the wonder own as by is'
assert find_sentence(X_test[10])[0:23] == 'and movie is him actual'

## 3. Word2Vec Classification Model

Now, in this function, we will build the deep learning model to classify the sentence sentiment based on the data we just get. In particular, we will finish the following steps:

1. Build a model which has 3 major stages. The first stage is a simple embedding layer that translates the list of word numbers into an embedding with length set by `model_width` (by default 128). The second stage is an LSTM with 128 neuron units and 0.2 dropout rate. The last stage is a single dense layer that uses sigmoid as the activation function and outputs one number as the probability of the sentiment being positive.

2. We will compile the model with `adam` optimizer and `binary_crossentropy` loss function. We will then use `accuracy` as our metrics, and 20% of the data as our validation split.

3. We fit the model with training features and labels with a `batch_size` equal to `batch_size` and `epoch` equal to `epoch_number`.

4. We evaluate the model based on testing features and labels.

5. We return the model as well as the testing accuracy.

**Note: This will take some time depending on how much computing resource Google allocates to you. You are training a relatively large LSTM network with a moderate embedding on 25,000 reviews. So each epoch will take one to two minutes at least.**

The model should look similar to the following:

| Layer (type)       |  Output Shape  | Paramerters |
| ----------- | ----------- | --------- |
|  embedding (Embedding)      | (None, None, 128)       | 2,560,000 |
|  lstm (LSTM)   | (None, 128)        | 131,584 |
| dense (Dense)      |         (None, 1)       |          129 |
 : Total params: 2,691,713
 : Trainable params: 2,691,713
 : Non-trainable params: 0



In [13]:
def lstm_sentiment(X_train, y_train, X_test, y_test, model_width=128, num_words=20000,
                   batch_size=32, epoch_number=5):
    """
    This function builds an LSTM classify to find the sentiment.
    Input:
        X_train, y_train, X_test, y_test: training and testing data
        model_width: the length of embeddings we use to represent words
        num_words: the total number of unique words in our data.
        batch_size: batch_size in processing data
        epoch_number: how many times we go over the data.
    Output:
        model: the LSTM sentiment classifer we trained.
        score: the validation score/cross-entropy loss of the model.
        testing_accuracy: the accuracy of the model on testing data.
    """
    training_accuracy = testing_accuracy = 0
    model = None

    ### BEGIN SOLUTION

    # initialize the model
    model = Sequential()
    model.add(Embedding(num_words, model_width))
    model.add(LSTM(model_width, dropout=0.2, recurrent_dropout=0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    # train the model
    model.fit(X_train, y_train, batch_size=batch_size, epochs=epoch_number, validation_split=0.2)

    # evaluate the model
    score = model.evaluate(X_test, y_test, batch_size=batch_size, return_dict=True)
    testing_accuracy = score['accuracy']
    
    ### END SOLUTION

    return model, score, testing_accuracy


In [14]:
model, score, testing_accuracy = lstm_sentiment(X_train, y_train, X_test, y_test)
assert len(model.layers) == 3
assert list(model.layers[0].weights[0].shape) == [20000, 128]
assert len(model.layers[1].weights) == 3
assert list(model.layers[1].weights[1].shape) == [128, 512]

Epoch 1/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 53ms/step - accuracy: 0.7247 - loss: 0.5448 - val_accuracy: 0.8268 - val_loss: 0.4033
Epoch 2/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 51ms/step - accuracy: 0.8805 - loss: 0.3022 - val_accuracy: 0.8236 - val_loss: 0.3938
Epoch 3/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m32s[0m 51ms/step - accuracy: 0.9164 - loss: 0.2202 - val_accuracy: 0.8216 - val_loss: 0.4921
Epoch 4/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 52ms/step - accuracy: 0.9399 - loss: 0.1606 - val_accuracy: 0.8236 - val_loss: 0.4693
Epoch 5/5
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m33s[0m 52ms/step - accuracy: 0.9603 - loss: 0.1105 - val_accuracy: 0.8230 - val_loss: 0.5424
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 23ms/step - accuracy: 0.8173 - loss: 0.5578


## 4. Flair

[Flair](https://github.com/flairNLP/flair) is a reasonably update-to-date sentiment classifier using word embeddings. In those packages, you will find out word vectors as well as classifiers that have been trained by others. There are in general three ways to use these pacakges/pre-trained models:

1. The first way is to directly use the model to conduct the task, which is what we will do here.

2. The second way is to use the pre-trained word vectors or other raw materials in pre-processing texts as an input of your own model. Imagine that you can repeat the previous task but, instead of training an embedding layer, you can use the word vectors from Flair at the beginning.

3. You can use pre-trained models and change the last part of the model to fit your own data. This is a process called transfer learning or fine-tuning, and has been utilized a lot in modern transformer-based models.


In this exercise, we will use the IMDB dataset but the sentiment classifier trained by Flair. [Here](https://flairnlp.github.io/docs/intro) is the documentation of the classifier and you can find the [code here](https://github.com/flairNLP/flair/blob/6b6b2f0ebcc078328f4b349f58f3f3b77e99072d/flair/models/text_classification_model.py).

    """
    Text Classification Model
    The model takes word embeddings, puts them into an RNN to obtain a text
    representation, and puts the text representation in the end into a linear
    layer to get the actual class label. The model can handle single and multi
    class data sets.
    """


In the first function below, before conducting the analysis, we have to reverse the imdb data from `X_test` to a list of words since the current data is not useable directly by Flair. In particular, we will take each item in `X_test` and use the function `find_sentence_2()` to change it into a string of words concatenated together.

In [16]:
def revert_data_to_text(original_data):
    """
    This function takes a list of original IMDB data and returns a list
    of strings.
    Input：
        original_data: a list original numeric data from IMDB dataset.
    Output:
        new_data: a list of strings where each string consists words represented
        by the original list.
    """
    new_data = []
    word_index = imdb.get_word_index()
    inverted_word_index = dict((i, word) for (word, i) in word_index.items())

    def find_sentence_2(train_sentence):
        return " ".join(inverted_word_index[i] for i in train_sentence if i !=0)

    ### BEGIN SOLUTION
    new_data = [find_sentence_2(i) for i in original_data]
    ### END SOLUTION

    return new_data

In [17]:
X_test_2 = revert_data_to_text(X_test)
assert len(X_test_2) == 25000
assert X_test_2[0][0:10] == 'the wonder'

In the following function, we will use flair to do sentiment classification and report the out-of-sample accuracy. In particular, we will do:

1. For each setence, we will first create a Flair sentence object to change it into Flair's internal data structure.

2. We will then create a Flair sentiment classifier model in English using `en-sentiment`.

3. For each Flair sentence, we will use the Flair model to classify it and save the results.

4. We will then compare the results with the true labels to report the accuracy of Flair classifiers.

Again, you can find more documentation about flair on [their GitHub page](https://github.com/flairNLP/flair) and [tutorial](https://flairnlp.github.io/docs/intro).

In [None]:
!pip install flair

In [53]:
import flair

def flair_classification(X_test, y_test):
    """
    This function uses Flair to do sentiment classification.
    Input:
        X_test, y_test: testing data
    Output:
        classification: a list of 1s and 0s where 1 represents the test data
        is positive and 0 otherwise.
        accuracy: the total accuracy of the model.
    """
    classification = []
    accuracy = 0

    ### BEGIN SOLUTION
    
    # prepare the data and load the model
    sentences = [flair.data.Sentence(i) for i in X_test]
    tagger = flair.models.TextClassifier.load('en-sentiment')

    # predict the sentiment
    tagger.predict(sentences)

    # calculate the accuracy
    classification = [1 if s.labels[0].value == 'POSITIVE' else 0 for s in sentences]
    accuracy = np.mean(classification == y_test)

    ### END SOLUTION

    return classification, accuracy

In [54]:
classification, accuracy = flair_classification(X_test_2[0:2500], y_test[0:2500])
assert np.isclose(accuracy, 0.5232)

## End of Problem Set 4.