# Lab 7: Implement a Neural Network for Sentiment Analysis

In [None]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
import time


Implement a neural network that performs sentiment analysis for a binary classification problem. 

1. Load the book review data set.
2. Create training and test datasets.
3. Transform the training and test text data using a TF-IDF vectorizer. 
4. Construct a neural network
5. Train the neural network.
6. Compare the model's performance on the training data vs test data.
7. Improve its generalization performance.

## Part 1:  Load the Data Set

We will work with the book review data set that contains book reviews taken from Amazon.com reviews.

You will be working with the file named "bookReviews.csv" that is located in a folder named "data".

In [None]:
df = pd.read_csv("data/bookReviews.csv", header=0)

In [None]:
df.head()

In [None]:
df.shape

## Part 2: Create Training and Test Data Sets

### Create Labeled Examples

* Get the `Positive_Review` column from DataFrame `df` and assign it to the variable `y`. This will be our label.
* Get the `Review` column from  DataFrame `df` and assign it to the variable `X`. This will be our feature. 


In [None]:
y = df['Positive Review']
X = df['Review']

In [None]:
X.head()

In [None]:
X.shape

### Split Labeled Examples into Training and Test Sets   

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

In [None]:
X_train.head()

## Part 3:  Implement TF-IDF Vectorizer to Transform Text


In the code cell below, transform the features into numerical vectors using `TfidfVectorizer`. 

1. Create a `TfidfVectorizer` object and save it to the variable `tfidf_vectorizer`.

2. Call `tfidf_vectorizer.fit()` to fit the vectorizer to the training data `X_train`.

3. Call the `tfidf_vectorizer.transform()` method to use the fitted vectorizer to transform the training data `X_train`. Save the result to `X_train_tfidf`.

4. Call the `tfidf_vectorizer.transform()` method to use the fitted vectorizer to transform the test data `X_test`. Save the result to `X_test_tfidf`.

In [None]:
# 1. Create a TfidfVectorizer object 
tfidf_vectorizer = TfidfVectorizer()


# 2. Fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)


# 3. Using the fitted vectorizer, transform the training data 
X_train_tfidf = tfidf_vectorizer.transform(X_train)

# 4. Using the fitted vectorizer, transform the test data 
X_test_tfidf = tfidf_vectorizer.transform(X_test)



In [None]:
vocabulary_size = len(tfidf_vectorizer.vocabulary_)

print(vocabulary_size)

## Part 4: Construct a Neural Network


### Step 1.  Define Model Structure

Next we will create our neural network structure. We will create an input layer, three hidden layers and an output layer:

* <b>Input layer</b>: The input layer will have the input shape corresponding to the vocabulary size. 
* <b>Hidden layers</b>: We will create three hidden layers of widths (number of nodes) 64, 32, and 16. They will utilize the ReLu activation function. 
* <b>Output layer</b>: The output layer will have a width of 1. The output layer will utilize the sigmoid activation function. Since we are working with binary classification, we will be using the sigmoid activation function to map the output to a probability between 0.0 and 1.0. We can later set a threshold and assume that the prediction is class 1 if the probability is larger than or equal to our threshold, or class 0 if it is lower than our threshold.

To construct the neural network model using Keras, we will do the following:
* We will use the Keras `Sequential` class to group a stack of layers. This will be our neural network model object.
* We will use the `Dense` class to create each layer. 
* We will add each layer to the neural network model object.   

In [None]:
# 1. Create model object
nn_model = keras.Sequential()



# 2. Create the input layer and add it to the model object: 

# Create input layer:
input_layer = keras.layers.InputLayer(input_shape=(vocabulary_size,))

# Add input_layer to the model object:
nn_model.add(input_layer)



# 3. Create the first hidden layer and add it to the model object:

# Create input layer:
hidden_layer_1 = keras.layers.Dense(units = 64, activation='relu')

# Add hidden_layer_1 to the model object:
nn_model.add(hidden_layer_1)



# 4. Create the second layer and add it to the model object:

# Create input layer:
hidden_layer_2 = keras.layers.Dense(units = 32, activation='relu')

# Add hidden_layer_2 to the model object:
nn_model.add(hidden_layer_2)



# 5. Create the third layer and add it to the model object:

# Create input layer:
hidden_layer_3 = keras.layers.Dense(units = 16, activation='relu')

# Add hidden_layer_3 to the model object:
nn_model.add(hidden_layer_3)

nn_model.add(keras.layers.Dropout(.25))

# 6. Create the output layer and add it to the model object:

# Create input layer:
output_layer = keras.layers.Dense(units = 1, activation='sigmoid')

# Add output_layer to the model object:
nn_model.add(output_layer)




# Print summary of neural network model structure
nn_model.summary()


### Step 2. Define the Optimization Function

In [None]:
sgd_optimizer = keras.optimizers.SGD(learning_rate = 0.1)

### Step 3. Define the Loss Function

In [None]:
loss_fn = keras.losses.BinaryCrossentropy(from_logits = False)

### Step 4. Compile the Model

Package the network architecture with the optimizer and the loss function using the `compile()` method.   

In [None]:
nn_model.compile(optimizer = sgd_optimizer, loss=loss_fn, metrics = ['accuracy'])

## Part 5. Fit the Model on the Training Data

In [None]:
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v)
                      for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))


Fit the neural network model to the vectorized training data.
<b>Note</b>: This may take a while to run.

In [None]:
num_epochs = 50

t0 = time.time() # start time
history = nn_model.fit(X_train_tfidf.toarray(), y_train, epochs = num_epochs, verbose= 0, callbacks=[ProgBarLoggerNEpochs(num_epochs, every_n=num_epochs)], validation_split = 0.2)
t1 = time.time() # stop time
print('Number of epochs: ', (num_epochs))

### Visualize the Model's Performance Over Time

The code above outputs both the training loss and accuracy and the validation loss and accuracy. Let us visualize the model's performance over time:

In [None]:
# Plot training and validation loss
plt.plot(range(1, num_epochs + 1), history.history['loss'], label='Training Loss')
plt.plot(range(1, num_epochs + 1), history.history['val_loss'], label='Validation Loss')

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()


# Plot training and validation accuracy
plt.plot(range(1, num_epochs + 1), history.history['accuracy'], label='Training Accuracy')
plt.plot(range(1, num_epochs + 1), history.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()


## Part 6. Improve the Model and Evaluate the Performance

We just evaluated our model's performance on the training and validation data. Let's now evaluate its performance on our test data and compare the results.

In [None]:
loss, accuracy = nn_model.evaluate(X_test_tfidf.toarray(), y_test)


print('Loss: ', str(loss) , 'Accuracy: ', str(accuracy))

### Prevent Overfitting and Improve Model's Performance

Neural networks can be prone to overfitting. Notice that the training accuracy is 100% but the test accuracy is around 82%. This indicates that our model is overfitting; it will not perform as well to new, previously unseen data as it did during training. We want to have an accurate idea of how well our model will generalize. Our goal is to have our training and testing accuracy scores be as close as possible.

While there are different techniques that can be used to prevent overfitting, for the purpose of this exercise, we will focus on two methods:

1. Changing the number of epochs. Too many epochs can lead to overfitting of the training dataset, whereas too few epochs may result in underfitting.

2. Adding dropout regularization. During training, the nodes of a particular layer may always become influenced only by the output of a particular node in the previous layer, causing overfitting. Dropout regularization is a technique that randomly drops a number of nodes in a neural network during training as a way to adding randomization and prevent nodes from becoming dependent on one another. Adding dropout regularization can reduce overfitting and also improve the performance of the model. 

<b>Task:</b> 

1. Tweak the variable `num_epochs` above and restart and rerun all of the cells above. Evaluate the performance of the model on the training data and the test data.

2. Add Keras `Dropout` layers after one or all hidden layers. Add the following line of code after you add a hidden layer to your model object:  `nn_model.add(keras.layers.Dropout(.25))`. The parameter `.25` is the fraction of the nodes to drop. You can experiment with this value as well. Restart and rerun all of the cells above. Evaluate the performance of the model on the training data and the test data.


<b>Analysis:</b> 
In the cell below, specify the different approaches you used to reduce overfitting and summarize which configuration led to the best generalization performance.

Did changing the number of epochs prevent overfitting? Which value of `num_epochs` yielded the closest training and testing accuracy score? Recall that too few epochs can lead to underfitting (both poor training and test performance). Which value of `num_epochs` resulted in the best accuracy score when evaluating the test data?

Did adding dropout layers prevent overfitting? How so? Did it also improve the accuracy score when evaluating the test data? How many dropout layers did you add and which fraction of nodes did you drop? 

Record your findings in the cell below.

At very low epoch numbers, the loss is incredibly high with much more loss than there is accuracy. Starting around 25 epochs, the accuracy becomes high and the loss almost 0. Increasing the number of epochs to greater than 100, leads to very long computational times which may not be worth the small gain in accuracy.
Effect of adding drop out layer : Adding a dropout layer resulted in slightly slower increase in accuracy, though the difference is mostly neglible.

A number between 25 and 50 seems to be an optimal number of epochs.

Without dropout layer 

Epoch [1/ [1, 10, 25, 50, 100, 150]], Loss: 0.6933, Accuracy: 0.5024, Val_loss: 0.6930, Val_accuracy: 0.4905 Number of epochs: 1

Epoch [10/ [1, 10, 25, 50, 100, 150]], Loss: 0.6070, Accuracy: 0.6537, Val_loss: 0.5790, Val_accuracy: 0.6741 Number of epochs: 10

Epoch [25/ [1, 10, 25, 50, 100, 150]], Loss: 0.0036, Accuracy: 1.0000, Val_loss: 0.5009, Val_accuracy: 0.8196 Number of epochs: 25

Epoch [50/ [1, 10, 25, 50, 100, 150]], Loss: 0.0004, Accuracy: 1.0000, Val_loss: 0.6112, Val_accuracy: 0.8133 Number of epochs: 50

Epoch [100/ [1, 10, 25, 50, 100, 150]], Loss: 0.0001, Accuracy: 1.0000, Val_loss: 0.6775, Val_accuracy: 0.8196 Number of epochs: 100

Epoch [150/ [1, 10, 25, 50, 100, 150]], Loss: 0.0000, Accuracy: 1.0000, Val_loss: 0.7226, Val_accuracy: 0.8165 Number of epochs: 150

After adding dropout layer

Epoch [1/ [1, 10, 25, 50, 100, 150]], Loss: 0.6933, Accuracy: 0.5071, Val_loss: 0.6930, Val_accuracy: 0.5095
Number of epochs:  1

Epoch [10/ [1, 10, 25, 50, 100, 150]], Loss: 0.5807, Accuracy: 0.7425, Val_loss: 0.5873, Val_accuracy: 0.7753
Number of epochs:  10

Epoch [25/ [1, 10, 25, 50, 100, 150]], Loss: 0.0196, Accuracy: 1.0000, Val_loss: 0.5037, Val_accuracy: 0.8133
Number of epochs:  25

Epoch [50/ [1, 10, 25, 50, 100, 150]], Loss: 0.0026, Accuracy: 1.0000, Val_loss: 0.7512, Val_accuracy: 0.8038
Number of epochs:  50

Epoch [100/ [1, 10, 25, 50, 100, 150]], Loss: 0.0010, Accuracy: 1.0000, Val_loss: 0.9175, Val_accuracy: 0.8038
Number of epochs:  100

Epoch [150/ [1, 10, 25, 50, 100, 150]], Loss: 0.0002, Accuracy: 1.0000, Val_loss: 1.0891, Val_accuracy: 0.8038
Number of epochs:  150

### Make Predictions on the Test Set

In [None]:
probability_predictions = nn_model.predict(X_test_tfidf.toarray())

print("Predictions for the first 10 examples:")
print("Probability\t\t\tClass")
for i in range(0,10):
    if probability_predictions[i] >= .5:
        class_pred = "Good Review"
    else:
        class_pred = "Bad Review"
    print(str(probability_predictions[i]) + "\t\t\t" + str(class_pred))

In [None]:
print('Review #1:\n')
print(X_test.to_numpy()[56])

goodReview = True if probability_predictions[56] >= .5 else False
    
print('\nPrediction: Is this a good review? {}\n'.format(goodReview))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[56]))

In [None]:
print('Review #2:\n')
print(X_test.to_numpy()[24])

goodReview = True if probability_predictions[24] >= .5 else False

print('\nPrediction: Is this a good review? {}\n'.format(goodReview)) 

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[24]))