# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [None]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
import tensorflow.keras as keras
import time

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [None]:
# YOUR CODE HERE
filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")
df = pd.read_csv(filename, header=0)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [None]:
df.head() # inspect the first few rows of Book Review data set

In [None]:
df.describe() # get insight into key statistics for each column 

In [None]:
# check the distribution of the label - Positive Review
print(df['Positive Review'].value_counts())

In [None]:
# visualize the distribution of reviews - more negative than positive reviews
sns.countplot(x='Positive Review', data=df)
plt.title('Distribution of Positive Reviews')
plt.show() 

## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [None]:
# create labeled examples
y = df['Positive Review'] # label
X = df['Review'] # feature

X.head()

In [None]:
X.shape

In [None]:
# split labeled examples into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1234)

X_train.head() 

In [None]:
# initialize the CountVectorizer
vectorizer = CountVectorizer(max_features=5000) 

# fit and transform the training data
X_train_vec = vectorizer.fit_transform(X_train)

# transform the test data
X_test_vec = vectorizer.transform(X_test)

In [None]:
# create a Logistic Regression model
lr_model = LogisticRegression(max_iter=1000, random_state=1234)

# train the model on the training data
lr_model.fit(X_train_vec, y_train)

# make predictions on the test data
y_pred = lr_model.predict(X_test_vec)

# calculate F1-score to evaluate the model's performance
f1 = f1_score(y_test, y_pred)

print(f"F1-Score: {f1}")

In [None]:
# calculate the confusion matrix
confusion_matrix = confusion_matrix(y_test, y_pred)

# visualize the confusion matrix - better understanding of distribution of true (or false) positives/negatives
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# explore additional features/techniques that may enhance model's performance - TF-IDF
# create a TfidfVectorizer object 
tfidf_vectorizer = TfidfVectorizer()

# fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)

# using the fitted vectorizer, transform the training data 
X_train_tfidf = tfidf_vectorizer.transform(X_train)

# using the fitted vectorizer, transform the test data 
X_test_tfidf = tfidf_vectorizer.transform(X_test)

In [None]:
# when constructing neural network, specify the input_shape - meaning the dimensionality of the input layer
# this corresponds to the dimension of each of the training examples, which is our vocabulary size.
vocabulary_size = len(tfidf_vectorizer.vocabulary_)

print(vocabulary_size)

In [None]:
# create model object
nn_model = keras.Sequential()

# create the input layer and add it to the model object: 
input_layer = keras.layers.InputLayer(input_shape=(vocabulary_size,))

# add input_layer to the model object:
nn_model.add(input_layer)

# create the first hidden layer and add it to the model object:
hidden_layer_1 = keras.layers.Dense(units=64, activation='relu')
nn_model.add(hidden_layer_1)
nn_model.add(keras.layers.Dropout(.25)) # dropout - add randomness & prevent overfitting

# create the second layer and add it to the model object:
hidden_layer_2 = keras.layers.Dense(units=32, activation='relu')
nn_model.add(hidden_layer_2)
nn_model.add(keras.layers.Dropout(.25)) # dropout - add randomness & prevent overfitting

# create the third layer and add it to the model object:
hidden_layer_3 = keras.layers.Dense(units=16, activation='relu')

# add hidden_layer_3 to the model object:
nn_model.add(hidden_layer_3)
nn_model.add(keras.layers.Dropout(.25)) # dropout - add randomness & prevent overfitting

# create the output layer and add it to the model object:
output_layer = keras.layers.Dense(units=1, activation='sigmoid')
nn_model.add(output_layer)

# print summary of neural network model structure
nn_model.summary()

In [None]:
# define the optimization function
sgd_optimizer = keras.optimizers.SGD(learning_rate=0.1)

# define the loss function
loss_fn = keras.losses.BinaryCrossentropy(from_logits=False)

# compile the model
nn_model.compile(optimizer=sgd_optimizer, loss=loss_fn, metrics=['accuracy'])

In [None]:
# callback class to output info from model when training
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v)
                      for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))

In [None]:
# fit the neural network model to the vectorized training data
t0 = time.time() # start time

num_epochs = 38 #epochs

history = nn_model.fit(X_train_tfidf.toarray(), y_train, epochs=num_epochs, verbose=0,
                      callbacks=[ProgBarLoggerNEpochs(num_epochs, every_n=50)], validation_split=0.2)

t1 = time.time() # stop time

print('Elapsed time: %.2fs' % (t1-t0))

In [None]:
# visualize model's performance over time 
# plot training and validation loss
plt.plot(range(1, num_epochs + 1), history.history['loss'], label='Training Loss')
plt.plot(range(1, num_epochs + 1), history.history['val_loss'], label='Validation Loss')

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# plot training and validation accuracy
plt.plot(range(1, num_epochs + 1), history.history['accuracy'], label='Training Accuracy')
plt.plot(range(1, num_epochs + 1), history.history['val_accuracy'], label='Validation Accuracy')

plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

In [None]:
# evaluate performance of the model 
loss, accuracy = nn_model.evaluate(X_test_tfidf.toarray(), y_test)

print('Loss: ', str(loss) , 'Accuracy: ', str(accuracy))

In [None]:
# make predictions on test set
probability_predictions = nn_model.predict(X_test_tfidf.toarray())

print("Predictions for the first 10 examples:")
print("Probability\t\t\tClass")
for i in range(0,10):
    if probability_predictions[i] >= .5:
        class_pred = "Good Review"
    else:
        class_pred = "Bad Review"
    print(str(probability_predictions[i]) + "\t\t\t" + str(class_pred))

In [None]:
# verify if model accurately predicted whether reviews are good or bad reviews
print('Review #1:\n')
print(X_test.to_numpy()[50])

goodReview = True if probability_predictions[50] >= .5 else False
    
print('\nPrediction: Is this a good review? {}\n'.format(goodReview))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[50]))