# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1) I have chosen the book review data set
2) I will be predicting whether a review is positive or negative by doing a sentiment analysis prediction for binary classification. The label will be 'positive review'
3) This is a classification problem, specifically binary classification. This is a supervised learning problem.
4) The feature is just 'review'
5) This is important because we can gain insight into customer satisfaction and companies can see which books are receiving negative or positive feedback. They can tend to the positive ratings by promoting these books as a marketing strategy. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
# YOUR CODE HERE
#Inspecting and analyzing data
print("Missing values?:\n", df.isna().sum())
print("(rows, cols) : ", df.shape)
print(df.describe()) #mostly False reviews
print('\n', df.dtypes)


Missing values?:
 Review             0
Positive Review    0
dtype: int64
(rows, cols) :  (1973, 2)
                                                   Review Positive Review
count                                                1973            1973
unique                                               1865               2
top     I have read several of Hiaasen's books and lov...           False
freq                                                    3             993

 Review             object
Positive Review      bool
dtype: object


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1) The features I chose is 'Review.' After text-preprocessing, the new feature will be 'processed review' which contains the review after performing text-preprocessing. 
2) I will perform text preprocessing techniques such as lowercasing, removing punctuation, tokenization, stop words removal, and stemming. Then I'll perform feature engineering such as TF-IDF vectorization. I will improving on my model by performing hyperparameter tuning and finding the best threshold. 
3) I will be doing a logistic regression model with default parameters first and then perform ways to optimize the model. I will specifically be looking at improved F1 scores and AUC scores. 
4) After performing text preprocessing techniques and feature engineering, I will also do hyperparameter tuning and using the best threshold to ensure that the model generalizes well on different subsets of the data.

F1 score: a metric used in machine learning to evaluate binary classification models. It is a harmonic mean of precision and recall. 
AUC: a single number that measures a classifier's performance across all possible classification thresholds.
Logistic Regression Model: a supervised machine learning algorithm used for classification tasks where the goal is to predict the probability that an instance belongs to a given class or not.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [4]:
# YOUR CODE HERE
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [5]:
# Defining a stemming function
def stemming(word):
    suffixes = ['ing', 'ed', 'es', 's', 'er', 'ly']
    for suffix in suffixes:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

# Defining text preprocessing function for tokens
def preprocess_text(text):
    # Tokenize
    tokens = text.split()  # splitting on whitespace
    
    # Convert tokens to lowercase and remove punctuation
    table = str.maketrans('', '', string.punctuation)
    tokens = [token.lower().translate(table) for token in tokens]
    
    # Remove stop words
    tokens = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
    
    # Stemming tokens using stemmer function
    tokens = [stemming(token) for token in tokens]
    
    return ' '.join(tokens)
    
df['Processed Review'] = df['Review'].apply(preprocess_text) #adding a new review text-preprocessed column 
df.head()

Unnamed: 0,Review,Positive Review,Processed Review
0,This was perhaps the best of Johannes Steinhof...,True,best johann steinhoff book do deal stellar tra...
1,This very fascinating book is a story written ...,True,fascinat book story written form numerou lette...
2,The four tales in this collection are beautifu...,True,tal collection beautiful compos art just stori...
3,The book contained more profanity than I expec...,False,book contain profanity expect read book rita r...
4,We have now entered a second time of deep conc...,True,enter second time deep concern science math te...


In [6]:
X = df['Processed Review'] #new review/feature column after text preprocessing
y = df['Positive Review'] #label
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1234)

In [7]:
#feature stage

tfidf_vectorizer = TfidfVectorizer()#tranforms text data into numerical features

# Fit and transform the data
tfidf_vectorizer.fit(X_train) #process text data to create vocab of words and their corresponding index

print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')

# Convert the matrix to a DataFrame for easier visualization
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


# 5. Print the matrix
print(X_train_tfidf.todense())


Vocabulary size 16338: 
[('reason', 11977), ('book', 1944), ('sold', 13583), ('180000', 83), ('copi', 3365), ('get', 6229), ('right', 12502), ('point', 11035), ('accompani', 461), ('strategy', 14002), ('visual', 15663), ('aid', 697), ('mental', 9137), ('picture', 10896), ('head', 6747), ('section', 12980), ('analyz', 891), ('stock', 13952), ('commentary', 2991), ('state', 13871), ('financial', 5653), ('statement', 13873), ('market', 8911), ('money', 9465), ('just', 8032), ('start', 13862), ('option', 10250), ('real', 11959), ('th', 14545), ('cookbook', 3350), ('author', 1377), ('learn', 8346), ('cook', 3349), ('recip', 12003), ('techniqu', 14440), ('know', 8175), ('heart', 6766), ('pam', 10495), ('anderson', 906), ('bigbreast', 1770), ('art', 1188), ('high', 6879), ('marketable', 8912), ('title', 14767), ('starter', 13864), ('artful', 1191), ('monik', 9471), ('play', 10980), ('single', 13376), ('gal', 6081)]

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 .

In [8]:
# 1. initial Logistic Regression model
#uses default hyperparameters, except max_iter = 200 
#trains a logistic regression model
print("Initial Logistic Regression model...")
model1 = LogisticRegression(max_iter=200)
model1.fit(X_train_tfidf, y_train) 

initial_prob_predictions = model1.predict_proba(X_test_tfidf)[:, 1] #gives probability estimates of label
initial_class_label_predictions = model1.predict(X_test_tfidf) #provides binary class predicitons

initial_auc = roc_auc_score(y_test, initial_prob_predictions) #evaluate model's performance using AUC, measures model's ability to distinguish between classes
print('Initial AUC on the test data: {:.4f}'.format(initial_auc))

print('Initial Classification Report:')
print(classification_report(y_test, initial_class_label_predictions)) #provides precision, recall, and F1 score

Initial Logistic Regression model...
Initial AUC on the test data: 0.9033
Initial Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.81      0.81       237
        True       0.83      0.82      0.83       257

    accuracy                           0.82       494
   macro avg       0.82      0.82      0.82       494
weighted avg       0.82      0.82      0.82       494



In [9]:
# Optimize threshold for the initial LR model
initial_thresholds = np.linspace(0, 1, 100) #creates 100 evenly spaces thresholds 0-1
initial_f1 = 0
initial_threshold = 0

for threshold in initial_thresholds:
    predictions = (initial_prob_predictions >= threshold).astype(int) #applies each threshold to create binary predictions
    current_f1 = f1_score(y_test, predictions) #computes F1 score for each threshold
    if current_f1 > initial_f1:
        initial_f1 = current_f1
        initial_threshold = threshold

final_initial_predictions = (initial_prob_predictions >= initial_threshold).astype(int)
print(f'Initial Best Threshold: {initial_threshold}')
print(f'Initial Best F1 Score: {initial_f1}')
print('Initial Classification Report with optimized threshold:')
print(classification_report(y_test, final_initial_predictions))

Initial Best Threshold: 0.4747474747474748
Initial Best F1 Score: 0.8379888268156425
Initial Classification Report with optimized threshold:
              precision    recall  f1-score   support

       False       0.85      0.77      0.81       237
        True       0.80      0.88      0.84       257

    accuracy                           0.82       494
   macro avg       0.83      0.82      0.82       494
weighted avg       0.83      0.82      0.82       494



In [10]:
# 2. Perform hyperparameter tuning
print("Performing hyperparameter tuning with GridSearchCV...")
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
} #defines grid of hyperparameters to search over
grid_search = GridSearchCV(estimator=LogisticRegression(max_iter=200),
                           param_grid=param_grid,
                           scoring='f1',
                           cv=5,
                           verbose=1,
                           n_jobs=-1) #trains a model using cross-validation
grid_search.fit(X_train_tfidf, y_train)

model2 = grid_search.best_estimator_ #gets logistic regression model with best hyperparameter
best_predictions = model2.predict_proba(X_test_tfidf)[:, 1]
best_class_label_predictions = model2.predict(X_test_tfidf)

best_auc = roc_auc_score(y_test, best_predictions)
print('Optimized AUC on the test data: {:.4f}'.format(best_auc))

print('Optimized Classification Report:')
print(classification_report(y_test, best_class_label_predictions))

# Optimize threshold for the optimized model
optimized_thresholds = np.linspace(0, 1, 100)
best_f1 = 0
best_threshold = 0

for threshold in optimized_thresholds:
    predictions = (best_predictions >= threshold).astype(int)
    curr_f1 = f1_score(y_test, predictions)
    if curr_f1 > best_f1:
        best_f1 = curr_f1
        best_threshold = threshold

final_predictions = (best_predictions >= best_threshold).astype(int)
print(f'Optimized Best Threshold: {best_threshold}')
print(f'Optimized Best F1 Score: {best_f1}')
print('Optimized Classification Report with optimized threshold:')
print(classification_report(y_test, final_predictions))

Performing hyperparameter tuning with GridSearchCV...
Fitting 5 folds for each of 20 candidates, totalling 100 fits




Optimized AUC on the test data: 0.9036
Optimized Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.81      0.81       237
        True       0.83      0.82      0.82       257

    accuracy                           0.82       494
   macro avg       0.82      0.82      0.82       494
weighted avg       0.82      0.82      0.82       494

Optimized Best Threshold: 0.4747474747474748
Optimized Best F1 Score: 0.8379888268156425
Optimized Classification Report with optimized threshold:
              precision    recall  f1-score   support

       False       0.85      0.77      0.81       237
        True       0.80      0.88      0.84       257

    accuracy                           0.82       494
   macro avg       0.83      0.82      0.82       494
weighted avg       0.83      0.82      0.82       494



In [11]:
# Prompt user to pick a review to analyze
def search_review(index):
    try:
        # Ensure the index is valid
        if index < 0 or index >= len(df):
            print("Invalid Index.")
            return
        
        print('Review #{}:\n'.format(index + 1))
        print(df.iloc[index]['Review'])  # Print the original review

        print('\nPrediction (initial model): Is this a good review? {}\n'.format(initial_class_label_predictions[index]))
        print('Prediction (optimized model): Is this a good review? {}\n'.format(best_class_label_predictions[index]))

        print('Actual: Is this a good review? {}\n'.format(y_test.iloc[index]))
    except Exception as e:
        print(f"An error occurred: {e}")

# Ask the user for the review 
try:
    user_index = int(input("Enter the index of the review you want to analyze: "))
    search_review(user_index)
except ValueError:
    print("Invalid index")

Enter the index of the review you want to analyze:  56


Review #57:

Is that those who read it and believe it, believe they actualy have girlfriends!!! Come on! admit it you are guys who wear black sabbath t-shirts and live in your parents basements. You also believe that you can get control of you live by chanting some spells from a book made to get your money.  Look, go get a hair cut, take a bath and loose 10 pounds and you will probably get that girl friend that you talk about.  Oh by the way the Necronomicon is fiction! . . except for the real copy that is in my basement, in my parents house where I used to live when I was 15


Prediction (initial model): Is this a good review? True

Prediction (optimized model): Is this a good review? True

Actual: Is this a good review? True

