# CIND 820 - Deliverable 3 - Model Evaluation
## Data Analysis Section

This section of the notebook aims to train and test 3 different supervised machine learning models on its performance with regard to predicting star-ratings based on user feedback. The notebook will be broken down into 3-main sections:

1. Data preparation

2. Model Training and Implementation

    1. Model Training for Logistic Regression Algorithm

    2. Model Training for a Naive Bayes Algorithm

    3. Model Training for a Support Vector Regression Algorithm
    
3. Model Performance Evaluation 

## Data Preparation

In [1]:
# Importing pandas library
import pandas as pd
import numpy as np

#MacOS Version of Filepath
filepathMac = r"/Users/sszhang/Documents/Learning Data Analytics/TMU Certificate copy/CIND 820/Yelp Dataset/balancedDataBoW.csv"
# filepathWindows = r"C:\Users\Sunora\iCloudDrive\Documents\Learning Data Analytics\TMU Certificate copy\CIND 820\Yelp Dataset\balancedDataBoW.csv"
data = pd.read_csv(filepathMac)
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,num_Tokens,neg,neu,pos,compound,aaron,...,yr,yuck,yum,yummy,yup,zero,zone,zoo,zucchini,étouffée
0,CoCim4CRm-WCoU-CFfWpLw,McdCFYocB1hFIiDQBRQ7YA,P_nqb7lULOtx3pAJbKfFXA,1,48,0.119,0.779,0.102,-0.0209,0,...,0,0,0,0,0,0,0,0,0,0
1,8s6Eejmy24XUhgNkR2uIUA,X67DbQdqHeZ-F2UVUOhn1g,WNjrsnJVPPnv_FtHHdjklA,1,84,0.11,0.819,0.071,-0.6697,0,...,0,0,0,0,0,0,0,0,0,0
2,2GdPCXF_5fR4_od5DJTD8Q,-VPeYf78MNJAB0iR7d9-zg,QboMIy08NLnBbLXEsmnDHg,1,27,0.276,0.69,0.034,-0.9612,0,...,0,0,0,0,0,0,0,0,0,0
3,vYSCzz-jM7ibdoIUCRLysw,I0Vt1g8iK0D_cxXkJyXb0A,INz7vujcHs0AggsV__pXYQ,1,6,0.222,0.778,0.0,-0.6449,0,...,0,0,0,0,0,0,0,0,0,0
4,mLokfOcquwIP57pcOkHBZQ,TV3p-bv5yh8RgdJ3WxM7Ug,eh8WfQqPa2ZWtbXe9_wHgQ,1,44,0.055,0.919,0.026,-0.3201,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
# Dropping 'review_id', 'user_id', and 'business_id' fields
data.drop(columns=['review_id', 'user_id', 'business_id'], inplace = True)
data.head()

Unnamed: 0,stars,num_Tokens,neg,neu,pos,compound,aaron,aback,abandon,ability,...,yr,yuck,yum,yummy,yup,zero,zone,zoo,zucchini,étouffée
0,1,48,0.119,0.779,0.102,-0.0209,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,84,0.11,0.819,0.071,-0.6697,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,27,0.276,0.69,0.034,-0.9612,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,6,0.222,0.778,0.0,-0.6449,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,44,0.055,0.919,0.026,-0.3201,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature Selection
All three ML models will use the following features: 1) Star Rating, 2) the 4 different sentiment scores generated by VADER, and 3) the Bag of Words matrix.

### Training and Testing Setup
For the sake of consistency we will be training and testing all 3 models using the same methods:

- Utilize an 80% training size (equivalent to 20,000 rows of data)

- Utilize a 20% test size (equivalent to 5,000 rows of data)


## Model Training and Implementation
### Preloading all dependent libraries

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, mean_squared_error, r2_score
from sklearn.svm import LinearSVR
from sklearn.preprocessing import StandardScaler

### 1. Model Training for Logistic Regression Algorithm

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Defining the Target Variable (Y = 'stars' column) and Feature Set (X = all other columns)
Y = data['stars']
X = data.drop(columns = 'stars')

# Splitting the data into training and testing components
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2025)

# Specifying a Logistic Regression algorithm
model = LogisticRegression(max_iter=5000, solver='lbfgs', multi_class='multinomial')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))



Accuracy: 0.5154
Confusion Matrix:
 [[659 233  63  24  23]
 [234 439 224  53  26]
 [ 69 238 396 238  80]
 [ 23  53 182 428 313]
 [ 15  30  62 240 655]]
Classification Report:
               precision    recall  f1-score   support

           1       0.66      0.66      0.66      1002
           2       0.44      0.45      0.45       976
           3       0.43      0.39      0.41      1021
           4       0.44      0.43      0.43       999
           5       0.60      0.65      0.62      1002

    accuracy                           0.52      5000
   macro avg       0.51      0.52      0.51      5000
weighted avg       0.51      0.52      0.51      5000



#### Notes:
Based on the Confusion Matrix output for our **Logistic Regression model**, we quickly observe the following:

- **Precision**, a metric that measures how accurate a model is at categorizing star-ratings, was highest for 1-star and 5-star reviews but suffered between 2-4 star ratings. For instance, the model scored a 66% on precision for 1-star ratings which means that 66% of the model's predictions for 1-star ratings were truly 1-stars. In other words, it can be thought of as the probability of a model's prediction being correct. 

- **Recall**, a metric that measures how complete a model is at capturing the right star-ratings, was also highest for 1-star and 5-star reviews and also suffered between 2-4 star ratings. For instance, the model scored a 66% on recall for 1-star ratings which means that 66% of all 1-star ratings were identified.  

- **F1-Score**, a harmonized metric that takes the average of the **precision** and **recall** score, is therefore also aligned in that both 1-star and 5-star categories had the highest score. 

- **Accuracy**, a metric that measures the overall correctness of a model across all categories, was 52%, indicating that the model correctly predicted just over half of all reviews. While this is better than random guessing (which would yield ~20% in a 5-class setup), it still reflects substantial room for improvement.


### 2. Model Training for Naive Bayes Algorithm

In [None]:
# Importing MultinomialNB from sklearn.naive_bayes
from sklearn.naive_bayes import MultinomialNB

# Revoving negatives from compound score to use Naive Bayes
data_nb = data.copy()
data_nb['compound'] = data_nb['compound'] + 1  # Shift to non-negative

# Defining the Target Variable (Y = 'stars' column) and Feature Set (X = all other columns)
Y = data_nb['stars']
X = data_nb.drop(columns = 'stars')

# Splitting the data into training and testing components
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2025)

# Train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, Y_train)

# Make predictions with the Naive Bayes model
nb_y_pred = nb_model.predict(X_test)

print("Accuracy:", accuracy_score(Y_test, nb_y_pred))
print("Confusion Matrix:\n", confusion_matrix(Y_test, nb_y_pred))
print("Classification Report:\n", classification_report(Y_test, nb_y_pred))


Accuracy: 0.5226
Confusion Matrix:
 [[670 239  73   6  14]
 [215 387 287  61  26]
 [ 86 164 427 280  64]
 [ 45  28 156 515 255]
 [ 53  14  43 278 614]]
Classification Report:
               precision    recall  f1-score   support

           1       0.63      0.67      0.65      1002
           2       0.47      0.40      0.43       976
           3       0.43      0.42      0.43      1021
           4       0.45      0.52      0.48       999
           5       0.63      0.61      0.62      1002

    accuracy                           0.52      5000
   macro avg       0.52      0.52      0.52      5000
weighted avg       0.52      0.52      0.52      5000



#### Notes:
Based on the Confusion Matrix output for our **Naive Bayes model**, we quickly observe the following:

- **Precision**, a metric that measures how accurate a model is at categorizing star-ratings, was highest for 1-star and 5-star reviews but suffered between 2-4 star ratings. For instance, the model scored a 63% on precision for 1-star ratings which means that 63% of the model's predictions for 1-star ratings were truly 1-stars. In other words, it can be thought of as the probability of a model's prediction being correct. 

- **Recall**, a metric that measures how complete a model is at capturing the right star-ratings, was also highest for 1-star and 5-star reviews and also suffered between 2-4 star ratings. For instance, the model scored a 67% on recall for 1-star ratings which means that 67% of all 1-star ratings were identified.  

- **F1-Score**, a harmonized metric that takes the average of the **precision** and **recall** score, is therefore also aligned in that both 1-star and 5-star categories had the highest score. 

- **Accuracy**, a metric that measures the overall correctness of a model across all categories, was 52%, indicating that the model correctly predicted just over half of all reviews. While this is better than random guessing (which would yield ~20% in a 5-class setup), it still reflects substantial room for improvement.


### 3. Model Training for Regression-Support Vector Machine Algorithm

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import numpy as np

# Defining the Target Variable (Y = 'stars' column) and Feature Set (X = all other columns)
Y = data['stars']
X = data.drop(columns = 'stars')

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Splitting the data into training and testing components
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=2025)

# Specifying a Support Vector Regression algorithm
svr_model = LinearSVR(max_iter = 10000)
svr_model.fit(X_train, Y_train)

# Make predictions with the SVR model
svr_y_pred = svr_model.predict(X_test)

# Evaluating the SVR model
print("Mean Squared Error:", mean_squared_error(Y_test, svr_y_pred))
print("R^2 Score:", r2_score(Y_test, svr_y_pred))

# Rounding predictions to nearest integer for classification evaluation
svr_y_pred_rounded = np.clip(np.round(svr_y_pred), 1, 5).astype(int)
print("Rounded Accuracy:", accuracy_score(Y_test, svr_y_pred_rounded))
print("Confusion Matrix:\n", confusion_matrix(Y_test, svr_y_pred_rounded))
print("Classification Report:\n", classification_report(Y_test, svr_y_pred_rounded))


Mean Squared Error: 1.364657920850164
R^2 Score: 0.3170491581973893
Rounded Accuracy: 0.4018
Confusion Matrix:
 [[399 387 178  29   9]
 [212 368 301  77  18]
 [ 61 212 426 259  63]
 [ 15  53 260 449 222]
 [  7  22 162 444 367]]
Classification Report:
               precision    recall  f1-score   support

           1       0.57      0.40      0.47      1002
           2       0.35      0.38      0.36       976
           3       0.32      0.42      0.36      1021
           4       0.36      0.45      0.40       999
           5       0.54      0.37      0.44      1002

    accuracy                           0.40      5000
   macro avg       0.43      0.40      0.41      5000
weighted avg       0.43      0.40      0.41      5000





#### Notes:
Based on the Confusion Matrix output for our **Support Vector Regression model**, we quickly observe the following:

- **Precision**, a metric that measures how accurate a model is at categorizing star-ratings, was highest for 1-star and 5-star reviews but suffered between 2-4 star ratings. For instance, the model scored a 57% on precision for 1-star ratings which means that 57% of the model's predictions for 1-star ratings were truly 1-stars. In other words, it can be thought of as the probability of a model's prediction being correct. 

- **Recall**, a metric that measures how complete a model is at capturing the right star-ratings, was highest for 4-star reviews. For instance, the model scored a 45% on recall for 4-star ratings which means that 45% of all 4-star ratings were identified.  

- **F1-Score**, a harmonized metric that takes the average of the **precision** and **recall** score, showed that both 1-star and 5-star categories had the highest harmonized score. 

- **Accuracy**, a metric that measures the overall correctness of a model across all categories, was 40%. This suggests that the SVR model underperformed both the Logistic Regression and Naive Bayes models.
