#### CSC 180 Intelligent Systems 

#### William Lorence, Ajaydeep Singh, 

#### California State University, Sacramento


# Project 1: Yelp Business Rating Prediction using Tensorflow
## Data Preparation
The following block of code sets up the ability to read the Yelp dataset and shows the first 5 rows of both the business dataframe and review dataframe separately. Note that businesses with less than 20 reviews are dropped from the dataframe.

In [58]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import tensorflow as tf
import os
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

path = "./yelp_dataset/"
save_path = "./models/"

df_business = pd.read_json('./yelp_dataset/yelp_academic_dataset_business.json', lines=True, nrows = 1000000)
df_business = df_business[df_business['review_count'] >= 20]

df_business.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
12,il_Ro8jwPlHresjw9EGmBg,Denny's,8901 US 31 S,Indianapolis,IN,46227,39.637133,-86.127217,2.5,28,1,"{'RestaurantsReservations': 'False', 'Restaura...","American (Traditional), Restaurants, Diners, B...","{'Monday': '6:0-22:0', 'Tuesday': '6:0-22:0', ..."
14,0bPLkL0QhhPO5kt1_EXmNQ,Zio's Italian Market,2575 E Bay Dr,Largo,FL,33771,27.916116,-82.760461,4.5,100,0,"{'OutdoorSeating': 'False', 'RestaurantsGoodFo...","Food, Delis, Italian, Bakeries, Restaurants","{'Monday': '10:0-18:0', 'Tuesday': '10:0-20:0'..."
15,MUTTqe8uqyMdBl186RmNeA,Tuna Bar,205 Race St,Philadelphia,PA,19106,39.953949,-75.143226,4.0,245,1,"{'RestaurantsReservations': 'True', 'Restauran...","Sushi Bars, Restaurants, Japanese","{'Tuesday': '13:30-22:0', 'Wednesday': '13:30-..."


In [59]:
df_review = pd.read_json('./yelp_dataset/yelp_academic_dataset_review.json', lines=True, nrows = 1000000)
df_review.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


The next block of code groups all reviews (text) by business_id and concatenates all the reviews for each business into a single string. This means that for each business, you'll get one entry where all its reviews are combined into one text entry.

df_ready_to_be_sent_to_sklearn converts the df_review_agg series into a dataframe.


In [60]:
df_review_agg = df_review.groupby('business_id')['text'].sum()
df_ready_to_be_sent_to_sklearn = pd.DataFrame({'business_id': df_review_agg.index, 'all_reviews': df_review_agg.values})

The below merges df_ready_to_be_sent_to_sklearn (which contains the concatenated reviews for each business) with df_business (which contains other information about each business such as name, category, location, etc).

df_review_business.shape = returns the shape of the resulting DataFrame (df_review_business), which tells you the number of rows and columns.

In [61]:
df_review_business = pd.merge(df_ready_to_be_sent_to_sklearn, df_business, on='business_id')
df_review_business.shape


(11927, 15)

In [62]:
df_review_business.head()

Unnamed: 0,business_id,all_reviews,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,--ZVrH2X2QXBFdCilbirsw,This place is sadly perm closed. I was hoping ...,Chris's Sandwich Shop,1531 W Wynnewood Rd,Ardmore,PA,19003.0,39.997299,-75.292207,4.5,32,0,"{'GoodForKids': 'True', 'RestaurantsAttire': '...","American (Traditional), Restaurants, Pizza, Sa...","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."
1,--sXnWH9Xm6_NvIjyuA99w,Ich war das erste mal in Philadelphia und ich ...,Philadelphia,,Philadelphia,PA,,39.952584,-75.165222,4.0,29,1,{'GoodForKids': 'True'},"Public Services & Government, Local Flavor",
2,-02xFuruu85XmDn2xiynJw,Dr. Curtis Dechant has an excellent chair-side...,Family Vision Center,7475 E Tanque Verde Rd,Tucson,AZ,85715.0,32.251039,-110.833173,4.5,109,1,"{'ByAppointmentOnly': 'True', 'BusinessParking...","Shopping, Ophthalmologists, Optometrists, Doct...","{'Monday': '0:0-0:0', 'Tuesday': '8:30-17:30',..."
3,-06OYKiIzxsdymBMDAKZug,Had catalytic converters replaced on our Subur...,Washoe Metal Fabricating,905 Bergin Way,Sparks,NV,89431.0,39.525558,-119.739221,4.5,34,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","RV Dealers, Home Services, Shopping, Tires, Au...","{'Monday': '7:30-17:30', 'Tuesday': '7:30-17:3..."
4,-06ngMH_Ejkm_6HQBYxB7g,I have an old main line that really should be ...,Stewart's De Rooting & Plumbing,415 E Montecito St,Santa Barbara,CA,93101.0,34.419838,-119.688029,4.0,25,1,"{'BusinessAcceptsCreditCards': 'True', 'ByAppo...","Plumbing, Home Services","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."


## TF-IDF to extract features from reviews

The following block of code takes common words from the filtered dataframe and compiles them into the list "X".

In [63]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')

# Fit and transform the 'all_reviews' column
tfidf_matrix = vectorizer.fit_transform(df_review_business['all_reviews'])

# Check the shape of the resulting TF-IDF matrix
# This should be (n_samples, n_features), where n_samples is the number of businesses
# and n_features is the number of words (features) in the TF-IDF representation.
print("Shape of TF-IDF matrix:", tfidf_matrix.shape)

# Get the feature names (words) for printing
feature_names = vectorizer.get_feature_names_out()
print("Some feature names (words):", feature_names)

Shape of TF-IDF matrix: (11927, 1000)
Some feature names (words): ['00' '10' '100' '11' '12' '15' '20' '25' '30' '40' '45' '50' 'able'
 'absolutely' 'accommodating' 'actually' 'add' 'added' 'addition'
 'affordable' 'afternoon' 'ago' 'ahead' 'air' 'amazing' 'ambiance'
 'american' 'apparently' 'appetizer' 'appetizers' 'apple' 'appointment'
 'appreciate' 'area' 'aren' 'arrived' 'art' 'asian' 'ask' 'asked' 'asking'
 'ate' 'atmosphere' 'attention' 'attentive' 'attitude' 'authentic'
 'available' 'average' 'avocado' 'avoid' 'away' 'awesome' 'awful' 'bacon'
 'bad' 'bag' 'baked' 'bakery' 'bar' 'barbara' 'barely' 'bars' 'bartender'
 'bartenders' 'based' 'basic' 'basically' 'bathroom' 'bbq' 'beach' 'beans'
 'beat' 'beautiful' 'bed' 'beef' 'beer' 'beers' 'believe' 'best' 'better'
 'big' 'birthday' 'bit' 'bite' 'black' 'bland' 'blue' 'book' 'bottle'
 'bought' 'bourbon' 'bowl' 'box' 'boy' 'boyfriend' 'bread' 'breakfast'
 'bring' 'brisket' 'broth' 'brought' 'brown' 'brunch' 'buffet' 'building'
 'bun'

The next block of code simply takes the words gathered from the reviews (in the form of the TF-IDF matrix) and the review ratings (in stars) and prepares them for training.

In [64]:
X = tfidf_matrix   #Review words
y = df_review_business['stars'].values  # Business star ratings (target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training
The following function further prepares for training by preparing the neural network model. By default, it uses relu as the activation model and adam as the optimizer. The function allows us to easily tweak values that will affect the overall outcome of the model.

In [65]:
def build_model(activation='relu', optimizer='adam', L1=128, L2=64):
    model = Sequential()
    model.add(Dense(L1, activation=activation, input_shape=(X_train.shape[1],)))
    model.add(Dense(L2, activation=activation))
    model.add(Dense(1))  # Regression task, so no activation in the output layer
    model.compile(optimizer=optimizer, loss='mean_squared_error')
    return model

When the following code is run, it performs the actual "training" of the model. It creates the model based on the values specified and fits the model over 50 epochs or until the early stopping takes place.

In [66]:
# 7. Early Stopping to Prevent Overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5)

# 8. Train the Model
model = build_model(activation='relu', optimizer='adam', L1=128, L2=64)
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=32, callbacks=[early_stopping])

# 9. Evaluate Model on Test Data
y_pred = model.predict(X_test).flatten()  # Flatten y_pred to make it compatible with y_test
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - loss: 3.0664 - val_loss: 0.2287
Epoch 2/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1994 - val_loss: 0.1961
Epoch 3/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1500 - val_loss: 0.2112
Epoch 4/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1380 - val_loss: 0.1973
Epoch 5/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1264 - val_loss: 0.1956
Epoch 6/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.1125 - val_loss: 0.1942
Epoch 7/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0893 - val_loss: 0.1917
Epoch 8/50
[1m299/299[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0735 - val_loss: 0.1934
Epoch 9/50
[1m299/299[0m [32m━━━━━━━━

## Predictions
Finally, the next block of code allows us to test this newly trained model's performance. It takes a random assortment of 100 businesses from the dataset and predicts their rating based on its training. The results are rounded to the nearest 0.5 star due to the fact that the model has no way of knowing that Yelp ratings are restricted to this scale, so any in-between values are treated as how certain the model is about a rating. For example, if the model predicts a 2.89, it is more certain that the correct rating is a 3.0 than a 2.5, so the result will be rounded to 3.0.

In [67]:
sample_size = 1000

sample_businesses = df_review_business.sample(sample_size).reset_index(drop=True)  # Ensure indices are reset
sample_reviews = vectorizer.transform(sample_businesses['all_reviews']).toarray()
predicted_ratings = model.predict(sample_reviews)

print("\nSample Predictions:")

percent_correct = 100
error_total = 0

for i, row in sample_businesses.iterrows():
    true_rating = row['stars']
    predicted_rating = round(predicted_ratings[i][0] * 2, 0)/2

    error_total += abs(true_rating - predicted_rating)

    if true_rating != predicted_rating:
        percent_correct += (-1 * 100 / sample_size)

    print(f"Business: {row['business_id']}, True Rating: {true_rating}, Predicted Rating: {predicted_rating}")

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 903us/step

Sample Predictions:
Business: T3tiN6g7G2XUoDjRv3d8Cw, True Rating: 4.0, Predicted Rating: 3.5
Business: aEEC4D-RylmiUyephE5t_g, True Rating: 4.0, Predicted Rating: 4.0
Business: PmFBiD-KW4U_L1MS9qcIUQ, True Rating: 2.0, Predicted Rating: 2.0
Business: zPT1Jg1MvqaBOMSqsiMF0A, True Rating: 4.5, Predicted Rating: 4.5
Business: obYxIb1jp4XDdIJvK-xdUw, True Rating: 3.5, Predicted Rating: 3.5
Business: j1WFr8rSrekUgTNy4NudxA, True Rating: 4.0, Predicted Rating: 4.5
Business: l-y7sdLCdNHnDBXu0bmnPQ, True Rating: 5.0, Predicted Rating: 5.0
Business: yMfbFFpGgDDbY_s3_FNi3Q, True Rating: 3.0, Predicted Rating: 3.0
Business: 0wQCEcpZ57TmTm6EmEDsIw, True Rating: 3.5, Predicted Rating: 3.5
Business: A6FDSdfwLPgl2T3ayKllqQ, True Rating: 2.5, Predicted Rating: 2.5
Business: F9c6M_4YwEGFcTVmDVIX2Q, True Rating: 2.5, Predicted Rating: 2.5
Business: Jx2AoB_IQOUrZ3s6fdAUSA, True Rating: 4.5, Predicted Rating: 4.5
Business: RMjHlA

## Analysis
To analyze the results, we measured two main statistics: percent correct and average error. Percent correct is simply how many out of the 100 predictions were correct, while the average error represents how many stars on average the model was off by.

In [68]:
print('Percent correct: ' + str(percent_correct) + "%")

average_error = error_total/sample_size
print('Average error: ' + str(average_error) + " stars")

Percent correct: 74.70000000000144%
Average error: 0.136 stars


## Saving the Model

The final block of code saves the current model to the "models" folder in the form of a hdf5 file named "network1".

In [72]:
import os

model.save(os.path.join(save_path,"network1.hdf5"))

