## Yelp Rating Prediction FCNN

Author: Rahul Gupta

Goal of the project is to develop an FCNN (Fully Connected Neural Network) to predict Yelp rating scores based on the text data from user-written reviews.

### Data Loading
Need to first load, understand, and preprocess data for the neural network.

In [1]:
import numpy as np
import pandas as pd

# Starter code
review = pd.read_json('./data/yelp_academic_dataset_review.json', lines=True, nrows= 1000000)
business = pd.read_json('./data/yelp_academic_dataset_business.json', lines=True, nrows= 1000000)

In [2]:
review.dropna(inplace=True)
review.head()
print(review.shape)

(1000000, 9)


In [3]:
business.dropna(inplace=True)
print(business.shape)
business.head()

(117618, 14)


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."


In [4]:
# Drop businesses with less than 20 reviews
business_to_drop = business[business['review_count'] < 20]

business_filtered = business[business['review_count'] >= 20]

# Remove reviews for businesses that aren't in business_filtered
review_filtered = review[review['business_id'].isin(business_filtered['business_id']) == True]

### First Approach - Group all reviews
Join the tables together such that we have a business and all the reviews for each business in one row.

We'll use TFIDF on all the reviews for a business to predict the business's rating.

In [5]:
from scipy.stats import zscore

df_review_agg = review_filtered.groupby('business_id')['text'].sum()

df_grouped = pd.DataFrame({
    'business_id': df_review_agg.index, 
    'all_reviews': df_review_agg.values,
    })

# Join the business star ratings w/ business IDs
# Normalize the star ratings
df_grouped = df_grouped.merge(business_filtered[['business_id', 'stars']], on='business_id', how='inner')
df_grouped['stars'] = zscore(df_grouped['stars'])

print(df_grouped.shape)
df_grouped.head()

(10769, 3)


Unnamed: 0,business_id,all_reviews,stars
0,--ZVrH2X2QXBFdCilbirsw,This place is sadly perm closed. I was hoping ...,1.055696
1,-02xFuruu85XmDn2xiynJw,Dr. Curtis Dechant has an excellent chair-side...,1.055696
2,-06OYKiIzxsdymBMDAKZug,Had catalytic converters replaced on our Subur...,1.055696
3,-06ngMH_Ejkm_6HQBYxB7g,I have an old main line that really should be ...,0.44279
4,-0E7laYjwZxEAQPhFJXxow,I recently visited this dealership because the...,-0.170116


#### Applying TFIDF
Use TFIDF on review text to extract relevant features. Ignore stop words.

In [6]:
import sklearn.feature_extraction.text as sk_text

vectorizer = sk_text.TfidfVectorizer(stop_words='english', max_features=250, min_df=1)

x = vectorizer.fit_transform(df_grouped['all_reviews'])
y = df_grouped[['stars']].to_numpy()

In [7]:
vectorizer.get_feature_names_out()[:40]

array(['10', '20', '30', 'able', 'absolutely', 'actually', 'amazing',
       'area', 'ask', 'asked', 'atmosphere', 'attentive', 'away',
       'awesome', 'bad', 'bar', 'beef', 'beer', 'best', 'better', 'big',
       'bit', 'bread', 'breakfast', 'burger', 'business', 'busy',
       'called', 'came', 'car', 'care', 'check', 'cheese', 'chicken',
       'city', 'clean', 'coffee', 'cold', 'come', 'coming'], dtype=object)

#### Train Test Split
Splitting the training and resting data (80/20)

In [8]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#### RSME Function

In [9]:
from sklearn import metrics

# Predict and measure RMSE
def print_rsme(model, x_test, y_test):
    pred = model.predict(x_test)
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    
    print("RMSE Score: {}".format(score))


#### Lift Chart Function

In [10]:
# Function from lab #4
import matplotlib.pyplot as plt

# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Plot the chart
# chart_regression(pred.flatten(),y_test, sort=True)

#### Basic FCNN
No dropout layers, just a basic fully connected neural network.

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping

basic_fcnn_model = Sequential()

basic_fcnn_model.add(Dense(500, input_dim=x.shape[1], activation='relu'))
basic_fcnn_model.add(Dense(250, activation='relu'))
basic_fcnn_model.add(Dense(125, activation='relu'))
basic_fcnn_model.add(Dense(60, activation='relu'))
basic_fcnn_model.add(Dense(30, activation='relu'))
basic_fcnn_model.add(Dense(15, activation='relu'))
basic_fcnn_model.add(Dense(1))

basic_fcnn_model.compile(loss='mean_squared_error', optimizer='adam')

# Early exit
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
basic_fcnn_model.fit(x_train, y_train, validation_data=(x_test,y_test), callbacks=[monitor], verbose=2, epochs=1000)



  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/1000
270/270 - 3s - 12ms/step - loss: 0.3641 - val_loss: 0.2427
Epoch 2/1000
270/270 - 1s - 4ms/step - loss: 0.2560 - val_loss: 0.2393
Epoch 3/1000
270/270 - 1s - 4ms/step - loss: 0.2181 - val_loss: 0.2585
Epoch 4/1000
270/270 - 1s - 4ms/step - loss: 0.1779 - val_loss: 0.2416
Epoch 5/1000
270/270 - 1s - 4ms/step - loss: 0.1431 - val_loss: 0.2541
Epoch 6/1000
270/270 - 1s - 3ms/step - loss: 0.1217 - val_loss: 0.2407
Epoch 7/1000
270/270 - 1s - 4ms/step - loss: 0.1027 - val_loss: 0.2543
Epoch 7: early stopping


<keras.src.callbacks.history.History at 0x1e364f4c2f0>

In [12]:
print_rsme(basic_fcnn_model, x_test, y_test)

[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE Score: 0.5042945869058632


#### Slower Learning Rate
Basic FCNN but using a slower learning rate.

In [13]:
from tensorflow.keras import optimizers

slow_fcnn_model = Sequential()

slow_fcnn_model.add(Dense(500, input_dim=x.shape[1], activation='relu'))
slow_fcnn_model.add(Dense(250, activation='relu'))
slow_fcnn_model.add(Dense(125, activation='relu'))
slow_fcnn_model.add(Dense(60, activation='relu'))
slow_fcnn_model.add(Dense(30, activation='relu'))
slow_fcnn_model.add(Dense(15, activation='relu'))
slow_fcnn_model.add(Dense(1))

adam_modified = optimizers.Adam(learning_rate=0.0001)

slow_fcnn_model.compile(loss='mean_squared_error', optimizer=adam_modified)

# Early exit
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
slow_fcnn_model.fit(x_train, y_train, validation_data=(x_test,y_test), callbacks=[monitor], verbose=2, epochs=1000)

Epoch 1/1000


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


270/270 - 3s - 10ms/step - loss: 0.6069 - val_loss: 0.2862
Epoch 2/1000
270/270 - 1s - 4ms/step - loss: 0.2969 - val_loss: 0.2603
Epoch 3/1000
270/270 - 1s - 4ms/step - loss: 0.2657 - val_loss: 0.2488
Epoch 4/1000
270/270 - 1s - 4ms/step - loss: 0.2441 - val_loss: 0.2370
Epoch 5/1000
270/270 - 1s - 4ms/step - loss: 0.2291 - val_loss: 0.2333
Epoch 6/1000
270/270 - 1s - 4ms/step - loss: 0.2127 - val_loss: 0.2371
Epoch 7/1000
270/270 - 1s - 4ms/step - loss: 0.2005 - val_loss: 0.2377
Epoch 8/1000
270/270 - 1s - 4ms/step - loss: 0.1844 - val_loss: 0.2349
Epoch 9/1000
270/270 - 1s - 4ms/step - loss: 0.1660 - val_loss: 0.2422
Epoch 10/1000
270/270 - 1s - 4ms/step - loss: 0.1504 - val_loss: 0.2418
Epoch 10: early stopping


<keras.src.callbacks.history.History at 0x1e364f55810>

In [14]:
print_rsme(slow_fcnn_model, x_test, y_test)

[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE Score: 0.49176793057386037


#### Dropout FCNN
Same model as the basic fcnn, just with dropout layers implemented in between. Ideally, this will generalize the model more

In [15]:
# Add dropout layers
from tensorflow.keras.layers import Dropout

dropout_fcnn_model = Sequential()

dropout_fcnn_model.add(Dense(500, input_dim=x.shape[1], activation='relu'))
dropout_fcnn_model.add(Dropout(0.1))
dropout_fcnn_model.add(Dense(250, activation='relu'))
dropout_fcnn_model.add(Dropout(0.1))
dropout_fcnn_model.add(Dense(125, activation='relu'))
dropout_fcnn_model.add(Dropout(0.1))
dropout_fcnn_model.add(Dense(60, activation='relu'))
dropout_fcnn_model.add(Dropout(0.1))
dropout_fcnn_model.add(Dense(30, activation='relu'))
dropout_fcnn_model.add(Dropout(0.1))
dropout_fcnn_model.add(Dense(15, activation='relu'))
dropout_fcnn_model.add(Dense(1))

dropout_fcnn_model.compile(loss='mean_squared_error', optimizer='adam')

# Early exit
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
dropout_fcnn_model.fit(x_train, y_train, validation_data=(x_test,y_test), callbacks=[monitor], verbose=2, epochs=1000)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/1000
270/270 - 3s - 12ms/step - loss: 0.4053 - val_loss: 0.2751
Epoch 2/1000
270/270 - 1s - 4ms/step - loss: 0.2856 - val_loss: 0.2375
Epoch 3/1000
270/270 - 1s - 5ms/step - loss: 0.2444 - val_loss: 0.2450
Epoch 4/1000
270/270 - 1s - 4ms/step - loss: 0.2191 - val_loss: 0.2468
Epoch 5/1000
270/270 - 1s - 5ms/step - loss: 0.1886 - val_loss: 0.2294
Epoch 6/1000
270/270 - 1s - 4ms/step - loss: 0.1638 - val_loss: 0.2411
Epoch 7/1000
270/270 - 1s - 4ms/step - loss: 0.1403 - val_loss: 0.2456
Epoch 8/1000
270/270 - 1s - 4ms/step - loss: 0.1266 - val_loss: 0.2395
Epoch 9/1000
270/270 - 1s - 5ms/step - loss: 0.1115 - val_loss: 0.2421
Epoch 10/1000
270/270 - 1s - 4ms/step - loss: 0.1019 - val_loss: 0.2690
Epoch 10: early stopping


<keras.src.callbacks.history.History at 0x1e364f54410>

In [16]:
print_rsme(dropout_fcnn_model, x_test, y_test)

[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
RMSE Score: 0.5186877993274409


#### Adjusting Hyperparameters
We'll be using the slower learning rate adam optimizer from before, since it yielded slightly lower RSME.


##### Changing Max Features
Adjusting the max number of features for TFIDF

In [17]:
vectorizer = sk_text.TfidfVectorizer(stop_words='english', max_features=500, min_df=1)

x = vectorizer.fit_transform(df_grouped['all_reviews'])
y = df_grouped[['stars']].to_numpy()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

model = Sequential()

model.add(Dense(1000, input_dim=x.shape[1], activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(125, activation='relu'))
model.add(Dense(60, activation='relu'))
model.add(Dense(30, activation='relu'))
model.add(Dense(15, activation='relu'))
model.add(Dense(1))

adam_modified = optimizers.Adam(learning_rate=0.0001)

model.compile(loss='mean_squared_error', optimizer=adam_modified)

# Early exit
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
model.fit(x_train, y_train, validation_data=(x_test,y_test), callbacks=[monitor], verbose=2, epochs=1000)

print_rsme(model, x_test, y_test)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/1000
270/270 - 4s - 13ms/step - loss: 0.4828 - val_loss: 0.2358
Epoch 2/1000
270/270 - 2s - 8ms/step - loss: 0.2471 - val_loss: 0.2160
Epoch 3/1000
270/270 - 2s - 8ms/step - loss: 0.2083 - val_loss: 0.2134
Epoch 4/1000
270/270 - 2s - 8ms/step - loss: 0.1777 - val_loss: 0.2034
Epoch 5/1000
270/270 - 2s - 8ms/step - loss: 0.1468 - val_loss: 0.2073
Epoch 6/1000
270/270 - 2s - 8ms/step - loss: 0.1158 - val_loss: 0.2061
Epoch 7/1000
270/270 - 2s - 8ms/step - loss: 0.0920 - val_loss: 0.2157
Epoch 8/1000
270/270 - 2s - 8ms/step - loss: 0.0724 - val_loss: 0.2567
Epoch 9/1000
270/270 - 2s - 8ms/step - loss: 0.0565 - val_loss: 0.2183
Epoch 9: early stopping
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step
RMSE Score: 0.46717785558954356


##### Changing Number of Neurons and Number of Layers
Adjusting layer and neuron count to see if there's any noticeable change

### Second Approach - Predict Review Scores

Instead of processing all reviews at once, we can make a model that predicts the individual score for each review.

Afterwards, we'll average all of the predicted review scores and predict the business's rating.