## Yelp Rating Prediction FCNN

Author: Rahul Gupta

Goal of the project is to develop an FCNN (Fully Connected Neural Network) to predict Yelp rating scores based on the text data from user-written reviews.

### Data Loading
Need to first load, understand, and preprocess data for the neural network.

In [40]:
import numpy as np
import pandas as pd

# Starter code
review = pd.read_json('./data/yelp_academic_dataset_review.json', lines=True, nrows= 1000000)
business = pd.read_json('./data/yelp_academic_dataset_business.json', lines=True, nrows= 1000000)

In [41]:
review.dropna(inplace=True)
review.head()
print(review.shape)

(1000000, 9)


In [42]:
business.dropna(inplace=True)
print(business.shape)
business.head()

(117618, 14)


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."
5,CF33F8-E6oudUQ46HnavjQ,Sonic Drive-In,615 S Main St,Ashland City,TN,37015,36.269593,-87.058943,2.0,6,1,"{'BusinessParking': 'None', 'BusinessAcceptsCr...","Burgers, Fast Food, Sandwiches, Food, Ice Crea...","{'Monday': '0:0-0:0', 'Tuesday': '6:0-22:0', '..."


In [49]:
# Drop businesses with less than 20 reviews
business_to_drop = business[business['review_count'] < 20]

# Remove reviews for businesses that are being dropped
review_filtered = review[review['business_id'].isin(business_to_drop['business_id']) == False]
business_filtered = business[business['review_count'] >= 20]

### First Approach - Group all reviews
Join the tables together such that we have a business and all the reviews for each business in one row.

We'll use TFIDF on all the reviews for a business to predict the business's rating.

In [83]:
from scipy.stats import zscore

df_review_agg = review_filtered.groupby('business_id')['text'].sum()

df_grouped = pd.DataFrame({
    'business_id': df_review_agg.index, 
    'all_reviews': df_review_agg.values,
    })

# Join the business star ratings w/ business IDs
# Normalize the star ratings
df_grouped = df_grouped.merge(business_filtered[['business_id', 'stars']], on='business_id', how='inner')

print(df_grouped.shape)
df_grouped.head()

(10769, 3)


Unnamed: 0,business_id,all_reviews,stars
0,--ZVrH2X2QXBFdCilbirsw,This place is sadly perm closed. I was hoping ...,4.5
1,-02xFuruu85XmDn2xiynJw,Dr. Curtis Dechant has an excellent chair-side...,4.5
2,-06OYKiIzxsdymBMDAKZug,Had catalytic converters replaced on our Subur...,4.5
3,-06ngMH_Ejkm_6HQBYxB7g,I have an old main line that really should be ...,4.0
4,-0E7laYjwZxEAQPhFJXxow,I recently visited this dealership because the...,3.5


In [84]:
unique_categories = business['stars'].unique()
unique_categories.sort()
print(f'Unique star ratings: {unique_categories}')

Unique star ratings: [1.  1.5 2.  2.5 3.  3.5 4.  4.5 5. ]


In [85]:
df_grouped['stars'] = df_grouped['stars'].astype(str)

def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)

encode_text_dummy(df_grouped, 'stars')

print(df_grouped.shape)
df_grouped.head()

(10769, 11)


Unnamed: 0,business_id,all_reviews,stars-1.0,stars-1.5,stars-2.0,stars-2.5,stars-3.0,stars-3.5,stars-4.0,stars-4.5,stars-5.0
0,--ZVrH2X2QXBFdCilbirsw,This place is sadly perm closed. I was hoping ...,False,False,False,False,False,False,False,True,False
1,-02xFuruu85XmDn2xiynJw,Dr. Curtis Dechant has an excellent chair-side...,False,False,False,False,False,False,False,True,False
2,-06OYKiIzxsdymBMDAKZug,Had catalytic converters replaced on our Subur...,False,False,False,False,False,False,False,True,False
3,-06ngMH_Ejkm_6HQBYxB7g,I have an old main line that really should be ...,False,False,False,False,False,False,True,False,False
4,-0E7laYjwZxEAQPhFJXxow,I recently visited this dealership because the...,False,False,False,False,False,True,False,False,False


In [109]:
import sklearn.feature_extraction.text as sk_text

vectorizer = sk_text.TfidfVectorizer(stop_words='english', max_features=1000, min_df=1)

x = vectorizer.fit_transform(df_grouped['all_reviews'])
y = df_grouped[['stars-1.0', 'stars-1.5', 'stars-2.0', 'stars-2.5', 'stars-3.0', 'stars-3.5', 'stars-4.0', 'stars-4.5', 'stars-5.0']]

In [110]:
vectorizer.get_feature_names_out()[:40]

array(['00', '10', '100', '11', '12', '15', '20', '25', '30', '40', '45',
       '50', 'able', 'absolutely', 'accommodating', 'actually', 'add',
       'added', 'addition', 'affordable', 'afternoon', 'ago', 'ahead',
       'air', 'amazing', 'ambiance', 'american', 'apparently',
       'appetizer', 'appetizers', 'apple', 'appointment', 'appreciate',
       'area', 'aren', 'arrived', 'art', 'asian', 'ask', 'asked'],
      dtype=object)

In [111]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [113]:
print(x.shape[1])

1000


In [121]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential()

model.add(Dense(500, input_dim=x.shape[1], activation='relu'))
model.add(Dense(250, activation='relu'))
model.add(Dense(125, activation='relu'))
model.add(Dense(60, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Early exit
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=2, verbose=2, mode='auto')
model.fit(x_train, y_train, validation_data=(x_test,y_test), callbacks=[monitor], verbose=2, epochs=1000)  


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/1000
270/270 - 4s - 14ms/step - accuracy: 0.4411 - loss: 1.3329 - val_accuracy: 0.5682 - val_loss: 1.0553
Epoch 2/1000
270/270 - 2s - 7ms/step - accuracy: 0.5784 - loss: 1.0152 - val_accuracy: 0.5840 - val_loss: 1.0170
Epoch 3/1000
270/270 - 2s - 7ms/step - accuracy: 0.6113 - loss: 0.9321 - val_accuracy: 0.5645 - val_loss: 1.0373
Epoch 4/1000
270/270 - 3s - 10ms/step - accuracy: 0.6636 - loss: 0.8344 - val_accuracy: 0.5803 - val_loss: 1.0343
Epoch 4: early stopping


<keras.src.callbacks.history.History at 0x216e66ce990>