<h1>Yelp Review Aggregator</h1>

Team: Better Late than never

<h3>Intro</h3>

- Motivations/Stakeholders
- Hypotheses

<h3>Data</h3>

Our project used the Yelp Open dataset, a public dataset for educational use. It spans over 150K businesses, over 7 million reviews, and almost 2 million users. The dataset covers 11 metropolitan areas over 20+ states in the United States. Reviews were collected between Feb 16, 2005 and Jan 19, 2022.

To clean the data, we first removed all non-restaurants from the dataset. Next, we removed all restaurants with less than 30 reviews. We then removed all reviews not pertaining to the restaurants left in the dataset. Lastly, we removed all users who did not write a review about the remaining restaurants. The following code cell outputs the results of cleaning (the dataset is too large to submit)

Source: https://business.yelp.com/data/resources/open-dataset/

In [1]:
import data_cleaning as clean
chunk_size = 100_000

# We assume that the raw yelp dataset has been extracted to the folder 'data'
restaurants_df = clean.filter_business_data("data/yelp_academic_dataset_business.json", chunk_size)
reviews_df = clean.filter_review_data("data/yelp_academic_dataset_review.json", restaurants_df, chunk_size)
users_df = clean.filter_user_data("data/yelp_academic_dataset_user.json", reviews_df, chunk_size)

print(f'Number of restaurants after cleaning: {restaurants_df.shape[0]}')
print(f'Number of reviews after cleaning: {reviews_df.shape[0]}')
print(f'Number of users after cleaning: {users_df.shape[0]}')

Number of restaurants after cleaning: 27894
Number of reviews after cleaning: 4371282
Number of users after cleaning: 1379980


Text - EDA - Distribution of Reviews and Review Scores

TODO: Describe viz and takeaways

In [None]:
## Run code for viz 1 - distribution of review scores

## Run

Text - EDA #1 - Heatmap of average score prices

TODO: Describe viz and takeaways

In [None]:
## output viz

# TODO: Harket - create .py file to output viz

Text - EDA #2 - Average score vs price

TODO: Describe viz and takeaways

In [None]:
## Output viz

## TODO: Sonya - Create .py file to output viz

<h3>ML #1 - Binary Classification of Review Text</h3>

After looking at review scores, we turned to review text. We wanted to see if the text of a review aligned with the review's score, bias and all. To do this, we created a machine learning model to classify review text as 'Bad' (score < 4) or 'Good' (score >= 4). We trained a SVC with 50K reviews using the cuml library, a GPU-accelerated version of sklearn. Our features were a matrix of TF-IDF vectors. To benchmark, we compared the accuracy of our model to a majority label classifier using 5K reviews unused in training. We cross-validated to find the most effective SVC kernel and number of data points, but didn't find a significant difference in model accuracy from our tuning

Our model beat our benchmark by almost 20%, not bad. It suggested an association between the score of a review and the text of the review. To further investigate, we created a second model with three classes

In [None]:
import text_analysis as ml
benchmark_size = 5000

train_X, train_, binary_tfidf, binary_classifier = ml.load_model("binary")
test_X, test_y = ml.create_binary_test_data(reviews_df, benchmark_size, binary_tfidf)

ml.benchmark(test_X, test_y)
ml.evaluate_classifier(binary_classifier, test_X, test_y)

cuML: Installed accelerator for sklearn.
cuML: Successfully initialized accelerator.
Benchmark accuracy for our model to beat: 0.6928066037735849
Model Accuracy: 0.8876768867924528


<h3>ML #2 - Three-way Classification of Review Text</h3>

This model was a SVC like the first, except it classifies the text of a review (In TF-IDF vector form) as 'Bad' (review score < 4), 'Good' (4 <= review score < 5), or 'Great' (review score == 5). cuML was unable to use GPU-acceleration on this kind of model, so we used sklearn instead. The SVC was trained using the 'One vs. Rest' strategy. This meant that the SVC consisted of 3 SVMs, one for each label. We used 50K reviews in training, and 5K seperate reviews for benchmarking. We used the same majority label classifier to benchmark this model

This benchmark beat the benchmark by almost 30%, more than the binary model. This further suggested an association between a review's score and text. It also suggests that any bias in the review score is also present in the review text. Both are user input

In [3]:
train_X, train_, three_tfidf, three_classifier = ml.load_model("3_class")
test_X, test_y = ml.create_multiclass_test_data(reviews_df, benchmark_size, three_tfidf)

ml.benchmark(test_X, test_y)
ml.evaluate_classifier(three_classifier, test_X, test_y)

Benchmark accuracy for our model to beat: 0.4206957547169811
Model Accuracy: 0.7054834905660378


<h3>Results</h3>

Our research boils down to two major takeaways:
<h5>1. Yelp data is biased</h5>
The average Yelp review score is 4/5, with the majority of users leaving 1 and 5 star ratings. 

<h5>2. Yelp bias is user input, at least in part</h5>

Text - App Demo

https://yelp-recs.crowsnet.io

TODO: Describe app and data pulling/cleaning process