Group Analytics Project

Course: BANA 320 - Predictive Analytics
Dataset: Yelp Open Dataset
Group Name: Three Gits
Team: Vincent Balalian, Sameer Patel, Arish Patel

Overview

Problem Statement: Does adding sentiment features from the first 90 days of reviews significantly improve the accuracy of predicting a restaurant's 1-year Yelp rating, compared to using non-text features alone?

Target Variable: Average star rating 12 months after first review

Result: No - sentiment features did not significantly improve prediction accuracy when geographic and user-based context features were already included.

Data Pipeline

Raw Yelp JSON data (hosted in Google Cloud Storage) is transformed via dbt + BigQuery:

Filtering: Business dataset filtered to restaurants only
Qualification criteria:
- Minimum 1 year of review history
- At least 3 reviews in first 90 days
- At least 10 reviews in first year
Feature aggregation: Check-ins, user metrics, and zip-code comparisons aggregated to restaurant level
Time windowing: Separate datasets for 90-day (features) and 365-day (target) review periods

Feature Categories

Category	Description	Count
Early Review Metrics	Review count & average rating (first 90 days)	2
Check-in Patterns	Day-of-week and time-of-day distributions	11
User Characteristics	Reviewer reputation, engagement, experience	10
Zip Code Context	Restaurant metrics vs. local averages	4
Sentiment (VADER)	Positive, negative, neutral, compound scores	4

Methodology

Phase 1: Train models using all features except sentiment
Phase 2: Add sentiment features derived from VADER analysis of first-90-day reviews
Comparison: Paired one-tailed t-test on absolute prediction errors (α = 0.05)

Models Tested

Support Vector Regression (linear kernel)
Linear Regression
Random Forest Regressor
XGBoost Regressor
Stacked Ensemble (average of all four)

Key Findings

Geographic context subsumes sentiment signal - Zip-code comparison features (rating vs. local average, etc.) already capture competitive positioning that sentiment would otherwise indicate.
Early rating is the dominant predictor - The 90-day average rating had the highest mutual information with 1-year rating by a wide margin.
Reviewer quality correlates with outcomes - User characteristics (average stars given, engagement metrics) ranked among top predictive features.

Tools & Technologies

Data Warehouse: Google BigQuery
Transformation: dbt
Analysis: Python (pandas, scikit-learn, XGBoost, VADER)
Environment: Google Colab
CI/CD: GitHub Actions

Analysis

The complete analysis notebook is available here. It includes all code, visualizations, and statistical test results.

Note: The data pipeline requires access to a private Google Cloud project. The notebook is provided for review purposes; reproducing it would require setting up your own BigQuery environment with the Yelp Open Dataset.

BANA 320 - Fall 2025

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github/workflows		.github/workflows
analysis		analysis
dbt_project		dbt_project
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Group Analytics Project

Overview

Data Pipeline

Feature Categories

Methodology

Models Tested

Key Findings

Tools & Technologies

Analysis

About

Uh oh!

Contributors 3

Uh oh!

Languages

vbalalian/three-gits

Folders and files

Latest commit

History

Repository files navigation

Group Analytics Project

Overview

Data Pipeline

Feature Categories

Methodology

Models Tested

Key Findings

Tools & Technologies

Analysis

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 3

Uh oh!

Languages