# Understanding the Geographical Impact on Restaurant Reviews: A Yelp Data Analysis

## **Team Members:**

Hugh Wang(hughwang@umich.edu)  
Brain Wang

## **Overview**

This project analyzes the relationship between geographical features and restaurant review trends in the United States. We aim to explore how factors such as location type (e.g., downtown vs. suburban), proximity to landmarks, and neighborhood characteristics influence customer feedback and ratings.

## **Motivation**

We chose this topic because understanding how geography influences customer behavior is valuable for businesses and urban planners alike. Restaurants in different areas may have varied customer expectations, and insights into these differences can help businesses tailor their services more effectively. Here are the three real-world questions we generated for this project:

- **How do review ratings vary by restaurant category (e.g., fast food vs. fine dining) across different regions in the U.S.?**  
  By answering this question, we aim to learn whether customers in different regions prefer certain types of restaurants and how these preferences influence review ratings. This will help identify regional trends in customer satisfaction.

- **What times of the year do restaurants tend to receive the most reviews, and how does this trend affect review sentiment?**  
  This question will help us understand seasonal patterns in review activity and whether customer sentiment (positive or negative) shifts during specific times of the year, such as holidays or summer months.

- **Are restaurants with higher average ratings more likely to have detailed reviews compared to those with lower ratings?**  
  This analysis will explore whether detailed reviews are more often associated with positive experiences or if dissatisfied customers also tend to provide more in-depth feedback. Understanding this can offer insights into the correlation between review length and customer satisfaction.


## **Data Sources**

In this project, we will use two key components from the Yelp dataset:

### **Yelp Business Dataset**  
This dataset contains vital information about businesses, including their name, location (address, latitude, longitude), category (e.g., restaurant, bar), and various attributes (e.g., price range, amenities like parking or takeout options). This structured data provides a comprehensive overview of the business characteristics that are essential for analysis.

- **URL**: [Yelp Dataset](https://www.yelp.com/dataset)  

### **Yelp Review Dataset**  
This dataset contains customer feedback in the form of written reviews, star ratings (1 to 5 stars), and timestamps. It captures user experiences for each business, offering a dynamic perspective on customer satisfaction, preferences, and the factors that influence ratings over time.

- **URL**: [Yelp Dataset](https://www.yelp.com/dataset)

### **How These Datasets Complement Each Other**  
The **Yelp Business** dataset provides a structured view of business attributes, such as location, category, and price range, giving essential context about each restaurant. The **Yelp Review** dataset, on the other hand, adds dynamic feedback from customers, including ratings and review text, which reflect real-world customer experiences and opinions. By integrating these two datasets, we can explore how business features—like geographic location, type of establishment, or pricing—impact customer satisfaction, offering deeper insights into the factors driving restaurant success.


## **Data Description**

For this project, we will be using the **Yelp Business** and **Yelp Review** datasets, each providing valuable information from different angles:

1. **Yelp Business Dataset**:
   - **Variables of Interest**:
     - **business_id**: Unique identifier for each business (string, 22 characters).
     - **name**: Name of the business (string).
     - **address**: Full street address (string).
     - **city**: The city where the business is located (string).
     - **state**: The state where the business is located (string, 2 characters).
     - **postal_code**: Postal code (string).
     - **latitude/longitude**: Coordinates of the business (floats).
     - **stars**: Average star rating for the business (float).
     - **review_count**: Total number of reviews for the business (integer).
     - **is_open**: Indicates if the business is currently open (0 = closed, 1 = open).
     - **attributes**: Dictionary of business attributes (e.g., parking, takeout).
     - **categories**: List of categories the business falls under (e.g., "Mexican," "Burgers").
     - **hours**: Operating hours for each day of the week (object, with times in 24-hour format).

   - **Size**: The business dataset contains information for thousands of businesses, and the size depends on the region being studied. Each entry represents a unique business.
   
   - **Missing Values**: Some fields like attributes or operating hours may have missing or incomplete data for certain businesses.

2. **Yelp Review Dataset**:
   - **Variables of Interest**:
     - **review_id**: Unique identifier for each review (string, 22 characters).
     - **user_id**: Unique identifier for the user who posted the review (string, 22 characters).
     - **business_id**: Unique identifier for the business being reviewed (string, 22 characters, matches with the business dataset).
     - **stars**: Star rating given in the review (integer).
     - **date**: Date the review was posted (string, formatted as YYYY-MM-DD).
     - **text**: Full text content of the review (string).
     - **useful, funny, cool**: Number of votes the review received for these categories (integers).

   - **Size**: The review dataset contains millions of reviews, each representing a unique instance of customer feedback for a particular business.
   
   - **Missing Values**: The review dataset is generally complete, although some reviews may have missing or zero values in vote counts (useful, funny, cool).

## **Data Manipulation**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [1]:
# Path to Yelp dataset csv file
business_file = 'yelp_business.csv'
review_file = 'yelp_review.csv'

# Load the business data and review data
business_df = pd.read_csv(business_file)
review_df = pd.read_csv(review_file)


In [3]:
# Filter out missing data
business_df = business_df.dropna(subset=['categories'])
review_df = review_df.dropna(subset=['text'])

# Filter out businesses that are not restaurants
restaurant_df = business_df[business_df['categories'].str.contains('Restaurants')]
restaurant_ids = restaurant_df['business_id'].values

# Filter out reviews that are not for restaurants
restaurant_review_df = review_df[review_df['business_id'].isin(restaurant_ids)]

print('Number of reviews for restaurants:', len(restaurant_review_df))
print(review_df.head())
print(business_df.head())

Number of reviews for restaurants: 4724471
                review_id                 user_id             business_id  \
0  KU_O5udG6zpxOg-VcAEodg  mh_-eMZ6K5RLWhZyISBhwA  XQfwVwDr-v0ZS3_CbbE5Xw   
1  BiTunyQ73aT9WBnpR9DZGw  OyoGAe7OKpv6SyGZT5g77Q  7ATYjTIgM3jUlt4UM3IypQ   
2  saUsX_uimxRlCVr67Z4Jig  8g_iMtfSiwikVnbP2etR0A  YjUWPpI6HXG530lwP-fb2A   
3  AqPFMleE6RsU23_auESxiA  _7bHUi9Uuf5__HHc_Q8guQ  kxX2SOes4o-D3ZQBkiMRfA   
4  Sx8TMOWLNuJBWer-0pcmoA  bcjbaE6dDog4jkNY91ncLQ  e4Vwtrqf-wpJfwesgvdgxQ   

   stars  useful  funny  cool  \
0    3.0       0      0     0   
1    5.0       1      0     1   
2    3.0       0      0     0   
3    5.0       1      0     1   
4    4.0       1      0     1   

                                                text                 date  
0  If you decide to eat here, just be aware it is...  2018-07-07 22:09:11  
1  I've taken a lot of spin classes over the year...  2012-01-03 15:28:18  
2  Family diner. Had the buffet. Eclectic assortm...  2014-02-05 20