# Understanding the Geographical Impact on Restaurant Reviews: A Yelp Data Analysis

## **Team Members:**

Hugh Wang(hughwang@umich.edu)  
Brian Wang(bswang@umich.edu)

## **Overview**

This project analyzes the relationship between geographical features and restaurant review trends in the United States. We aim to explore how factors such as location type (e.g., downtown vs. suburban), proximity to landmarks, and neighborhood characteristics influence customer feedback and ratings.

## **Motivation**

We chose this topic because understanding how geography influences customer behavior is valuable for businesses and urban planners alike. Restaurants in different areas may have varied customer expectations, and insights into these differences can help businesses tailor their services more effectively. Here are the three real-world questions we generated for this project:

- **How do review ratings vary by restaurant category (e.g., fast food vs. fine dining) across different regions in the U.S.?**  
  By answering this question, we aim to learn whether customers in different regions prefer certain types of restaurants and how these preferences influence review ratings. This will help identify regional trends in customer satisfaction.

- **What times of the year do restaurants tend to receive the most reviews, and how does this trend affect review sentiment?**  
  This question will help us understand seasonal patterns in review activity and whether customer sentiment (positive or negative) shifts during specific times of the year, such as holidays or summer months.

- **Are restaurants with higher average ratings more likely to have detailed reviews compared to those with lower ratings?**  
  This analysis will explore whether detailed reviews are more often associated with positive experiences or if dissatisfied customers also tend to provide more in-depth feedback. Understanding this can offer insights into the correlation between review length and customer satisfaction.


## **Data Sources**

In this project, we will use two key components from the Yelp dataset:

### **Yelp Business Dataset**  
This dataset contains vital information about businesses, including their name, location (address, latitude, longitude), category (e.g., restaurant, bar), and various attributes (e.g., price range, amenities like parking or takeout options). This structured data provides a comprehensive overview of the business characteristics that are essential for analysis.

- **URL**: [Yelp Dataset](https://www.yelp.com/dataset)  

### **Yelp Review Dataset**  
This dataset contains customer feedback in the form of written reviews, star ratings (1 to 5 stars), and timestamps. It captures user experiences for each business, offering a dynamic perspective on customer satisfaction, preferences, and the factors that influence ratings over time.

- **URL**: [Yelp Dataset](https://www.yelp.com/dataset)
  
### **US Population Dataset**
This dataset contains population statistics of each US state. It captures the ranking, ratio to entire US population, and population count of each state.

- **URL**: [US Population Dataset][https://www.kaggle.com/datasets/alexandrepetit881234/us-population-by-state]

### **How These Datasets Complement Each Other**  
The **Yelp Business** dataset provides a structured view of business attributes, such as location, category, and price range, giving essential context about each restaurant. The **Yelp Review** dataset, on the other hand, adds dynamic feedback from customers, including ratings and review text, which reflect real-world customer experiences and opinions. By integrating these two datasets, we can explore how business features—like geographic location, type of establishment, or pricing—impact customer satisfaction, offering deeper insights into the factors driving restaurant success. Lastly, we could analyze how **Yelp Review** and **Yelp Business** are correlated with population statistics (macro correlation). 


## **Data Description**

For this project, we will be using the **Yelp Business** and **Yelp Review** datasets, each providing valuable information from different angles:

1. **Yelp Business Dataset**:
   - **Variables of Interest**:
     - **business_id**: Unique identifier for each business (string, 22 characters).
     - **name**: Name of the business (string).
     - **address**: Full street address (string).
     - **city**: The city where the business is located (string).
     - **state**: The state where the business is located (string, 2 characters).
     - **postal_code**: Postal code (string).
     - **latitude/longitude**: Coordinates of the business (floats).
     - **stars**: Average star rating for the business (float).
     - **review_count**: Total number of reviews for the business (integer).
     - **is_open**: Indicates if the business is currently open (0 = closed, 1 = open).
     - **attributes**: Dictionary of business attributes (e.g., parking, takeout).
     - **categories**: List of categories the business falls under (e.g., "Mexican," "Burgers").
     - **hours**: Operating hours for each day of the week (object, with times in 24-hour format).

   - **Size**: The business dataset contains information for thousands of businesses, and the size depends on the region being studied. Each entry represents a unique business.
   
   - **Missing Values**: Some fields like attributes or operating hours may have missing or incomplete data for certain businesses.

2. **Yelp Review Dataset**:
   - **Variables of Interest**:
     - **review_id**: Unique identifier for each review (string, 22 characters).
     - **user_id**: Unique identifier for the user who posted the review (string, 22 characters).
     - **business_id**: Unique identifier for the business being reviewed (string, 22 characters, matches with the business dataset).
     - **stars**: Star rating given in the review (integer).
     - **date**: Date the review was posted (string, formatted as YYYY-MM-DD).
     - **text**: Full text content of the review (string).
     - **useful, funny, cool**: Number of votes the review received for these categories (integers).

   - **Size**: The review dataset contains millions of reviews, each representing a unique instance of customer feedback for a particular business.
   
   - **Missing Values**: The review dataset is generally complete, although some reviews may have missing or zero values in vote counts (useful, funny, cool).
3. **US Population Dataset**:
   - **Variables of Interest**:
     - **rank**: Population rank with respect to all US states 
     - **state**: The full name of the state
     - **state_code**: State code (str)
     - **2020_census**: Population count based on 2020 census
     - **percent__of_total**: percent of share as compared to US population

## **Data Manipulation**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Check https://github.com/matplotlib/basemap
#\python -m pip install basemap-data
from utils import plot_geomap

%load_ext autoreload
%autoreload 2

In [4]:
# Path to Yelp dataset csv file

business_file = '../project_data/yelp_business.csv'
review_file = '../project_data/yelp_review.csv'

# business_file = '../project_data/yelp_business.csv'
# review_file = 'yelp_review.csv'

# Load the business data and review data
business_df = pd.read_csv(business_file)
review_df = pd.read_csv(review_file)


In [5]:
# Filter out missing data
print('business data :', business_df.shape)
print('review data :', review_df.shape)
business_df = business_df.dropna(subset=['categories'])
review_df = review_df.dropna(subset=['text'])
review_df = review_df.sample(frac=0.01, random_state=42)
print('After dropping NA: business data :', business_df.shape)
print('After dropping NA: review data :', review_df.shape)

# Filter out businesses that are not restaurants
restaurant_df = business_df[business_df['categories'].str.contains('Restaurants')]
restaurant_ids = restaurant_df['business_id'].values

# Filter out reviews that are not for restaurants
restaurant_review_df = review_df[review_df['business_id'].isin(restaurant_ids)]

print('Number of reviews for restaurants:', len(restaurant_review_df))
print(review_df.head())
print(business_df.head())

business data : (150346, 14)
review data : (6990280, 9)
After dropping NA: business data : (150243, 14)
After dropping NA: review data : (69903, 9)
Number of reviews for restaurants: 47195
                      review_id                 user_id  \
1295256  J5Q1gH4ACCj6CtQG7Yom7g  56gL9KEJNHiSDUoyjk2o3Q   
3297618  HlXP79ecTquSVXmjM10QxQ  bAt9OUFX9ZRgGLCXG22UmA   
1217795  JBBULrjyGx6vHto2osk_CQ  NRHPcLq2vGWqgqwVugSgnQ   
3730348  U9-43s8YUl6GWBFCpxUGEw  PAxc0qpqt5c2kA0rjDFFAg   
1826590  8T8EGa_4Cj12M6w8vRgUsQ  BqPR1Dp5Rb_QYs9_fz9RiA   

                    business_id  stars  useful  funny  cool  \
1295256  8yR12PNSMo6FBYx1u5KPlw    2.0       1      0     0   
3297618  pBNucviUkNsiqhJv5IFpjg    5.0       0      0     0   
1217795  8sf9kv6O4GgEb0j1o22N1g    5.0       0      0     0   
3730348  XwepyB7KjJ-XGJf0vKc6Vg    4.0       0      0     0   
1826590  prm5wvpp0OHJBlrvTj9uOg    5.0       0      0     0   

                                                      text  \
1295256  Went f

In [6]:
#This is an auxillary dataset. We theoretically have 2, mentioend above
population_df = pd.read_csv('us_pop_by_state.csv')

## Merging DataFrames

In [7]:
#merge two dataframes
merged_df = pd.merge(restaurant_review_df, restaurant_df, on='business_id', how='inner')
merged_df.head()

Unnamed: 0,review_id,user_id,business_id,stars_x,useful,funny,cool,text,date,name,...,state,postal_code,latitude,longitude,stars_y,review_count,is_open,attributes,categories,hours
0,J5Q1gH4ACCj6CtQG7Yom7g,56gL9KEJNHiSDUoyjk2o3Q,8yR12PNSMo6FBYx1u5KPlw,2.0,1,0,0,Went for lunch and found that my burger was me...,2018-04-04 21:09:53,Bru Burger Bar - Indianapolis,...,IN,46204,39.773307,-86.152091,4.0,1608,1,"{'BikeParking': 'True', 'GoodForKids': 'True',...","Restaurants, Gluten-Free, Bars, Food, Nightlif...","{'Monday': '0:0-0:0', 'Tuesday': '11:0-21:0', ..."
1,U9-43s8YUl6GWBFCpxUGEw,PAxc0qpqt5c2kA0rjDFFAg,XwepyB7KjJ-XGJf0vKc6Vg,4.0,0,0,0,Been here a few times to get some shrimp. The...,2013-04-27 01:55:49,Joe's Seafood,...,PA,19403,40.139854,-75.387321,3.0,7,0,"{'OutdoorSeating': 'False', 'RestaurantsGoodFo...","Seafood Markets, Food, Restaurants, Seafood, S...","{'Monday': '9:0-19:0', 'Tuesday': '9:0-19:0', ..."
2,8T8EGa_4Cj12M6w8vRgUsQ,BqPR1Dp5Rb_QYs9_fz9RiA,prm5wvpp0OHJBlrvTj9uOg,5.0,0,0,0,This is one fantastic place to eat whether you...,2019-05-15 18:29:25,Rumbi Island Grill,...,ID,83709,43.604645,-116.289834,3.5,29,0,"{'HasTV': 'False', 'GoodForMeal': ""{'dessert':...","Sandwiches, Salad, Restaurants, Hawaiian, Cari...","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."
3,18E_haOfOm8ks-A7SlVWRg,bnDZpsii_if2_wpn8oPcig,bK0j7YtVyN98UnM_8fUONg,3.0,1,1,1,Dirt cheap happy hour specials. Half priced d...,2011-11-08 01:30:27,Tavern on Broad,...,PA,19102,39.949119,-75.164844,2.5,282,0,"{'NoiseLevel': ""u'very_loud'"", 'RestaurantsPri...","American (New), Bars, Sports Bars, Restaurants...","{'Monday': '11:0-2:0', 'Tuesday': '11:0-2:0', ..."
4,xQVDB9xRdpLmPh9XMQ6Gvg,yy38DH7ENFTJ10-d4GUlig,S26FJcC298XNpN2cZiwOrA,5.0,0,0,0,Nothing beats pizza and beer in my book. This ...,2012-12-24 02:18:18,Pi Pizzeria - Central West End,...,MO,63108,38.648864,-90.26069,4.0,488,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","Restaurants, Pizza","{'Monday': '11:0-21:0', 'Tuesday': '11:0-21:0'..."


In [8]:
merged_df = pd.merge(merged_df, population_df)
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   review_id         0 non-null      object 
 1   user_id           0 non-null      object 
 2   business_id       0 non-null      object 
 3   stars_x           0 non-null      float64
 4   useful            0 non-null      int64  
 5   funny             0 non-null      int64  
 6   cool              0 non-null      int64  
 7   text              0 non-null      object 
 8   date              0 non-null      object 
 9   name              0 non-null      object 
 10  address           0 non-null      object 
 11  city              0 non-null      object 
 12  state             0 non-null      object 
 13  postal_code       0 non-null      object 
 14  latitude          0 non-null      float64
 15  longitude         0 non-null      float64
 16  stars_y           0 non-null      float64
 17  review_co

## More Data Anlaysis

In [9]:
print(f'Num restaurants = {len(restaurant_ids)}')

Num restaurants = 52268


In [None]:
restaurant_review_df.info() #all columns are non-null

<class 'pandas.core.frame.DataFrame'>
Index: 47195 entries, 1295256 to 4428200
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   review_id    47195 non-null  object 
 1   user_id      47195 non-null  object 
 2   business_id  47195 non-null  object 
 3   stars        47195 non-null  float64
 4   useful       47195 non-null  int64  
 5   funny        47195 non-null  int64  
 6   cool         47195 non-null  int64  
 7   text         47195 non-null  object 
 8   date         47195 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 3.6+ MB


In [None]:
#few columns have null values, but they aren't prioritized for this analysis
restaurant_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 52268 entries, 3 to 150340
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   52268 non-null  object 
 1   name          52268 non-null  object 
 2   address       51825 non-null  object 
 3   city          52268 non-null  object 
 4   state         52268 non-null  object 
 5   postal_code   52247 non-null  object 
 6   latitude      52268 non-null  float64
 7   longitude     52268 non-null  float64
 8   stars         52268 non-null  float64
 9   review_count  52268 non-null  int64  
 10  is_open       52268 non-null  int64  
 11  attributes    51703 non-null  object 
 12  categories    52268 non-null  object 
 13  hours         44990 non-null  object 
dtypes: float64(3), int64(2), object(9)
memory usage: 6.0+ MB


In [None]:
#convert string to pandas datetime object
restaurant_review_df['date'] = restaurant_review_df['date'].apply(pd.to_datetime)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['date'] = restaurant_review_df['date'].apply(pd.to_datetime)


In [None]:
#get date attributes for analyzing periodical patterns
restaurant_review_df['year'] = restaurant_review_df['date'].apply(lambda x:x.year)
restaurant_review_df['month'] = restaurant_review_df['date'].apply(lambda x:x.month)
restaurant_review_df['day'] = restaurant_review_df['date'].apply(lambda x:x.day)
restaurant_review_df['hour'] = restaurant_review_df['date'].apply(lambda x:x.hour)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['year'] = restaurant_review_df['date'].apply(lambda x:x.year)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['month'] = restaurant_review_df['date'].apply(lambda x:x.month)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_review_df['day'] = restaur

In [None]:
#convert string to dictionary
restaurant_df['hours'] = restaurant_df['hours'].apply(
    lambda x: None if pd.isnull(x) else eval(x))
days = ['Monday', 'Tuesday', "Wednesday", 'Thursday', 'Friday', 'Saturday', 'Sunday']
#for each key, value in the dictionary, create a new column
for day in days:
    restaurant_df[f'hr_{day.lower()}'] = restaurant_df['hours']\
        .apply(lambda x: None if x==None or day not in x else x[day])
restaurant_df.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_df['hours'] = restaurant_df['hours'].apply(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_df[f'hr_{day.lower()}'] = restaurant_df['hours']\
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  restaurant_df[f'hr_{day.lower()}'] = restaurant_df['hours']\
A value is trying to be set 

<class 'pandas.core.frame.DataFrame'>
Index: 52268 entries, 3 to 150340
Data columns (total 21 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   business_id   52268 non-null  object 
 1   name          52268 non-null  object 
 2   address       51825 non-null  object 
 3   city          52268 non-null  object 
 4   state         52268 non-null  object 
 5   postal_code   52247 non-null  object 
 6   latitude      52268 non-null  float64
 7   longitude     52268 non-null  float64
 8   stars         52268 non-null  float64
 9   review_count  52268 non-null  int64  
 10  is_open       52268 non-null  int64  
 11  attributes    51703 non-null  object 
 12  categories    52268 non-null  object 
 13  hours         44990 non-null  object 
 14  hr_monday     38967 non-null  object 
 15  hr_tuesday    41783 non-null  object 
 16  hr_wednesday  43742 non-null  object 
 17  hr_thursday   44429 non-null  object 
 18  hr_friday     44644 non-null  

In [None]:
#plot the restuarants by state
restaurant_by_state = restaurant_df.groupby('state')['business_id']\
.count().reset_index().sort_values(by = 'business_id', ascending=False)
print(f'Included States = {restaurant_by_state.shape[0]}')

plt.figure(figsize=(12, 8))
plt.title('Number of Restaurants by State')
plt.xlabel('State')
plt.ylabel('Number of Restaurants')
sns.barplot(
    x = restaurant_by_state['state'],
    y = restaurant_by_state['business_id']
)

Included States = 19
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\IPython\core\interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\XD\AppData\Local\Temp\ipykernel_11180\2545224056.py", line 6, in <module>
    plt.figure(figsize=(12, 8))
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 1027, in figure
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 549, in new_figure_manager
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 526, in _warn_if_gui_out_of_main_thread
    _warn_if_gui_out_of_main_thread()
                                      
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 358, in _get_backend_mod
    # keep working for backcompat.
        

In [None]:
#plot mean ratings by state
restaurant_by_state = restaurant_df.groupby('state')['stars'].mean()\
    .sort_values(ascending=False).reset_index()
plt.figure(figsize=(12, 8))
plt.title('Mean Stars by State')
plt.xlabel('State')
plt.ylabel('Stars')
sns.barplot(
    x = restaurant_by_state['state'],
    y = restaurant_by_state['stars']
)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\IPython\core\interactiveshell.py", line 3550, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\XD\AppData\Local\Temp\ipykernel_11180\132337809.py", line 4, in <module>
    plt.figure(figsize=(12, 8))
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 1027, in figure
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 549, in new_figure_manager
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 526, in _warn_if_gui_out_of_main_thread
    _warn_if_gui_out_of_main_thread()
                                      
  File "c:\Users\XD\Desktop\SI 618\SI_618_FA_24_Files\.venv\Lib\site-packages\matplotlib\pyplot.py", line 358, in _get_backend_mod
    # keep working for backcompat.
        ^

In [None]:
#plot distribution of mean ratings
plt.hist(restaurant_review_df['stars'])
plt.title('Distribution of restaurant review ratings')
plt.xlabel('Rating')
plt.ylabel('Count')

In [None]:
plot_geomap(restaurant_df, col = 'stars', task_name = 'ratings', by_state = False)

In [None]:
state_df = restaurant_df[restaurant_df['state'] == 'FL']
plot_geomap(state_df, col = 'stars', task_name = 'ratings', by_state = False)

In [None]:
state_df = restaurant_df[restaurant_df['state'] == 'PA']
plot_geomap(state_df, col = 'stars', task_name = 'ratings', by_state = False)

In [None]:
plt.scatter(restaurant_df['longitude'], restaurant_df['latitude'])

In [None]:
_ = plt.hist(restaurant_df['longitude'], bins = 100)
plt.title('Longitude Distribution')
plt.xlabel('Longitude')
plt.ylabel('count')

In [None]:
_ = plt.hist(restaurant_df['latitude'], bins = 100)
plt.title('Latitude Distribution')
plt.xlabel('Latitude')
plt.ylabel('count')

Above shows the distribution of number of reviews and mean ratings across states. We observe that the geographical distribution of yelp reviews are skewed towards urban areas and has very spare distributions, which makes the national level visualization less effective. We plot a few state-level visualizations to observe for potential trends. Lastly, through some of the state visualizations, we observe that the ratings are mostly 3's to 5's, and the reviews are more clustered at the metropolitan areas. 

In [None]:
plt.hist(restaurant_review_df['stars'])
plt.title('Restaurant Mean Rating Distribution')
plt.xlabel('Stars')
plt.ylabel('count')


In [None]:
monthly_ratings = restaurant_review_df.groupby('month')['stars'].mean()
plt.plot(monthly_ratings)

In [None]:
daily_ratings = restaurant_review_df.groupby('day')['stars'].mean()
plt.plot(daily_ratings)

The above plots shows the mean ratings by day and month of the year. While there are some fluncutations, the difference in mean ratings isn't apparent enough to conclude that these features influences the ratings. We could perform hypothesis testing to validate such observation

# **Project Part II: Analysis**

## **Descriptive Statistics**

In [None]:
descriptive_stats = merged_df.describe().T
descriptive_stats['mode'] = merged_df.mode().iloc[0]
display(descriptive_stats)

IndexError: single positional indexer is out-of-bounds