Identifying and Analyzing Patterns in NYC Restaurant Inspections

A project examining a dataset of health inspection results for restaurants in the five boroughs of NYC. I generate various EDA visualizations of the data and conduct a statistical hypothesis test relating restaurant grades to local wealth of neighborhoods. I also use geo-plotting libraries in Python to create choropleth and heat maps of the data.

Data Sources

Official dataset from the NYC Department of Health on every health inspection result for all restaurants in the city. Link
Python package uszipcode for gathering NYC median income data Link

Background Information

In 2019, NYC's restaurant industry consisted of 24,000 restaurants and 317,000 jobs -- both of these numbers were all time highs. Further, the growth rate of the restaurant industry in the preceding 10-year timeframe doubled the growth rate of overall city businesses.
The pandemic in 2020 drastically changed, along with many other things, the size of the restaurant industry in NYC. This project will only look at data in the completed calendar years 2017-2019.
The DoH conducts health inspections of every one of these establishments on a regular cycle, giving a certain number of points for each sanitary violation found:
"A" grade: 0-13 points
"B" grade: 14-27 points
"C" grade: 28 or more points

Exploratory Data Analysis

I first performed early EDA on the counts of grades and grade distributions across boroughs. I broke down how the numerical scores were distributed, and also took the top 10 cuisine types in the dataset, and looked at the proportion of grades distributed.

Distribution of scores:

By cuisine type:

Incorporating Median Income of Zip Code

I try to see if there is a relationship between the overall wealth of a neighborhood (as measured by median income) and the rate of health grades of its restaurants. To do this, I introduce new data that contains measurements of median income for all NYC postal codes - of which there are roughly ~160.

We see a small, but observable difference in the median incomes for A-graded restaurants to B and C-graded restaurants. We now perform a statistical hypothesis test to see if the difference observed is indeed statistically significant.

Hypothesis Testing

Method: Mann-Whitney U-test. See income_htest notebook for details on p-value calculation

Linear Regression Model

I create a simple linear regression model relating the median income of a postal code, and the percentage of A-graded restaurants in that zip code.

I obtained a p-value of 0.009 and an R-squared of 0.090. There appears to be a weak positive correlation between these two metrics of a given NYC zip code.

Geo-plotting with Folium

Due to the availability of lat-long coordinates for every restaurant in the dataset, along with zip-code data, I used both choropleth maps and heat maps available in the Folium python package to geographically plot the data.

Choropleth map showing percentage of A graded restaurants per zip code

Choropleth map showing median income of zip code

Heat map of where restaurant violations are cited using lat-long coordinates

Further Work

Obtain Yelp price point data for each restaurant
Incorporate ethnic demographic numbers for neighborhoods of New York
Create composite map of Percentage A vs. Income map. This would make it easier to read and provide better information at a glance.
- https://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
img		img
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

img

img

notebooks

notebooks

src

src

.gitignore

.gitignore

README.md

README.md

presentation.pdf

presentation.pdf

Repository files navigation

Identifying and Analyzing Patterns in NYC Restaurant Inspections

Data Sources

Background Information

Exploratory Data Analysis

Incorporating Median Income of Zip Code

Hypothesis Testing

Linear Regression Model

Geo-plotting with Folium

Further Work

About

Releases

Packages

Languages

twang94/RestaurantInspections

Folders and files

Latest commit

History

Repository files navigation

Identifying and Analyzing Patterns in NYC Restaurant Inspections

Data Sources

Background Information

Exploratory Data Analysis

Incorporating Median Income of Zip Code

Hypothesis Testing

Linear Regression Model

Geo-plotting with Folium

Further Work

About

Resources

Stars

Watchers

Forks

Languages