## Wake County - Restaurant Food Inspections Analysis

In [1]:
# import pandas, numpy, matplotlib, seaborn 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# importing the requests library
import requests

In [None]:
pip install ipynb

### Resources
 1. [Restaurants in Wake County Data Info](https://www.arcgis.com/home/item.html?id=124c2187da8c41c59bde04fa67eb2872)
 2. [Wake County Open Data](https://data-wake.opendata.arcgis.com/search?tags=restaurants)
 3. [Food Inspection Violations Data Info](https://data.wakegov.com/datasets/Wake::food-inspection-violations/about)
 4. [Wake County Yelp Initiative](https://ash.harvard.edu/news/wake-county-yelp-initiative)
 5. [Yelp LIVES data](https://www.yelp.com/healthscores/feeds)

In [2]:
# pip install ipynb if this fails
# the first time you run this, it will execute these, but run it again if you'd like
# warning: there's an issue where the arguments won't work so just use no-arg functions to pull
from ipynb.fs.full.RestaurantInspectionsData import getFoodInspectionsDf, preprocess_inspections
from ipynb.fs.full.RestaurantsData import getRestaurantsDf, preprocess_restaurants
from ipynb.fs.full.RestaurantViolationsData import getViolationsDf, preprocess_violations
from ipynb.fs.full.WeatherData import getWeatherData, preprocess_weatherdata
from ipynb.fs.full.YelpReviewData import preprocess_restaurants_yelp
from ipynb.fs.full.CrimeData import getCrimeDataDf, preprocess_crimedata

Using pre-fetched inspections data
(20956, 8)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20956 entries, 0 to 20955
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   OBJECTID     20956 non-null  int64  
 1   HSISID       20956 non-null  int64  
 2   SCORE        20956 non-null  float64
 3   DATE         20956 non-null  object 
 4   DESCRIPTION  16880 non-null  object 
 5   TYPE         20956 non-null  object 
 6   INSPECTOR    20956 non-null  object 
 7   PERMITID     20956 non-null  int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 1.3+ MB


None

{'OBJECTID': 20956,
 'HSISID': 3875,
 'SCORE': 48,
 'DATE': 782,
 'DESCRIPTION': 6030,
 'TYPE': 2,
 'INSPECTOR': 47,
 'PERMITID': 3875}

Using pre-fetched restaurants data
restaurants df shape: (3641, 15)


Unnamed: 0,OBJECTID,HSISID,NAME,ADDRESS1,ADDRESS2,CITY,STATE,POSTALCODE,PHONENUMBER,RESTAURANTOPENDATE,FACILITYTYPE,PERMITID,X,Y,GEOCODESTATUS
0,1891530,4092016487,PEACE CHINA,13220 Strickland RD,Ste 167,RALEIGH,NC,27613,(919) 676-9968,2013-08-14T04:00:00Z,Restaurant,2,-78.725938,35.908783,M
1,1891531,4092018622,Northside Bistro & Cocktails,832 SPRING FOREST RD,,RALEIGH,NC,27609,(919) 890-5225,2021-05-13T04:00:00Z,Restaurant,22,-78.622635,35.866275,M
2,1891532,4092016155,DAILY PLANET CAFE,11 W JONES ST,STE 1509,RALEIGH,NC,27601,(919) 707-8060,2012-04-12T04:00:00Z,Restaurant,26,-78.639431,35.782205,M
3,1891533,4092016161,HIBACHI 88,3416 POOLE RD,,RALEIGH,NC,27610,(919) 231-1688,2012-04-18T04:00:00Z,Restaurant,28,-78.579533,35.767246,M
4,1891534,4092017180,BOND BROTHERS BEER COMPANY,202 E CEDAR ST,,CARY,NC,27511,(919) 459-2670,2016-03-11T05:00:00Z,Restaurant,29,-78.778021,35.787986,M



Display Raw Data Info------------------------------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3641 entries, 0 to 3640
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   OBJECTID            3641 non-null   int64  
 1   HSISID              3641 non-null   int64  
 2   NAME                3641 non-null   object 
 3   ADDRESS1            3641 non-null   object 
 4   ADDRESS2            485 non-null    object 
 5   CITY                3641 non-null   object 
 6   STATE               3641 non-null   object 
 7   POSTALCODE          3641 non-null   object 
 8   PHONENUMBER         3487 non-null   object 
 9   RESTAURANTOPENDATE  3641 non-null   object 
 10  FACILITYTYPE        3641 non-null   object 
 11  PERMITID            3641 non-null   int64  
 12  X                   3641 non-null   float64
 13  Y                   3641 non-null   float64
 14  GEOCODESTATUS       3641 non-null   object 
dtypes

None


---------------------------------------------------



{'OBJECTID': 3641,
 'HSISID': 3641,
 'NAME': 3507,
 'ADDRESS1': 3164,
 'ADDRESS2': 298,
 'CITY': 45,
 'STATE': 1,
 'POSTALCODE': 565,
 'PHONENUMBER': 3127,
 'RESTAURANTOPENDATE': 2250,
 'FACILITYTYPE': 10,
 'PERMITID': 3641,
 'X': 2154,
 'Y': 2154,
 'GEOCODESTATUS': 3}


Preprocessing--------------------------------------

Dropping columns with more than 25% missing values: Index(['ADDRESS2'], dtype='object')
OBJECTID              0.0
HSISID                0.0
NAME                  0.0
ADDRESS1              0.0
CITY                  0.0
POSTALCODE            0.0
RESTAURANTOPENDATE    0.0
PERMITID              0.0
X                     0.0
Y                     0.0
GEOCODESTATUS         0.0
dtype: float64
(2385, 11)

Display--------------------------------------------



Unnamed: 0,OBJECTID,HSISID,NAME,ADDRESS1,CITY,POSTALCODE,RESTAURANTOPENDATE,PERMITID,X,Y,GEOCODESTATUS
0,1891530,4092016487,PEACE CHINA,13220 Strickland RD,RALEIGH,27613,2013-08-14,2,-78.725938,35.908783,M
1,1891531,4092018622,Northside Bistro & Cocktails,832 SPRING FOREST RD,RALEIGH,27609,2021-05-13,22,-78.622635,35.866275,M
2,1891532,4092016155,DAILY PLANET CAFE,11 W JONES ST,RALEIGH,27601,2012-04-12,26,-78.639431,35.782205,M
3,1891533,4092016161,HIBACHI 88,3416 POOLE RD,RALEIGH,27610,2012-04-18,28,-78.579533,35.767246,M
4,1891534,4092017180,BOND BROTHERS BEER COMPANY,202 E CEDAR ST,CARY,27511,2016-03-11,29,-78.778021,35.787986,M


Using pre-fetched violations data


ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

## Fetch Inspections

In [None]:
food_inspections_raw = getFoodInspectionsDf()
inspections = preprocess_inspections(food_inspections_raw.copy())
inspections.head()

## Fetch Restaurants

In [None]:
restaurants_raw = getRestaurantsDf()
restaurants = preprocess_restaurants(restaurants_raw.copy())
restaurants.head()

## Fetch violations

In [None]:
violations_raw = getViolationsDf()
violations = preprocess_violations(violations_raw.copy())
violations.head()

## Fetch weather data

In [None]:
weatherdata_raw = getWeatherData()
weatherdata = preprocess_weatherdata(weatherdata_raw.copy())
weatherdata.head()

## Fetch Yelp Ratings Data

In [None]:
#get list of restaurants with all relevant information plus phone number post-preprocessing
restaurants_yelp_raw = getRestaurantsDf()
restaurants_yelp = preprocess_restaurants_yelp(restaurants_yelp_raw.copy())
restaurants_yelp.head()


In [None]:
#write to csv
restaurants_yelp.to_csv('restaurants_yelp.csv')

In [None]:
#read in processed yelp data
yelpmatch_phone = pd.read_csv('yelpmatch_phone.csv')

In [None]:
yelpmatch_phone.head()

## Fetch crime data proxy data

## Next Steps ( we have T-minus 2 weeks!!!! !!!!!! FREAK OUTTTTT !!!!) 
0. Pull the police incidents/crime data and possibly cencus tracked income by location - Hearsch & Shyamal 
1. Make sure yelp data is sourced in this main notebook (minimal datapoints: ratings, dollar signs, type/cuisine, review metadata, other features at Ms. Park's discretion)
2. Clean & Validate the data as part of Data Prep, EDA. Join tables by inspection. We want historical data per inspection and then we want to predict the risk scores for restaurants in high risk for future inspections. Note that although we have data around inspections by date, we don't really want to do a time series forecasting,bc time series forecasting sucks!
3. Deal with missing values and encode variables 
4. Feature engineering 
5. Baseline model
6. More complicated model
7. Datasheets for datasets (ask Jon about this in next class if we need datasheet for every table or for every source) - Christine 
8. Hearsch - Ethical checklist 
9. Visualizations and story telling!
10. Get started on a slideshow (FUN PART)