In [1]:
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline

In [2]:
restaurant_df = pd.read_csv('DOHMH_Python.csv')

In [3]:
restaurant_df.head()

Unnamed: 0,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE,ACTION,VIOLATION,VIOLATION_DESCRIPTION,CRITICAL_FLAG,SCORE,GRADE,INSPECTION_TYPE
0,STARBUCKS,Manhattan,78,SPRING STREET,10012.0,Café/Coffee/Tea,Violations were cited in the following area(s).,10B,Plumbing not properly installed or maintained;...,N,9.0,A,Cycle Inspection / Initial Inspection
1,110 KENNEDY FRIED CHICKEN,Staten Island,110,VICTORY BOULEVARD,10301.0,Chicken,Establishment Closed by DOHMH. Violations wer...,02C,Hot food item that has been cooked and refrige...,Y,39.0,,Pre-permit (Operational) / Initial Inspection
2,HQ CLUB,Manhattan,552,WEST 38 STREET,10018.0,American,Violations were cited in the following area(s).,04N,Filth flies or food/refuse/sewage-associated (...,Y,13.0,A,Cycle Inspection / Initial Inspection
3,RESTAURANT TATIANA,Brooklyn,3152,BRIGHTON 6 STREET,11235.0,Russian,Violations were cited in the following area(s).,10F,Non-food contact surface improperly constructe...,N,12.0,A,Cycle Inspection / Re-inspection
4,KO SUSHI,Manhattan,1329,2 AVENUE,10021.0,Japanese,Violations were cited in the following area(s).,10B,Plumbing not properly installed or maintained;...,N,7.0,,Cycle Inspection / Initial Inspection


In [4]:
restaurant_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 316935 entries, 0 to 316934
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   DBA                    316935 non-null  object 
 1   BORO                   316935 non-null  object 
 2   BUILDING               316875 non-null  object 
 3   STREET                 316935 non-null  object 
 4   ZIPCODE                314882 non-null  float64
 5   CUISINE                316935 non-null  object 
 6   ACTION                 316935 non-null  object 
 7   VIOLATION              313672 non-null  object 
 8   VIOLATION_DESCRIPTION  311073 non-null  object 
 9   CRITICAL_FLAG          311073 non-null  object 
 10  SCORE                  304473 non-null  float64
 11  GRADE                  160251 non-null  object 
 12  INSPECTION_TYPE        316935 non-null  object 
dtypes: float64(2), object(11)
memory usage: 31.4+ MB


In [7]:
restaurant_df.describe()

Unnamed: 0,ZIPCODE,SCORE
count,314882.0,304473.0
mean,10542.432562,20.587136
std,564.366569,14.943915
min,10001.0,-1.0
25%,10018.0,11.0
50%,10310.0,16.0
75%,11215.0,26.0
max,12345.0,164.0


From this initial exploration, what I learned was that there are a total of 13 columns and roughly 300,000 entries. Most of the data is qualitative, with restaurant score being the most important quantitative column. 

Each restaurant also has a numeric score based on the number and type of violation it receives. There are three overarching categories of violations with a point value associated with them: 


General - 2 points | 
Critical - 5 points |
Public Health Hazard - 7 points 

The final score is an accumulation of the sum of the points, and thus a letter grade is assigned. Restaurants with scores of 0-13 points are given an A, 14-27 points are given a B, and 28+ points are given a C. 

Even though I cleaned the data in Excel prior, using the describe() function provided more insight. A major mistake I found is that the minimum score is -1, which seems like an unlikely mistake. I will remove that entry with the code below. 

In [14]:
#see the rows that have a restaurant with a score of -1 
restaurant_df[restaurant_df.SCORE == -1]

Unnamed: 0,DBA,BORO,BUILDING,STREET,ZIPCODE,CUISINE,ACTION,VIOLATION,VIOLATION_DESCRIPTION,CRITICAL_FLAG,SCORE,GRADE,INSPECTION_TYPE
2666,ELI'S ESSENTIALS,Manhattan,1291,LEXINGTON AVENUE,10028.0,American,Violations were cited in the following area(s).,08A,Facility not vermin proof. Harborage or condit...,N,-1.0,,Cycle Inspection / Initial Inspection
5174,PERK KAFE,Manhattan,534,EAST 14 STREET,10009.0,Café/Coffee/Tea,Violations were cited in the following area(s).,05A,Sewage disposal system improper or unapproved.,Y,-1.0,,Pre-permit (Operational) / Initial Inspection
10105,THE CORNER CAFE,Manhattan,729,6 AVENUE,10010.0,Sandwiches/Salads/Mixed Buffet,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Y,-1.0,,Cycle Inspection / Initial Inspection
10838,TANG PAVILION,Manhattan,65,WEST 55 STREET,10019.0,Chinese,Violations were cited in the following area(s).,04L,Evidence of mice or live mice present in facil...,Y,-1.0,C,Cycle Inspection / Re-inspection
12244,KENKA,Manhattan,25,ST MARKS PLACE,10003.0,Japanese,Violations were cited in the following area(s).,06D,"Food contact surface not properly washed, rins...",Y,-1.0,,Cycle Inspection / Initial Inspection
...,...,...,...,...,...,...,...,...,...,...,...,...,...
305553,YURI SUSHI,Manhattan,374,WEST 46 STREET,10036.0,Japanese,Violations were cited in the following area(s).,06D,"Food contact surface not properly washed, rins...",Y,-1.0,,Cycle Inspection / Initial Inspection
306761,TANG PAVILION,Manhattan,65,WEST 55 STREET,10019.0,Chinese,Violations were cited in the following area(s).,06D,"Food contact surface not properly washed, rins...",Y,-1.0,C,Cycle Inspection / Re-inspection
313629,KENKA,Manhattan,25,ST MARKS PLACE,10003.0,Japanese,Violations were cited in the following area(s).,10H,Proper sanitization not provided for utensil w...,N,-1.0,,Cycle Inspection / Initial Inspection
315276,KAI FAN ASIAN CUISINE,Bronx,3717,RIVERDALE AVENUE,10463.0,Jewish/Kosher,Violations were cited in the following area(s).,06D,"Food contact surface not properly washed, rins...",Y,-1.0,B,Cycle Inspection / Re-inspection


In [15]:
#delete the rows with score of -1 
restaurant_df = restaurant_df.drop(restaurant_df[restaurant_df.SCORE == -1].index)

In [16]:
#check my work using describe function again 
restaurant_df.describe()

Unnamed: 0,ZIPCODE,SCORE
count,314768.0,304356.0
mean,10542.541316,20.595434
std,564.379829,14.940791
min,10001.0,0.0
25%,10018.0,11.0
50%,10310.0,16.0
75%,11215.0,26.0
max,12345.0,164.0


In [17]:
#store this new df to use in future notebooks - Analysis of Restaurant Grades 
%store restaurant_df

Stored 'restaurant_df' (DataFrame)


See http://www.hicathy.com/restaurant_data/index.php for a full analysis. 