**Here we are going to clean the data for manipulating the dataset later by saving/loading a dataframe**

Finish with thi sline of code:

In [83]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

**Dataframe Preparation**

In [84]:
food = pd.read_csv('data/food-inspections.csv', sep=',')

In [85]:
# Drop The columns containing only 'NaN'
food.drop(['Zip Codes','Historical Wards 2003-2015', 'Community Areas', 'Census Tracts','Wards'], axis=1, inplace=True)

# Drop because the State is either 'NaN' or 'IL' for Illinois and remove the Address because it is not useful.
food.drop(['State','Address'], axis=1, inplace=True)

In [86]:
food.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,City,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,Location
0,2320519,SALAM RESTAURANT,SALAM RESTAURANT,2002822.0,Restaurant,Risk 1 (High),CHICAGO,60625.0,2019-10-25T00:00:00.000,Complaint Re-Inspection,Pass,,41.965719,-87.708538,"{'latitude': '-87.70853756167853', 'longitude'..."
1,2320509,TAQUERIA EL DORADO,TAQUERIA EL DORADO,2694960.0,Restaurant,Risk 1 (High),CHICAGO,60625.0,2019-10-25T00:00:00.000,License Re-Inspection,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.96882,-87.682292,"{'latitude': '-87.6822915036914', 'longitude':..."
2,2320412,"DANTE'S PIZZA,INC.",DANTE'S PIZZA,2092884.0,Restaurant,Risk 1 (High),CHICAGO,60647.0,2019-10-24T00:00:00.000,Canvass,Fail,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.917539,-87.703728,"{'latitude': '-87.70372788811352', 'longitude'..."
3,2320430,LAO PENG YOU LLC,LAO PENG YOU,2694477.0,Restaurant,Risk 1 (High),CHICAGO,60622.0,2019-10-24T00:00:00.000,License Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.896005,-87.677938,"{'latitude': '-87.6779378973854', 'longitude':..."
4,2320384,ARBOR,ARBOR,2363029.0,Restaurant,Risk 1 (High),CHICAGO,60647.0,2019-10-24T00:00:00.000,Recent Inspection,Pass w/ Conditions,14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TAG...,41.932025,-87.692169,"{'latitude': '-87.69216904438716', 'longitude'..."


**Cleaning steps**

1) Drop rows without location, check Zips and cities.

2) Drop the facilities having a license # equal to 0 and filter only the 4 facility types we are going to analyse: restaurants, grocery stores, schools and hospitals.

3) Convert the date format to an analysis friendly format

4) There are 3 risk levels (low, medium and high), any other value will be removed

5) For restaurants, we are going to focus only the 5 most inspected chains of different types: McDonald's, Subway, Taco Bell, Satrbucks and Dunkin Donuts.

6) We will add spatial coordinates to the dataframe using 'Geopandas' and a map of districts found online.

7) Remove the entries having a result that is not 'Fail', 'Pass' or 'Pass with condition'

**Step 1**

We remove the rows without Longitudes/Latitudes because those are absolutely necessary for our work.

In [87]:
food.Latitude.isna().any() or food.Longitude.isna().any()

True

In [88]:
food.dropna(subset = ["Latitude", "Longitude"], inplace=True)
food.Latitude.isna().any() or food.Longitude.isna().any() or food.Location.isna().any()

False

In [89]:
# Drop Location because it is no more necessary
food.drop(['Location'], axis=1, inplace=True)

Zip will be useful to look at the district location in Chicago

In [90]:
def zip_cleaner(npa):
    npa = str(npa)
    if npa.startswith('606'):
        npa = npa.split('606')[1]
        npa = npa.split('.')[0]
    else:
        npa = 'check'
    return npa

In [91]:
food['Zip'] = food['Zip'].apply(zip_cleaner)

In [92]:
food.Zip.unique()

array(['25', '47', '22', '38', '55', '28', '51', '03', '13', '40', '29',
       '39', '44', '08', '66', '11', '32', '45', '54', '23', '30', '43',
       '20', '49', 'check', '06', '19', '42', '60', '07', '59', '34',
       '52', '09', '41', '26', '57', '46', '21', '02', '05', '61', '14',
       '16', '12', '31', '36', '18', '04', '56', '24', '15', '37', '01',
       '10', '17', '33', '53', '27'], dtype=object)

In [93]:
food[food['Zip'] == 'check'].shape

(1300, 14)

We delete the rows with unknown Zip (not a lot) and drop as well the column 'City' because we know now that all facilities remaining in the dataframe are located in Chicago.

In [94]:
food.drop(food[food['Zip'] == 'check'].index, inplace=True)
food.drop(['City'], axis=1, inplace=True)

**Step 2**

In [95]:
food.drop(food[food['License #'] == 0].index, inplace=True)

In [96]:
food = food[food["Facility Type"].isin(["Restaurant","Grocery Store", "School", "Hospital"])]
food["Facility Type"].value_counts()

Restaurant       128862
Grocery Store     24678
School            11688
Hospital            517
Name: Facility Type, dtype: int64

**Step 3**

In [97]:
food["Inspection Date"] = food["Inspection Date"].str.split("-").str[0]
food.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude
0,2320519,SALAM RESTAURANT,SALAM RESTAURANT,2002822.0,Restaurant,Risk 1 (High),25,2019,Complaint Re-Inspection,Pass,,41.965719,-87.708538
1,2320509,TAQUERIA EL DORADO,TAQUERIA EL DORADO,2694960.0,Restaurant,Risk 1 (High),25,2019,License Re-Inspection,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.96882,-87.682292
2,2320412,"DANTE'S PIZZA,INC.",DANTE'S PIZZA,2092884.0,Restaurant,Risk 1 (High),47,2019,Canvass,Fail,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.917539,-87.703728
3,2320430,LAO PENG YOU LLC,LAO PENG YOU,2694477.0,Restaurant,Risk 1 (High),22,2019,License Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.896005,-87.677938
4,2320384,ARBOR,ARBOR,2363029.0,Restaurant,Risk 1 (High),47,2019,Recent Inspection,Pass w/ Conditions,14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TAG...,41.932025,-87.692169


In [98]:
#Check for missing years
food["Inspection Date"].isna().any()

False

**Step 4**

In [99]:
#checking for NaN's
food.dropna(subset = ["Risk"], inplace = True)
food.Risk.isna().any()

False

In [100]:
#Checking for any other value than the 3 levels of risk
food.Risk.value_counts()

Risk 1 (High)      122183
Risk 2 (Medium)     33251
Risk 3 (Low)        10297
All                    11
Name: Risk, dtype: int64

In [101]:
#cleaning
food = food[~food["Risk"].str.contains('|'.join(["All"]))]
food.Risk.value_counts()

Risk 1 (High)      122183
Risk 2 (Medium)     33251
Risk 3 (Low)        10297
Name: Risk, dtype: int64

**Step 5**

In [102]:
#Preview of restaurants
restaurants = food[food["Facility Type"]=="Restaurant"].copy()
restaurants["AKA Name"].value_counts().head(25)

SUBWAY                          3198
DUNKIN DONUTS                   1322
MCDONALD'S                       758
BURGER KING                      365
MCDONALDS                        336
DUNKIN DONUTS/BASKIN ROBBINS     320
CHIPOTLE MEXICAN GRILL           318
WENDY'S                          293
STARBUCKS COFFEE                 279
POTBELLY SANDWICH WORKS          271
CORNER BAKERY CAFE               240
FRESHII                          236
STARBUCKS                        233
JIMMY JOHN'S                     222
PIZZA HUT                        212
Subway                           209
SUBWAY SANDWICHES                204
DOMINO'S PIZZA                   201
TACO BELL                        199
KFC                              198
AU BON PAIN                      187
POTBELLY SANDWICH WORKS LLC      177
HAROLD'S CHICKEN SHACK           170
MC DONALD'S                      169
SEE THRU CHINESE KITCHEN         169
Name: AKA Name, dtype: int64

In [103]:
#Drop NaN's
restaurants.dropna(subset=["AKA Name"], inplace=True)
restaurants["AKA Name"].isna().any()

False

In [104]:
#Only uppercase
restaurants["AKA Name"] = restaurants["AKA Name"].str.upper()

**Step 6**

In [105]:
from requests import get
import json
import geopandas as gpd
from shapely.geometry import Point, Polygon
import folium

folium.__version__ == '0.10.0'

True

In [106]:
url='https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON'
r = get(url)
geojson_data = r.json()
geojson = gpd.GeoDataFrame.from_features(geojson_data['features'])

geojson.head()

Unnamed: 0,geometry,community,area,shape_area,perimeter,area_num_1,area_numbe,comarea_id,comarea,shape_len
0,"MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ...",DOUGLAS,0,46004621.1581,0,35,35,0,0,31027.0545098
1,"MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ...",OAKLAND,0,16913961.0408,0,36,36,0,0,19565.5061533
2,"MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ...",FULLER PARK,0,19916704.8692,0,37,37,0,0,25339.0897503
3,"MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ...",GRAND BOULEVARD,0,48492503.1554,0,38,38,0,0,28196.8371573
4,"MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ...",KENWOOD,0,29071741.9283,0,39,39,0,0,23325.1679062


In [107]:
geojson.drop(['perimeter', 'comarea_id','comarea'], axis=1, inplace=True)

In [108]:
geometry = [Point(xy) for xy in zip(food['Longitude'], food['Latitude'])]
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(food, crs=crs, geometry=geometry)

In [109]:
food = gpd.sjoin(gdf, geojson, op='within', how='left')
food.reset_index(inplace=True, drop=True)
food.drop(['index_right'], axis=1, inplace=True)

In [110]:
food.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Zip,Inspection Date,Inspection Type,Results,Violations,Latitude,Longitude,geometry,community,area,shape_area,area_num_1,area_numbe,shape_len
0,2320519,SALAM RESTAURANT,SALAM RESTAURANT,2002822.0,Restaurant,Risk 1 (High),25,2019,Complaint Re-Inspection,Pass,,41.965719,-87.708538,POINT (-87.70854 41.96572),ALBANY PARK,0,53542230.8191,14,14,39339.0164387
1,2320509,TAQUERIA EL DORADO,TAQUERIA EL DORADO,2694960.0,Restaurant,Risk 1 (High),25,2019,License Re-Inspection,Fail,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.96882,-87.682292,POINT (-87.68229 41.96882),LINCOLN SQUARE,0,71352328.2399,4,4,36624.6030848
2,2320412,"DANTE'S PIZZA,INC.",DANTE'S PIZZA,2092884.0,Restaurant,Risk 1 (High),47,2019,Canvass,Fail,"38. INSECTS, RODENTS, & ANIMALS NOT PRESENT - ...",41.917539,-87.703728,POINT (-87.70373 41.91754),LOGAN SQUARE,0,100057566.7,22,22,49213.4217488
3,2320430,LAO PENG YOU LLC,LAO PENG YOU,2694477.0,Restaurant,Risk 1 (High),22,2019,License Re-Inspection,Pass w/ Conditions,"3. MANAGEMENT, FOOD EMPLOYEE AND CONDITIONAL E...",41.896005,-87.677938,POINT (-87.67794 41.89600),WEST TOWN,0,127562904.597,24,24,55203.7186956
4,2320384,ARBOR,ARBOR,2363029.0,Restaurant,Risk 1 (High),47,2019,Recent Inspection,Pass w/ Conditions,14. REQUIRED RECORDS AVAILABLE: SHELLSTOCK TAG...,41.932025,-87.692169,POINT (-87.69217 41.93202),LOGAN SQUARE,0,100057566.7,22,22,49213.4217488


In [111]:
chicago_coord = [41.85, -87.7]
chicago_map = folium.Map(location=chicago_coord)

In [112]:
area_area10 = food[food['area_num_1'] == '10'].copy()
area_area10.reset_index(drop=True, inplace=True)
#shape = area_zip25.shape

for index, row in area_area10.iterrows():
    if index < 30:
        folium.Marker([row["Latitude"], row["Longitude"]], popup=row['community'], 
                      icon=folium.Icon(color ='blue', icon = 'map-marker')).add_to(chicago_map)
    else: 
        break

In [113]:
folium.GeoJson(geojson_data).add_to(chicago_map)

<folium.features.GeoJson at 0x23e08ff6c88>

In [114]:
chicago_map

**Step 7**

In [115]:
food['Results'].unique()

array(['Pass', 'Fail', 'Pass w/ Conditions', 'Not Ready', 'No Entry',
       'Out of Business', 'Business Not Located'], dtype=object)

In [116]:
print('Lenght before cleaning: {0}'.format(food.shape[0]))

Lenght before cleaning: 165731


In [117]:
food = food[food['Results'] != 'Not Ready']
food = food[food['Results'] != 'Out of Business']
food = food[food['Results'] != 'No Entry']
food = food[food['Results'] != 'Business Not Located']

In [118]:
print('Lenght after cleaning: {0}'.format(food.shape[0]))

Lenght after cleaning: 147455


In [119]:
food['Results'].unique()

array(['Pass', 'Fail', 'Pass w/ Conditions'], dtype=object)

Now that a first cleaning was done, we will concentrate on the 5 restaurants chosen and mentioned before

**McDonald's**

In [120]:
#Let's see if there is other restaurants with "donald" to avoid adding wrong data to McDonald's
restaurants[restaurants["AKA Name"].str.contains("DONALD")]["AKA Name"].value_counts()

MCDONALD'S                           906
MCDONALDS                            405
MC DONALD'S                          169
MC DONALDS                           168
MCDONALD'S RESTAURANT                 87
MCDONALDS RESTAURANT                  42
MCDONALD'S RESTAURANTS                24
MCDONALD'S #490                       20
DONALDS FAMOUS HOT DOGS               19
MCDONALD'S RESTAURANT  (T3 H9)        18
MCDONALD'S  (T3 HK FOOD COURT)        17
MCDONALD'S CORPORATION                17
MC DONALDS # 6771                     16
MCDONALDS  (T3  K9)                   15
MCDONALD' S # 5618                    15
MCDONALDS #27672                      15
MCDONALDS #4655                       14
MCDONALD'S   (T3- L4)                 13
MCDONALDS #7069                       13
MCDONALDS#6337                        12
MCDONALD'S STORE #4061                11
MCDONALD'S  (T1-B11)                  11
MCDONALD'S  (T2   E/F)                11
MCDONALD'S  (T1-C10)                  11
MC DONALDS-MCCOR

In [121]:
#removing "Donald's famous hot dogs" than unifying all McDonald's
restaurants = restaurants[~restaurants["AKA Name"].str.contains('|'.join(["DOGS"]))]
restaurants.loc[restaurants["AKA Name"].str.contains("DONALD"), "AKA Name"] = "MCDONALDS"
restaurants[restaurants["AKA Name"].str.contains("DONALD")]["AKA Name"].value_counts()

MCDONALDS    2216
Name: AKA Name, dtype: int64

**Subway**

In [122]:
#Let's see if there is other restaurants with "subway" to avoid adding wrong data to Subway
restaurants[restaurants["AKA Name"].str.contains("SUBWAY")]["AKA Name"].value_counts()

SUBWAY                                           3407
SUBWAY SANDWICHES                                 231
SUBWAY SANDWICH                                    61
SUBWAY SANDWICH & SALAD                            29
SUBWAY (T3 ROTUNDA)                                25
SUBWAY RESTAURANT                                  25
SUBWAY #3333                                       21
SUBWAY SANDWICHES & SALADS                         21
SUBWAY 28330                                       19
SNAPPY CONVENIENCE CENTER/SUBWAY/DUNKIN DONUT      18
BP/SUBWAY                                          17
SUBWAY SANDWICH STORE                              16
SUBWAY #45927                                      16
SUBWAY 48735                                       14
SHELL SUBWAY                                       13
ROAD RANGER/SUBWAY                                 11
LALO SUBWAY INC                                    10
LAKEVIEW SUBWAY                                     9
MADISON SUBWAY LLC          

In [123]:
#removing restaurants with "subway" undesired than unifying all Subway's
removable = ["FULLERTON","MADISON", "LALO","LAKEVIEW","SNAPPY"]
restaurants = restaurants[~restaurants["AKA Name"].str.contains('|'.join(removable))]
restaurants.loc[restaurants["AKA Name"].str.contains("SUBWAY"), "AKA Name"] = "SUBWAY"
restaurants[restaurants["AKA Name"].str.contains("SUBWAY")]["AKA Name"].value_counts()

SUBWAY    3970
Name: AKA Name, dtype: int64

**Starbucks**

In [124]:
restaurants[restaurants["AKA Name"].str.contains("STARBUCKS")]["AKA Name"].value_counts().head(40)

STARBUCKS COFFEE                                   292
STARBUCKS                                          233
MARKET PLACE/STARBUCKS COFFE/FRANGO/GODIVA          22
STARBUCKS COFFEE #2370                              14
STARBUCKS HK APEX (T3 HK FOODCOURT)                 13
STARBUCKS COFFEE (T1-B5)                            13
STARBUCKS COFFEE #2334                              13
STARBUCKS (T3  G14 LL)                              12
STARBUCKS (T1/B CONCOURSE-BAGGAGE CLAIM)            12
STARBUCKS COFFEE #2410                              12
STARBUCKS/ W KITCHEN/ F1,F2 POD/WAREHSE/LA BREA     12
STARBUCKS (T2 LL ARRIVAL)                           12
STARBUCKS COFFEE #2527                              11
STARBUCKS COFFEE #228                               11
STARBUCKS  (T3 H6)                                  11
MAIN KITCHEN/STARBUCKS /ETA/ EMPL CAFE              11
STARBUCKS COFFEE #8954                              11
STARBUCKS COFFEE (T1-B14)                           11
STARBUCKS 

In [125]:
#we are going to assume "starbucks" is a very distinctive name and have negligible chance of being used in another restaurant
restaurants.loc[restaurants["AKA Name"].str.contains("STARBUCKS"), "AKA Name"] = "STARBUCKS"
restaurants[restaurants["AKA Name"].str.contains("STARBUCKS")]["AKA Name"].value_counts()

STARBUCKS    1426
Name: AKA Name, dtype: int64

**Taco Bell**

In [126]:
#Checking a misspelled case
restaurants[restaurants["AKA Name"].str.contains("TACOBELL")]["AKA Name"].value_counts()

Series([], Name: AKA Name, dtype: int64)

In [127]:
#Looks like Taco Bell is always written in separated words
restaurants[restaurants["AKA Name"].str.contains("TACO BELL")]["AKA Name"].value_counts()

TACO BELL            205
KFC/TACO BELL         25
TACO BELL #15855      15
TACO BELL CANTINA     11
TACO BELL #30407      11
TACO BELL #15875       9
TACO BELL_#4171        9
TACO BELL #2513        9
TACO BELL #5751        8
TACO BELL 32575        3
TACO BELL 34921        2
Name: AKA Name, dtype: int64

In [128]:
#And the name is luckly very unique
restaurants.loc[restaurants["AKA Name"].str.contains("TACO BELL"), "AKA Name"] = "TACO BELL"
restaurants[restaurants["AKA Name"].str.contains("TACO BELL")]["AKA Name"].value_counts()

TACO BELL    307
Name: AKA Name, dtype: int64

**Dunkin Donuts**

In [129]:
#Checking a misspelled case
restaurants[restaurants["AKA Name"].str.contains("DUNKINDONUTS")]["AKA Name"].value_counts()

DUNKINDONUTS    7
Name: AKA Name, dtype: int64

When unifying all Dunkin Donuts we will also need to add the 7 "DUNKINDONUTS" from above

In [130]:
restaurants[restaurants["AKA Name"].str.contains("DUNKIN DONUTS")]["AKA Name"].value_counts()

DUNKIN DONUTS                           1375
DUNKIN DONUTS/BASKIN ROBBINS             320
DUNKIN DONUTS BASKIN ROBBINS             121
DUNKIN DONUTS / BASKIN ROBBINS           116
DUNKIN DONUTS/ BASKIN ROBBINS             60
DUNKIN DONUTS / BASKIN ROBINS             47
DUNKIN DONUTS & BASKIN ROBBINS            28
DUNKIN DONUTS / BASKIN & ROBBINS          25
DUNKIN DONUTS & BASKIN ROBINS             15
DUNKIN DONUTS-BASKIN ROBBINS              14
HALSTED SHELL, DUNKIN DONUTS, MR SUB      14
BASKIN ROBBINS/ DUNKIN DONUTS             13
DUNKIN DONUTS BASKIN ROBBINS TOGO'S       12
DUNKIN DONUTS (T3 HK FOODCOURT)           11
GRAND CITGO/ MR. SUB / DUNKIN DONUTS      10
DUNKIN DONUTS /  BASKIN ROBBINS           10
DUNKIN DONUTS/BASKIN ROBINS               10
BASKIN ROBBINS/DUNKIN DONUTS              10
DUNKIN DONUTS INC                          9
DUNKIN DONUTS/BASKIN  ROBBINS              8
DUNKIN DONUTS AND BASKIN ROBBINS           4
DUNKIN DONUTS /  BASKIN ROBINS             4
DUNKIN DON

In [131]:
#Dunkin Donuts is very unique too which helps us when unifying
restaurants.loc[restaurants["AKA Name"].str.contains("DUNKIN DONUTS"), "AKA Name"] = "DUNKIN DONUTS"
restaurants.loc[restaurants["AKA Name"].str.contains("DUNKINDONUTS"), "AKA Name"] = "DUNKIN DONUTS"
restaurants[restaurants["AKA Name"].str.contains("DUNKIN DONUTS")]["AKA Name"].value_counts()

DUNKIN DONUTS    2251
Name: AKA Name, dtype: int64

**Removing any other restaurants**

In [132]:
restaurant_list = ["MCDONALDS","SUBWAY","STARBUCKS","TACO BELL","DUNKIN DONUTS"]
restaurants = restaurants[restaurants["AKA Name"].isin(restaurant_list)]
restaurants["AKA Name"].value_counts()

SUBWAY           3970
DUNKIN DONUTS    2251
MCDONALDS        2216
STARBUCKS        1426
TACO BELL         307
Name: AKA Name, dtype: int64

**Final Step**

In [133]:
#Reset Indexes after cleaning
food.reset_index(drop=True, inplace=True)
restaurants.reset_index(drop=True, inplace=True)

In [134]:
#save pickles for use in analysis
food.to_pickle("food.pkl")
restaurants.to_pickle("restaurants.pkl")