# Chicago Food

## Milestone 2 : Preprossessing

In this milestone we will decribe what's inside our dataset. 

We will firstly make some cleaning process to be able to treat a coherous dataset for the following step. We will focus on four facilities type ; Restaurant, Grocery store, School and hospital. In the restaurant group, we will focus on the 5 biggest group company. We will enrich our database by adding a row with the area number of chicago. It will help us to do area analyse of our data.

From this clean dataset we will do a brute analysis on the risk, type of violation and number of inspection. We will contextualyse our data trought an analyse over the different area, the four facility type and the time.

Finaly, from the preprossing step we do, we will present what would be our analyse plan. 

#### Part 1. Cleaning the data

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Loading the data
food = pd.read_csv('data/food-inspections.csv', sep=',')

In the cleaning process of the data we made the assumption that an inspection with no latitude and longitude information is not useful. The feature license should be different that 0 otherwise it's impossible to differenciate the type of facility inspection.

**Cleaning steps**

1) Drop column without any utility

2) Drop rows without latitude or longitude.

3) Drop the facilities having a license # equal to 0 

4) Convert the date format to an analysis friendly format

5) There are 3 risk levels (low, medium and high), any other value will be removed

6) Filter only the 4 facility types we are going to analyse

7) For restaurants, we are going to focus only the 5 most inspected chains of different types: McDonald's, Subway, Taco Bell, Satrbucks and Dunkin Donuts.

8) We will add spatial coordinates to the dataframe using 'Geopandas' and a map of districts found online.

**Step 1.**  Drop column without any utility

In [3]:
# Drop The columns whithout any links to our futher analyse 
# We will not use the zip code because some of them are wrong. We will only use latitude and longitude
# to localisate the facility 

food.drop(['Zip Codes','Historical Wards 2003-2015', 'Community Areas', 'Census Tracts','Wards','Zip','State','Address','Location'], axis=1, inplace=True)


**Step 2.** Drop row whitout latitude or longitude 

In [4]:
food.dropna(subset = ["Latitude", "Longitude"], inplace=True)

**Step 3.** Drop row with 0 as licence number.

In [5]:
food.drop(food[food['License #'] == 0].index, inplace=True)

**Step 4.** Convert the date format to an analysis friendly format

In [6]:
# We only keep the year of the inspection
food["Inspection Date"] = food["Inspection Date"].str.split("-").str[0]

In [7]:
#Check for missing years
food["Inspection Date"].isna().any()

False

**Step 5.** Drop risk different than high, medium, low

In [8]:
#Checking for any other value than the 3 levels of risk
food.Risk.value_counts()

Risk 1 (High)      139244
Risk 2 (Medium)     37831
Risk 3 (Low)        16874
All                    28
Name: Risk, dtype: int64

In [9]:
#cleaning
food = food[~food["Risk"].str.contains('|'.join(["All"]))]
food.Risk.value_counts()

TypeError: bad operand type for unary ~: 'float'

**Step 6.** Focus on the facility type: Restaurant, Grocery, School and Hospital

In [10]:
# Check number of inpection per facility Type
food.groupby('Facility Type')['Facility Type'].count().sort_values(ascending=False).head(20)

Facility Type
Restaurant                         129818
Grocery Store                       24850
School                              11816
Children's Services Facility         3044
Bakery                               2860
Daycare (2 - 6 Years)                2685
Daycare Above and Under 2 Years      2359
Long Term Care                       1322
Catering                             1181
Liquor                                855
Mobile Food Dispenser                 789
Daycare Combo 1586                    745
Mobile Food Preparer                  584
Golden Diner                          553
Wholesale                             534
Hospital                              533
TAVERN                                282
Daycare (Under 2 Years)               249
Shared Kitchen User (Long Term)       170
BANQUET HALL                          147
Name: Facility Type, dtype: int64

In [11]:
# We wil focus our analysis on the facility type: Restaurant, Grocery, School and Hospital.
food = food[food["Facility Type"].isin(["Restaurant","Grocery Store", "School", "Hospital"])]
food["Facility Type"].value_counts()

Restaurant       129818
Grocery Store     24850
School            11816
Hospital            533
Name: Facility Type, dtype: int64

**Step 7.** For restaurant we focus only the 5 most inspected chains of different types of food 

In [12]:
#Preview of restaurants
restaurants = food[food["Facility Type"]=="Restaurant"].copy()
#Drop NaN's
restaurants.dropna(subset=["AKA Name"], inplace=True)
#Only uppercase
restaurants["AKA Name"] = restaurants["AKA Name"].str.upper()

restaurants["AKA Name"].value_counts().head(25)

SUBWAY                          3429
DUNKIN DONUTS                   1383
MCDONALD'S                       908
CHIPOTLE MEXICAN GRILL           422
MCDONALDS                        406
BURGER KING                      383
DUNKIN DONUTS/BASKIN ROBBINS     320
WENDY'S                          313
STARBUCKS COFFEE                 301
POTBELLY SANDWICH WORKS          271
AU BON PAIN                      240
CORNER BAKERY CAFE               240
FRESHII                          237
STARBUCKS                        236
SUBWAY SANDWICHES                231
JIMMY JOHN'S                     222
PIZZA HUT                        220
TACO BELL                        206
DOMINO'S PIZZA                   201
KFC                              199
JIMMY JOHNS                      198
SEE THRU CHINESE KITCHEN         180
POTBELLY SANDWICH WORKS LLC      177
HAROLD'S CHICKEN SHACK           170
MC DONALD'S                      169
Name: AKA Name, dtype: int64

We creat a new datafram with only McDonald's, Subway, Taco Bell, Satrbucks and Dunkin Donuts.

**McDonald's**

In [13]:
#Let's see if there is other restaurants with "donald" to avoid adding wrong data to McDonald's
restaurants[restaurants["AKA Name"].str.contains("DONALD")]["AKA Name"].value_counts().head(8)

MCDONALD'S                908
MCDONALDS                 406
MC DONALD'S               169
MC DONALDS                168
MCDONALD'S RESTAURANT      87
MCDONALDS RESTAURANT       43
MCDONALD'S RESTAURANTS     24
MCDONALD'S #490            20
Name: AKA Name, dtype: int64

In [14]:
#removing "Donald's famous hot dogs" than unifying all McDonald's
restaurants = restaurants[~restaurants["AKA Name"].str.contains('|'.join(["DOGS"]))]
restaurants.loc[restaurants["AKA Name"].str.contains("DONALD"), "AKA Name"] = "MCDONALDS"
restaurants[restaurants["AKA Name"].str.contains("DONALD")]["AKA Name"].value_counts()

MCDONALDS    2221
Name: AKA Name, dtype: int64

**Subway**

In [15]:
#Let's see if there is other restaurants with "subway" to avoid adding wrong data to Subway
restaurants[restaurants["AKA Name"].str.contains("SUBWAY")]["AKA Name"].value_counts().head(8)

SUBWAY                        3429
SUBWAY SANDWICHES              231
SUBWAY SANDWICH                 60
SUBWAY SANDWICH & SALAD         30
SUBWAY RESTAURANT               25
SUBWAY (T3 ROTUNDA)             25
SUBWAY #3333                    21
SUBWAY SANDWICHES & SALADS      21
Name: AKA Name, dtype: int64

In [16]:
#removing restaurants with "subway" undesired than unifying all Subway's
removable = ["FULLERTON","MADISON", "LALO","LAKEVIEW","SNAPPY"]
restaurants = restaurants[~restaurants["AKA Name"].str.contains('|'.join(removable))]
restaurants.loc[restaurants["AKA Name"].str.contains("SUBWAY"), "AKA Name"] = "SUBWAY"
restaurants[restaurants["AKA Name"].str.contains("SUBWAY")]["AKA Name"].value_counts()

SUBWAY    3990
Name: AKA Name, dtype: int64

**Starbucks**

In [17]:
restaurants[restaurants["AKA Name"].str.contains("STARBUCKS")]["AKA Name"].value_counts().head(8)

STARBUCKS COFFEE                              301
STARBUCKS                                     236
MARKET PLACE/STARBUCKS COFFE/FRANGO/GODIVA     22
STARBUCKS COFFEE #2370                         14
STARBUCKS COFFEE #2334                         13
STARBUCKS HK APEX (T3 HK FOODCOURT)            13
STARBUCKS COFFEE (T1-B5)                       13
STARBUCKS COFFEE #2410                         12
Name: AKA Name, dtype: int64

In [18]:
#we are going to assume "starbucks" is a very distinctive name and have negligible chance of being used in another restaurant
restaurants.loc[restaurants["AKA Name"].str.contains("STARBUCKS"), "AKA Name"] = "STARBUCKS"
restaurants[restaurants["AKA Name"].str.contains("STARBUCKS")]["AKA Name"].value_counts()

STARBUCKS    1452
Name: AKA Name, dtype: int64

**Taco Bell**

In [19]:
#Checking a misspelled case
restaurants[restaurants["AKA Name"].str.contains("TACOBELL")]["AKA Name"].value_counts()

Series([], Name: AKA Name, dtype: int64)

In [20]:
#Looks like Taco Bell is always written in separated words
restaurants[restaurants["AKA Name"].str.contains("TACO BELL")]["AKA Name"].value_counts()

TACO BELL                         206
KFC/TACO BELL                      25
TACO BELL #15855                   15
TACO BELL & LONG JOHN SILVER'S     13
TACO BELL #30407                   11
TACO BELL CANTINA                  11
TACO BELL_#4171                     9
TACO BELL #15875                    9
TACO BELL #2513                     9
TACO BELL #5751                     8
TACO BELL 32575                     3
TACO BELL 34921                     2
Name: AKA Name, dtype: int64

In [21]:
#And the name is luckly very unique
restaurants.loc[restaurants["AKA Name"].str.contains("TACO BELL"), "AKA Name"] = "TACO BELL"
restaurants[restaurants["AKA Name"].str.contains("TACO BELL")]["AKA Name"].value_counts()

TACO BELL    321
Name: AKA Name, dtype: int64

**Dunkin Donuts**

In [47]:
#Checking a misspelled case
restaurants[restaurants["AKA Name"].str.contains("DUNKINDONUTS")]["AKA Name"].value_counts()

DUNKINDONUTS    7
Name: AKA Name, dtype: int64

When unifying all Dunkin Donuts we will also need to add the 7 "DUNKINDONUTS" from above

In [22]:
restaurants[restaurants["AKA Name"].str.contains("DUNKIN DONUTS")]["AKA Name"].value_counts().head(8)

DUNKIN DONUTS                       1383
DUNKIN DONUTS/BASKIN ROBBINS         320
DUNKIN DONUTS BASKIN ROBBINS         122
DUNKIN DONUTS / BASKIN ROBBINS       118
DUNKIN DONUTS/ BASKIN ROBBINS         60
DUNKIN DONUTS / BASKIN ROBINS         47
DUNKIN DONUTS & BASKIN ROBBINS        28
DUNKIN DONUTS / BASKIN & ROBBINS      25
Name: AKA Name, dtype: int64

In [23]:
#Dunkin Donuts is very unique too which helps us when unifying
restaurants.loc[restaurants["AKA Name"].str.contains("DUNKIN DONUTS"), "AKA Name"] = "DUNKIN DONUTS"
restaurants.loc[restaurants["AKA Name"].str.contains("DUNKINDONUTS"), "AKA Name"] = "DUNKIN DONUTS"
restaurants[restaurants["AKA Name"].str.contains("DUNKIN DONUTS")]["AKA Name"].value_counts()

DUNKIN DONUTS    2263
Name: AKA Name, dtype: int64

**Removing any other restaurants**

In [24]:
restaurant_list = ["MCDONALDS","SUBWAY","STARBUCKS","TACO BELL","DUNKIN DONUTS"]
restaurants = restaurants[restaurants["AKA Name"].isin(restaurant_list)]
restaurants["AKA Name"].value_counts()

SUBWAY           3990
DUNKIN DONUTS    2263
MCDONALDS        2221
STARBUCKS        1452
TACO BELL         321
Name: AKA Name, dtype: int64

**Step 8.** Add spatial coordinates to the dataframe

In [11]:
from requests import get
import json
import geopandas as gpd
from shapely.geometry import Point, Polygon
import folium

folium.__version__ == '0.10.0'

ModuleNotFoundError: No module named 'geopandas'

In [None]:
url='https://data.cityofchicago.org/api/geospatial/cauq-8yn6?method=export&format=GeoJSON'
r = get(url)
geojson_data = r.json()
geojson = gpd.GeoDataFrame.from_features(geojson_data['features'])

geojson.head()

In [None]:
geojson.drop(['perimeter', 'comarea_id','comarea'], axis=1, inplace=True)

In [None]:
geometry = [Point(xy) for xy in zip(food['Longitude'], food['Latitude'])]
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(food, crs=crs, geometry=geometry)

In [None]:
food = gpd.sjoin(gdf, geojson, op='within', how='left')
food.reset_index(inplace=True, drop=True)
food.drop(['index_right'], axis=1, inplace=True)

In [None]:
chicago_coord = [41.85, -87.7]
chicago_map = folium.Map(location=chicago_coord)

In [None]:
area_area10 = food[food['area_num_1'] == '10'].copy()
area_area10.reset_index(drop=True, inplace=True)
#shape = area_zip25.shape

for index, row in area_area10.iterrows():
    if index < 30:
        folium.Marker([row["Latitude"], row["Longitude"]], popup=row['community'], 
                      icon=folium.Icon(color ='blue', icon = 'map-marker')).add_to(chicago_map)
    else: 
        break

In [None]:
folium.GeoJson(geojson_data).add_to(chicago_map)

In [None]:
chicago_map

**Final Step**

In [54]:
#Reset Indexes after cleaning
food.reset_index(drop=True, inplace=True)
restaurants.reset_index(drop=True, inplace=True)

In [55]:
#save pickles for use in analysis
food.to_pickle("food.pkl")
restaurants.to_pickle("restaurants.pkl")

#### Part 2. Brute Data analysis

#### Part 3. Context Data analysis

#### Part 4. What's next ?