# Chipotle Saturation Analysis

This notebook presents a sample analysis of Chipotle restaurant saturation by county in the US.  It begins by pulling in geographic shape data on US counties from a COVID-19 data set, then brings in various columns of demographic data from another COVID-19 data set and a 2017 US census data set.  It then cleans and combines that data, and creates a linear regression model of the number of Chipotles by county.  By removing predictors that are not significant, a linear model with four variables is produced with a relatively high r-squared value.  Then, the full US county data is brought against the regression, residuals are computed, and a map generated to show potential areas for expansion and oversaturation.

**Please note**: this is my first attempt at this type of data analysis using Python, Pandas, Folium, etc.  I'm 100% sure there are things that could be improved upon, and I welcome feedback.  Also, take note of ideas and conclusions near the end of the analysis.


The analysis begins by reading in various libraries that will be used throughout.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import geopandas as gpd
from shapely.geometry import Point
from shapely import wkt
import folium
from folium import Choropleth

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
pd.plotting.register_matplotlib_converters()
%matplotlib inline
import seaborn as sns

import sklearn
from sklearn import linear_model

import statsmodels.regression.linear_model as sm

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Reading in files, filling in Chipotle geometry.  Note that most of the information comes in as an object datatype, so there will be some later work to convert these types to float.

There were a few US counties that didn't have geometries listed.  These were in Alaska and didn't apper to be relevant to the analysis, so they were dropped.

In [None]:
chipotle = gpd.read_file("../input/chipotle-locations/chipotle_stores.csv")
geometry = [Point(xy) for xy in zip(chipotle['longitude'].astype('float'),chipotle['latitude'].astype('float'))]
chipotle['geometry'] = geometry
chipotle.head()

us_states = gpd.read_file("../input/chipotle-locations/us-states.json")
us_counties = gpd.read_file('../input/enrichednytimescovid19/us_county_pop_and_shps.csv')
us_counties = us_counties.loc[us_counties['county_geom'] != 'None']

census2017 = gpd.read_file('../input/us-census-demographic-data/acs2017_county_data.csv')

covid2019 = gpd.read_file('../input/covid19-us-county-jhu-data-demographics/us_county.csv')

The county geometry data has both a center point and a polygon.  I just want the polygon, so I'll set it to be the main geometry column.

In [None]:
us_counties['county_geom'] = us_counties['county_geom'].apply(wkt.loads)
us_counties = gpd.GeoDataFrame(us_counties, geometry='county_geom')
census2017

I used geopandas spacial join feature to assign counties to each Chipotle store's latitude and longitude.  Later, I added the CRS.

In [None]:
chipotle_enhanced = gpd.sjoin(chipotle,us_counties)
chipotle_enhanced.head()

One of my ideas was that population density might be a predictor.  However, in order to do this, I needed to calculate area and population density of each county, and add it to the chipotle_enhanced dataframe. I was able to validate a few sample counties against Wikipedia data for reasonability.

In [None]:
#calculate population density as people / km**2
us_counties['popdensity'] = us_counties['county_pop_2019_est'].astype(float) / (us_counties['county_geom'].area * 10000)

#create a common key for reading in calculated density
chipotle_enhanced['statecounty'] = chipotle_enhanced['state_right'] + chipotle_enhanced['county']
us_counties['statecounty'] = us_counties['state'] + us_counties['county']

#read in calculated density
chipotle_enhanced = pd.merge(chipotle_enhanced, us_counties[['statecounty','popdensity']], on='statecounty',how='left')

#create a common key in the us census data
census2017['statecounty'] = census2017['State'] + census2017['County']
census2017['statecounty'] = census2017['statecounty'].replace(' County','', regex=True)

#create a common key in the us county covid data
covid2019['statecounty'] = covid2019['state'] + covid2019['county']
covid2019['statecounty'] = covid2019['statecounty'].replace(' County','',regex=True)

At this point, I need to start bringing everything together.  First, I want to summarize a count of how many Chipotles are in each county that has one.  Then, I read in the various demographic data columns from the other datasets into a master dataframe.  

In [None]:
#create a pivot table giving the number of stores by county
chipotle_counts = pd.pivot_table(chipotle_enhanced, values=['address'], index=['statecounty'], aggfunc=lambda x: len(x.unique()))

#trim some columns from the census data
#dropping TotalPop because we have a 2019 estimate, which should be better than a 2017 estimate
census2017_to_merge = census2017.drop(['State','County','CountyId','geometry','TotalPop'], axis=1)

#read in other potentially useful fields for a regression
chipotle_counts = pd.merge(chipotle_counts, us_counties[['statecounty','popdensity','county_pop_2019_est']], on='statecounty',how='left')
chipotle_counts = pd.merge(chipotle_counts, census2017_to_merge, on='statecounty',how='left')
chipotle_counts = pd.merge(chipotle_counts, covid2019[['median_age','statecounty']],on='statecounty',how='left')

# #rename the "address" column to "count" because it annoys me
# chipotle_counts.rename(columns={'address':'count'})

#There are some missing values in a few fields.  Filling them in with the next entry, or failing that, zero
chipotle_counts = chipotle_counts.fillna(method='bfill',axis=0).fillna(0)

#convert values to floats so we don't have to keep doing it
#chipotle_counts = chipotle_counts.loc[:, chipotle_counts.columns != 'statecounty'].astype('float64')
chipotle_counts.loc[:, chipotle_counts.columns != 'statecounty'] = chipotle_counts.loc[:, chipotle_counts.columns != 'statecounty'].apply(pd.to_numeric)

#display data to make sure things look about right
chipotle_counts

Let's try out a linear regression on these variables (except statecounty, which is a key field)

In [None]:
X = chipotle_counts.drop(['address','statecounty'],axis=1)
y = chipotle_counts.address

lm = linear_model.LinearRegression()
model=lm.fit(X,y)
lm.score(X,y)

The above confirms that a linear regression has a high predictive value.  Now I run several iterations to figure out which variables are statistically significant.  For that, sklearn isn't sufficient, because it doesn't give data on the p-values of each predictor - at least not in a way I can understand.  We need statsmodels OLS, which gives some statistics on the variables.

In [None]:
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

Trim variables with high p-values (over 0.75) and try again.  Variables dropped are:

* Asian
* IncomeErr
* PrivateWork
* PublicWork
* SelfEmployed

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

Trying again, this time with variables that have p-values over 0.5.  New variables dropped:

* popdensity
* Native
* FamilyWork

So much for my idea that population density might be a good predictor.  It also seems like the type of work isn't very important for the number of Chipotles.

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

Trying again, this time trimming variables with p-value > 0.25. New variables dropped:

* White
* Black
* Poverty
* ChildPoverty
* median_age

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork','White','Black','Poverty','ChildPoverty','median_age'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

Next round omits variables with p-values > 0.1.  New variables dropped:

* Men
* Hispanic

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork','White','Black','Poverty','ChildPoverty','median_age','Men','Hispanic'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork','White','Black','Poverty','ChildPoverty','median_age','Men','Hispanic','Office','Construction','Production','Drive','Carpool','Transit','WorkAtHome','MeanCommute','Unemployment'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork','White','Black','Poverty','ChildPoverty','median_age','Men','Hispanic','Office','Construction','Production','Drive','Carpool','Transit','WorkAtHome','MeanCommute','Unemployment','Income','IncomePerCap','Professional','Service','Walk','OtherTransp'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

In [None]:
X = chipotle_counts.drop(['address','statecounty','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','popdensity','Native','FamilyWork','White','Black','Poverty','ChildPoverty','median_age','Men','Hispanic','Office','Construction','Production','Drive','Carpool','Transit','WorkAtHome','MeanCommute','Unemployment','Income','IncomePerCap','Professional','Service','Walk','OtherTransp','VotingAgeCitizen','IncomePerCapErr'],axis=1)

mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())

Next few rounds will omit any variables with p-value > 0.05, until all variables fit that condition.  Final variables dropped:

* Office
* Construction
* Production
* Drive
* Carpool
* Transit
* WorkAtHome
* MeanCommute
* Unemployment
* Income
* IncomePerCap
* Professional
* Service
* Walk
* OtherTransp
* VotingAgeCitizen
* IncomePerCapErr

Now that we know the significant variables are county_pop_2019_est, Women, Pacific, and Employed, we can go back to sklearn for easier prediction capacities. The next section retrains the model using sklearn, then does a set of predictions with all US county data (other than NaNs, which are dropped). 

In [None]:
#retrain model
X = chipotle_counts.loc[:,['Women','Pacific','Employed','county_pop_2019_est']]
y = chipotle_counts.address
lm = linear_model.LinearRegression()
model=lm.fit(X,y)

#develop new predictions with all counties
census_xpred = census2017.drop(['geometry','CountyId','State','County','TotalPop','Asian','IncomeErr','PrivateWork','PublicWork','SelfEmployed','Native','FamilyWork','White','Black','Poverty','ChildPoverty','Men','Hispanic','Office','Construction','Production','Drive','Carpool','Transit','WorkAtHome','MeanCommute','Unemployment','Income','IncomePerCap','Professional','Service','Walk','OtherTransp','VotingAgeCitizen','IncomePerCapErr'],axis=1)
xpred = pd.merge(census_xpred,us_counties[['statecounty','county_pop_2019_est']], on='statecounty',how='left')
xpred = xpred.dropna()
census2017_with_pred = xpred
xpred = xpred.loc[:, xpred.columns != 'statecounty'].astype('float64')
census2017_with_pred['ypred'] = model.predict(xpred)
census2017_with_pred

This next section of code merges the prediction data back with county geometry for the final map.  I also create a "residuals" variable, which is the predicted number of Chipotles per county minus the actual number.  Typically residuals are expressed as "actual minus expected", not "expected minus actual", but I wanted areas that appear to be deficient in Chipotles to show up as positives, i.e., build this many Chipotles in this county.

In [None]:
#Merging the prediction and actual counts with county geometry
model_data = pd.merge(us_counties,census2017_with_pred, on='statecounty',how='left')
model_data['statecounty'] = model_data['state'] + model_data['county']
model_data = pd.merge(model_data, chipotle_counts[['address','statecounty']],on='statecounty',how='left')
model_data.address.fillna(0, inplace=True)
model_data.ypred.fillna(0, inplace=True)

#Normally we'd subtract "actual" minus "expected" to determine residuals.  In this case, 
#it would be better to have "expected" minus "actual" to show how many restaurants should be built.

model_data['residuals'] = model_data['ypred'] - model_data['address']
model_data.crs = "epsg:4326"
model_data

Time to put a map together of those residuals.  First, let's figure out the min and max. to help set the color scale.

In [None]:
print('Maximum residual = ' + str(model_data.residuals.max()))
print('Minimum residual = ' + str(model_data.residuals.min()))

The color scale was difficult for me to set.  Basically, anywhere that there's no need to build a Chipotle shows up in red, areas that are basically right are in orange, and the yellow and green areas appear to be able to support more Chipotles than are currently in the county.

In [None]:
geo_data = model_data[['statecounty','county_geom']].set_index('statecounty')
residual_data = model_data[['statecounty','residuals']].set_index('statecounty')

m_1 = folium.Map(location=[40,-100], tiles='openstreetmap', zoom_start=4)

Choropleth(geo_data=geo_data.__geo_interface__, data=residual_data['residuals'], fill_color='RdYlGn', key_on='feature.id', legend_name='Chipotles to build per county', threshold_scale=[-50,-1,1,5,10,12]).add_to(m_1)

m_1

Tabular view of top 10 counties needing more Chipotles.

In [None]:
model_data.sort_values('residuals',ascending=False).head(10)

Tabular view of top 10 counties unable to support additional Chipotles.  This may include counties where there are currently no Chipotles and it's a really bad idea to build one there.

In [None]:
model_data.sort_values('residuals',ascending=True).head(10)

# Concluding Thoughts

A few observations from the final dataset and map:

* The final model has the equation: Number of Chipotles in County = 0.000005397 * 2019 County Population Estimate - 0.0000363 * Number of Women in County - 2.0156 * Percentage of Pacific Islanders in County + 0.00005169 * Number of employed people in county.  
* Variables that I had suspected might be significant turned out not to be.  In particular, population density dropped out of the running for statistically significant variables rather quickly.
* Areas that appear to have lower Chipotles than expected seem to cluster around a few large cities: Seattle, Salt Lake City, Houston, San Antonio, Oklahoma City, Memphis, Detroit, Buffalo, Maimi, and certain portions of the northeast megalopolis.  
* Interestingly, New York appears to be over-saturated with Chipotles.
* The model can produce a negative number of predicted Chipotles.  While practically impossible, I don't think there's any theoretical problem with that outcome: it simply means that it's a bad place to even try starting a Chipotle.

A few other areas for future research:

* The original data gave existing Chipotle locations, but some measure of each restuarant's profitability would have added another dimension to the analysis.
* While this analysis gives a general idea of which counties appear to be able to support more or fewer Chipotles, it ignores more granular data.  There may be opportunities for additional Chipotles in a county that seems oversaturated if it is on the border with another that has fewer Chipotles than expected, for example.
* Information on competing restaurant chains could be useful.  Some of the cities above may have other, regional franchises operating that limit Chipotle's market share in those domains.
* My method of choosing significant variables may be sub-optimal.  I didn't know how to implement something like the Bayesian Information Criterion or Schwartz Information Criterion here.

A few comments on my learning experience.  As I mentioned above, this notebook was my first attempt at this type of analysis.

* I'm sure some of my methods are clunky, but it got me where I needed to go.
* My color scale in Folium is not my first choice, but it seems like there are limited options in the Choropleth map.
* I was unable to add a tooltip or pop-up marker to the Choropleth map in Folium.

I hope this helps someone think about the task in a new way.  Feel free to use my notebook as a starting point for your own.