# Where to build a pizza place in San Francisco?

## Introduction

Pizza places enjoy great popularity. For the customer, they provide a well-known and well-liked product, and for the owner, they offer good added value and correspondingly high profit margins. In order to maximize the profit, however, the optimal location is crucial. In this project we try to find the optimal position for a pizza place in San Francisco. It would seem logical at first to look for a district in which there is not yet a pizza place, as it would obviously serve a gap in the market there. But following [Hotelling's spatial competition](https://www.sciencedirect.com/science/article/pii/0165176582900891) (also known as [Hotelling's law](https://en.wikipedia.org/wiki/Hotelling%27s_law)) a new pizza place fits nice next to another one. If there is no or only one pizza place in a district, few customers will go to that district to eat a pizza. However, if the district is known for pizza, our client's new pizza place can catch and win over the customers of other pizzerias by offering better quality, a wider selection and more exotic creations. In addition to the number of pizzerias per district, the crime rate in that district is also important to our customers. Let us assume that he has closed his last pizza place due to frequent vandalism and is therefore looking for a district with a low crime rate. The goal of this project is to cluster the districts of San Francisco according to their equipment of shops and restaurants and the criminality occurring in them and recommend suitable districts for our client.    


## Data

The analysis is based upon three datasets:
 1. Geometries of the districts of San Francisco which are downloaded from the official website [data.sfgov.org](https://data.sfgov.org/Geographic-Locations-and-Boundaries/Current-Supervisor-Districts/8nkz-x4ny)
 2. Foursquare data for each district which are downloaded via the API 
 3. Police Department Incident Reports which are also downloaded from the official website [data.sfgov.org](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783)
 
In the following the data sets are read into the jupyter notebook and their contents are displayed. The explorative data analysis as well as the application of machine learning methods will be done in the later methodology part.

### First import the python packages:

In [1]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn
import numpy as np
import lxml
import requests 
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import jaccard_score
from sklearn.utils import resample

### 1. Import of District information

The Geometries of the districts of San Francisco are downloaded from the official website [data.sfgov.org] and imported into pyhton via the geopandas package. The geopandas dataframe contains the geometries itself and the numbers and names of the districts. In the course of this project, especially the names will continue to be used.

In [2]:
districts = gpd.read_file('./data_sfgov/Current Supervisor Districts.geojson')
districts

Unnamed: 0,supdistpad,supdist,supname,supervisor,numbertext,geometry
0,11,SUPERVISORIAL DISTRICT 11,Safai,11,ELEVEN,"MULTIPOLYGON (((-122.42247 37.71789, -122.4224..."
1,9,SUPERVISORIAL DISTRICT 9,Ronen,9,NINE,"MULTIPOLYGON (((-122.41093 37.76941, -122.4108..."
2,3,SUPERVISORIAL DISTRICT 3,Peskin,3,THREE,"MULTIPOLYGON (((-122.39198 37.79387, -122.3921..."
3,1,SUPERVISORIAL DISTRICT 1,Fewer,1,ONE,"MULTIPOLYGON (((-122.49374 37.78761, -122.4936..."
4,8,SUPERVISORIAL DISTRICT 8,Mandelman,8,EIGHT,"MULTIPOLYGON (((-122.42327 37.77206, -122.4232..."
5,2,SUPERVISORIAL DISTRICT 2,Stefani,2,TWO,"MULTIPOLYGON (((-122.41922 37.80845, -122.4192..."
6,4,SUPERVISORIAL DISTRICT 4,Mar,4,FOUR,"MULTIPOLYGON (((-122.47485 37.76179, -122.4749..."
7,7,SUPERVISORIAL DISTRICT 7,Yee,7,SEVEN,"MULTIPOLYGON (((-122.44854 37.75904, -122.4484..."
8,10,SUPERVISORIAL DISTRICT 10,Walton,10,TEN,"MULTIPOLYGON (((-122.39905 37.76973, -122.3981..."
9,6,SUPERVISORIAL DISTRICT 6,Haney,6,SIX,"MULTIPOLYGON (((-122.39382 37.79374, -122.3931..."


In [3]:
lat = districts.centroid.x.mean()
lon = districts.centroid.y.mean()

In [4]:
import folium

mapa = folium.Map([lon, lat],
                  zoom_start=12)

folium.TileLayer('stamentoner').add_to(mapa)
folium.TileLayer('Stamen Terrain').add_to(mapa)

for i in range(0,districts.shape[0]):
    fg = folium.GeoJson(
        districts.iloc[[i]],
        style_function=lambda feature: {
            'fillColor': "grey",
            'color' : "black",
            'weight' : 1,
            'fillOpacity' : 0.5,},
    )
    folium.Popup('{}'.format(districts.loc[i].supname)).add_to(fg)
    fg.add_to(mapa)
    
folium.LayerControl().add_to(mapa)    

mapa

### 2. Import Foursquare data

The Foursquare data for each district are downloaded via the API. First the cretentials to download from Foursquare are inserted and the limit of locations and the maximum distance are set.

In [5]:
creds = pd.read_csv('../creds.csv')
CLIENT_ID = creds.CLIENT_ID.to_string(index=False).replace(" ", "")
CLIENT_SECRET = creds.CLIENT_SECRET.to_string(index=False).replace(" ", "")
VERSION = '20200605' # Foursquare API version
LIMIT = 2000
radius = 3000

In [6]:
centers = districts.centroid

The Foursquare data are downloaded and transferred to a pandas dataframe containing the locations id, the Name, the Category and the coordinates. As an example I will exercice the download at first for the first district:

In [7]:
lat = float(centers.y.iloc[[0]])
lon = float(centers.x.iloc[[0]])
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon, 
            radius, 
            LIMIT)
results = requests.get(url).json()

In [8]:
res_star = results["response"]['groups'][0]['items']
i = 0
data = {'Category':  [res_star[i]['venue']['categories'][0]['name']],
        'lat': [res_star[i]['venue']['location']['lat']],
        'lon': [res_star[i]['venue']['location']['lng']],
        'Name':  [res_star[i]['venue']['name']],
        'id':  [res_star[i]['venue']['id']],
        }
df = pd.DataFrame (data, columns = ['id','Name','Category','lat','lon'])

for i in range(1,len(res_star)):
    data = {'Category':  [res_star[i]['venue']['categories'][0]['name']],
            'lat': [res_star[i]['venue']['location']['lat']],
            'lon': [res_star[i]['venue']['location']['lng']],
            'Name':  [res_star[i]['venue']['name']],
            'id':  [res_star[i]['venue']['id']],
            }
    df = df.append(pd.DataFrame(data, columns = ['id','Name','Category','lat','lon']))

Afterwards the procedure is applied on all districts:

In [9]:
for d in range(0,len(centers)):

    lat = float(centers.y.iloc[[d]])
    lon = float(centers.x.iloc[[d]])
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lon, 
                radius, 
                LIMIT)
    results = requests.get(url).json()

    res_star = results["response"]['groups'][0]['items']
    
    for i in range(0,len(res_star)):
        data = {'Category':  [res_star[i]['venue']['categories'][0]['name']],
                'lat': [res_star[i]['venue']['location']['lat']],
                'lon': [res_star[i]['venue']['location']['lng']],
                'Name':  [res_star[i]['venue']['name']],
                'id':  [res_star[i]['venue']['id']],
                }
        df = df.append(pd.DataFrame(data, columns = ['id','Name','Category','lat','lon']))

df

Unnamed: 0,id,Name,Category,lat,lon
0,4a0e123af964a520c2751fe3,Taquerias El Farolito,Mexican Restaurant,37.721230,-122.437395
0,4ec020bbb8f7963bcdde0f6b,The Dark Horse Inn,Bar,37.716127,-122.440373
0,546960f7498eac74bd5baf47,Tao Sushi,Japanese Restaurant,37.721037,-122.437665
0,4b63b31cf964a520d28c2ae3,Little Joe's Pizza,Pizza Place,37.718478,-122.439856
0,49f796fff964a520c06c1fe3,Roxie Food Center,Sandwich Place,37.726867,-122.441398
...,...,...,...,...,...
0,4a579b5ef964a52074b61fe3,La Boulangerie de San Francisco,Bakery,37.787892,-122.433985
0,4a64a8f4f964a5206cc61fe3,Spruce,New American Restaurant,37.787551,-122.452777
0,585c8202ca1070180ddb525c,Pearl Spa and Sauna,Bath House,37.785642,-122.429130
0,500088f7d63e64b62bc19e6e,Rich Table,New American Restaurant,37.774891,-122.422736


### 3. Import of crime data

Since our customer is afraid of malicious mischief, robbery, buglary and vandalism, we want to know in which district he could get affected by it. Therefore Police Department Incident Reports are downloaded from the official [website](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783). The dataset includes many attributes, but for our analysis we are specificly interested in the incident category and subcategory as well as in the location information.  

In [10]:
crimes_full = pd.read_csv('./data_sfgov/Police_Department_Incident_Reports__2018_to_Present.csv')
crimes_full[['Incident Category','Incident Subcategory', 'Incident Description', 'Latitude', 'Longitude']]

Unnamed: 0,Incident Category,Incident Subcategory,Incident Description,Latitude,Longitude
0,Offences Against The Family And Children,Other,Domestic Violence (secondary only),37.762569,-122.499627
1,Non-Criminal,Other,Mental Health Detention,37.780535,-122.408161
2,Missing Person,Missing Person,Found Person,37.721600,-122.390745
3,Offences Against The Family And Children,Family Offenses,Elder Adult or Dependent Abuse (not Embezzleme...,37.794860,-122.404876
4,Assault,Simple Assault,Battery,37.797716,-122.430559
...,...,...,...,...,...
356650,Non-Criminal,Non-Criminal,Found Property,37.780927,-122.413676
356651,Larceny Theft,Larceny - From Vehicle,"Theft, From Locked Vehicle, >$950",37.766406,-122.424258
356652,Assault,Simple Assault,Battery,37.759830,-122.425920
356653,Robbery,Robbery - Commercial,"Robbery, Chain Store, W/ Force",37.726132,-122.464573
