# Opening A New Sushi Chain in San Francisco, California, USA
## The Battle of the Neighborhoods - Final
### Coursera Applied Data Science Capstone by IBM

In [21]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Table of contents
* [Business Problem Introduction](#businessproblemintroduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#resultsanddiscussion)
* [Conclusion](#conclusion)

## Business Problem Introduction <a name="businessproblemintroduction"></a>

In this project we will try to find a suitable location for a Japanese chain who will be opening a new brick and mortar restaurant in a new geographic market. The key stakeholder will be the Director of Operations who is interested in opening a *sushi restaurant* in *San Francisco, California, USA*. 

The Director has specified two criteria: first, as San Francisco is an international destination with many eateries, the Director has specifically requested a location *not crowded with competitor cuisines*. Second, since the raw materials and ingredients for maintaining a restaurant of the chain's calibre are expensive, the Director would also prefer a city location *with low theft rates*. Since the chain maintains an active Loss Prevention program with specialized equipment, a location with low crime theft over the past 5 years is acceptable.

We will be analyzing locations in San Francisco as well as crime data to support the restaurant chain's initiative. We will then present pros and cons of areas selected and provide recommendations to the stakeholders.

## Data <a name="data"></a>

Based on the requirements defined in the problem, factors that will influence the stakeholder's decision are:
* number of competitor restaurants in a location
* crime data for a location

For the first criteria, we plan to use data from Foursquare to generate a list of eateries in San Francisco by location. For the second criteria, we will use the public San Francisco crimes data to investigate theft and robberies by location to present a proposal for site selection.

We decided to use key district parameters to define San Francisco locations since San Francisco is no larger than approximately 7 miles by 7 miles. Data sources initially planned for extraction and generation include:
* number of eateries and their type by San Francisco location will be obtained using **Foursquare API**
* theft locations of the city will be used and overlayed onto a San Franciso **Folium map** for analysis and analysis

## Methodology<a name="methodology"></a>

At a high level, we will employ the following methodology for analysis:
* Import all data libraries into Python notebook
* Extract 2016 San Francisco Crimes data using the PdDistrict as basis for San Francisco Distircts
* Data wrangle police incidents file to select only Burglary, Larceny/Theft, and Robberies
* Use agglomerative approach to aggregate theft incidents by District
* Using the Districts, determine latitude and longitude geolocator data for each 
* Use Foursquare API to get all venues in defined Districts
* Data wrangle venues data to select only restaurants in each District
* Aggregate restaurants by District
* Create a map to overlay theft incidents data
* Merge theft incidents data with restaurants data and sort by theft incidents

## Analysis<a name="analysis"></a>

In [25]:
import requests
import urllib.request
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import csv
from bs4 import BeautifulSoup
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library
from folium import plugins
import json # library to handle JSON files
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2020.6.20          |   py36h9f0ad1d_0         151 KB  conda-forge
    ------------------------------------------------------------
                       

In [26]:
#Foursquare info
CLIENT_ID = '1H41W44OYMI2XZ5Y13TAWB0YDLCMITZJSXCLELDNFRGLB23G' # your Foursquare ID
CLIENT_SECRET = 'TMYVWORZW4OEW1NCQUQWYNX2DKF2X2D340ILPIZLKCGO5DVF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version


### Get PD crimes data

In [39]:
#get PD districts from crimes data
urlDistricts='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-_Previous_Year__2016_.csv'
df_incidents = pd.read_csv(urlDistricts,
                           index_col=0)
df_incidents.drop(['Descript', 'DayOfWeek', 'Date', 'Time', 'Resolution', 'Address', 'PdId'], axis=1, inplace=True)
df_incidents.head()


Unnamed: 0_level_0,Category,PdDistrict,X,Y,Location
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
120058272,WEAPON LAWS,SOUTHERN,-122.403405,37.775421,"(37.775420706711, -122.403404791479)"
120058272,WEAPON LAWS,SOUTHERN,-122.403405,37.775421,"(37.775420706711, -122.403404791479)"
141059263,WARRANTS,BAYVIEW,-122.388856,37.729981,"(37.7299809672996, -122.388856204292)"
160013662,NON-CRIMINAL,TENDERLOIN,-122.412971,37.785788,"(37.7857883766888, -122.412970537591)"
160002740,NON-CRIMINAL,MISSION,-122.419672,37.76505,"(37.7650501214668, -122.419671780296)"


In [204]:
#get theft incidents only
df_stealing = df_incidents.loc[df_incidents['Category'].isin(["BURGLARY", "LARCENY/THEFT","ROBBERY"])].reset_index(drop=True)
#drop any NaN districts
df_stealing = df_stealing[df_stealing['PdDistrict'].notna()]
df_stealing.head()

Unnamed: 0,Category,PdDistrict,X,Y,Location
0,LARCENY/THEFT,TARAVAL,-122.477377,37.764478,"(37.7644781578695, -122.477376524003)"
1,BURGLARY,CENTRAL,-122.400909,37.791643,"(37.791642982384, -122.40090869889)"
2,ROBBERY,MISSION,-122.40687,37.75729,"(37.7572895904578, -122.406870402082)"
3,LARCENY/THEFT,SOUTHERN,-122.408421,37.78357,"(37.7835699386918, -122.408421116922)"
4,BURGLARY,NORTHERN,-122.419203,37.787438,"(37.7874378309112, -122.419203004268)"


In [205]:
#theft by district
df_district_theft = df_stealing.groupby(['PdDistrict']).count()

#drop columns not needed here
df_district_theft.drop(['X', 'Y', 'Location'], axis=1, inplace=True)
df_district_theft.columns = ['Total Theft']
df_district_theft = df_district_theft.reset_index()

df_district_theft

Unnamed: 0,PdDistrict,Total Theft
0,BAYVIEW,3192
1,CENTRAL,7780
2,INGLESIDE,2628
3,MISSION,4661
4,NORTHERN,8410
5,PARK,2758
6,RICHMOND,3629
7,SOUTHERN,10633
8,TARAVAL,3480
9,TENDERLOIN,2338


### Here we get all the geolocation data for each of the PD districts

In [177]:
#geo locator for Bayview, San Francisco, California
bayview_address = 'Bayview, San Francisco, California'

geolocator = Nominatim(user_agent="bayview_explorer")
location = geolocator.geocode(bayview_address)
bayview_latitude = location.latitude
bayview_longitude = location.longitude
print('The geograpical coordinate of Bayview, San Francisco, California are {}, {}.'.format(bayview_latitude, bayview_longitude))


The geograpical coordinate of Bayview, San Francisco, California are 37.7288889, -122.3925.


In [178]:
#geo locator for Central, San Francisco, California
central_address = 'Central, San Francisco, California'

geolocator = Nominatim(user_agent="central_explorer")
location = geolocator.geocode(central_address)
central_latitude = location.latitude
central_longitude = location.longitude
print('The geograpical coordinate of Central, San Francisco, California are {}, {}.'.format(central_latitude, central_longitude))

The geograpical coordinate of Central, San Francisco, California are 37.790465049999995, -122.40504277897986.


In [179]:
#geo locator for Ingleside, San Francisco, California
ingleside_address = 'Ingleside, San Francisco, California'

geolocator = Nominatim(user_agent="ingleside_explorer")
location = geolocator.geocode(ingleside_address)
ingleside_latitude = location.latitude
ingleside_longitude = location.longitude
print('The geograpical coordinate of Ingleside, San Francisco, California are {}, {}.'.format(ingleside_latitude, ingleside_longitude))


The geograpical coordinate of Ingleside, San Francisco, California are 37.7229872, -122.4530272.


In [180]:
#geo locator for Mission, San Francisco, California
mission_address = 'Mission, San Francisco, California'

geolocator = Nominatim(user_agent="mission_explorer")
location = geolocator.geocode(mission_address)
mission_latitude = location.latitude
mission_longitude = location.longitude
print('The geograpical coordinate of Mission, San Francisco, California are {}, {}.'.format(mission_latitude, mission_longitude))

The geograpical coordinate of Mission, San Francisco, California are 37.7524984, -122.4128258.


In [181]:
#geo locator for Mission, San Francisco, California
mission_address = 'Mission, San Francisco, California'

geolocator = Nominatim(user_agent="mission_explorer")
location = geolocator.geocode(mission_address)
mission_latitude = location.latitude
mission_longitude = location.longitude
print('The geograpical coordinate of Mission, San Francisco, California are {}, {}.'.format(mission_latitude, mission_longitude))

The geograpical coordinate of Mission, San Francisco, California are 37.7524984, -122.4128258.


In [182]:
#geo locator for Northern, San Francisco, California
northern_address = 'North Beach, San Francisco, California'

geolocator = Nominatim(user_agent="northern_explorer")
location = geolocator.geocode(northern_address)
northern_latitude = location.latitude
northern_longitude = location.longitude
print('The geograpical coordinate of Northern, San Francisco, California are {}, {}.'.format(northern_latitude, northern_longitude))

The geograpical coordinate of Northern, San Francisco, California are 37.8011749, -122.4090021.


In [183]:
#geo locator for Park, San Francisco, California
park_address = 'Glen Park, San Francisco, California'

geolocator = Nominatim(user_agent="park_explorer")
location = geolocator.geocode(park_address)
park_latitude = location.latitude
park_longitude = location.longitude
print('The geograpical coordinate of Park, San Francisco, California are {}, {}.'.format(park_latitude, park_longitude))

The geograpical coordinate of Park, San Francisco, California are 37.734281, -122.4344696.


In [184]:
#geo locator for Richmond, San Francisco, California
richmond_address = 'Richmond, San Francisco, California'

geolocator = Nominatim(user_agent="richmond_explorer")
location = geolocator.geocode(richmond_address)
richmond_latitude = location.latitude
richmond_longitude = location.longitude
print('The geograpical coordinate of Richmond, San Francisco, California are {}, {}.'.format(richmond_latitude, richmond_longitude))

The geograpical coordinate of Richmond, San Francisco, California are 37.7770459, -122.4654532.


In [185]:
#geo locator for Southern, San Francisco, California
southern_address = 'South of Market, San Francisco, California'

geolocator = Nominatim(user_agent="southern_explorer")
location = geolocator.geocode(southern_address)
southern_latitude = location.latitude
southern_longitude = location.longitude
print('The geograpical coordinate of Southern, San Francisco, California are {}, {}.'.format(southern_latitude, southern_longitude))

The geograpical coordinate of Southern, San Francisco, California are 37.7808925, -122.4009518.


In [186]:
#geo locator for Taraval, San Francisco, California
taraval_address = 'Taraval, San Francisco, California'

geolocator = Nominatim(user_agent="taraval_explorer")
location = geolocator.geocode(taraval_address)
taraval_latitude = location.latitude
taraval_longitude = location.longitude
print('The geograpical coordinate of Taraval, San Francisco, California are {}, {}.'.format(taraval_latitude, taraval_longitude))

The geograpical coordinate of Taraval, San Francisco, California are 37.7433353, -122.4693349.


In [187]:
#geo locator for Tenderloin, San Francisco, California
tenderloin_address = 'Tenderloin, San Francisco, California'

geolocator = Nominatim(user_agent="tenderloin_explorer")
location = geolocator.geocode(tenderloin_address)
tenderloin_latitude = location.latitude
tenderloin_longitude = location.longitude
print('The geograpical coordinate of Tenderloin, San Francisco, California are {}, {}.'.format(tenderloin_latitude, tenderloin_longitude))

The geograpical coordinate of Tenderloin, San Francisco, California are 37.7842493, -122.4139933.


### Here we read the latitude and longitude of each district to pull venues

In [206]:
#add latitude and longitude to the specified Pd Districts (corrected)
df_district_theft.insert(2,"Latitude",[bayview_latitude, central_latitude, ingleside_latitude, mission_latitude, northern_latitude, park_latitude, richmond_latitude, southern_latitude, taraval_latitude, tenderloin_latitude], True)
df_district_theft.insert(3,"Longitude",[bayview_longitude, central_longitude, ingleside_longitude, mission_longitude, northern_longitude, park_longitude, richmond_longitude, southern_longitude, taraval_longitude, tenderloin_longitude], True)

df_district_theft


Unnamed: 0,PdDistrict,Total Theft,Latitude,Longitude
0,BAYVIEW,3192,37.728889,-122.3925
1,CENTRAL,7780,37.790465,-122.405043
2,INGLESIDE,2628,37.722987,-122.453027
3,MISSION,4661,37.752498,-122.412826
4,NORTHERN,8410,37.801175,-122.409002
5,PARK,2758,37.734281,-122.43447
6,RICHMOND,3629,37.777046,-122.465453
7,SOUTHERN,10633,37.780893,-122.400952
8,TARAVAL,3480,37.743335,-122.469335
9,TENDERLOIN,2338,37.784249,-122.413993


### Create Map of SF PD Districts Covered

In [179]:
# San Francisco latitude and longitude values
sf_latitude = 37.77
sf_longitude = -122.42

# create map of SF using latitude and longitude values
#map_sf = folium.Map(location=[sf_latitude, sf_longitude], zoom_start=12)

In [240]:
# add markers to map
#for lat, lng, label in zip(df_district_theft['Latitude'], df_district_theft['Longitude'], df_district_theft['PdDistrict']):
#    label = folium.Popup(label, parse_html=True)
#    folium.CircleMarker(
#        [lat, lng],
#        radius=5,
#        popup=label,
#        color='blue',
#        fill=True,
#        fill_color='#3186cc',
#        fill_opacity=0.7,
#        parse_html=False).add_to(map_sf)  
    
#map_sf


### Here we generate the code to find venues of each district

In [54]:
#function for getting nearby venues 
def getNearbyVenues(names, latitudes, longitudes, radius=2500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            50)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)


In [55]:
#get SF venues
sf_venues = getNearbyVenues(names=df_district_theft['PdDistrict'],
                                   latitudes=df_district_theft['Latitude'],
                                   longitudes=df_district_theft['Longitude']
                                  )


BAYVIEW
CENTRAL
INGLESIDE
MISSION
NORTHERN
PARK
RICHMOND
SOUTHERN
TARAVAL
TENDERLOIN


In [207]:
print(sf_venues.shape)
sf_venues.head()

(500, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,BAYVIEW,37.728889,-122.3925,Curio Parlor,37.728156,-122.394723,Piercing Parlor
1,BAYVIEW,37.728889,-122.3925,Craftsman and Wolves Den,37.726908,-122.391536,Bakery
2,BAYVIEW,37.728889,-122.3925,Three Babes Bakeshop,37.72973,-122.40016,Dessert Shop
3,BAYVIEW,37.728889,-122.3925,Zaccho Dance Theatre,37.728223,-122.394916,Dance Studio
4,BAYVIEW,37.728889,-122.3925,Radio Africa & Kitchen,37.734826,-122.390764,African Restaurant


### Here we filter on relevant restaurants only 

In [222]:
sf_restaurants = sf_venues[sf_venues['Venue Category'].str.contains('taurant')]
sf_restaurants = sf_restaurants.reset_index()


sf_restaurants.drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Latitude', 'Venue Longitude', 'Venue', 'index'], axis=1, inplace=True)

print(sf_restaurants.shape)

sf_restaurants.head()

(111, 2)


Unnamed: 0,Neighborhood,Venue Category
0,BAYVIEW,African Restaurant
1,BAYVIEW,Mexican Restaurant
2,BAYVIEW,Latin American Restaurant
3,BAYVIEW,Southern / Soul Food Restaurant
4,BAYVIEW,Vietnamese Restaurant


In [223]:
#restaurants by district

sf_district_restaurants = sf_restaurants.groupby('Neighborhood').count()

sf_district_restaurants


Unnamed: 0_level_0,Venue Category
Neighborhood,Unnamed: 1_level_1
BAYVIEW,12
CENTRAL,11
INGLESIDE,13
MISSION,18
NORTHERN,8
PARK,10
RICHMOND,13
SOUTHERN,7
TARAVAL,15
TENDERLOIN,4


In [235]:
sf_district_restaurants.columns = ['Restaurant Venues']
sf_district_restaurants.sort_values('Restaurant Venues')

Unnamed: 0_level_0,Restaurant Venues
Neighborhood,Unnamed: 1_level_1
TENDERLOIN,4
SOUTHERN,7
NORTHERN,8
PARK,10
CENTRAL,11
BAYVIEW,12
INGLESIDE,13
RICHMOND,13
TARAVAL,15
MISSION,18


In [237]:
df_district_theft.sort_values('Total Theft')

Unnamed: 0,PdDistrict,Total Theft,Latitude,Longitude
9,TENDERLOIN,2338,37.784249,-122.413993
2,INGLESIDE,2628,37.722987,-122.453027
5,PARK,2758,37.734281,-122.43447
0,BAYVIEW,3192,37.728889,-122.3925
8,TARAVAL,3480,37.743335,-122.469335
6,RICHMOND,3629,37.777046,-122.465453
3,MISSION,4661,37.752498,-122.412826
1,CENTRAL,7780,37.790465,-122.405043
4,NORTHERN,8410,37.801175,-122.409002
7,SOUTHERN,10633,37.780893,-122.400952


In [230]:
df_summary = df_district_theft.merge(sf_district_restaurants, left_on='PdDistrict', right_on='Neighborhood')
#df_summary = df_summary.reset_index()
df_summary.drop(['Latitude', 'Longitude'], axis=1, inplace=True)
#df_summary = df_summary.reset_index()
#df_summary.sort_values('Total Theft')
df_summary

Unnamed: 0,PdDistrict,Total Theft,Restaurant Venues
0,BAYVIEW,3192,12
1,CENTRAL,7780,11
2,INGLESIDE,2628,13
3,MISSION,4661,18
4,NORTHERN,8410,8
5,PARK,2758,10
6,RICHMOND,3629,13
7,SOUTHERN,10633,7
8,TARAVAL,3480,15
9,TENDERLOIN,2338,4


### Combine views of theft and restaurant views

In [234]:
print(df_summary.sort_values('Total Theft'))

   PdDistrict  Total Theft  Restaurant Venues
9  TENDERLOIN         2338                  4
2   INGLESIDE         2628                 13
5        PARK         2758                 10
0     BAYVIEW         3192                 12
8     TARAVAL         3480                 15
6    RICHMOND         3629                 13
3     MISSION         4661                 18
1     CENTRAL         7780                 11
4    NORTHERN         8410                  8
7    SOUTHERN        10633                  7


### Here we generate map of all theft incidents only

In [238]:
# let's start again with a clean copy of the map of San Francisco
#sf_map = folium.Map(location = [sf_latitude, sf_longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
#sf_theft_incidents = plugins.MarkerCluster().add_to(sf_map)

# loop through the dataframe and add each data point to the mark cluster
#for lat, lng, label, in zip(df_stealing.Y, df_stealing.X, df_stealing.Category):
 #   folium.Marker(
 #       location=[lat, lng],
 #       icon=None,
 #       popup=label,
 #   ).add_to(sf_theft_incidents)

In [239]:
#sf_map

## Results and Discussion<a name="resultsanddiscussion"></a>

In [243]:
print(df_summary.sort_values('Total Theft'))

   PdDistrict  Total Theft  Restaurant Venues
9  TENDERLOIN         2338                  4
2   INGLESIDE         2628                 13
5        PARK         2758                 10
0     BAYVIEW         3192                 12
8     TARAVAL         3480                 15
6    RICHMOND         3629                 13
3     MISSION         4661                 18
1     CENTRAL         7780                 11
4    NORTHERN         8410                  8
7    SOUTHERN        10633                  7


### Based on the theft and total restaurant venues returned by Foursquare API, we find a clear potential candidate location for our client to open a sushi restaurant:
* The **Tenderloin** district appears to exhibit the attributes of comparative minimized theft incidents and fewest restaurant venues compared with other districts in San Francisco
* Additional districts the client may wish to consider include the **Ingleside** and **Park** districts which seem to have the second and third lowest theft incidents, respectively
* While the **Southern** and **Northern** districts appears to have the second and third number of restaurant venues in the city, they have nearly four or five times as many theft incidents compared with the lowest ranking district
* Only two factors were considered in this project's scope, namely **theft incidents** and **restaurant venues** which may present limitations to the results of the data. For example, while the Tenderloin may have the lowest theft rates, other crime types, homelessness, or other socioeconomic data may be found to be higher than other districts to the dismay of the client. As a result, additional iterations of data analysis are recommended to finetune the requirements of the client to better serve their business needs 

## Conclusion<a name="conclusion"></a>

Based on the findings of this analysis, we can conclude the following top 3 locations to our client based on the **theft incidents** and **restaurant venues** criteria alone. We suggest running further iterations using other criteria, such as other crimes statistics or homelessness, to finetune the model for further details.
* Tenderloin District
* Ingleside District
* Park District