Introduction / Business Problem:

I am representing a family who is moving to San Francisco from Los Angeles. They had a Japanese bakery in LA's Little Tokyo and are hoping to open a Japanese bakery somewhere in San Francisco. They want to be in a location where there are other food establishments in the area but that don't have many (if any) bakeries. They would also really like to be located near shops, movie theaters, and/or parks since they think there will be a lot of foot traffic in those areas.

Data:

I will use map data and Foursquare neighborhood data to figure out the best neighborhood for the family to open up their bakery in. For every food establishment in the area I will assign +1 count and for every shop, movie theater, park I will assign +2 counts. Then for every bakery/cafe/dessert shop I will deduct -3 counts. At the end I will look at the total counts for each neighborhood and make my recommendation based on the neighborhood with the highest number of points.

In [1]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

import geopy
import geopandas

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

/bin/sh: conda: command not found
/bin/sh: conda: command not found
Folium installed
Libraries imported.


In [3]:
import requests # library to handle requests
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [32]:
#reading the website html into my notebook
d = pd.read_html('http://www.healthysf.org/bdi/outcomes/zipmap.htm')

In [33]:
#selected the part of the website that I am interested in

d[4]

Unnamed: 0,0,1
0,Zip Code,Neighborhood
1,94102,Hayes Valley/Tenderloin/North of Market
2,94103,South of Market
3,94107,Potrero Hill
4,94108,Chinatown
5,94109,Polk/Russian Hill (Nob Hill)
6,94110,Inner Mission/Bernal Heights
7,94112,Ingelside-Excelsior/Crocker-Amazon
8,94114,Castro/Noe Valley
9,94115,Western Addition/Japantown


In [34]:
#setting a variable for this dataframe
df = d[4]

In [35]:
df.head()

Unnamed: 0,0,1
0,Zip Code,Neighborhood
1,94102,Hayes Valley/Tenderloin/North of Market
2,94103,South of Market
3,94107,Potrero Hill
4,94108,Chinatown


In [36]:
# making first row with zipcode and neighborhood as the header

new_header = df.iloc[0] # grab the first row for the header
df = df[1:] # take the data minus the header row
df.columns = new_header # set the header as the dataframe header
df.head()

Unnamed: 0,Zip Code,Neighborhood
1,94102,Hayes Valley/Tenderloin/North of Market
2,94103,South of Market
3,94107,Potrero Hill
4,94108,Chinatown
5,94109,Polk/Russian Hill (Nob Hill)


In [37]:
# removing the last entry because it is an "all zipcodes" entry
df = df[:-1]
df.tail()

Unnamed: 0,Zip Code,Neighborhood
17,94127,St. Francis Wood/Miraloma/West Portal
18,94131,Twin Peaks-Glen Park
19,94132,Lake Merced
20,94133,North Beach/Chinatown
21,94134,Visitacion Valley/Sunnydale


In [102]:
# adding the coordinates to each neighborhood

!pip install uszipcode
from uszipcode import SearchEngine

search = SearchEngine(simple_zipcode=True)

latitude = []
longitude = []

for index, row in df.iterrows():
    zipcode = search.by_zipcode(row["Zip Code"]).to_dict()
    latitude.append(zipcode.get("lat"))
    longitude.append(zipcode.get("lng"))

df["Latitude"] = latitude
df["Longitude"] = longitude

df.tail()



Unnamed: 0,Zip Code,Neighborhood,Latitude,Longitude
17,94127,St. Francis Wood/Miraloma/West Portal,37.73,-122.46
18,94131,Twin Peaks-Glen Park,37.75,-122.44
19,94132,Lake Merced,37.72,-122.48
20,94133,North Beach/Chinatown,37.8,-122.44
21,94134,Visitacion Valley/Sunnydale,37.72,-122.41


Here I start by looking at a map of San Francisco and then using the coordinate data with Foursquare data to get info on venues in each neighborhood.

In [43]:
# getting the coordinates of San Francisco
address = 'San Francisco, California'

geolocator = Nominatim(user_agent="San Francisco_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Francisco are 37.7790262, -122.4199061.


In [48]:
# making a map of San Francisco
map_sf = folium.Map(location = [latitude, longitude], zoom_start=10)

# adding markers to map
for lat, lng, neighborhood in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_sf)  
    
map_sf

In [49]:
# defining Foursquare with my credentials

CLIENT_ID = 'WD1VZ1O5BSLPMSQT2L3X5P35BGCQU3QMAX5RFVXI1MSOYSWJ'
CLIENT_SECRET = 'P5KJSMUCU3FKWQDIPHOWDMIRCPUCANZLLKP1CMYOIOM5CR42'
VERSION = '20161225'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WD1VZ1O5BSLPMSQT2L3X5P35BGCQU3QMAX5RFVXI1MSOYSWJ
CLIENT_SECRET:P5KJSMUCU3FKWQDIPHOWDMIRCPUCANZLLKP1CMYOIOM5CR42


In [88]:
# making a function that will gather info on nearby venues for each neighborhood

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [103]:
# looking at the veues in SF sorted by neighborhood

sf_venues = getNearbyVenues(names = df['Neighborhood'],
                                   latitudes = df['Latitude'],
                                   longitudes = df['Longitude']
                                  )
                                  
sf_venues.tail()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
90,North Beach/Chinatown,37.8,-122.44,SusieCakes,37.800546,-122.438142,Cupcake Shop
91,North Beach/Chinatown,37.8,-122.44,Delarosa,37.800287,-122.43911,Pizza Place
92,Visitacion Valley/Sunnydale,37.72,-122.41,John McLaren Park Lookout Point,37.717758,-122.407291,Park
93,Visitacion Valley/Sunnydale,37.72,-122.41,Visitacion Valley Greenway,37.717687,-122.407316,Garden
94,Visitacion Valley/Sunnydale,37.72,-122.41,Louis Sutter Playground,37.722388,-122.413928,Baseball Field


In [90]:
#grouping the venues by neighborhood
sf_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bayview-Hunters Point,3,3,3,3,3,3
Castro/Noe Valley,5,5,5,5,5,5
Chinatown,5,5,5,5,5,5
Haight-Ashbury,5,5,5,5,5,5
Hayes Valley/Tenderloin/North of Market,5,5,5,5,5,5
Ingelside-Excelsior/Crocker-Amazon,5,5,5,5,5,5
Inner Mission/Bernal Heights,5,5,5,5,5,5
Inner Richmond,5,5,5,5,5,5
Lake Merced,5,5,5,5,5,5
Marina,5,5,5,5,5,5


In [91]:
# looking at unique categories of venues for each neighborhood

print('There are {} uniques categories.'.format(len(sf_venues['Venue Category'].unique())))

There are 53 uniques categories.


In [94]:
# creating a dataframe with the number of venues per type per neighborhood

sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix = "", prefix_sep = "")

# add neighborhood column back to dataframe
sf_onehot['Neighborhood'] = sf_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sf_onehot.columns[-1]] + list(sf_onehot.columns[:-1])
sf_onehot = sf_onehot[fixed_columns]

sf_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Bakery,Baseball Field,Boxing Gym,Bubble Tea Shop,Burmese Restaurant,Burrito Place,Bus Line,...,Sandwich Place,Scenic Lookout,Soccer Field,Spa,Street Food Gathering,Sushi Restaurant,Szechuan Restaurant,Tennis Court,Trail,Yoga Studio
0,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [96]:
sf_grouped = sf_onehot.groupby('Neighborhood').mean().reset_index()
sf_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Bakery,Baseball Field,Boxing Gym,Bubble Tea Shop,Burmese Restaurant,Burrito Place,Bus Line,...,Sandwich Place,Scenic Lookout,Soccer Field,Spa,Street Food Gathering,Sushi Restaurant,Szechuan Restaurant,Tennis Court,Trail,Yoga Studio
0,Bayview-Hunters Point,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Castro/Noe Valley,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.2
2,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
3,Haight-Ashbury,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2
4,Hayes Valley/Tenderloin/North of Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


These next steps will allow me to view the top 10 venue types per neighborhood. I will use this data to determine which neighborhood I recommend for the family to open up their bakery.


In [97]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [101]:

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        # append 'st', 'nd', 'rd' to the top 3 venues
        columns.append('{}{} Most Common Venue'.format(ind + 1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind + 1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns = columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']

for ind in np.arange(sf_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview-Hunters Point,Coffee Shop,Art Gallery,Motorcycle Shop,Concert Hall,Garden,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run
1,Castro/Noe Valley,Yoga Studio,Dog Run,Trail,Park,Szechuan Restaurant,Bakery,Baseball Field,Art Gallery,Fountain,Food Truck
2,Chinatown,Hotel,Spa,Korean Restaurant,Pizza Place,Yoga Studio,Coffee Shop,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run
3,Haight-Ashbury,Yoga Studio,Scenic Lookout,Coffee Shop,Park,Tennis Court,Baseball Field,Fountain,Food Truck,Flower Shop,Dumpling Restaurant
4,Hayes Valley/Tenderloin/North of Market,Concert Hall,Park,Dance Studio,Opera House,Yoga Studio,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run
5,Ingelside-Excelsior/Crocker-Amazon,Mexican Restaurant,Japanese Restaurant,Sandwich Place,Pizza Place,Yoga Studio,Coffee Shop,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run
6,Inner Mission/Bernal Heights,Optical Shop,Massage Studio,Burrito Place,Dessert Shop,Cocktail Bar,Yoga Studio,Fountain,Food Truck,Flower Shop,Dumpling Restaurant
7,Inner Richmond,Art Gallery,Italian Restaurant,Japanese Restaurant,Burmese Restaurant,Flower Shop,Yoga Studio,Cupcake Shop,Fountain,Food Truck,Dumpling Restaurant
8,Lake Merced,Sandwich Place,Mexican Restaurant,Café,Performing Arts Venue,Coffee Shop,Yoga Studio,Concert Hall,Fountain,Food Truck,Flower Shop
9,Marina,Gym / Fitness Center,Pizza Place,Deli / Bodega,Cupcake Shop,Greek Restaurant,Boxing Gym,Bubble Tea Shop,Garden,Fountain,Food Truck


Neighborhood counts:

Bayview = 13
Castro = 13
Chinatown = 9
Haight-Ashbury = 13
Hayes = 18
Ingleside = 9
Inner mission = 11
Inner richmond = 10
Lake merced = 7
Marina = 10
North beach = 10
Parkside = 11
Polk/ Nob Hill = 16
Potrero Hill = 12
South of Market = 6
West Portal = 11
Sunset = 6
Twin Peaks = 7
Visitacion Valley = 13
Western Addition/Japantown = 7

In [106]:
# adding the counts to the table

counts = [13,13,9,13,18,9,11,10,7,10,10,11,16,12,6,11,6,7,13,7]

neighborhoods_venues_sorted['Counts'] = counts
neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Counts
0,Bayview-Hunters Point,Coffee Shop,Art Gallery,Motorcycle Shop,Concert Hall,Garden,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run,13
1,Castro/Noe Valley,Yoga Studio,Dog Run,Trail,Park,Szechuan Restaurant,Bakery,Baseball Field,Art Gallery,Fountain,Food Truck,13
2,Chinatown,Hotel,Spa,Korean Restaurant,Pizza Place,Yoga Studio,Coffee Shop,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run,9
3,Haight-Ashbury,Yoga Studio,Scenic Lookout,Coffee Shop,Park,Tennis Court,Baseball Field,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,13
4,Hayes Valley/Tenderloin/North of Market,Concert Hall,Park,Dance Studio,Opera House,Yoga Studio,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run,18
5,Ingelside-Excelsior/Crocker-Amazon,Mexican Restaurant,Japanese Restaurant,Sandwich Place,Pizza Place,Yoga Studio,Coffee Shop,Food Truck,Flower Shop,Dumpling Restaurant,Dog Run,9
6,Inner Mission/Bernal Heights,Optical Shop,Massage Studio,Burrito Place,Dessert Shop,Cocktail Bar,Yoga Studio,Fountain,Food Truck,Flower Shop,Dumpling Restaurant,11
7,Inner Richmond,Art Gallery,Italian Restaurant,Japanese Restaurant,Burmese Restaurant,Flower Shop,Yoga Studio,Cupcake Shop,Fountain,Food Truck,Dumpling Restaurant,10
8,Lake Merced,Sandwich Place,Mexican Restaurant,Café,Performing Arts Venue,Coffee Shop,Yoga Studio,Concert Hall,Fountain,Food Truck,Flower Shop,7
9,Marina,Gym / Fitness Center,Pizza Place,Deli / Bodega,Cupcake Shop,Greek Restaurant,Boxing Gym,Bubble Tea Shop,Garden,Fountain,Food Truck,10


Based on my counts per neighborhood I would suggest that the family opens up their bakery in the Hayes Valley/Tenderloin/North of Market neighborhood area. After Hayes Valley I would suggest Polk/Russion Hill/Nob Hill. In both neighborhoods there are a lot of attractions (shops, activity centers, etc) in the area that'll bring customers to the area but no bakery/dessert places so there won't be direct competition.