# Capstone Project - The Battle of the Neighborhoods

This notebook is for the <a href="https://www.coursera.org/professional-certificates/ibm-data-science">IBM Data Science Professional Certificate.</a>.

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a hair salon. Specifically, this report will be targeted to stakeholders interested in opening a **Hair Salon** in **Boise, Idaho**.

Boise, Idaho as of 2020 has a population of less than 250,000 people.  Based on this we are looking for areas that already have a lot of venues.  This will allow us to generate cross business from people visiting other locations.

We will use data science techniques to generate the most promissing neighborhoods based on the above criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on the definition of our problem, factors that will influence our decission are:
* number of existing venues in the neighborhood 
* number of salons already in the neighborhood

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **GeoPandas**
* number of hair salons and their location in every neighborhood will be obtained using **Foursquare API**
* coordinate of city center's will be obtained using **GeoPandas**.

### Make the necessary imports

In [1]:
import sys
!{sys.executable} -m pip install pyproj

import numpy as np
import pandas as pd
import folium
import requests
import warnings
import math
import pickle
from pyproj import Proj,transform
from pandas.io.json import json_normalize
from geopy import Nominatim

!pip install shapely
import shapely.geometry

warnings.simplefilter("ignore")

print('Libraries imported.')

Libraries imported.


In [2]:
geolocator = Nominatim(user_agent="myGeocoder")
address = 'Boise, Idaho'

locator = Nominatim(user_agent='myGeocoder')

boise_center = locator.geocode(address)
longitude = boise_center.longitude
latitude = boise_center.latitude
print('Boise, Idaho Coordinates: Latitude={}, Longitude = {}'.format(latitude,longitude))

Boise, Idaho Coordinates: Latitude=43.6166163, Longitude = -116.200886


Now let's create a grid of area candidates, equaly spaced, centered around the center of Boise. Our neighborhoods will be defined as circular areas with a radius of 400 meters, so our neighborhood centers will be 200 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on the Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

In [3]:
def lonlat_to_xy(lon, lat):
    proj_latlon = Proj(proj='latlong',datum='WGS84')
    proj_xy = Proj(proj="utm", zone=33, datum='WGS84')
    xy = transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = Proj(proj='latlong',datum='WGS84')
    proj_xy = Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Boise longitude={}, latitude={}'.format(longitude, latitude))

x, y = lonlat_to_xy(longitude, latitude)
print('Boise UTM X={}, Y={}'.format(x, y))

lo, la = xy_to_lonlat(x, y)
print('Boise longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Boise longitude=-116.200886, latitude=43.6166163
Boise UTM X=-3400784.902310381, Y=13858643.278638296
Boise longitude=-116.200886, latitude=43.6166163


In [4]:
boise_center_x, boise_center_y = lonlat_to_xy(longitude, latitude) # City center in Cartesian coordinates
nb_k = 10
radius = 400
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells

x_min = boise_center_x - radius*10
x_step = radius*2
y_min = boise_center_y - radius*2 - (int(nb_k/k)*k*radius*2 - radius*10)/2
y_step = radius*2 * k 

latitudes = []
longitudes = []

distances_from_center = []

xs = []
ys = []

for i in range(0, int(nb_k/k)):
    y = y_min + i * y_step
    x_offset = radius if i%2==0 else 0
    for j in range(0, nb_k):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(boise_center_x, boise_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)
            
boise_map = folium.Map(location=[latitude, longitude], zoom_start=13)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=radius, color='blue', fill=False).add_to(boise_map)

boise_map

In [5]:
addresses = []
compteur = 0

df_locations = pd.DataFrame()

loaded = False

try:
    with open('locations.pkl', 'rb') as f:
        df_locations = pickle.load(f)
    
    print('Location data loaded from pickle.')
    loaded = True
except:
    pass

if not loaded:
    print('Obtaining location addresses.')
    
    for lat, lon in zip(latitudes, longitudes):
        compteur = compteur + 1
        address = locator.reverse("43.6166163, -116.200886")
        
        if address is None:
            address = 'NO ADDRESS'
        
        addresses.append(address)

    print(' done.')

    df_locations = pd.DataFrame({'Address': addresses,
                                 'Latitude': latitudes,
                                 'Longitude': longitudes,
                                 'X': xs,
                                 'Y': ys,
                                 'Distance from center': distances_from_center})

df_locations.head(10)

Obtaining location addresses.
 done.


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"(Hospitality Of Nez Perce Statue, West Bannock...",43.615238,-116.154723,-3404385.0,13856030.0,4446.883373
1,"(Hospitality Of Nez Perce Statue, West Bannock...",43.618982,-116.161243,-3403585.0,13856030.0,3828.155135
2,"(Hospitality Of Nez Perce Statue, West Bannock...",43.622726,-116.167765,-3402785.0,13856030.0,3288.582025
3,"(Hospitality Of Nez Perce Statue, West Bannock...",43.62647,-116.174288,-3401985.0,13856030.0,2873.111856
4,"(Hospitality Of Nez Perce Statue, West Bannock...",43.630213,-116.180812,-3401185.0,13856030.0,2640.979314
5,"(Hospitality Of Nez Perce Statue, West Bannock...",43.633957,-116.187337,-3400385.0,13856030.0,2640.979314
6,"(Hospitality Of Nez Perce Statue, West Bannock...",43.6377,-116.193864,-3399585.0,13856030.0,2873.111856
7,"(Hospitality Of Nez Perce Statue, West Bannock...",43.641444,-116.200392,-3398785.0,13856030.0,3288.582025
8,"(Hospitality Of Nez Perce Statue, West Bannock...",43.645187,-116.206921,-3397985.0,13856030.0,3828.155135
9,"(Hospitality Of Nez Perce Statue, West Bannock...",43.64893,-116.213452,-3397185.0,13856030.0,4446.883373


### Save location data thus far

In [6]:
# df_locations.to_pickle('./locations.pkl')
df_locations.shape

(110, 6)

### Foursquare
Now that we have our locations, we will use the Foursquare API to get info on hair salons in each neighborhood.

In [7]:
foursquare_client_id = 'BX0BTGI0XYHS2YEE0P4B0MQS3RNWO2U13VSKPC20SGXBMZ2D' # your Foursquare ID
foursquare_client_secret = 'V2ZLWQM00UPQ2D1EPF0UJWQ1XJUKMCA4XE3NYTJADTCSKFJK' # your Foursquare Secret

In [8]:
# Category IDs corresponding to Hair Salons were taken from Foursquare web site 
# (https://developer.foursquare.com/docs/resources/categories):

salon_category = '4bf58dd8d48988d110951735'

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        # print(sys.exc_info()[0])
        venues = []
    return venues

In [9]:
def get_salons(lats, lons):
    salons = {}
    location_salons = []

    print('Obtaining venues around candidate locations.')
    
    for lat, lon in zip(lats, lons):
        venues = get_venues_near_location(lat, lon, salon_category, foursquare_client_id, 
                                          foursquare_client_secret, radius=350, limit=100)
        
        area_salons = []
        
        for venue in venues:            
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
        
            x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
            salon = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address,
                          venue_distance, x, y)
            
            if venue_distance<=300:
                area_salons.append(salon)
                
            salons[venue_id] = salon
        
        location_salons.append(area_salons)
    
    return salons, location_salons

In [10]:
# Try to load from local file system in case we did this before
salons = {}
location_salons = []

loaded = False

try:
    with open('salons_350.pkl', 'rb') as f:
        salons = pickle.load(f)
    with open('location_salons_350.pkl', 'rb') as f:
        location_salons = pickle.load(f)
    print('Salon data loaded.')
    loaded = True
except:
    pass
# If load failed use the Foursquare API to get the data
if not loaded:
    salons, location_salons = get_salons(latitudes, longitudes)
    # Let's persists this in local file system
    with open('salons_350.pkl', 'wb') as f:
        pickle.dump(salons, f)
    with open('location_salons_350.pkl', 'wb') as f:
        pickle.dump(location_salons, f)

print("Successfully obtained candidate salons")

Obtaining venues around candidate locations.
Successfully obtained candidate salons


In [11]:
print('Total number of salons:', len(salons))
print('Average number of salons in neighborhood:', np.array([len(r) for r in location_salons]).mean())

Total number of salons: 99
Average number of salons in neighborhood: 0.7


In [12]:
boise_map = folium.Map(location=[latitude, longitude], zoom_start=15)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for sal in salons.values():
    lat = sal[2]; lon = sal[3]
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, color='blue', fill=True, fill_color='blue',                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)
    
boise_map

In [13]:
print('List of all salons')
print('-----------------------')
for r in list(salons.values())[:10]:
    print(r)
print('...')
print('Total:', len(salons))

List of all salons
-----------------------
('4b4aaf2af964a520d68c26e3', 'Euphoria Salon', 43.62995445728302, -116.20347082614899, '1517 N 13th St (BTW Alturas & Eastman), Boise, ID 83702, United States', 75, -3399495.2389167324, 13857410.580666699)
('4cd46b7394848cfa328adfb1', "Vince's Barber Shop", 43.629886, -116.203647, '1519 N 13th St, Boise, ID 83702, United States', 82, -3399487.552309863, 13857428.197553424)
('5d9f6e1c7550d70008038ec7', "Red Betty's Hair House", 43.6428368047944, -116.2235699594021, '2503 N 28th St (Sunset Avenue), Boise, ID 83703, United States', 254, -3396920.9015173526, 13857268.808738869)
('4d043f6b7d9ba35d14f56423', 'North End Barber Shop', 43.62224406445441, -116.19821200802728, '814 W Fort St, Boise, ID 83702, United States', 282, -3400525.0856689187, 13857898.72958821)
('4bad58e9f964a520c0483be3', 'Supercuts', 43.62478398728182, -116.21189296245574, '7610 W State St, Boise, ID 83714, United States', 296, -3399283.3715022886, 13858450.062812986)
('4c13eac

In [14]:
def get_meta_venues(lats, lons, meta_category):
    meta_venues = {}
    neighborhoods_venues = []
    
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        area_meta_venues = []
        
        for i, category in enumerate(meta_category):
            venues = get_venues_near_location(lat, lon, category, foursquare_client_id, 
                                              foursquare_client_secret, radius=350, limit=100)
            
            for venue in venues:
                venue_id = venue[0]
                venue_name = venue[1]
                venue_categories = venue[2]
                venue_latlon = venue[3]
                venue_address = venue[4]
                venue_distance = venue[5]
                
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                company = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], 
                    venue_address, venue_distance, x, y)
                
                if venue_distance<=300:
                    area_meta_venues.append(company)
                
                meta_venues[venue_id] = company

            neighborhoods_venues.append(area_meta_venues)
    return meta_venues, neighborhoods_venues

In [15]:
# 'Root' category for all companies venues
companies_category = ['4d4b7105d754a06375d81259']

# Try to load from local file system in case we did this before
meta_company = {}
neighborhoods_company = []

loaded = False

try:
    with open('meta_company_350.pkl', 'rb') as f:
        meta_company = pickle.load(f)
    with open('neighborhoods_company_350.pkl', 'rb') as f:
        neighborhoods_company = pickle.load(f)
    print('Companies data loaded.')
    loaded = True
except:
    pass

if not loaded:
    meta_company, neighborhoods_company = get_meta_venues(latitudes, longitudes, companies_category)
    # Let's persists this in local file system
    with open('meta_company_350.pkl', 'wb') as f:
        pickle.dump(meta_company, f)
    with open('neighborhoods_company_350.pkl', 'wb') as f:
        pickle.dump(neighborhoods_company, f)

print('Total number of companies:', len(meta_company))
print('Average number of companies in neighborhood:', np.array([len(r) for r in neighborhoods_company]).mean())

Total number of companies: 694
Average number of companies in neighborhood: 4.818181818181818


In [16]:
boise_map = folium.Map(location=[latitude, longitude], zoom_start=15)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for sal in meta_company.values():
    lat = sal[2]; lon = sal[3]
    # is_company = res[6]
    color = 'red'
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color,                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)
    
for sal in salons.values():
    lat = sal[2]; lon = sal[3]
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, color='blue', fill=True, fill_color='blue',                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)
    
boise_map

## Methodology <a name="methodology"></a>

In this project we will look for areas that have established venues in the region.



## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of salons in every area candidate** area:

In [17]:
location_salons_count = [len(sal) for sal in location_salons]
df_locations['Salons in Area'] = location_salons_count

print('Average number of salons in every area with radius=300m:', np.array(location_salons_count).mean())

df_results = df_locations[df_locations['Salons in Area'] > 0]
df_results

Average number of salons in every area with radius=300m: 0.7


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Salons in Area
26,"(Hospitality Of Nez Perce Statue, West Bannock...",43.629489,-116.20279,-3399585.0,13857420.0,1714.733007,2
29,"(Hospitality Of Nez Perce Statue, West Bannock...",43.640715,-116.222378,-3397185.0,13857420.0,3802.671336,1
35,"(Hospitality Of Nez Perce Statue, West Bannock...",43.619769,-116.197461,-3400785.0,13858110.0,532.050808,1
37,"(Hospitality Of Nez Perce Statue, West Bannock...",43.627253,-116.210515,-3399185.0,13858110.0,1686.142954,3
38,"(Hospitality Of Nez Perce Statue, West Bannock...",43.630995,-116.217044,-3398385.0,13858110.0,2458.267289,2
39,"(Hospitality Of Nez Perce Statue, West Bannock...",43.634736,-116.223574,-3397585.0,13858110.0,3243.929417,2
43,"(Hospitality Of Nez Perce Statue, West Bannock...",43.610051,-116.192133,-3401985.0,13858800.0,1210.721618,2
44,"(Hospitality Of Nez Perce Statue, West Bannock...",43.613793,-116.198658,-3401185.0,13858800.0,431.099568,6
45,"(Hospitality Of Nez Perce Statue, West Bannock...",43.617534,-116.205184,-3400385.0,13858800.0,431.099568,15
46,"(Hospitality Of Nez Perce Statue, West Bannock...",43.621275,-116.211711,-3399585.0,13858800.0,1210.721618,3


In [18]:
salon_latlons = [[sal[2], sal[3]] for sal in salons.values()]

In [19]:
from folium import plugins
from folium.plugins import HeatMap

b_center = location=[latitude, longitude]

map_boise = folium.Map(b_center, zoom_start=13)

folium.TileLayer('cartodbpositron').add_to(map_boise) # cartodbpositron cartodbdark_matter
HeatMap(salon_latlons).add_to(map_boise)

folium.Marker(b_center, popup='Boise').add_to(map_boise)

folium.Circle(b_center, radius=1000, fill=False, color='white').add_to(map_boise)
folium.Circle(b_center, radius=2000, fill=False, color='white').add_to(map_boise)
folium.Circle(b_center, radius=3000, fill=False, color='white').add_to(map_boise)

map_boise

In [30]:
location_cos_count = [len(co) for co in neighborhoods_company]
df_locations['Businesses in Area'] = location_cos_count

print('Average number of salons in every area with radius=300m:', np.array(location_cos_count).mean())

df_results = df_locations[(df_locations['Salons in Area'] > 0) | (df_locations['Businesses in Area'] > 0)]
df_results

Average number of salons in every area with radius=300m: 4.818181818181818


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Salons in Area,Businesses in Area
6,"(Hospitality Of Nez Perce Statue, West Bannock...",43.637700,-116.193864,-3.399585e+06,1.385603e+07,2873.111856,0,1
8,"(Hospitality Of Nez Perce Statue, West Bannock...",43.645187,-116.206921,-3.397985e+06,1.385603e+07,3828.155135,0,4
9,"(Hospitality Of Nez Perce Statue, West Bannock...",43.648930,-116.213452,-3.397185e+06,1.385603e+07,4446.883373,0,1
18,"(Hospitality Of Nez Perce Statue, West Bannock...",43.639209,-116.208120,-3.398385e+06,1.385673e+07,3072.058025,0,3
19,"(Hospitality Of Nez Perce Statue, West Bannock...",43.642951,-116.214650,-3.397585e+06,1.385673e+07,3730.622001,0,1
...,...,...,...,...,...,...,...,...
104,"(Hospitality Of Nez Perce Statue, West Bannock...",43.589151,-116.225399,-3.401185e+06,1.386296e+07,4336.180288,3,5
105,"(Hospitality Of Nez Perce Statue, West Bannock...",43.592889,-116.231926,-3.400385e+06,1.386296e+07,4336.180288,0,1
106,"(Hospitality Of Nez Perce Statue, West Bannock...",43.596627,-116.238454,-3.399585e+06,1.386296e+07,4481.345723,0,3
107,"(Hospitality Of Nez Perce Statue, West Bannock...",43.600365,-116.244983,-3.398785e+06,1.386296e+07,4758.409344,0,2


In [20]:
company_latlons = [[cpny[2], cpny[3]] for cpny in meta_company.values()]

In [21]:
map_boise = folium.Map(b_center, zoom_start=13)

folium.TileLayer('cartodbpositron').add_to(map_boise) # cartodbpositron cartodbdark_matter
HeatMap(company_latlons).add_to(map_boise)

folium.Marker(b_center, popup='Boise').add_to(map_boise)

folium.Circle(b_center, radius=1000, fill=False, color='white').add_to(map_boise)
folium.Circle(b_center, radius=2000, fill=False, color='white').add_to(map_boise)
folium.Circle(b_center, radius=3000, fill=False, color='white').add_to(map_boise)

map_boise