# Capstone Project - The Battle of the Neighborhoods

This notebook is for the <a href="https://www.coursera.org/professional-certificates/ibm-data-science">IBM Data Science Professional Certificate.</a>.

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a hair salon. Specifically, this report will be targeted to stakeholders interested in opening a **Hair Salon** in **Boise, Idaho**.

Boise, Idaho as of 2020 has a population of less than 250,000 people.  Based on this we are looking for areas that already have a lot of venues.  This will allow us to generate cross business from people visiting other locations.

We will use data science techniques to generate the most promissing neighborhoods based on the above criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on the definition of our problem, factors that will influence our decission are:
* number of existing venues in the neighborhood 
* type of venues in the neighborhood

Following data sources will be needed to extract/generate the required information:
* FourSquare
* zipcodes.com
* public.opendatasoft.com //for lat-lon information

### Make the necessary imports

In [1]:
import sys
!{sys.executable} -m pip install pyproj

import numpy as np
import pandas as pd
import folium
import requests
import warnings
import math
import pickle
from pyproj import Proj,transform
from pandas.io.json import json_normalize
from geopy import Nominatim

!pip install shapely
import shapely.geometry

warnings.simplefilter("ignore")

print('Libraries imported.')

Libraries imported.


In [2]:
geolocator = Nominatim(user_agent="myGeocoder")
address = 'Boise, Idaho'

locator = Nominatim(user_agent='myGeocoder')

boise_center = locator.geocode(address)
longitude = boise_center.longitude
latitude = boise_center.latitude
print('Boise, Idaho Coordinates: Latitude={}, Longitude = {}'.format(latitude,longitude))

Boise, Idaho Coordinates: Latitude=43.6166163, Longitude = -116.200886


Now let's create a grid of area candidates, equaly spaced, centered around city center. Our neighborhoods will be defined as circular areas with a radius of 400 meters, so our neighborhood centers will be 200 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in meters).

In [3]:
def lonlat_to_xy(lon, lat):
    proj_latlon = Proj(proj='latlong',datum='WGS84')
    proj_xy = Proj(proj="utm", zone=33, datum='WGS84')
    xy = transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = Proj(proj='latlong',datum='WGS84')
    proj_xy = Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Boise longitude={}, latitude={}'.format(longitude, latitude))

x, y = lonlat_to_xy(longitude, latitude)
print('Boise UTM X={}, Y={}'.format(x, y))

lo, la = xy_to_lonlat(x, y)
print('Boise longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Boise longitude=-116.200886, latitude=43.6166163
Boise UTM X=-3400784.902310381, Y=13858643.278638296
Boise longitude=-116.200886, latitude=43.6166163


In [4]:
boise_center_x, boise_center_y = lonlat_to_xy(longitude, latitude) # City center in Cartesian coordinates
nb_k = 10
radius = 400
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells

x_min = boise_center_x - radius*10
x_step = radius*2
y_min = boise_center_y - radius*2 - (int(nb_k/k)*k*radius*2 - radius*10)/2
y_step = radius*2 * k 

latitudes = []
longitudes = []

distances_from_center = []

xs = []
ys = []

for i in range(0, int(nb_k/k)):
    y = y_min + i * y_step
    x_offset = radius if i%2==0 else 0
    for j in range(0, nb_k):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(boise_center_x, boise_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)
            
boise_map = folium.Map(location=[latitude, longitude], zoom_start=13)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=radius, color='blue', fill=False).add_to(boise_map)

boise_map

In [5]:
addresses = []
compteur = 0
df_locations = pd.DataFrame()
loaded = False
try:
    with open('locations.pkl', 'rb') as f:
        df_locations = pickle.load(f)
    print('Location data loaded from pickle.')
    loaded = True
except:
    pass
if not loaded:
    print('Obtaining location addresses: ', end='')
    for lat, lon in zip(latitudes, longitudes):
        compteur = compteur + 1
        address = locator.reverse("43.6166163, -116.200886")
        if address is None:
            address = 'NO ADDRESS'
        addresses.append(address)
        if compteur > 500:
            print("Urgency exit")
            break
        print(' .', end='')
    print(' done.')
if not loaded:
    addresses
if not loaded:
    df_locations = pd.DataFrame({'Address': addresses,
                                 'Latitude': latitudes,
                                 'Longitude': longitudes,
                                 'X': xs,
                                 'Y': ys,
                                 'Distance from center': distances_from_center})

df_locations.head(10)

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"(Hospitality Of Nez Perce Statue, West Bannock...",43.615238,-116.154723,-3404385.0,13856030.0,4446.883373
1,"(Hospitality Of Nez Perce Statue, West Bannock...",43.618982,-116.161243,-3403585.0,13856030.0,3828.155135
2,"(Hospitality Of Nez Perce Statue, West Bannock...",43.622726,-116.167765,-3402785.0,13856030.0,3288.582025
3,"(Hospitality Of Nez Perce Statue, West Bannock...",43.62647,-116.174288,-3401985.0,13856030.0,2873.111856
4,"(Hospitality Of Nez Perce Statue, West Bannock...",43.630213,-116.180812,-3401185.0,13856030.0,2640.979314
5,"(Hospitality Of Nez Perce Statue, West Bannock...",43.633957,-116.187337,-3400385.0,13856030.0,2640.979314
6,"(Hospitality Of Nez Perce Statue, West Bannock...",43.6377,-116.193864,-3399585.0,13856030.0,2873.111856
7,"(Hospitality Of Nez Perce Statue, West Bannock...",43.641444,-116.200392,-3398785.0,13856030.0,3288.582025
8,"(Hospitality Of Nez Perce Statue, West Bannock...",43.645187,-116.206921,-3397985.0,13856030.0,3828.155135
9,"(Hospitality Of Nez Perce Statue, West Bannock...",43.64893,-116.213452,-3397185.0,13856030.0,4446.883373


### Save location data thus far

In [6]:
df_locations.to_pickle('./locations.pkl')

### Foursquare
Now that we have our locations, let's use Foursquare API to get info on restaurants in each neighborhood.
We're interested in venues in 'food' category, but only the ones who can be competitors, this mean food truck, quick food, take away, healthy, not restaurant taking too long.

In [7]:
foursquare_client_id = 'BX0BTGI0XYHS2YEE0P4B0MQS3RNWO2U13VSKPC20SGXBMZ2D' # your Foursquare ID
foursquare_client_secret = 'V2ZLWQM00UPQ2D1EPF0UJWQ1XJUKMCA4XE3NYTJADTCSKFJK' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [33]:
# Category IDs corresponding to Hair Salons were taken from Foursquare web site 
# (https://developer.foursquare.com/docs/resources/categories):

salon_category = '4bf58dd8d48988d110951735' # 'Root' category for all food-related venues

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [39]:
def get_salons(lats, lons):
    salons = {}
    location_salons = []

    print('Obtaining venues around candidate locations:', end='')
        
    for lat, lon in zip(lats, lons):
        venues = get_venues_near_location(lat, lon, salon_category, foursquare_client_id, 
                                          foursquare_client_secret, radius=350, limit=100)
        
        area_salons = []
        
        for venue in venues:            
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
                        
#             is_res, is_asian = is_restaurant(venue_categories)
            
#             if is_res:
            x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
            
            salon = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, 
                          venue_distance, x, y)
            
            print(venue_distance)
            
            if venue_distance<=100:
                area_salons.append(salon)
                salons[venue_id] = salon
#                 if is_asian:
#                     asian_restaurants[venue_id] = restaurant
        
#         pbar.update(1)        
        
        location_restaurants.append(area_salons)
        
    return salons #, asian_restaurants, location_restaurants

salons = get_salons(latitudes, longitudes)
salons

Obtaining venues around candidate locations:75
82
254
282
296
211
293
228
254
94
257
151
103
312
322
87
137
187
111
111
120
347
350
275
218
118
148
159
160
162
168
159
247
247
257
262
228
294
317
322
325
245
296
255
316
245
102
218
242
177
159
320
194
236
319
332
197
208
243
239
316
316
323
199
154
174
204
238
335
109
316
336
254
286
201
246
287
311
315
266
343
158
158
112
220
194
264
277
332
275
286
216
340
140
148
326
171
103
188
293
301


{'4b4aaf2af964a520d68c26e3': ('4b4aaf2af964a520d68c26e3',
  'Euphoria Salon',
  43.62995445728302,
  -116.20347082614899,
  '1517 N 13th St (BTW Alturas & Eastman), Boise, ID 83702, United States',
  75,
  -3399495.2389167324,
  13857410.580666699),
 '4cd46b7394848cfa328adfb1': ('4cd46b7394848cfa328adfb1',
  "Vince's Barber Shop",
  43.629886,
  -116.203647,
  '1519 N 13th St, Boise, ID 83702, United States',
  82,
  -3399487.552309863,
  13857428.197553424),
 '536bba87498eb74df378a371': ('536bba87498eb74df378a371',
  'Studio Chic',
  43.633907318115234,
  -116.22331237792969,
  'Boise, ID, United States',
  94,
  -3397672.6496671517,
  13858181.814938808),
 '4d07bf7437036dcb156523fb': ('4d07bf7437036dcb156523fb',
  'Cardinales Hair Salon',
  43.613369,
  -116.19956699999999,
  '435 W Main St, Boise, ID 83702, United States',
  87,
  -3401151.004297237,
  13858902.301217638)}

In [50]:
boise_map = folium.Map(location=[latitude, longitude], zoom_start=15)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for sal in salons.values():
    lat = sal[2]; lon = sal[3]
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, color=7, fill=True, fill_color=7,                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)

boise_map

In [58]:
# 'Root' category for all universities venues
univ_category = ['4d4b7105d754a06372d81259']

def get_meta_venues(lats, lons, meta_category):
    meta_venues = {}
    neighborhoods_venues = []
    
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        area_meta_venues = []
        for i, category in enumerate(meta_category):
            venues = get_venues_near_location(lat, lon, category, foursquare_client_id, 
                                              foursquare_client_secret, radius=350, limit=100)
            for venue in venues:
                venue_id = venue[0]
                venue_name = venue[1]
                venue_categories = venue[2]
                venue_latlon = venue[3]
                venue_address = venue[4]
                venue_distance = venue[5]

                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                other_venue = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], 
                              venue_address, venue_distance, x, y)
                
                if venue_distance<=100:
                    area_meta_venues.append(other_venue)
                
                meta_venues[venue_id] = other_venue

            neighborhoods_venues.append(area_meta_venues)

    return meta_venues, neighborhoods_venues

# Try to load from local file system in case we did this before
meta_univ = {}
neighborhoods_univ = []
loaded = False
try:
    with open('meta_univ_350.pkl', 'rb') as f:
        meta_univ = pickle.load(f)
    with open('neighborhoods_univ_350.pkl', 'rb') as f:
        neighborhoods_univ = pickle.load(f)
    print('Universities data loaded.')
    loaded = True
except:
    pass

if not loaded:
    meta_univ, neighborhoods_univ = get_meta_venues(latitudes, longitudes, univ_category)
    # Let's persists this in local file system
    with open('meta_univ_350.pkl', 'wb') as f:
        pickle.dump(meta_univ, f)
    with open('neighborhoods_univ_350.pkl', 'wb') as f:
        pickle.dump(neighborhoods_univ, f)
        
print('Total number of universities:', len(meta_univ))
print('Average number of universities in neighborhood:', np.array([len(r) for r in neighborhoods_univ]).mean())

boise_map = folium.Map(location=[latitude, longitude], zoom_start=15)
folium.Marker(location=[latitude, longitude], popup='Boise').add_to(boise_map)

for sal in meta_univ.values():
    lat = sal[2]; lon = sal[3]
    is_univ = sal[5]
    color = 'blue' 
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color,                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)

boise_map

Universities data loaded.
Total number of universities: 127
Average number of universities in neighborhood: 0.2


In [60]:
# 'Root' category for all companies venues
companies_category = ['4d4b7105d754a06375d81259', '4d4b7105d754a06378d81259', '4d4b7105d754a06379d81259',
                     '4d4b7104d754a06370d81259']
# Try to load from local file system in case we did this before
meta_company = {}
neighborhoods_company = []
loaded = False
try:
    with open('meta_company_350.pkl', 'rb') as f:
        meta_company = pickle.load(f)
    with open('neighborhoods_company_350.pkl', 'rb') as f:
        neighborhoods_company = pickle.load(f)
    print('Companies data loaded.')
    loaded = True
except:
    pass

if not loaded:
    meta_company, neighborhoods_company = get_meta_venues(latitudes, longitudes, companies_category)
    # Let's persists this in local file system
    with open('meta_company_350.pkl', 'wb') as f:
        pickle.dump(meta_company, f)
    with open('neighborhoods_company_350.pkl', 'wb') as f:
        pickle.dump(neighborhoods_company, f)
print('Total number of companies:', len(meta_company))
print('Average number of companies in neighborhood:', np.array([len(r) for r in neighborhoods_company]).mean())

boise_map = folium.Map(location=[latitude, longitude], zoom_start=15)
folium.Marker(location=[latitude, longitude], popup='Seoul').add_to(boise_map)
for sal in meta_company.values():
    lat = sal[2]; lon = sal[3]
    is_company = res[6]
    color = 'red'
    label = '{}'.format(sal[1])
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color,                         
                        popup=label, fill_opacity=1, parse_html=False).add_to(boise_map)

boise_map

Total number of companies: 0
Average number of companies in neighborhood: 0.0


### Screen scrape the zipcodes from www.zip-codes.com

In [None]:
link = "https://www.zip-codes.com/city/id-boise.asp"
r = requests.get(link)

soup = BeautifulSoup(r.content)
print("Zip codes loaded in.")

In [None]:
p = re.compile("ZIP Code (\d{5})")

table_data = soup.find('table', attrs = {'class': 'statTable'})
content = table_data.find_all('a')

zip_code = []

for l in content:
    if p.match(l.text) != None:
        tmp = int (pd.Series([p.search(l.text).group(1)]))
        zip_code.append(tmp)
    
zips=pd.DataFrame(zip_code)
zips.rename(columns={0: "Zip"}, inplace=True)

### Import the lat lon coordinates 

In [None]:
link_lat = (
    "https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/download/" 
    "?format=csv&timezone=America/Mexico_City&lang=en&use_labels_for_header=true&csv_separator=%3B"
    )

df_lat_lon = pd.read_csv(link_lat, sep=";")
print("Latitude and Longitude data loaded in.")

In [None]:
type(df_lat_lon)

df_merged = pd.merge(zips, df_lat_lon, on="Zip", how="inner")
df_merged.drop(columns=["City", "State", "Timezone", "Daylight savings time flag", "geopoint"], inplace=True)

In [None]:
# create map of Boise using latitude and longitude values
map_boise = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, zip_code in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Zip']):
    label = '{}'.format(zip_code)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boise)
    
map_boise

In [None]:
boise_venues = getNearbyVenues(names=df_merged['Zip'],
                                   latitudes=df_merged['Latitude'],
                                   longitudes=df_merged['Longitude']
                                  )

boise_venues.rename(columns={"Neighborhood": "Zip"}, inplace=True)

boise_venues

In [None]:
# one hot encoding
boise_onehot = pd.get_dummies(boise_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
boise_onehot['Zip'] = boise_venues['Zip'] 

# move neighborhood column to the first column
fixed_columns = [boise_onehot.columns[-1]] + list(boise_onehot.columns[:-1])
boise_onehot = boise_onehot[fixed_columns]

boise_onehot.head()

In [None]:
boise_grouped = boise_onehot.groupby('Zip').mean().reset_index()
boise_grouped

In [None]:
num_top_venues = 5

for hood in boise_grouped['Zip']:
    print("----"+str(hood)+"----")
    temp = boise_grouped[boise_grouped['Zip'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [None]:
num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zip'] = boise_grouped['Zip']

for ind in np.arange(boise_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boise_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

In [None]:
# set number of clusters
kclusters = 5

boise_grouped_clustering = boise_grouped.drop('Zip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boise_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boise_merged = df_merged

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
boise_merged = boise_merged.join(neighborhoods_venues_sorted.set_index('Zip'), on='Zip')
boise_merged.dropna(inplace=True)

boise_merged # check the last columns!

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boise_merged['Latitude'], boise_merged['Longitude'], boise_merged['Zip'], boise_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Methodology <a name="methodology"></a>

In this project we will look for areas that have established venues in the zip-code region.

In the first section the data was partitioned by zip-code.  Once we had the zip-codes for Boise, we introduced venues to the data-frame.  We then ordered the venues by most common type.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of restaurants in every area candidate**: