# Capstone Project - The Battle of the Neighborhoods

This notebook is for the <a href="https://www.coursera.org/professional-certificates/ibm-data-science">IBM Data Science Professional Certificate.</a>.

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a hair salon. Specifically, this report will be targeted to stakeholders interested in opening a **Hair Salon** in **Boise, Idaho**.

Since there are lots of restaurants in Boise we will try to detect **locations that are not already crowded with Hair Salons**. We are also particularly interested in **areas with no Hair Salons in the vicinity**. We would also prefer locations **as close to center of the city as possible**, assuming that first two conditions are met.

We will use data science techniques to generate the most promissing neighborhoods based on the above criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing hair salons in the neighborhood 
* number of and distance to hair salons in the neighborhood, if any
* distance of the neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Berlin center will be obtained using **Google Maps API geocoding** of well known Berlin location (Alexanderplatz)

### Make the necessary imports

In [1]:
import requests
import pandas as pd
import numpy as np
import re

from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup

import matplotlib.cm as cm
import matplotlib.colors as colors

import folium 

from sklearn.cluster import KMeans

import sys

!{sys.executable} -m pip install geopy



### Define functions

#### Define a function to get nearby venues

In [2]:
CLIENT_ID = 'BX0BTGI0XYHS2YEE0P4B0MQS3RNWO2U13VSKPC20SGXBMZ2D' # your Foursquare ID
CLIENT_SECRET = 'V2ZLWQM00UPQ2D1EPF0UJWQ1XJUKMCA4XE3NYTJADTCSKFJK' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Define a function to get the categories

In [3]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Define a function to get n common venues

In [4]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

### Screen scrape the zipcodes from www.zip-codes.com

In [5]:
link = "https://www.zip-codes.com/city/id-boise.asp"
r = requests.get(link)

soup = BeautifulSoup(r.content)
print("Zip codes loaded in.")

Zip codes loaded in.


In [6]:
p = re.compile("ZIP Code (\d{5})")

table_data = soup.find('table', attrs = {'class': 'statTable'})
content = table_data.find_all('a')

zip_code = []

for l in content:
    if p.match(l.text) != None:
        tmp = int (pd.Series([p.search(l.text).group(1)]))
        zip_code.append(tmp)
    
zips=pd.DataFrame(zip_code)
zips.rename(columns={0: "Zip"}, inplace=True)

### Import the lat lon coordinates 

In [7]:
link_lat = (
    "https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/download/" 
    "?format=csv&timezone=America/Mexico_City&lang=en&use_labels_for_header=true&csv_separator=%3B"
    )

df_lat_lon = pd.read_csv(link_lat, sep=";")
print("Latitude and Longitude data loaded in.")

Latitude and Longitude data loaded in.


In [8]:
type(df_lat_lon)

df_merged = pd.merge(zips, df_lat_lon, on="Zip", how="inner")
df_merged.drop(columns=["City", "State", "Timezone", "Daylight savings time flag", "geopoint"], inplace=True)

In [9]:
address = 'Boise, Idaho'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Boise, Idaho are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boise, Idaho are 43.6166163, -116.200886.


In [10]:
# create map of Boise using latitude and longitude values
map_boise = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, zip_code in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Zip']):
    label = '{}'.format(zip_code)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_boise)
    
map_boise

In [11]:
boise_venues = getNearbyVenues(names=df_merged['Zip'],
                                   latitudes=df_merged['Latitude'],
                                   longitudes=df_merged['Longitude']
                                  )

boise_venues.rename(columns={"Neighborhood": "Zip"}, inplace=True)

boise_venues

Unnamed: 0,Zip,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,83701,43.603768,-116.272921,Aquarium of Boise,43.605527,-116.273216,Aquarium
1,83701,43.603768,-116.272921,Chef's Hut,43.603031,-116.273142,Diner
2,83701,43.603768,-116.272921,Subway,43.604107,-116.273683,Sandwich Place
3,83701,43.603768,-116.272921,Enterprise Rent-A-Car,43.602539,-116.274492,Rental Car Location
4,83701,43.603768,-116.272921,Larry H. Miller Honda Boise,43.602030,-116.278220,Automotive Shop
...,...,...,...,...,...,...,...
82,83729,43.459855,-116.243984,Boise Shade Co.,43.459523,-116.242465,Home Service
83,83732,43.459855,-116.243984,Boise Shade Co.,43.459523,-116.242465,Home Service
84,83735,43.459855,-116.243984,Boise Shade Co.,43.459523,-116.242465,Home Service
85,83756,43.459855,-116.243984,Boise Shade Co.,43.459523,-116.242465,Home Service


In [12]:
# one hot encoding
boise_onehot = pd.get_dummies(boise_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
boise_onehot['Zip'] = boise_venues['Zip'] 

# move neighborhood column to the first column
fixed_columns = [boise_onehot.columns[-1]] + list(boise_onehot.columns[:-1])
boise_onehot = boise_onehot[fixed_columns]

boise_onehot.head()

Unnamed: 0,Zip,ATM,American Restaurant,Aquarium,Athletics & Sports,Auto Workshop,Automotive Shop,Bakery,Bistro,Brewery,...,Rest Area,Sandwich Place,Smoke Shop,Soccer Field,Sports Bar,Supermarket,Toy / Game Store,Tree,Vietnamese Restaurant,Wine Shop
0,83701,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,83701,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,83701,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,83701,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,83701,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
boise_grouped = boise_onehot.groupby('Zip').mean().reset_index()
boise_grouped

Unnamed: 0,Zip,ATM,American Restaurant,Aquarium,Athletics & Sports,Auto Workshop,Automotive Shop,Bakery,Bistro,Brewery,...,Rest Area,Sandwich Place,Smoke Shop,Soccer Field,Sports Bar,Supermarket,Toy / Game Store,Tree,Vietnamese Restaurant,Wine Shop
0,83701,0.0,0.0,0.083333,0.0,0.0,0.083333,0.0,0.0,0.0,...,0.0,0.083333,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0
1,83702,0.037037,0.037037,0.0,0.0,0.0,0.0,0.037037,0.037037,0.037037,...,0.0,0.0,0.0,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0
2,83703,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,83704,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0
4,83705,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,83706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.1,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0
6,83707,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,83708,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,83709,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,83711,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
num_top_venues = 5

for hood in boise_grouped['Zip']:
    print("----"+str(hood)+"----")
    temp = boise_grouped[boise_grouped['Zip'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----83701----
                    venue  freq
0                   Hotel  0.08
1          Sandwich Place  0.08
2                Aquarium  0.08
3  Furniture / Home Store  0.08
4     Rental Car Location  0.08


----83702----
                  venue  freq
0           Pizza Place  0.11
1           Coffee Shop  0.11
2  Fast Food Restaurant  0.07
3                   ATM  0.04
4        Farmers Market  0.04


----83703----
           venue  freq
0  Auto Workshop  0.33
1           Food  0.33
2           Park  0.33
3            ATM  0.00
4    Pizza Place  0.00


----83704----
            venue  freq
0            Tree   0.5
1  Farmers Market   0.5
2             ATM   0.0
3       Pet Store   0.0
4    Home Service   0.0


----83705----
          venue  freq
0      Dive Bar   1.0
1           ATM   0.0
2           Gym   0.0
3  Home Service   0.0
4         Hotel   0.0


----83706----
                    venue  freq
0             Pizza Place   0.3
1       Korean Restaurant   0.1
2  Furniture / Home Stor

In [15]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zip'] = boise_grouped['Zip']

for ind in np.arange(boise_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boise_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Zip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,83701,Clothing Store,Sandwich Place,Gym,Hotel,Diner,Coffee Shop,Gas Station,Rental Car Location,Furniture / Home Store,Aquarium
1,83702,Coffee Shop,Pizza Place,Fast Food Restaurant,Farmers Market,Ice Cream Shop,Gym / Fitness Center,Gift Shop,Frozen Yogurt Shop,Grocery Store,Mediterranean Restaurant
2,83703,Park,Auto Workshop,Food,Wine Shop,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Fast Food Restaurant,Farmers Market
3,83704,Tree,Farmers Market,Wine Shop,Construction & Landscaping,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant
4,83705,Dive Bar,Wine Shop,Construction & Landscaping,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
5,83706,Pizza Place,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Soccer Field,Sandwich Place,Korean Restaurant,Mobile Phone Shop,Coffee Shop,Frozen Yogurt Shop
6,83707,American Restaurant,Hotel,Rest Area,Wine Shop,Construction & Landscaping,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant
7,83708,Home Service,Wine Shop,Golf Course,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
8,83709,Construction & Landscaping,Athletics & Sports,Wine Shop,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
9,83711,Home Service,Wine Shop,Golf Course,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market


In [16]:
# set number of clusters
kclusters = 5

boise_grouped_clustering = boise_grouped.drop('Zip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boise_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 0, 2, 2, 1, 3, 1], dtype=int32)

In [17]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boise_merged = df_merged

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
boise_merged = boise_merged.join(neighborhoods_venues_sorted.set_index('Zip'), on='Zip')
boise_merged.dropna(inplace=True)

boise_merged # check the last columns!

Unnamed: 0,Zip,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,83701,43.603768,-116.272921,2.0,Clothing Store,Sandwich Place,Gym,Hotel,Diner,Coffee Shop,Gas Station,Rental Car Location,Furniture / Home Store,Aquarium
1,83702,43.627734,-116.20756,2.0,Coffee Shop,Pizza Place,Fast Food Restaurant,Farmers Market,Ice Cream Shop,Gym / Fitness Center,Gift Shop,Frozen Yogurt Shop,Grocery Store,Mediterranean Restaurant
2,83703,43.668396,-116.25707,2.0,Park,Auto Workshop,Food,Wine Shop,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Fast Food Restaurant,Farmers Market
3,83704,43.63123,-116.28716,2.0,Tree,Farmers Market,Wine Shop,Construction & Landscaping,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant
4,83705,43.583139,-116.2252,0.0,Dive Bar,Wine Shop,Construction & Landscaping,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
5,83706,43.593523,-116.19903,2.0,Pizza Place,Furniture / Home Store,Fried Chicken Joint,Fast Food Restaurant,Soccer Field,Sandwich Place,Korean Restaurant,Mobile Phone Shop,Coffee Shop,Frozen Yogurt Shop
6,83707,43.38469,-115.997118,2.0,American Restaurant,Hotel,Rest Area,Wine Shop,Construction & Landscaping,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant
7,83708,43.459855,-116.243984,1.0,Home Service,Wine Shop,Golf Course,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
8,83709,43.572671,-116.29527,3.0,Construction & Landscaping,Athletics & Sports,Wine Shop,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market
9,83711,43.459855,-116.243984,1.0,Home Service,Wine Shop,Golf Course,Gas Station,Furniture / Home Store,Frozen Yogurt Shop,Fried Chicken Joint,Food,Fast Food Restaurant,Farmers Market


In [18]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boise_merged['Latitude'], boise_merged['Longitude'], boise_merged['Zip'], boise_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Berlin that have low restaurant density, particularly those with low number of Italian restaurants. We will limit our analysis to area ~6km around city center.

In first step we have collected the required **data: location and type (category) of every restaurant within 6km from Berlin center** (Alexanderplatz). We have also **identified Italian restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas of Berlin - we will use **heatmaps** to identify a few promising areas close to center with low number of restaurants in general (*and* no Italian restaurants in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two restaurants in radius of 250 meters**, and we want locations **without Italian restaurants in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.