# Capstone Project - The Battle of Neighborhoods

## Introduction/Business Problem

The city of Toronto will be used to determine where to recommend to open a restaurant for someone looking to create a major impact and have the best chances at getting a profitable business up and running.

Opening restaurants is a high risk business where only a few can survive and location matters the most at the time of deciding where to open up to clients.

## Data

The data to be used to determine the best place to open a restaurant will be build from the information derived in Week 3 of this course. A list of postal codes from Canada will be obtained from Wikipedia. This information will give us the postal codes, boroughs, and neighborhoods. This information will be paired with the latitude and longitude coordinates listed in a separate file matching each individual borough. Then, the boroughs specifically located in Toronto will be selected. This information will be usefull for later identifying all the venues in each borough. The venues with the least number of restaurants will be identified as the best places for opening up a new place.

Import the library to use to open the wikipedia page

In [140]:
import urllib.request

Asign the wikipedia URL to the variable to be used


In [141]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Open the wikipedia URL using urllib.request and put the HTML into the page variable

In [142]:
page = urllib.request.urlopen(url)

Import the BeautifulSoup library so we can parse the HTML page from Wikipedia 

In [143]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [144]:
from bs4 import BeautifulSoup

Parse the HTML from our URL into the BeautifulSoup parse tree format

In [145]:
soup = BeautifulSoup(page, "html")

Go through the HTML data to get the infomration from the table we want

In [146]:
# get the table that contains the data we need
right_table=soup.find('table', class_='wikitable sortable')
# right_table

# to store the data by column
postal_code=[]
borough=[]
neighborhood=[]

for row in right_table.findAll('tr'): # go through all the rows in the table
    cells=row.findAll('td') # get all the table data
    if len(cells)==3:
        postal_code.append(cells[0].find(text=True).strip('\n'))
        borough.append(cells[1].find(text=True).strip('\n'))
        neighborhood.append(cells[2].find(text=True).strip('\n'))

Build the dataframe to display the information in the correct format required by the assignment

In [147]:
import pandas as pd
df=pd.DataFrame(postal_code,columns=['PostalCode'])
df['Borough']=borough
df['Neighborhood']=neighborhood

# Get names of indexes for which column Borough has value Not assigned
indexNames = df[df['Borough'] == 'Not assigned'].index
# Delete these row indexes from dataFrame
df.drop(indexNames, inplace=True)
df.reset_index(inplace = True, drop = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Print the number of rows in the dataframe

In [148]:
print('The dataframe has', df.shape[0], 'rows')

The dataframe has 103 rows


Get the file that contains the latitude and longitude data

In [149]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 
# Read data from file 'https://cocl.us/Geospatial_data' 
coordinates = pd.read_csv("https://cocl.us/Geospatial_data") 

Add the latitude and longitude columns to the neighborhood dataframe

In [150]:
# Eliminate the space in the the Postal Code column
coordinates.columns = [b.replace(' ', '') for b in coordinates.columns]

# Merge coordinates into neighbourhood dataframe
df = df.merge(coordinates)

df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Select the boroughs that have 'Toronto' on it

In [151]:
toronto_boroughs = df[df['Borough'].str.contains("Toronto")]
toronto_boroughs.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


Get geographical coordinates of Toronto

In [None]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

address = 'Toronto, ON'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Collecting package metadata (current_repodata.json): done
Solving environment: | 

Visualize Toronto and the neighborhoods in it.

In [None]:
# Import what is needed in this cell to visualize Toronto
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_boroughs['Latitude'], toronto_boroughs['Longitude'], toronto_boroughs['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Use the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [None]:
CLIENT_ID = 'GWGMT1O051MRS01WEHIN3TXQEV5LVT423ON1Y0SKJGF1X4MX' # your Foursquare ID
CLIENT_SECRET = 'MXYM00LDLCEG3JFII4AMNMT3R5PEO0ESSCMQQXGYNRYL2X5M' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

#### Function to explore all the neighborhoods in Toronto

In [None]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    LIMIT = 100
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Code to run the above function on each neighborhood and to create a new dataframe called *toronto_venues*.

In [None]:
toronto_venues = getNearbyVenues(names=toronto_boroughs['Neighborhood'],
                                   latitudes=toronto_boroughs['Latitude'],
                                   longitudes=toronto_boroughs['Longitude']
                                  )

Check how many venues were returned for each neighborhood

In [None]:
toronto_venues.groupby('Neighborhood').count()

Analyze each neighborhood

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
moving_col = toronto_onehot['Neighborhood']
toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_onehot.insert(0, 'Neighborhood', moving_col)

toronto_onehot.head()

#### Group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [None]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Function to sort the venues in descending order.

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create a new dataframe and display the top 10 venues for each neighborhood. 

In [None]:
import numpy as np # library to handle data in a vectorized manner

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Run *k*-means to cluster the neighborhood into 5 clusters.

In [None]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [None]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_boroughs

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Visualize the resulting clusters

In [None]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters