## **Data Science Capstone Project: Battle of Neighborhoods**

Veena Muralidharan<br/>
January, 2019

#### **Project Title: Identifying an ideal location in the city of Toronto for starting a Cafe business**

#### **Executive Summary**
This data science capstone project deals with the process of leveraging location data acquired from data providers such as Foursquare to  explore the neighborhoods within a targeted city and create clustering models. Using K-means cluster, similar locations with minimum distance shall be grouped into clusters. It is the simplest form of unsupervised machine learning algorithm and it helps in grouping similar data points. Utilizing this model, I intend to create a solution for small scale business start-ups who are exploring ideal locations to establish their small scale business in an urban locality.

#### **Introduction**
In this capstone project, I will be using the data that was scraped online for Toronto City and use the data along with the Foursquare API to address the business challenge of identifying an ideal location to establish a local cafe with maximum client coverage. For this purpose, I would need to identify localities or boroughs that are firstly densely populated and which have minimum number of cafe services in order to avoid competition.

#### **Project Objectives**
Identify an ideal business location for a cafe startup by considering areas/localities/boroughs that 
- High frequency of office and colleges in the neighborhood
- Low frequency of cafes in the localities/boroughs

#### **Methodology**
##### **(I) Data Preparation**
The data for this project will be extracted, processed and analysed by integrating the borough information for Toronto City extracted from the web and venue related information acquired through Foursquare API.The data extraction from web shall done using the web scraping libraries for python such as Beautiful Soup. After extracting the html page the information shall be converted into a data frame using the pandas python library. Using the pandas library, the data will be cleaned and processed to prepare a final data frame for analysis. In order to render my data onto a map, I will be using the Folium library. Also, to create clusters of similar regions of interest, I will be using k-means clustering technique. For this analytical method, I will be utilitzing the sklearn library of python, to create a clustering model for my project.  

##### ***(A) Environment Preparation***

In [None]:
from bs4 import BeautifulSoup
import html5lib
import requests
import lxml
import pandas as pd
import numpy as np
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
print('Libraries imported.')

##### ***(B) Data Extraction using Web Scraping Library***

In [2]:
source=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
df_table=soup.find_all('table')[0]

<title>List of postal codes of Canada: M - Wikipedia</title>


##### ***(C) Transforming the HTML page into a Data Frame***

In [3]:
tb_row=df_table.find_all('tr')
table=[]
for row in tb_row :
    head=row.find_all('th')
    head=[x.text.strip() for x in head]
    cols=row.find_all('td')
    cols=[x.text.strip() for x in cols]
    table.append(cols)
    
df=pd.DataFrame(table)
df.head(5)

Unnamed: 0,0,1,2
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


##### **(II) Data Processing**

##### ***(A) Cleaning the Data*** 
In this step, the data frame will be cleaned with respect to missing values, error data values and shall be transformed into a more workable framework for the analytical and machine learning logarithms.

In [4]:
#Renaming Columns
df.rename(columns={0:'Postcode',1:'Borough',2:'Neighbourhood'}, inplace=True)

#Ignore cells with a borough that is Not assigned
indexNames=df[df['Borough']=='Not assigned'].index
indexNames
df.drop(indexNames, inplace=True)
df=df.drop(df.index[0])

#More than one neighborhood can exist in one postal code area.
df[df.duplicated(['Postcode'], keep=False)]
df = df.groupby('Postcode').agg({'Borough':'first', 
                             'Neighbourhood': ', '.join,}).reset_index()

#Cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
indexNei=df[df['Neighbourhood']=='Not assigned'].index
indexNei
df.at[[9], 'Neighbourhood']="Queen's Park"
df.loc[9]

df.shape

(103, 3)

##### ***(B) Merging Geographical Coordinates through another data file***

In [5]:
#Reading in the gespatial data
file='http://cocl.us/Geospatial_data'
geocode=pd.read_csv(file)
geocode.rename(columns={'Postal Code':'Postcode'}, inplace=True)

#Combine both data frames
dfg=pd.merge(df,geocode, on='Postcode')
dfg.head()

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude
0,M1B,"Rouge, Malvern",Scarborough,43.806686,-79.194353
1,M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough,43.784535,-79.160497
2,M1E,"Guildwood, Morningside, West Hill",Scarborough,43.763573,-79.188711
3,M1G,Woburn,Scarborough,43.770992,-79.216917
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476


##### **(III) Data Exploration**

##### ***(A) Rendering the Map and Adding the Boroughs***
Using the map rendering libraries, we will now view the boroughs and neighborhoods on the map of Toronto City using the geopy library and add markers to indicate their location 

In [None]:
#Identify the geographical extent of Toronto City
address = 'Toronto City'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
#print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

In [7]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, postcode, borough, neighbourhood in zip(dfg['Latitude'], dfg['Longitude'], dfg['Postcode'],dfg['Borough'], dfg['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

##### ***(B) Exploring Neighbourhoods using Foursqaure API***

In [8]:
#Connecting to Foursquare API
CLIENT_ID = 'BER04FSHPG4JPJF3DI5S0KELRU0ZUDQM3E2Q4FGAO5L1J15C' # your Foursquare ID
CLIENT_SECRET = '0HLCNVM232FXDPQR0ESPH1EHPO0VYRTVL5TMECAUFLMSVE5V' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [9]:
df[df['Postcode']=='M5A'].index

Int64Index([53], dtype='int64')

In [10]:
#Getting the latitude and longitude values of the above neighborhood
lat = dfg.loc[53,'Latitude'] # neighborhood latitude value
long = dfg.loc[53,'Longitude'] # neighborhood longitude value
name = dfg.loc[53,'Postcode'] # neighborhood name

In [None]:
#Exploring the Postcode of Toronto City using Foursquare API
LIMIT = 100
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    long, 
    radius, 
    LIMIT)
url

results = requests.get(url).json()
results

In [12]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#Clean the json file and create pandas data frame for venues for postcode M5A
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]


print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

nearby_venues.head()

100 venues were returned by Foursquare.


Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Toronto Cooper Koo Family Cherry St YMCA Centre,Gym / Fitness Center,43.653191,-79.357947
3,Impact Kitchen,Restaurant,43.656369,-79.35698
4,The Distillery Historic District,Historic Site,43.650244,-79.359323


In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Postcode Latitude', 
                  'Postcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

toronto_venues = getNearbyVenues(names=dfg['Postcode'],
                                   latitudes=dfg['Latitude'],
                                   longitudes=dfg['Longitude']
                                  )

In [14]:
toronto_venues.head()

Unnamed: 0,Postcode,Postcode Latitude,Postcode Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Images Salon & Spa,43.802283,-79.198565,Spa
1,M1B,43.806686,-79.194353,Caribbean Wave,43.798558,-79.195777,Caribbean Restaurant
2,M1B,43.806686,-79.194353,Wendy's,43.802008,-79.19808,Fast Food Restaurant
3,M1B,43.806686,-79.194353,Harvey's,43.800106,-79.198258,Fast Food Restaurant
4,M1B,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant


In [None]:
toronto_venues.groupby('Postcode').count()

In [16]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 331 uniques categories.


In [17]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_onehot['Postcode'] = toronto_venues['Postcode'] 
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]
toronto_onehot.head()

Unnamed: 0,Postcode,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M1B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
toronto_grouped = toronto_onehot.groupby('Postcode').mean().reset_index()
toronto_grouped
toronto_grouped.shape

(102, 332)

In [None]:
num_top_venues=5

for hood in toronto_grouped['Postcode']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Postcode'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

In [20]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postcode'] = toronto_grouped['Postcode']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Fast Food Restaurant,Coffee Shop,Fruit & Vegetable Store,Paper / Office Supplies Store,Chinese Restaurant
1,M1C,Breakfast Spot,Burger Joint,Playground,Italian Restaurant,Farmers Market
2,M1E,Pizza Place,Fast Food Restaurant,Coffee Shop,Greek Restaurant,Plaza
3,M1G,Coffee Shop,Park,Chinese Restaurant,Electronics Store,Indian Restaurant
4,M1H,Bakery,Pharmacy,Coffee Shop,Indian Restaurant,Athletics & Sports


##### ***(C) Machine Learning Algorithm: Clustering and Segmentation of Neighborhoods with Similar Venues***

In [21]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([3, 0, 3, 3, 3, 3, 3, 3, 3, 0, 3, 0, 3, 3, 3, 3, 3, 0, 3, 2, 0, 0, 3,
       3, 3, 0, 0, 3, 3, 0, 3, 4, 3, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 3, 3, 3, 0, 0, 0, 0, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 0,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 1], dtype=int32)

In [22]:
dfg_group=dfg.groupby('Postcode').mean().reset_index()
#dfg_group.head()
#dfg_group['duplicate']=dfg_group.duplicated() 
#print(dfg_group.loc[dfg_group['duplicate']==False])
#toronto_grouped['duplicate']=toronto_grouped.duplicated() 
#print(toronto_grouped.loc[toronto_grouped['duplicate']==False])

toronto_merge=pd.merge(dfg_group, toronto_grouped, on='Postcode')
toronto_merge1=toronto_merge[['Postcode','Latitude','Longitude']]
toronto_merge1
#toronto_merge=dfg

# add clustering labels
pd.options.mode.chained_assignment = None
toronto_merge1['Cluster Labels'] = kmeans.labels_


# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merge1 = toronto_merge1.join(neighborhoods_venues_sorted.set_index('Postcode'), on='Postcode')

toronto_merge1.head() # check the last columns!

Unnamed: 0,Postcode,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,43.806686,-79.194353,3,Fast Food Restaurant,Coffee Shop,Fruit & Vegetable Store,Paper / Office Supplies Store,Chinese Restaurant
1,M1C,43.784535,-79.160497,0,Breakfast Spot,Burger Joint,Playground,Italian Restaurant,Farmers Market
2,M1E,43.763573,-79.188711,3,Pizza Place,Fast Food Restaurant,Coffee Shop,Greek Restaurant,Plaza
3,M1G,43.770992,-79.216917,3,Coffee Shop,Park,Chinese Restaurant,Electronics Store,Indian Restaurant
4,M1H,43.773136,-79.239476,3,Bakery,Pharmacy,Coffee Shop,Indian Restaurant,Athletics & Sports


In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merge1['Latitude'], toronto_merge1['Longitude'], toronto_merge1['Postcode'], toronto_merge1['Cluster Labels']):
    label = folium.Popup(str(poi) + ' cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### **Results**
##### **Clustering Model using Machine Learning to identify ideal neighbourhoods to establish a local Cafe**

Now using the Machine Learning model for clustering similar neighbourhoods with places of entertainment and office hubs, I will identify the ideal locations where the client can start their own Cafe stores without much competition within the radius of one kilometer.After identifying all the nearest and most common venues within a kilometer radius, I have filtered out those neighbourhoods that don't have any nearby cafe's or coffee shops. The below code has helped me eliminate probable competitors that would have been the primary concern for a client who is just about to start his/her own cafe. 

In [24]:
#Identifying neighbourhoods that lack coffee shops and cafe
array=['Coffee Shop','Café']
#Cluster0=toronto_merge1[toronto_merge1['Cluster Labels']==0]
NoCafe=toronto_merge1[~toronto_merge1['1st Most Common Venue'].isin(array) & ~toronto_merge1['2nd Most Common Venue'].isin(array)& ~toronto_merge1['3rd Most Common Venue'].isin(array) & ~toronto_merge1['4th Most Common Venue'].isin(array) & ~toronto_merge1['5th Most Common Venue'].isin(array) ] 
NoCafe.head()
NoCafe.shape

(26, 9)

In [25]:
# Rendering these neighborhoods on Map for easy visualization and understanding
# create map
nocafe_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.ocean(np.linspace(0, 1, len(ys)))
ocean = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NoCafe['Latitude'], NoCafe['Longitude'], NoCafe['Postcode'], NoCafe['Cluster Labels']):
    label = folium.Popup(str(poi) + ' cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=ocean[cluster-1],
        fill=True,
        fill_color=ocean[cluster-1],
        fill_opacity=0.7).add_to(nocafe_clusters)
       
nocafe_clusters

Moreover, the above rendered map enables visualization of those ideal neighborhoods in Toronto city that do not have any nearby coffee shops and cafe's. Moreover, the presence of nearby airports in Etobicoke & York and the main highway 401 running across the identified neighbourhoods present an advantageous choice of ideal locations for the client for initiating their business.

#### **Discussion**

By doing this Capstone Project, the learning objectives regarding the building of a Data Science Project has been accomplished. By performing various Lab excercises we have been able to understand the purpose and function of various algorithms and machine learning models. This in turn has resulted in identification of the problem statements for this project for which data science has provided a viable solution. By using the existing data available on the web and with the help of thrid-party API, we have been able to create a successful model that would provide an optimum solution for the business challenge. 

#### **Conclusion**

This Capstone Project has provided an understanding of the application of Data Science and Machine Learning Algorithms in providing robust solutions using large amounts of existing data. By using the third party location provider API such as Foursquare and existing neighborhood information of Toronto City, we were able to develop a model that would create clusters of similar neighbourhood and segment them into ideal locations that would answer most of the business challenges.