# Capstone Project - The Battle of the Neighborhoods

## Introduction: Business Problem

In this project, I am creating a hypothetical scenario for a concept that there may not be enough Italian Restaurants in Toronto Area. With the purpose in mind, finding the location to open such a restaurant is one of the most important decisions for this entrepreneur and I am designing this project to help him find the most suitable location.

In this project we will try to find the best locations to open this Italian restaurant. We will use our data science powers to find a few most promising neighbourhoods where there are not many Italian Restaurants yet.

Our target stakeholders are businesspeople and investors that want to open an Italian restaurant in Toronto Canada.

## Data

Following data sources will be used to get the required information:

1. Wikipedia will be used scrap Toronto neighborhoods;
1. Geospatial_Coordinates.csv will be used to get Latitude and Longitude information;
1. Foursquare API will be used to get restaurants data related to these 2 cities.

In [1]:
# Package install
!pip install folium
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import folium
from sklearn.cluster import KMeans
print("Libraries Installed")

Libraries Installed


In [2]:
#We will use BeautifulSoup to get the zip code information of Canada from Wikipedia
page = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

In [4]:
#We save this to dataframe (df)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
# we see how many records we collected for this dataframe
df.shape

(103, 3)

In [6]:
# Checking for Nan/NullValues
display(100*df.isnull().sum()/df.shape[0])

PostalCode      0.0
Borough         0.0
Neighborhood    0.0
dtype: float64

In [7]:
#download Geospatial_Coordinates and put it in dataframe (df1)
URL = "https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv"
temp_df = pd.read_csv("https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv")

# show the first 5 rows
temp_df.head ()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
#Rename Collum Postal Code
temp_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)

#show the first 5 rows
temp_df.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Merge the 2 data sets (df and temp_df)
temp_df = pd.merge(df, temp_df, on='PostalCode')

#show the first 5 rows
temp_df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


In [10]:
temp_df.shape

(103, 5)

## Foursquare

Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each neighborhood.

We're interested in venues in 'food' category, but only those that are proper restaurants - coffe shops, pizza places, bakeries etc. are not direct competitors so we don't care about those. So we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific 'Italian restaurant' category, as we need info on Italian restaurants in the neighborhood.

Foursquare credentials are defined in cell bellow.

In [11]:
# Define Foursquare Credentials and Version
CLIENT_ID = 'BFMR1TVTOHLZDHCACSLAK0NDE11KGFIRHUGHWGVEZA0BJDKC' # your Foursquare ID
CLIENT_SECRET = 'S5SKLV0AHKCUZONCFN0FTXCUU12CPSYDVMRFB2N452DNOXUV' # your Foursquare Secret
VERSION = '20180604' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BFMR1TVTOHLZDHCACSLAK0NDE11KGFIRHUGHWGVEZA0BJDKC
CLIENT_SECRET:S5SKLV0AHKCUZONCFN0FTXCUU12CPSYDVMRFB2N452DNOXUV


In [12]:

# Lets get the venue data from foursquare
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
   
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
toronto_venues = getNearbyVenues(names=temp_df['Neighborhood'],
                                   latitudes=temp_df['Latitude'],
                                   longitudes=temp_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills North
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth  East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview East
The Danforth

In [14]:
toronto_venues.shape

(2104, 7)

In [15]:
#get the dumies
to_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
to_onehot['Neighborhoods'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [to_onehot.columns[-1]] + list(to_onehot.columns[:-1])
to_onehot = to_onehot[fixed_columns]

print(to_onehot.shape)
to_onehot.head()

(2104, 269)


Unnamed: 0,Neighborhoods,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Group by Neighborhoods
to_grouped = to_onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(to_grouped.shape)
to_grouped

(99, 269)


Unnamed: 0,Neighborhoods,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,Willowdale West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,"Willowdale, Newtonbrook",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
#create a new data frame with only the italian restaurants
Italian_Restaurants = to_grouped[["Neighborhoods","Italian Restaurant"]]

#show the first 5 rows
Italian_Restaurants.head()

Unnamed: 0,Neighborhoods,Italian Restaurant
0,Agincourt,0.0
1,"Alderwood, Long Branch",0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0
3,Bayview Village,0.0
4,"Bedford Park, Lawrence Manor East",0.090909


## Methodology 
The goal of this project is to come up with a simple study to identify area’s in the city of Toronto, where Italian Restaurants are located. So we can define areas of opportunities to invest / start an new Italian Restaurant.

After that, it will be presented some number to justify the decision about which area has the most restaurant which helps us determine other area’s where we could start our restaurant.

And finally, in the last part of this study, it is showed a map showing the spots where these Italian restaurants are located, and helps us to visualize the areas of opportunity for our restaurant.

## Analysis 
First we have defined in the data stage the area's in toronto. Based on the information we extrated from wikipedia and combined this with some geograpic coordinates.

Secondly we extrated some data from foursquare and filtered this to only show the italian restaurants.

Now in this new dataset we want to determine custers to see if we can find area's where there are not many restaurants yet. We will do this with a type of analysis called K-Means.

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

In [18]:
# cluster the above dataset into 3 clusters.
toclusters = 3
to_clustering = Italian_Restaurants.drop(["Neighborhoods"], 1)
kmeans = KMeans(n_clusters=toclusters, random_state=1)
kmeans.fit_transform(to_clustering)
kmeans.labels_[0:20]

array([1, 1, 1, 1, 0, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 1])

In [19]:
#create dataset (to_merged)
to_merged = Italian_Restaurants.copy()

# add clustering labels
to_merged["Cluster Labels"] = kmeans.labels_

In [20]:
# Rename the columns
to_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
to_merged.head(5)

Unnamed: 0,Neighborhood,Italian Restaurant,Cluster Labels
0,Agincourt,0.0,1
1,"Alderwood, Long Branch",0.0,1
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,1
3,Bayview Village,0.0,1
4,"Bedford Park, Lawrence Manor East",0.090909,0


In [21]:
# sort values  based on cluster labels
to_merged.sort_values(["Cluster Labels"], inplace=True)
to_merged.head()

Unnamed: 0,Neighborhood,Italian Restaurant,Cluster Labels
84,"The Danforth West, Riverdale",0.071429,0
22,Don Mills South,0.052632,0
4,"Bedford Park, Lawrence Manor East",0.090909,0
18,Davisville,0.057143,0
64,"Parkdale, Roncesvalles",0.066667,0


In [22]:
#Combine the sets and set index
to_merged = to_merged.join(toronto_venues.set_index("Neighborhood"), on="Neighborhood")

print(to_merged.shape)
to_merged.head()

(2104, 9)


Unnamed: 0,Neighborhood,Italian Restaurant,Cluster Labels,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
84,"The Danforth West, Riverdale",0.071429,0,43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
84,"The Danforth West, Riverdale",0.071429,0,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
84,"The Danforth West, Riverdale",0.071429,0,43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
84,"The Danforth West, Riverdale",0.071429,0,43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop
84,"The Danforth West, Riverdale",0.071429,0,43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop


## Results 
Now that we have create the clusters with K-means we want first find out in which cluster are the least amount of Italian restaurants. So we know where to invest.

First let's visualize our findings Cluster 0 = red, Cluster 1 = blue, cluster 2 = green

In [23]:
# Checking latitude/longitude for Toronto
!pip install geopy
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode("Toronto, ON")
print(location.address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

Toronto, Golden Horseshoe, Ontario, Canada
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [24]:
#Build the map
map_clusters = folium.Map(location=[43.7532586, -79.3296565],zoom_start=11)

# set color scheme for the clusters


# add markers to the map
markers_colors={}
markers_colors[0] = 'red'
markers_colors[1] = 'blue'
markers_colors[2] = 'green'

for lat, lon, cluster in zip(to_merged['Neighborhood Latitude'], to_merged['Neighborhood Longitude'], to_merged['Cluster Labels']):
    
    
    folium.features.CircleMarker(
        [lat, lon],
        radius=7,
       
        color =markers_colors[cluster],
        fill_color=markers_colors[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters