# IBM Applied Data Science Capstone

## 1. Introduction

John is moving to Toronto next year because he will be studying in the University of Toronto. Now he is looking for a place to rent. However, he does not know which neighbourhood suits him.

His search criteria are as below:

- Close to the campus.
- As a well-trained barista, he wants to find a part time job in a coffee shop while studying. If a neighbourhood has more coffee shops, he will have more chances to find a job.

The goal of the project is to identify neighbourhoods that meets John’s criteria. 


### Data

In this project, following data will be used:

- Toronto neighbourhoods: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
- Geographical coordinates: http://cocl.us/Geospatial_data
- Venue data: will be retrieved Foursquare API by using following query https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}

The data will be merged as one dataframe representing neighbourhoods with coordinates, venue information and the distance from the University of Toronto. The dataframe will be sorted using distance. Then the neighbourhoods that do not match John's criteria will be dropped from the The dataframe. The top 5 neighbourhoods will be recommended to John.

## 2. Import Libraries

In [2]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geopy.distance # calculate distance between two coordinates
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

## 3. Download and Explore Dataset

#### Download and Explore Dataset

scraping list of postal codes of canada

In [2]:
req = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(req.content,'lxml')
table = soup.find_all('table')[0]
neighbourhood_df = pd.read_html(str(table))[0]

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [3]:
neighbourhood_df = neighbourhood_df[neighbourhood_df.Borough != 'Not assigned']

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [4]:
neighbourhood_df['Neighbourhood'] = neighbourhood_df.groupby('Postal Code')['Neighbourhood'].transform(lambda x: ','.join(x))
neighbourhood_df.drop_duplicates()
neighbourhood_df = neighbourhood_df.reset_index().drop(['index'], axis=1)

In [5]:
neighbourhood_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [6]:
print('The dataframe has {} boroughs and {} neighbourhoods.'.format(
        len(neighbourhood_df['Borough'].unique()),
        neighbourhood_df.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighbourhoods.


Since geocoder is not reliable, geographical coordinates are loaded from 'http://cocl.us/Geospatial_data'. Then merge coordinates data with postcode_df.

In [7]:
geospatial_df = pd.read_csv('http://cocl.us/Geospatial_data')
neighbourhood_df = neighbourhood_df.merge(geospatial_df, left_on='Postal Code', right_on='Postal Code')

In [8]:
neighbourhood_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


####  Use geopy library to get the latitude and longitude values of Toronto.

In [9]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Use geopy library to get the latitude and longitude values of the University of Toronto.

In [3]:
uni_address = 'University of Toronto, Harbord'

geolocator = Nominatim(user_agent="toronto_explorer")
uni_location = geolocator.geocode(uni_address)
uni_latitude = uni_location.latitude
uni_longitude = uni_location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(uni_latitude, uni_longitude))

The geographical coordinate of Toronto are 43.6640959, -79.3986695.


#### Create a function to calucate distance between two coordinates

In [11]:
def calculate_distance(lat1, lng1, lat2, lng2):
    return geopy.distance.geodesic((lat1, lng1), (lat2, lng2)).km

#### Add a column for distance to University of Toronto 

In [12]:
distance_list = []

for index, row in neighbourhood_df.iterrows():
    distance_list.append(calculate_distance(uni_latitude, uni_longitude, row['Latitude'], row['Longitude']))


neighbourhood_df['Distance'] = distance_list

#### Sort based on distance

In [13]:
neighbourhood_df = neighbourhood_df.sort_values(['Distance']).reset_index(drop=True)
neighbourhood_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Distance
0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,0.191289
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.766481
2,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,1.111517
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1.13788
4,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,1.21507


In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(neighbourhood_df['Latitude'], neighbourhood_df['Longitude'], neighbourhood_df['Borough'], neighbourhood_df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### Define Foursquare Credentials and Version

In [40]:
# Client_ID and Client_Secret are hidden in the cell.

CLIENT_ID = '<Hidden>' # Foursquare ID
CLIENT_SECRET = '<Hidden>' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: <Hidden>
CLIENT_SECRET:<Hidden>


#### Explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [16]:
neighbourhood_df.loc[0, 'Neighbourhood']

'University of Toronto, Harbord'

Get the neighborhood's latitude and longitude values.

In [17]:
neighbourhood_latitude = neighbourhood_df.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = neighbourhood_df.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = neighbourhood_df.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of University of Toronto, Harbord are 43.6626956, -79.4000493.


#### Get the top 100 venues that are in University of Toronto, Harbord within a radius of 500 meters.

In [18]:
LIMIT = 100
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighbourhood_latitude, 
    neighbourhood_longitude, 
    radius, 
    LIMIT)

In [19]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f36792f022b1022e23a4640'},
 'response': {'headerLocation': 'University of Toronto',
  'headerFullLocation': 'University of Toronto, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 36,
  'suggestedBounds': {'ne': {'lat': 43.6671956045, 'lng': -79.39384042790832},
   'sw': {'lat': 43.6581955955, 'lng': -79.4062581720917}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5362c366498e602fbe1db395',
       'name': 'Yasu',
       'location': {'address': '81 Harbord St.',
        'lat': 43.66283719650635,
        'lng': -79.40321739973975,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.66283719650635,
          'lng': -79.40321739973975}],
        'distance': 255,
        'postalCode': 'M5S 1G

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a *pandas* dataframe.

In [21]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Yasu,Japanese Restaurant,43.662837,-79.403217
1,Rasa,Restaurant,43.662757,-79.403988
2,The Dessert Kitchen,Dessert Shop,43.662823,-79.402746
3,Almond Butterfly,Bakery,43.662836,-79.403365
4,Her Father's Cider Bar + Kitchen,Beer Bar,43.662448,-79.404703


Check how many venues were returned by Foursquare.

In [22]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

36 venues were returned by Foursquare.


## 4. Explore Neighborhoods in Toronto

#### Creating a function to repeat the same process to all the neighborhoods in toronto

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Use the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [24]:
toronto_venues = getNearbyVenues(names=neighbourhood_df['Neighbourhood'],
                                   latitudes=neighbourhood_df['Latitude'],
                                   longitudes=neighbourhood_df['Longitude']
                                  )

University of Toronto, Harbord
Queen's Park, Ontario Provincial Government
The Annex, North Midtown, Yorkville
Central Bay Street
Kensington Market, Chinatown, Grange Park
Church and Wellesley
Garden District, Ryerson
Richmond, Adelaide, King
Christie
First Canadian Place, Underground city
Commerce Court, Victoria Hotel
Toronto Dominion Centre, Design Exchange
St. James Town
Rosedale
Little Portugal, Trinity
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
St. James Town, Cabbagetown
Stn A PO Boxes
Harbourfront East, Union Station, Toronto Islands
Berczy Park
Moore Park, Summerhill East
Regent Park, Harbourfront
Dufferin, Dovercourt Village
Forest Hill North & West, Forest Hill Road Park
Brockton, Parkdale Village, Exhibition Place
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Humewood-Cedarvale
The Danforth West, Riverdale
Davisville
Studio District
Parkdale, Roncesvalles
Caledonia-Fairbanks
High Park, The J

#### Check the size of the resulting dataframe

In [25]:
print(toronto_venues.shape)
toronto_venues.head()

(2124, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"University of Toronto, Harbord",43.662696,-79.400049,Yasu,43.662837,-79.403217,Japanese Restaurant
1,"University of Toronto, Harbord",43.662696,-79.400049,Rasa,43.662757,-79.403988,Restaurant
2,"University of Toronto, Harbord",43.662696,-79.400049,The Dessert Kitchen,43.662823,-79.402746,Dessert Shop
3,"University of Toronto, Harbord",43.662696,-79.400049,Almond Butterfly,43.662836,-79.403365,Bakery
4,"University of Toronto, Harbord",43.662696,-79.400049,Her Father's Cider Bar + Kitchen,43.662448,-79.404703,Beer Bar


Check how many venues were returned for each neighborhood

In [26]:
toronto_venues.groupby('Neighbourhood')['Venue'].count()

Neighbourhood
Agincourt                                           4
Alderwood, Long Branch                              6
Bathurst Manor, Wilson Heights, Downsview North    22
Bayview Village                                     4
Bedford Park, Lawrence Manor East                  25
                                                   ..
Willowdale, Willowdale East                        34
Willowdale, Willowdale West                         5
Woburn                                              3
Woodbine Heights                                    6
York Mills West                                     2
Name: Venue, Length: 95, dtype: int64

#### Check how many unique categories can be curated from all the returned venues

In [27]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 269 uniques categories.


## 5. Analyze Neighbourhoods

In [28]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighbourhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighbourhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"University of Toronto, Harbord",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"University of Toronto, Harbord",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"University of Toronto, Harbord",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"University of Toronto, Harbord",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"University of Toronto, Harbord",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
toronto_onehot.shape

(2124, 270)

#### Group rows by neighbourhood and by taking the total of occurrence of each category

In [30]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').sum().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Alderwood, Long Branch",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Bathurst Manor, Wilson Heights, Downsview North",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Bayview Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Bedford Park, Lawrence Manor East",0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90,"Willowdale, Willowdale East",0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
91,"Willowdale, Willowdale West",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
92,Woburn,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93,Woodbine Heights,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
toronto_grouped.shape

(95, 270)

#### Drop neighbourhoods that does not have coffee shop

In [32]:
toronto_grouped = toronto_grouped[toronto_grouped['Coffee Shop'] > 0]

In [33]:
toronto_grouped.shape

(47, 270)

In [34]:
neighbourhood_df = neighbourhood_df.merge(toronto_grouped, left_on='Neighbourhood', right_on='Neighbourhood')

#### Save top 5 neighbourhoods that meet John's criteria and visualize them in a map

In [35]:
recommendations = neighbourhood_df.head()

In [36]:
recommendations

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Distance,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,0.191289,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.766481,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1
2,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,1.111517,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1.13788,0,0,0,0,...,0,1,0,0,0,1,0,0,0,1
4,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,1.21507,0,0,0,0,...,0,3,0,3,0,1,0,0,0,0


In [37]:
# create map of Toronto using latitude and longitude values
map_recommendation = folium.Map(location=[uni_latitude, uni_longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(recommendations['Latitude'], recommendations['Longitude'], recommendations['Borough'], recommendations['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_recommendation)  
    
map_recommendation

## 6. Conclusion

The neighbourhoods recommended to John are:

In [38]:
recommendations[['Postal Code', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude', 'Distance', 'Coffee Shop']]

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Distance,Coffee Shop
0,M5S,Downtown Toronto,"University of Toronto, Harbord",43.662696,-79.400049,0.191289,1
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.766481,9
2,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,1.111517,2
3,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383,1.13788,11
4,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park",43.653206,-79.400049,1.21507,4
