## SOFE3720 | FinalProject - Neighbourhoods in Toronto

## Table of Contents
* [Introduction](#introduction)
    * [Background](#background)
    * [Business Problem](#businessproblem)
* [Methodology](#methodology)
* [Data Used](#data)



## Introduction <a name="introduction"></a>

**1.1. Background** <a name="background"></a>

**1.2. Business Problem** <a name="businessproblem"></a>


## Methodology <a name="methodology"></a>


## Data Used <a name="data"></a>

### Importing Libraries

In [1]:
!wget -O GeoSpatial_Data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
!wget -O Crime_Data https://opendata.arcgis.com/datasets/af500b5abb7240399853b35a2362d0c0_0.csv

--2022-04-11 00:29:05--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2788 (2.7K) [text/csv]
Saving to: 'GeoSpatial_Data'

     0K ..                                                    100%  926M=0s

2022-04-11 00:29:05 (926 MB/s) - 'GeoSpatial_Data' saved [2788/2788]

--2022-04-11 00:29:05--  https://opendata.arcgis.com/datasets/af500b5abb7240399853b35a2362d0c0_0.csv
Resolving opendata.arcgis.com (opendata.arcgis.com)... 54.205.192.36, 52.2.134.244, 54.173.145.175
Connecting to opendata.arcgis.com (opendata.arcgis.co

In [2]:
import pandas as pd     # library for data analysis          
import numpy as np      # library to handle data in a vectorized manner
import folium           # library for map rendering
import requests         # library to handle request

from bs4 import BeautifulSoup as bs     
from geopy.geocoders import Nominatim   # Module to convert an address into latitude and longitude values

print("Libraries imported.")

Libraries imported.


### Extract Postal Codes, Borough, and Neighbourhood

In [3]:
# Requestion data from html url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_table_data = requests.get(url).text 
soup = bs(html_table_data, 'html5lib')

# Create dataframe with following columns (Postal, Borough, Neighbourhood)
df = pd.DataFrame(columns = ['PostalCode','Borough','Neighbourhood'])
# Scrape the Wikipedia page for the rows in the table
tb_rows = soup.find('table').tbody.find_all('tr')       

# Filtering the scraped data and inserting to dataframe
for rows in tb_rows :
    for column in rows.find_all('td') :
        if column.span.text != 'Not assigned' :
            span  = column.span.text.split('(')
            df = df.append({'PostalCode' : column.b.text,
                              'Borough' : span[0],
                              'Neighbourhood' : span[1][:-1]}, ignore_index=True)

# Replace the following name of borough
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# Sort dataframe by PostalCode and reset to default indexing
df = df.sort_values('PostalCode').reset_index(drop = True)
df.shape    # shape/size of dataframe


(103, 3)

In [4]:
df.head()   # print the first 5 in df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Extracting Latitude and Longitude 

In [5]:
geospatial_data = pd.read_csv('GeoSpatial_Data')                    # Read from the csv file
geospatial_data.columns = ['PostalCode', 'Latitude', 'Longitude']   # Set the columns
geospatial_data.shape    # shape/size of dataframe


(103, 3)

In [6]:
geospatial_data.head()   # print the first 5 in df

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Merging Dataframes based on Postal Code

In [7]:
# Join both data based on PostalCode
df = df.join(geospatial_data.set_index('PostalCode'), on = 'PostalCode')        

# Cleaning data to split and splitting neighbourhoods
df = df.assign(Neighbourhood=df.Neighbourhood.str.split(" / ")).explode('Neighbourhood')
df.head() # print the first 5 in df


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Malvern,43.806686,-79.194353
0,M1B,Scarborough,Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill,43.784535,-79.160497
1,M1C,Scarborough,Port Union,43.784535,-79.160497
1,M1C,Scarborough,Highland Creek,43.784535,-79.160497


### Create Clustered Map of Toronto Neighbourhoods

In [8]:
df.Borough.value_counts()      # return most frequent-occuring Borough (most neighbourhood)

Etobicoke                 44
Scarborough               38
North York                36
Downtown Toronto          35
Central Toronto           16
West Toronto              13
Etobicoke Northwest        9
York                       8
East Toronto               6
East York                  5
East York/East Toronto     1
Downtown Toronto Stn A     1
Queen's Park               1
Mississauga                1
East Toronto Business      1
Name: Borough, dtype: int64

In [9]:
# Use geopy library to get the latitude and longitude values of Toronto city
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent = 'ny_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [10]:
# Array of Toronto boroughs
borough_array = ['North York', 'York ', 'East York', 'Downtown Toronto', 'Central Toronto', 'West Toronto', 'East Toronto', 'Downtown Toronto Stn A' , 'East Toronto Business', 'East York/East Toronto', 'Scarborough',
                 'Etobicoke', 'Etobicoke Northwest', "Queen's Park", 'Mississauga']

# Make changes in the dataframe accordingly
df1 = df.copy()
for boroughs in borough_array :
    for borough in boroughs :
        df1.replace(borough, str(boroughs), inplace = True)

colors_array = np.empty(15, dtype = str)
colors_array.fill('blue')

# cCeate map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for borough, color in zip(borough_array, colors_array) :
    df2 = df1[df1.Borough == str(borough)]
    for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius = 5,
            popup = label,
            color = 'blue',
            fill = True,
            fill_color = 'blue',
            fill_opacity = 1,
            parse_html = False).add_to(map_toronto)  
    
map_toronto


### Types of Crime Rates Based on Reported Locations

In [11]:
crime_data = pd.read_csv('Crime_Data')          # Read from the csv data file
# Filter crime data to extract the following column
crime_data = crime_data[["Neighbourhood", "Population", "Assault_Rate_2019", "AutoTheft_Rate_2019", "BreakandEnter_Rate_2019", "Homicide_Rate_2019", "Robbery_Rate_2019", "TheftOver_Rate_2019", "Shape__Area"]]

# Merge data based on Neighbourhood
df = df.merge(crime_data.set_index('Neighbourhood'), on = 'Neighbourhood')

df.head() # print the first 5 rows in df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Population,Assault_Rate_2019,AutoTheft_Rate_2019,BreakandEnter_Rate_2019,Homicide_Rate_2019,Robbery_Rate_2019,TheftOver_Rate_2019,Shape__Area
0,M1B,Scarborough,Malvern,43.806686,-79.194353,43794,760.4,162.1,100.5,2.3,105.0,22.8,8866244.0
1,M1B,Scarborough,Rouge,43.806686,-79.194353,46496,391.4,187.1,126.9,0.0,68.8,28.0,37534490.0
2,M1C,Scarborough,Highland Creek,43.784535,-79.160497,12494,464.2,216.1,264.1,8.0,72.0,8.0,5248058.0
3,M1E,Scarborough,Guildwood,43.763573,-79.188711,9917,282.3,30.3,100.8,10.1,70.6,10.1,3804331.0
4,M1E,Scarborough,Morningside,43.763573,-79.188711,17455,1048.4,45.8,97.4,5.7,114.6,22.9,5740138.0


### Neighbourhood Profiles

In [12]:
# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "6e19a90f-971c-46b3-852c-0c48c436d1fc"}
package = requests.get(url, params = params).json()
# print(package["result"])

# Get the data by passing the resource_id to the datastore_search endpoint
# For example, to retrieve the data content for the first resource in the datastore:
for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search"
        p = { "id": resource["id"] }
        data = requests.get(url, params = p).json()
        df = pd.DataFrame(data["result"]["records"])
        break

# df = df.transpose()
# df = df.drop(labels=["_id", "Topic", "Category", "Data Source"], axis=0)
df.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


### Population Based on Age

In [13]:
# Obtain the row number for "Population depending on age group" to allow us extract it from the dataframe
df.index[df['Characteristic'] == ('Children (0-14 years)', 'Youth (15-24 years)','Working Age (25-54 years)', 'Pre-retirement (55-64 years)', 'Seniors (65+ years)', 'Older Seniors (85+ years)')].tolist()

[]

In [37]:
# Slice demographics dataframe to obtain "Population depending on age group" per Neighbourhood
pop_data=df.iloc[lambda df: [0,9,10,11,12,13,14], 4:]
pop_data.head(7)

Unnamed: 0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Number,,129,128,20,95,42,34,76,52,...,37,7,137,64,60,94,100,97,27,31
9,Children (0-14 years),398135.0,3840,3075,1760,2360,3605,2325,1695,2415,...,1785,3555,9625,2325,1165,1860,1800,1210,4045,1960
10,Youth (15-24 years),340270.0,3705,3360,1235,3750,2730,1940,6860,2505,...,2230,2625,7660,1035,675,1320,1225,920,4750,1870
11,Working Age (25-54 years),1229555.0,11305,9965,5220,15040,10810,6655,13065,10310,...,7480,8140,21945,6165,3790,6420,5860,5960,12290,5860
12,Pre-retirement (55-64 years),336670.0,4230,3265,1825,3480,3555,2030,1760,2540,...,2070,2905,6245,1625,1150,1595,1325,1540,2965,1810
13,Seniors (65+ years),426945.0,6045,4105,2015,5910,6975,2940,2420,3615,...,3370,4905,8010,1380,1095,3150,1600,2905,3530,3295
14,Older Seniors (85+ years),66000.0,925,555,320,1040,1640,710,330,610,...,655,885,1130,170,125,880,165,470,400,775


In [38]:
# Drop irrelevant columns
pop_data.rename(columns={'Characteristic':'Neighbourhood'}, inplace=True)
# Set index and Transpose
pop_data=pop_data.set_index('Neighbourhood').T
pop_data.reset_index(inplace = True)
# Re-order columns
pop_data.columns = ['Neighbourhood', 'Neighbourhood ID', 'Children (0-14 years)', 'Youth (15-24 years)', 'Working Age (25-54 years)', 'Pre-retirement (55-64 years)', 'Seniors (65+ years)', 'Older Seniors (85+ years)']
# Set Neighbourhood ID and Unemployment Rate to numeric type
pop_data.head()

Unnamed: 0,Neighbourhood,Neighbourhood ID,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),Seniors (65+ years),Older Seniors (85+ years)
0,City of Toronto,,398135,340270,1229555,336670,426945,66000
1,Agincourt North,129.0,3840,3705,11305,4230,6045,925
2,Agincourt South-Malvern West,128.0,3075,3360,9965,3265,4105,555
3,Alderwood,20.0,1760,1235,5220,1825,2015,320
4,Annex,95.0,2360,3750,15040,3480,5910,1040


### Unemployment Rate

In [60]:
# Obtain the row number for "Unemployment" to allow us extract it from the dataframe
neighbourhood_profile = pd.read_csv('Neighbourhood_Profiles - 2016.csv')

In [61]:
# Slice demographics dataframe to obtain "Unemployment" per Neighbourhood
neighbourhood_profile.index[neighbourhood_profile['Characteristic'] == 'Unemployment rate'].tolist()
slice_neighbourhood_profile = neighbourhood_profile.iloc[lambda df: [0, 1890], 4:]
slice_neighbourhood_profile.head()

Unnamed: 0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Number,,129.0,128.0,20.0,95.0,42.0,34.0,76.0,52.0,...,37.0,7.0,137.0,64.0,60.0,94.0,100.0,97.0,27.0,31
1890,Unemployment rate,8.2,9.8,9.8,6.1,6.7,7.2,7.2,10.2,7.7,...,9.8,8.5,10.6,7.7,6.6,5.2,6.9,5.9,10.7,8


In [62]:
# Drop irrelevant columns
slice_neighbourhood_profile.drop(labels='City of Toronto',axis=1, inplace=True)
slice_neighbourhood_profile.rename(columns={'Characteristic':'Neighbourhood'}, inplace=True)
# Set index and Transpose
slice_neighbourhood_profile=slice_neighbourhood_profile.set_index('Neighbourhood').T
slice_neighbourhood_profile.reset_index(inplace = True)
# Re-order columns
slice_neighbourhood_profile.columns = ['Neighbourhood', 'Neighbourhood ID', 'Unemployment Rate']
# Set Neighbourhood ID and Unemployment Rate to numeric type
slice_neighbourhood_profile['Neighbourhood ID']=slice_neighbourhood_profile['Neighbourhood ID'].apply(pd.to_numeric) 
slice_neighbourhood_profile['Unemployment Rate']=slice_neighbourhood_profile['Unemployment Rate'].apply(pd.to_numeric) 
slice_neighbourhood_profile.head()

Unnamed: 0,Neighbourhood,Neighbourhood ID,Unemployment Rate
0,Agincourt North,129,9.8
1,Agincourt South-Malvern West,128,9.8
2,Alderwood,20,6.1
3,Annex,95,6.7
4,Banbury-Don Mills,42,7.2


In [63]:
cluster_data = pd.merge(slice_neighbourhood_profile, crime_data, on = ['Neighbourhood'])
cluster_data.head()

Unnamed: 0,Neighbourhood,Neighbourhood ID,Unemployment Rate,Population,Assault_Rate_2019,AutoTheft_Rate_2019,BreakandEnter_Rate_2019,Homicide_Rate_2019,Robbery_Rate_2019,TheftOver_Rate_2019,Shape__Area
0,Agincourt North,129,9.8,29113,271.4,144.3,192.4,0.0,120.2,6.9,7261857.0
1,Agincourt South-Malvern West,128,9.8,23757,517.7,261.0,420.9,0.0,122.1,63.1,7873163.0
2,Alderwood,20,6.1,12054,298.7,116.1,215.7,0.0,41.5,58.1,4978488.0
3,Annex,95,6.7,30526,943.5,98.3,694.5,3.3,101.6,137.6,2790356.0
4,Banbury-Don Mills,42,7.2,27695,267.2,151.7,292.5,0.0,36.1,50.6,10041550.0


In [64]:
cluster_data = pd.merge(cluster_data, pop_data, on = ['Neighbourhood'])
cluster_data.head()

Unnamed: 0,Neighbourhood,Neighbourhood ID_x,Unemployment Rate,Population,Assault_Rate_2019,AutoTheft_Rate_2019,BreakandEnter_Rate_2019,Homicide_Rate_2019,Robbery_Rate_2019,TheftOver_Rate_2019,Shape__Area,Neighbourhood ID_y,Children (0-14 years),Youth (15-24 years),Working Age (25-54 years),Pre-retirement (55-64 years),Seniors (65+ years),Older Seniors (85+ years)
0,Agincourt North,129,9.8,29113,271.4,144.3,192.4,0.0,120.2,6.9,7261857.0,129,3840,3705,11305,4230,6045,925
1,Agincourt South-Malvern West,128,9.8,23757,517.7,261.0,420.9,0.0,122.1,63.1,7873163.0,128,3075,3360,9965,3265,4105,555
2,Alderwood,20,6.1,12054,298.7,116.1,215.7,0.0,41.5,58.1,4978488.0,20,1760,1235,5220,1825,2015,320
3,Annex,95,6.7,30526,943.5,98.3,694.5,3.3,101.6,137.6,2790356.0,95,2360,3750,15040,3480,5910,1040
4,Banbury-Don Mills,42,7.2,27695,267.2,151.7,292.5,0.0,36.1,50.6,10041550.0,42,3605,2730,10810,3555,6975,1640
