## SOFE3720 | FinalProject - Neighbourhoods in Toronto

## Table of Contents
* [Introduction](#introduction)
    * [Background](#background)
    * [Business Problem](#businessproblem)
* [Methodology](#methodology)
* [Data Used](#data)



## **Introduction** <a name="introduction"></a>

**1.1. Background** <a name="background"></a>

**1.2. Business Problem** <a name="businessproblem"></a>


## **Methodology** <a name="methodology"></a>


## **Data Sources** <a name="data"></a>
This report aims to analyze the neighborhoods of Toronto city from different data sets and find the 
perfect spot. The following datasets will be utilized in the project: 

1) **Neighbourhoods Dataset:** This dataset contains neighborhood names as well as the geographic 
coordinates (latitude and longitude). The geographic coordinates will be used for two purposes; visualize 
Toronto map and call Foursquare API. 
 
2) **FourSquare:** Foursquare API is used to collect data to find the most common venues within a specific 
radius of a given geographic coordinate. 
 
3) **Neighborhood Profile Toronto:** This dataset contains the data for each of City of Toronto 
neighbourhoods. 
 
4) **Neighbourhood Crime Rates:** This dataset contains Crime Data by Neighbourhood. Data includes four- 
year averages and crime rates per 100,000 people by neighbourhood based on 2016 Census Population. 

5) **Wikipedia page to get more information about Toronto:** information we need to explore and cluster 
the  neighborhoods  in  Toronto.  You  will  scrape  the  Wikipedia  page,  wrangle,  and  clean  the  data,  and 
then read it into a pandas data frame. 

### Importing Libraries

The following csv files must be placed locally. This can be done by using a simple `wget` command to be able to access the dataset.

In [422]:
%%capture
!wget -O GeoSpatial_Data https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv
!wget -O Crime_Data https://opendata.arcgis.com/datasets/af500b5abb7240399853b35a2362d0c0_0.csv

The following dependencies and libraries will be required before going forward and making sure the following codes work properly.

In [423]:
import pandas as pd     # library for data analysis          
import numpy as np      # library to handle data in a vectorized manner
import folium           # library for map rendering
import requests         # library to handle request

from bs4 import BeautifulSoup as bs     
from geopy.geocoders import Nominatim   # Module to convert an address into latitude and longitude values

print("Libraries imported...")

Libraries imported...


## **1. Postal Code, Borough, Neighbourhood, Longitude, and, Latitude**
### Scraping from Wikipedia page for Data
The table from the Wikipage has a list of all the Neighbourhoods in Toronto with the following Postal Code and associated Borough. It is scaped using and inserted into a dataframe using the code below:

In [424]:
# Requestion data from html url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_table_data = requests.get(url).text 
soup = bs(html_table_data, 'html5lib')

# Create dataframe with following columns (Postal, Borough, Neighbourhood)
df = pd.DataFrame(columns = ['PostalCode','Borough','Neighbourhood'])
# Scrape the Wikipedia page for the rows in the table
tb_rows = soup.find('table').tbody.find_all('tr')       

# Filtering the scraped data and inserting to dataframe
for rows in tb_rows :
    for column in rows.find_all('td') :
        if column.span.text != 'Not assigned' :
            span  = column.span.text.split('(')
            df = df.append({'PostalCode' : column.b.text,
                              'Borough' : span[0],
                              'Neighbourhood' : span[1][:-1]}, ignore_index=True)

# Replace the following name of borough
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

# Sort dataframe by PostalCode and reset to default indexing
df = df.sort_values('PostalCode').reset_index(drop = True)
df.head()   # print the first 5 in df

# df.shape    # shape/size of dataframe


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Extracting Latitude and Longitude 
One of the download file (`GeoSpatial_Data`) from earlier will be used and inserted to a seperate dataframe.

In [425]:
geospatial_data = pd.read_csv('GeoSpatial_Data')                    # Read from the csv file
geospatial_data.columns = ['PostalCode', 'Latitude', 'Longitude']   # Set the columns
geospatial_data.head()   # print the first 5 in df

# geospatial_data.shape    # shape/size of dataframe


Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Joining Dataframes based on Postal Code
We must clean the data first and seperate the neighbourhood into separate rows from the first dataframe, before continuing to join the two dataframes

In [426]:
# Cleaning data to split and splitting neighbourhoods
df = df.assign(Neighbourhood=df.Neighbourhood.str.split(" / ")).explode('Neighbourhood')

# Join both data based on PostalCode
df = df.join(geospatial_data.set_index('PostalCode'), on = 'PostalCode')        

df.head() # print the first 5 in df


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Malvern,43.806686,-79.194353
0,M1B,Scarborough,Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill,43.784535,-79.160497
1,M1C,Scarborough,Port Union,43.784535,-79.160497
1,M1C,Scarborough,Highland Creek,43.784535,-79.160497


## **2. Exploring Neighbourhoods on a Map**
### Create Clustered Map of Toronto Neighbourhoods

In [427]:
df.Borough.value_counts()      # return most frequent-occuring Borough (most neighbourhood)

Etobicoke                 44
Scarborough               38
North York                36
Downtown Toronto          35
Central Toronto           16
West Toronto              13
Etobicoke Northwest        9
York                       8
East Toronto               6
East York                  5
East Toronto Business      1
Queen's Park               1
East York/East Toronto     1
Downtown Toronto Stn A     1
Mississauga                1
Name: Borough, dtype: int64

In [428]:
# Use geopy library to get the latitude and longitude values of Toronto city
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent = 'ny_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('Geographical Coordinates of Toronto City:')
print('Latitude: ', latitude)
print('Longitude: ', longitude)

Geographical Coordinates of Toronto City:
Latitude:  43.6534817
Longitude:  -79.3839347


In [429]:
# Array of Toronto boroughs
borough_array = ['North York', 'York ', 'East York', 'Downtown Toronto', 'Central Toronto', 'West Toronto', 'East Toronto', 'Downtown Toronto Stn A' , 'East Toronto Business', 'East York/East Toronto', 'Scarborough',
                 'Etobicoke', 'Etobicoke Northwest', "Queen's Park", 'Mississauga']

# Make changes in the dataframe accordingly
df1 = df.copy()
for boroughs in borough_array :
    for borough in boroughs :
        df1.replace(borough, str(boroughs), inplace = True)

colors_array = np.empty(15, dtype = str)
colors_array.fill('blue')

# cCeate map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# Add markers to map
for borough, color in zip(borough_array, colors_array) :
    df2 = df1[df1.Borough == str(borough)]
    for lat, lng, borough, neighborhood in zip(df2['Latitude'], df2['Longitude'], df2['Borough'], df2['Neighbourhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius = 5,
            popup = label,
            color = 'blue',
            fill = True,
            fill_color = 'blue',
            fill_opacity = 1,
            parse_html = False).add_to(map_toronto)  
    
map_toronto


### Types of Crime Rates Based on Reported Locations

In [430]:
crime_data = pd.read_csv('Crime_Data')          # Read from the csv data file
# Filter crime data to extract the following column
crime_data = crime_data[["Neighbourhood", "Population", "Assault_Rate_2019", "AutoTheft_Rate_2019", "BreakandEnter_Rate_2019", "Homicide_Rate_2019", "Robbery_Rate_2019", "TheftOver_Rate_2019", "Shape__Area"]]

# Merge data based on Neighbourhood
df = df.merge(crime_data.set_index('Neighbourhood'), on = 'Neighbourhood')

df.head() # print the first 5 rows in df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Population,Assault_Rate_2019,AutoTheft_Rate_2019,BreakandEnter_Rate_2019,Homicide_Rate_2019,Robbery_Rate_2019,TheftOver_Rate_2019,Shape__Area
0,M1B,Scarborough,Malvern,43.806686,-79.194353,43794,760.4,162.1,100.5,2.3,105.0,22.8,8866244.0
1,M1B,Scarborough,Rouge,43.806686,-79.194353,46496,391.4,187.1,126.9,0.0,68.8,28.0,37534490.0
2,M1C,Scarborough,Highland Creek,43.784535,-79.160497,12494,464.2,216.1,264.1,8.0,72.0,8.0,5248058.0
3,M1E,Scarborough,Guildwood,43.763573,-79.188711,9917,282.3,30.3,100.8,10.1,70.6,10.1,3804331.0
4,M1E,Scarborough,Morningside,43.763573,-79.188711,17455,1048.4,45.8,97.4,5.7,114.6,22.9,5740138.0


### Neighbourhood Profiles

In [431]:
# Get the dataset metadata by passing package_id to the package_search endpoint
# For example, to retrieve the metadata for this dataset:
url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/package_show"
params = { "id": "6e19a90f-971c-46b3-852c-0c48c436d1fc"}
package = requests.get(url, params = params).json()
# print(package["result"])

# Get the data by passing the resource_id to the datastore_search endpoint
# For example, to retrieve the data content for the first resource in the datastore:
for idx, resource in enumerate(package["result"]["resources"]):
    if resource["datastore_active"]:
        url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/datastore_search"
        p = { "id": resource["id"] }
        data = requests.get(url, params = p).json()
        df = pd.DataFrame(data["result"]["records"])
        break

# df = df.transpose()
# df = df.drop(labels=["_id", "Topic", "Category", "Data Source"], axis=0)
df.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,1,Neighbourhood Information,Neighbourhood Information,City of Toronto,Neighbourhood Number,,129,128,20,95,...,37,7,137,64,60,94,100,97,27,31
1,2,Neighbourhood Information,Neighbourhood Information,City of Toronto,TSNS2020 Designation,,No Designation,No Designation,No Designation,No Designation,...,No Designation,No Designation,NIA,No Designation,No Designation,No Designation,No Designation,No Designation,NIA,Emerging Neighbourhood
2,3,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2016",2731571,29113,23757,12054,30526,...,16936,22156,53485,12541,7865,14349,11817,12528,27593,14804
3,4,Population,Population and dwellings,Census Profile 98-316-X2016001,"Population, 2011",2615060,30279,21988,11904,29177,...,15004,21343,53350,11703,7826,13986,10578,11652,27713,14687
4,5,Population,Population and dwellings,Census Profile 98-316-X2016001,Population Change 2011-2016,4.50%,-3.90%,8.00%,1.30%,4.60%,...,12.90%,3.80%,0.30%,7.20%,0.50%,2.60%,11.70%,7.50%,-0.40%,0.80%


### Population Based on Age

In [432]:
# Obtain the row number for "Population depending on age group" to allow us extract it from the dataframe
df.index[df['Characteristic'] == ('Children (0-14 years)', 'Youth (15-24 years)','Working Age (25-54 years)', 'Pre-retirement (55-64 years)', 'Seniors (65+ years)', 'Older Seniors (85+ years)')].tolist()

[]

In [433]:
# Slice demographics dataframe to obtain "Population depending on age group" per Neighbourhood
pop_data=df.iloc[lambda df: [0,10,11,12,13,14], 4:]
pop_data.head()

Unnamed: 0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
0,Neighbourhood Number,,129,128,20,95,42,34,76,52,...,37,7,137,64,60,94,100,97,27,31
10,Youth (15-24 years),340270.0,3705,3360,1235,3750,2730,1940,6860,2505,...,2230,2625,7660,1035,675,1320,1225,920,4750,1870
11,Working Age (25-54 years),1229555.0,11305,9965,5220,15040,10810,6655,13065,10310,...,7480,8140,21945,6165,3790,6420,5860,5960,12290,5860
12,Pre-retirement (55-64 years),336670.0,4230,3265,1825,3480,3555,2030,1760,2540,...,2070,2905,6245,1625,1150,1595,1325,1540,2965,1810
13,Seniors (65+ years),426945.0,6045,4105,2015,5910,6975,2940,2420,3615,...,3370,4905,8010,1380,1095,3150,1600,2905,3530,3295


### Unemployment Rate

In [434]:
# Obtain the row number for "Unemployment" to allow us extract it from the dataframe
neighbourhood_profile = pd.read_csv('Neighbourhood_Profiles - 2016.csv')

In [None]:
# Slice demographics dataframe to obtain "Unemployment" per Neighbourhood
neighbourhood_profile.index[neighbourhood_profile['Characteristic'] == 'Unemployment rate'].tolist()
slice_neighbourhood_profile = neighbourhood_profile.iloc[lambda df: [1890], 4:]
slice_neighbourhood_profile.head()

Unnamed: 0,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,...,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
1890,Unemployment rate,8.2,9.8,9.8,6.1,6.7,7.2,7.2,10.2,7.7,...,9.8,8.5,10.6,7.7,6.6,5.2,6.9,5.9,10.7,8
