# Toronto's Neighborhoods Recommender System
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.wallpaperup.com%2Fuploads%2Fwallpapers%2F2013%2F12%2F19%2F199807%2F4d86b2357c55ff2bc433fc0af0705b97.jpg&f=1&nofb=1/toronto.jpeg%E2%80%9D" alt="toronto" align="left" width="600" />

## Table of Contents
1. **[Introduction](#introduction)**
2. **[Data](#data)**  
3. **[Methodology](#methodology)**
4. **[Results](#results)**
5. **[Discussion](#discussion)**
6. **[Conclusion](#conclusion)**

## 1. Introduction<a name="introduction"></a>
According to __[CIC News](https://www.cicnews.com/2020/02/which-cities-in-canada-attract-the-most-immigrants-0213741.html#)__, Canada welcomed more than 341,000 immigrants in 2019 and Toronto has successfully attracted nearly 118,000 immigrants which contribute to almost 35% of the total number of immigrants. **The statistics indicate that most of the immigrants prefer to settle in Toronto over other cities.** Why? __[VisaPlace](https://www.visaplace.com/blog-immigration-law/why-immigrants-settle-in-toronto-heres-10-reasons/)__ has listed out 10 reasons for this question. For me, the most convincing reason is Toronto is Canada’s business and financial capital, that's why immigrants prefer it.

Toronto is Canada’s largest city, it has 6 boroughs which are Etobicoke, North York, East York, Central Toronto, York and Scarborough. These 6 boroughs can be further divided into 140 neighborhoods. According to __[City of Toronto](https://www.toronto.ca/community-people/moving-to-toronto/about-toronto/)__, Toronto is one of the most multicultural cities in the world due to its large population of immigrants all over the world, each Toronto's neighborhood might be quite different from one another. **Therefore, out of 140 neighborhoods in Toronto, how can immigrants decide which neighborhood suits them best?** This is exactly what I want to resolve in this project.

**In this project, I will try to build a Toronto's neighborhoods recommender system based on 4 factors including job opportunities, cost of living, safety and culture.** So, who would be interested in this recommender system? I can say that at least 118,000 people would and I believe that this number will be growing in the future. And of course, I can't wait to find out which neighborhood suit me best too because I wish to migrate to Canada and settle in Toronto in the future. How about you?

## 2. Data<a name="data"></a> 
Previously, I mentioned that the Toronto's neighborhoods recommender system is built on job opportunities, cost of living, safety and culture. In this section, I will explain why these factors are important, describe the data that will be used and their source, finally import and clean the data.

### A. Factors to consider while deciding where to settle
* **Job opportunities**: We have to make a living to support ourselves or our family. And I bet we wish to get our dream job right? So, we need to know what are the common jobs for each neighborhood.
* **Cost of living**: We would like to buy our dream house but how much does it cost? Curious of how much should we earn to afford to live in a specific neighborhood? To answer these questions, we need to know the average house price and household income for each neighborhood.
* **Safety**: We wish to live in a safe and peaceful area but how can we determine if the area is safe? To answer these questions, we need to know the crime rate for each neighborhood.
* **Culture**: We will talk and eat everyday. If possible, we would like to communicate in our favorite language and eat our favorite food right? And it's even better if our favorite things are just around us. So, it's important to know what are the language spoken most often at home and what are the popular food in each neighborhood.

### B. Description of data and data source
|No.| Data           | Data Description  |   Data Source   | 
|:-------------| :------------- | :---------- | :----------- |
|I. | Common jobs| These data show the common jobs for each neighborhood. The data categorize jobs according to North American Industry Classification System (NAICS) 2012. For example: 54-Professional, scientific and technical services, 23-Construction, etc. | I extracted the data from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. City of Toronto uses the 2016 Canadian Census to provide a portrait of the demographic, social and economic characteristics of the people and households in each Toronto's neighbourhood. |
|II. | Average house price and household income | These data show the average house price and household income for each neighborhood in Canadian Dollar (CAD). The home affordability for each neighborhood also calculated.| I scraped the data current as of October 2020 from __[Realosophy](https://www.realosophy.com/toronto/neighbourhood-map)__. Realosophy is a real estate brokerage company that helps their customers make better decision based on data. |
|III. |Crime rate| These data show the crime rate per 100,000 people for each neighborhood. | I get the data from the __[Toronto Neighborhood Crime Rates Boundary File](https://data.torontopolice.on.ca/datasets/neighbourhood-crime-rates-boundary-file-?geometry=-79.598%2C43.673%2C-79.158%2C43.760&orderBy=OBJECTID&page=6)__ by calling a REST API from Toronto Police Service. The file contains the 2014-2019 crime data by neighbourhood. |
|IV. |Language spoken most often at home|  These data show the language spoken most often at home in each neighborhood. For example: English, Spanish, Italian, French, etc.  | I extracted the data from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. |
|V. |Popular food| These data show the popular food categories around each neighborhood according to Foursquare API. For example: Italian food, Korean food, Japanese food, etc. | I get the data through __[Foursqure API](https://developer.foursquare.com/docs/)__. Foursquare is a location technology platform dedicated to improve how people move through the real world. |
|VI. |Boundaries of neighborhoods| These data contain the boundary of each neighborhood in GeoJSON file. These data are used to create the boundary of each neighborhood on a map. | I get the data from __[Boundaries of Toronto's Neighbourhoods](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=geojson&projection=4326)__. City of Toronto made the data available on its open data portal. |

### C. Import data and data wrangling

#### I. Common jobs data
Now, let's import and clean the common jobs data first.

In [1]:
# import necessary library
import pandas as pd
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

In [2]:
def get_whole_toronto(dataframe):
    row_total=pd.DataFrame(dataframe.sum()).T
    row_total.iloc[[0,0],[0,1]]=['Whole Toronto',0]
    for columns in row_total.columns[2:]:
        row_total[columns]=int(row_total[columns]/140)
    temp_df=pd.concat([dataframe,row_total])
    temp_df['ID']=temp_df['ID'].astype(int)
    temp_df=temp_df.sort_values('ID').reset_index(drop=True)
    return temp_df

In [None]:
# define a function to return a dataframe of top 5 elements
def get_top5_elements(dataframe,column_name):
    first_element=[]
    second_element=[]
    third_element=[]
    fourth_element=[]
    fifth_element=[]
    first_column=dataframe.iloc[:,0].values
    second_column=dataframe.iloc[:,1].values
    for i in range(141):
        sorted_elements=dataframe.iloc[i,2:].sort_values(ascending=False).index
        first_element.append(sorted_elements[0])
        second_element.append(sorted_elements[1])
        third_element.append(sorted_elements[2])
        fourth_element.append(sorted_elements[3])
        fifth_element.append(sorted_elements[4])
    return pd.DataFrame({'Neighborhood':first_column,'ID':second_column,'1st Most Common {}'.format(column_name):first_element,
                         '2nd Most Common {}'.format(column_name):second_element,'3rd Most Common {}'.format(column_name):third_element,
                         '4th Most Common {}'.format(column_name):fourth_element,'5th Most Common {}'.format(column_name):fifth_element})

In [None]:
# import the 2016 toronto neighborhood profiles into toronto_2016_df and clean the dataframe
toronto_2016_df=pd.read_csv('https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv')
toronto_2016_df.drop(['_id','Category','Data Source','City of Toronto'],axis=1,inplace=True)
index=['Neighbourhood Number']

# extract common jobs data from toronto_2016_df into common_jobs_df and clean the dataframe
common_jobs_df=toronto_2016_df[(toronto_2016_df['Topic']=='Industry - North American Industry Classification System (NAICS) 2012')|
                               toronto_2016_df['Characteristic'].isin(index)].drop('Topic',axis=1).set_index('Characteristic').T
common_jobs_df.columns=common_jobs_df.columns.str.strip()
common_jobs_df=common_jobs_df.drop(['Total Labour Force population aged 15 years and over by Industry - North American Industry Classification System (NAICS) 2012 - 25% sample data',
                                    'Industry - NAICS2012 - not applicable','All industry categories'],axis=1)
common_jobs_df=common_jobs_df.replace(',','',regex=True).astype(int).sort_values(index).rename_axis(None,axis=1).reset_index()
common_jobs_df.rename(columns={'index':'Neighborhood','Neighbourhood Number':'ID'},inplace=True)
common_jobs_df=get_whole_toronto(common_jobs_df)

# get top 5 common jobs and save the data into top5_common_jobs_df
top5_common_jobs_df=get_top5_elements(common_jobs_df,'Job')
top5_common_jobs_df.head()

In [None]:
# normalize the common_jobs_df
common_jobs_df=common_jobs_df.drop(0).reset_index(drop=True)
common_jobs_df['Job Opportunity']=common_jobs_df.iloc[:,2:].sum(axis=1)
columns_to_normalize=common_jobs_df.columns[2:]
for columns in columns_to_normalize:
    common_jobs_df[columns]=common_jobs_df[columns]/common_jobs_df[columns].max()
    common_jobs_df[columns]=common_jobs_df[columns].astype(float).round(3)
common_jobs_df.head()

#### II. Average house price and household income data
Now, let's scrape the average house price and household income data from Realosophy and clean the data.

In [3]:
# import necessary libraries
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
from bs4 import BeautifulSoup
import numpy as np
firefox_options=Options()
firefox_options.add_argument('-headless')

In [None]:
# define a function to scrape each neighborhood's name and website by borough
def get_neighborhood_websites(borough):
    neighborhoods_list_driver=webdriver.Firefox(options=firefox_options)
    neighborhoods_list_driver.get('https://www.realosophy.com/{}-former-toronto/neighbourhood-map'.format(borough))
    time.sleep(5)
    neighborhoods_list_html=neighborhoods_list_driver.page_source
    neighborhoods_list_html_scraper=BeautifulSoup(neighborhoods_list_html,'lxml')
    neighborhoods_list_raw_data=neighborhoods_list_html_scraper.find('div',{'class':'row mt-4'})
    neighborhoods_list_data=neighborhoods_list_raw_data.find_all('a')
    neighborhood=[]
    website=[]
    for neighborhoods_list in neighborhoods_list_data:
        neighborhood.append(neighborhoods_list.text)
        website_temp=neighborhoods_list['href'].replace('/','https://www.realosophy.com/',1)
        website.append(website_temp)
    neighborhoods_list_driver.quit()
    print('...',end='')
    return pd.DataFrame({'Neighborhood':neighborhood,'Website':website})

In [None]:
# define a function to get the avg house price or household income in numerical result
def get_number_only(money_in_str):
    clear_space=money_in_str.text.strip()
    clear_dollar_sign=clear_space.replace('$','')
    clear_character=clear_dollar_sign.replace(',','')
    if 'M' in clear_character:
        string=clear_character.replace('M','')
        number=float(string)
        result=int(number*1000000)
        return result
    else:
        string=clear_character.replace('K','')
        number=float(string)
        result=int(number*1000)
        return result

In [22]:
def get_categories(dataframe):
    temp_df=dataframe
    temp_df=temp_df.drop(0)
    for columns in temp_df.columns[2:]:
        Q1=temp_df[columns].quantile(0.25)
        Q3=temp_df[columns].quantile(0.75)
        IQR=Q3-Q1
        temp_df_min=Q1-1.5*IQR
        temp_df_max=Q3+1.5*IQR
        temp_bins=temp_df[columns][~((temp_df[columns]<temp_df_min)|(temp_df[columns]>temp_df_max))]
        bins=np.linspace(min(temp_bins),max(temp_bins),4)
        bins=bins.astype(int)
        dataframe['{} Categories'.format(columns)]='Dunno'
        for i in range(141):
            if dataframe[columns][i]<bins[1]:
                dataframe['{} Categories'.format(columns)][i]='Low (<{:,d})'.format(bins[1])
            elif dataframe[columns][i]>bins[2]:
                dataframe['{} Categories'.format(columns)][i]='High (>{:,d})'.format(bins[2])
            else:
                dataframe['{} Categories'.format(columns)][i]='Medium ({:,d}-{:,d})'.format(bins[1],bins[2])
    return dataframe

In [None]:
# scrape etobicoke's neighborhoods and websites into etobicoke_website_df then insert neighborhoods ID
print('Almost...',end='')
etobicoke_website_df=get_neighborhood_websites('etobicoke')
etobicoke_website_df.drop(25,inplace=True)
etobicoke_website_df['ID']=[20,11,1,14,13,17,8,9,14,6,19,12,17,18,10,14,4,7,2,16,5,15,16,3,11]

# scrape north york's neighborhoods and websites into north_york_website_df then insert neighborhoods ID
north_york_website_df=get_neighborhood_websites('north-york')
north_york_website_df.loc[len(north_york_website_df.index)]=north_york_website_df.loc[31,:]
north_york_website_df.loc[len(north_york_website_df.index)]=north_york_website_df.loc[39,:]
north_york_website_df['ID']=[38,42,34,52,49,43,24,41,30,39,33,39,42,47,45,26,44,31,25,45,53,48,41,21,23,22,38,32,
                             41,39,29,36,45,23,46,28,43,41,35,37,40,27,31,50,51]

# scrape east york's neighborhoods and websites into east_york_website_df then insert neighborhoods ID
east_york_website_df=get_neighborhood_websites('east-york')
east_york_website_df['ID']=[56,57,59,61,56,58,54,55,54,54,60]

# scrape central toronto's neighborhoods and websites into central_toronto_website_df then insert neighborhoods ID
central_toronto_website_df=get_neighborhood_websites('central-toronto')
central_toronto_website_df=central_toronto_website_df.drop([16,47,69,74]).reset_index(drop=True)
central_toronto_website_df.loc[len(central_toronto_website_df.index)]=central_toronto_website_df.loc[39,:]
central_toronto_website_df.loc[len(central_toronto_website_df.index)]=central_toronto_website_df.loc[35,:]
central_toronto_website_df.loc[len(central_toronto_website_df.index)]=central_toronto_website_df.loc[20,:]
central_toronto_website_df['ID']=[78,103,76,84,105,80,89,83,71,91,96,100,78,93,75,77,73,66,93,99,97,77,93,83,92,77,
                                  77,76,101,102,82,77,77,90,87,94,74,78,70,82,81,103,98,73,103,67,96,92,72,68,70,
                                  86,98,95,79,77,96,85,73,74,77,97,87,105,95,63,67,90,69,81,82,62,93,77,64,75,95,
                                  65,88,104]

# scrape york's neighborhoods and websites into york_website_df then insert neighborhoods ID
york_website_df=get_neighborhood_websites('york')
york_website_df=york_website_df.drop(11).reset_index(drop=True)
york_website_df['ID']=[114,112,108,109,106,110,114,115,107,114,111,113]

# scrape scarborough's neighborhoods and websites into scarborough_website_df then insert neighborhoods ID
scarborough_website_df=get_neighborhood_websites('scarborough')
scarborough_website_df.loc[len(scarborough_website_df.index)]=scarborough_website_df.loc[0,:]
scarborough_website_df['ID']=[128,127,122,120,120,123,122,126,138,140,134,125,124,117,132,119,130,135,135,121,133,
                              131,139,116,118,136,131,119,137,129]

# concatenate etobicoke_website_df, north_york_website_df, east_york_website_df, central_toronto_website_df,
# york_website_df and scarborough_website_df into neighborhood_website_df
neighborhood_website_df=pd.concat([etobicoke_website_df,north_york_website_df,east_york_website_df,
                                  central_toronto_website_df,york_website_df,scarborough_website_df])
neighborhood_website_df=neighborhood_website_df.sort_values('ID').reset_index(drop=True)
print('...Done!',end='')
neighborhood_website_df.head()

In [None]:
# scrape average house price and household income for each neighborhood into neighborhood_website_df
print('Progress:',end='')
avg_houseprice_income_driver=webdriver.Firefox(options=firefox_options)
neighborhood_website_df['Avg House Price']=0
neighborhood_website_df['Avg Household Income']=0
for i in range(len(neighborhood_website_df['ID'])):
    avg_houseprice_income_driver.get(neighborhood_website_df['Website'][i])
    time.sleep(5)
    avg_houseprice_income_html=avg_houseprice_income_driver.page_source
    avg_houseprice_income_scraper=BeautifulSoup(avg_houseprice_income_html,'lxml')
    avg_houseprice_data=avg_houseprice_income_scraper.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})        
    avg_income_data=avg_houseprice_income_scraper.find('p',{'class':'h3 font-sans-caption-bold mb-0 text-center text-sm-left ng-binding ng-scope'})
    while avg_houseprice_data==None or avg_income_data==None:
        avg_houseprice_income_driver.get(neighborhood_website_df['Website'][i])
        time.sleep(5)
        avg_houseprice_income_html=avg_houseprice_income_driver.page_source
        avg_houseprice_income_scraper=BeautifulSoup(avg_houseprice_income_html,'lxml')
        avg_houseprice_data=avg_houseprice_income_scraper.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})        
        avg_income_data=avg_houseprice_income_scraper.find('p',{'class':'h3 font-sans-caption-bold mb-0 text-center text-sm-left ng-binding ng-scope'})
    neighborhood_website_df['Avg House Price'][i]=get_number_only(avg_houseprice_data)
    neighborhood_website_df['Avg Household Income'][i]=get_number_only(avg_income_data)
    print('.',end='')
avg_houseprice_income_driver.quit()

# group neighborhood_website_df by neighborhood ID and save the data into avg_houseprice_income_df
neighborhood_website_df.drop('Neighborhood',axis=1,inplace=True)
neighborhood_website_df=neighborhood_website_df.groupby('ID').mean().reset_index()
neighborhood_website_df['Avg House Price']=neighborhood_website_df['Avg House Price'].astype(int)
neighborhood_website_df['Avg Household Income']=neighborhood_website_df['Avg Household Income'].astype(int)
avg_houseprice_income_df=common_jobs_df.iloc[:,0:2]
avg_houseprice_income_df['Avg House Price']=neighborhood_website_df['Avg House Price']
avg_houseprice_income_df['Avg Household Income']=neighborhood_website_df['Avg Household Income']
avg_houseprice_income_df=get_whole_toronto(avg_houseprice_income_df)
print('...Done!')
avg_houseprice_income_df.head()

In [26]:
avg_houseprice_income_df=pd.read_csv('avg_houseprice_income.csv')
avg_houseprice_income_df.drop('Unnamed: 0',axis=1,inplace=True)
avg_houseprice_income_df=get_whole_toronto(avg_houseprice_income_df)

In [27]:
# calculate the home affordability for each neighborhood
avg_houseprice_income_df['Home Affordability']=avg_houseprice_income_df['Avg Household Income']/avg_houseprice_income_df['Avg House Price']
avg_houseprice_income_df['Home Affordability']=avg_houseprice_income_df['Home Affordability']/avg_houseprice_income_df['Home Affordability'].max()
avg_houseprice_income_df['Home Affordability']=avg_houseprice_income_df['Home Affordability']*100
avg_houseprice_income_df['Home Affordability']=avg_houseprice_income_df['Home Affordability'].round(3)

# get the categories of house price, household income and home affordability
avg_houseprice_income_df=get_categories(avg_houseprice_income_df)
avg_houseprice_income_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['{} Categories'.format(columns)][i]='Medium ({:,d}-{:,d})'.format(bins[1],bins[2])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['{} Categories'.format(columns)][i]='Low (<{:,d})'.format(bins[1])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe['{} Categories'.format(columns)][i]='High (>{:,d})'.format(bins[2])


Unnamed: 0,Neighborhood,ID,Avg House Price,Avg Household Income,Home Affordability,Avg House Price Categories,Avg Household Income Categories,Home Affordability Categories
0,Whole Toronto,0,1102194,130501,52.922,"Medium (964,000-1,432,000)","Medium (106,666-152,333)",Medium (48-63)
1,West Humber-Clairville,1,587000,94000,71.577,"Low (<964,000)","Low (<106,666)",High (>63)
2,Mount Olive-Silverstone-Jamestown,2,578000,79000,61.092,"Low (<964,000)","Low (<106,666)",Medium (48-63)
3,Thistletown-Beaumond Heights,3,898000,94000,46.788,"Low (<964,000)","Low (<106,666)",Low (<48)
4,Rexdale-Kipling,4,744000,91000,54.67,"Low (<964,000)","Low (<106,666)",Medium (48-63)


#### III. Crime rate data
Now, let's get the crime rate data by calling a REST API from Toronto Police Service and clean the data.

In [None]:
import requests
from pandas.io.json import json_normalize
url = 'https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Neighbourhood_MCI/FeatureServer/0/query?where=1%3D1&outFields=Neighbourhood,Hood_ID,Population,Assault_AVG,AutoTheft_AVG,Homicide_AVG,TheftOver_AVG,BreakandEnter_AVG,Robbery_AVG&outSR=4326&f=json'
results = requests.get(url).json()
crime_data = results['features']
dataframe = json_normalize(crime_data)

In [None]:
language_df = toronto_2016_df.loc[393:670]

## 3. Methodology<a name="methodology"></a>

## 4. Results<a name="results"></a>

## 5. Discussion<a name="discussion"></a>

## 6. Conclusion<a name="conclusion"></a>

### Thank you for reading this notebook! Feel free to read the __[full report]()__ and the __[blogpost]()__ too! 

## Author  
__[Titus Chin Jun Hong](https://www.linkedin.com/in/joseph-s-50398b136/)__  
**13 November 2020**