# Toronto's Neighborhoods Recommender System
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.wallpaperup.com%2Fuploads%2Fwallpapers%2F2013%2F12%2F19%2F199807%2F4d86b2357c55ff2bc433fc0af0705b97.jpg&f=1&nofb=1/toronto.jpeg%E2%80%9D" alt="toronto" align="left" width="600" />

## Table of Contents
1. **[Introduction](#introduction)**
2. **[Data](#data)**  
3. **[Methodology](#methodology)**
4. **[Results](#results)**
5. **[Discussion](#discussion)**
6. **[Conclusion](#conclusion)**

## 1. Introduction<a name="introduction"></a>
According to __[CIC News](https://www.cicnews.com/2020/02/which-cities-in-canada-attract-the-most-immigrants-0213741.html#)__, Canada welcomed more than 341,000 immigrants in 2019 and Toronto has successfully attracted nearly 118,000 immigrants which contribute to almost 35% of the total number of immigrants. **The statistics indicate that most of the immigrants prefer to settle in Toronto over other cities.** Why? __[VisaPlace](https://www.visaplace.com/blog-immigration-law/why-immigrants-settle-in-toronto-heres-10-reasons/)__ has listed out 10 reasons for this question. For me, the most convincing reason is Toronto is Canada’s business and financial capital, that's why immigrants prefer it.

Toronto is Canada’s largest city, it has 6 boroughs which are Etobicoke, North York, East York, Central Toronto, York and Scarborough. These 6 boroughs can be further divided into 140 neighborhoods. According to __[City of Toronto](https://www.toronto.ca/community-people/moving-to-toronto/about-toronto/)__, Toronto is one of the most multicultural cities in the world due to its large population of immigrants all over the world, each Toronto's neighborhood might be quite different from one another. **Therefore, out of 140 neighborhoods in Toronto, how can immigrants decide which neighborhood suits them best?** This is exactly what I want to resolve in this project.

**In this project, I will try to build a Toronto's neighborhoods recommender system based on 5 factors including job opportunities, cost of living, ease of transportation, safety and culture.** So, who would be interested in this recommender system? I can say that at least 118,000 people would and I believe that this number will be growing in the future. And of course, I can't wait to find out which neighborhood suit me best too because I wish to migrate to Canada and settle in Toronto in the future. How about you?

## 2. Data<a name="data"></a> 
Previously, I mentioned that the Toronto's neighborhoods recommender system is built on job opportunities, cost of living, ease of transportation, safety and culture. In this section, I will explain why these factors are important, describe the data that will be used and their source, finally import and clean the data.

### A. Factors to consider while deciding where to settle
* **Job opportunities**: We have to make a living to support ourselves or our family. And I bet we wish to get our dream job right? So, we need to know what are the common jobs for each neighborhood.
* **Cost of living**: We would like to buy our dream house but how much does it cost? Curious of how much should we earn to afford to live in a specific neighborhood? To answer these questions, we need to know the average house price and household income for each neighborhood.
* **Ease of transportation**: We need to travel from one point to another for different purposes but what are some available mode of transportation for each neighborhood? So, we need to know how people travel to get the answer.
* **Safety**: We wish to live in a safe and peaceful area but how can we determine if the area is safe? To answer these questions, we need to know the crime rate and the coronavirus cases for each neighborhood.
* **Culture**: Everyone likes to have fun right? So, it's important to know what are some popular places around each neighorhood. For some people, English is not their first language, so they will prefer certain neighborhoods in which they can still communicate in their mother tongue. Hence, we need to know what non-English language spoken most often at home in each neighborhood.

### B. Description of data and data source
|No.| Data           | Data Description  |   Data Source   | 
|:-------------| :------------- | :---------- | :----------- |
|1. | Common jobs| These data categorized jobs according to North American Industry Classification System (NAICS) 2012. For example: 54-Professional, scientific and technical services, 23-Construction, etc. | I extracted the common jobs for each neighborhood from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. City of Toronto uses the 2016 Canadian Census to provide a portrait of the demographic, social and economic characteristics of the people and households in each Toronto's neighbourhood. |
|2. | Average house price and household income | These data show average house price and household income in Canadian Dollar (CAD). Data current as of Oct 2020. | I scraped the latest house price and household income for each neighborhood from __[Realosophy](https://www.realosophy.com/toronto/neighbourhood-map)__. Realosophy is a real estate brokerage company that helps their customers make better decision based on data. |
|3. | Mode of transportation | These data show how people travel to work in each neighborhood. For example: car, truck, van, public transit, etc. |I extracted the mode of transportation for each neighborhood from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. |
|4. |Crime rate| xxx | xxx |
|5. |Coronavirus cases | xxx | xxx |
|6. |Popular places| xxx | xxx |
|7. |Non-English language spoken most often at home|  These data show non-English language spoken most often at home in each neighborhood. For example: Spanish, Italian, Korean, etc.  | I extracted the data of non-English language spoken most often at home in each neighborhood from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. |

### C. Import data and data wrangling

In [1]:
import requests
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np

In [8]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [15]:
language_df.head()

Unnamed: 0,_id,Category,Topic,Data Source,Characteristic,City of Toronto,Agincourt North,Agincourt South-Malvern West,Alderwood,Annex,Banbury-Don Mills,Bathurst Manor,Bay Street Corridor,Bayview Village,Bayview Woods-Steeles,Bedford Park-Nortown,Beechborough-Greenbrook,Bendale,Birchcliffe-Cliffside,Black Creek,Blake-Jones,Briar Hill-Belgravia,Bridle Path-Sunnybrook-York Mills,Broadview North,Brookhaven-Amesbury,Cabbagetown-South St. James Town,Caledonia-Fairbank,Casa Loma,Centennial Scarborough,Church-Yonge Corridor,Clairlea-Birchmount,Clanton Park,Cliffcrest,Corso Italia-Davenport,Danforth,Danforth East York,Don Valley Village,Dorset Park,Dovercourt-Wallace Emerson-Junction,Downsview-Roding-CFB,Dufferin Grove,East End-Danforth,Edenbridge-Humber Valley,Eglinton East,Elms-Old Rexdale,Englemount-Lawrence,Eringate-Centennial-West Deane,Etobicoke West Mall,Flemingdon Park,Forest Hill North,Forest Hill South,Glenfield-Jane Heights,Greenwood-Coxwell,Guildwood,Henry Farm,High Park North,High Park-Swansea,Highland Creek,Hillcrest Village,Humber Heights-Westmount,Humber Summit,Humbermede,Humewood-Cedarvale,Ionview,Islington-City Centre West,Junction Area,Keelesdale-Eglinton West,Kennedy Park,Kensington-Chinatown,Kingsview Village-The Westway,Kingsway South,Lambton Baby Point,L'Amoreaux,Lansing-Westgate,Lawrence Park North,Lawrence Park South,Leaside-Bennington,Little Portugal,Long Branch,Malvern,Maple Leaf,Markland Wood,Milliken,Mimico (includes Humber Bay Shores),Morningside,Moss Park,Mount Dennis,Mount Olive-Silverstone-Jamestown,Mount Pleasant East,Mount Pleasant West,New Toronto,Newtonbrook East,Newtonbrook West,Niagara,North Riverdale,North St. James Town,Oakridge,Oakwood Village,O'Connor-Parkview,Old East York,Palmerston-Little Italy,Parkwoods-Donalda,Pelmo Park-Humberlea,Playter Estates-Danforth,Pleasant View,Princess-Rosethorn,Regent Park,Rexdale-Kipling,Rockcliffe-Smythe,Roncesvalles,Rosedale-Moore Park,Rouge,Runnymede-Bloor West Village,Rustic,Scarborough Village,South Parkdale,South Riverdale,St.Andrew-Windfields,Steeles,Stonegate-Queensway,Tam O'Shanter-Sullivan,Taylor-Massey,The Beaches,Thistletown-Beaumond Heights,Thorncliffe Park,Trinity-Bellwoods,University,Victoria Village,Waterfront Communities-The Island,West Hill,West Humber-Clairville,Westminster-Branson,Weston,Weston-Pelham Park,Wexford/Maryvale,Willowdale East,Willowdale West,Willowridge-Martingrove-Richview,Woburn,Woodbine Corridor,Woodbine-Lumsden,Wychwood,Yonge-Eglinton,Yonge-St.Clair,York University Heights,Yorkdale-Glen Park
393,394,Language,Language spoken most often at home,Census Profile 98-316-X2016001,Moose Cree,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
394,395,Language,Mother tongue,Census Profile 98-316-X2016001,French and non-official language,3300,25,5,5,45,35,25,55,30,20,15,5,35,15,20,10,15,10,15,25,25,10,10,5,40,30,15,15,20,15,25,50,40,45,40,15,20,10,35,0,20,25,20,35,20,15,25,5,0,20,30,25,15,15,15,10,20,20,25,40,15,15,25,30,50,5,25,30,10,10,10,10,15,0,55,10,10,10,40,15,40,10,70,25,65,10,15,35,30,5,25,25,20,35,10,10,90,10,10,20,15,25,20,20,15,20,50,5,5,40,35,20,20,25,15,30,20,15,10,25,10,10,35,110,60,20,60,20,20,35,60,20,30,55,5,10,15,10,10,40,10
395,396,Language,Mother tongue,Census Profile 98-316-X2016001,"English, French and non-official language",2715,25,25,20,25,30,15,30,40,35,20,10,35,15,5,10,5,10,5,5,5,5,15,20,30,20,10,5,5,10,20,70,30,20,30,5,5,15,25,0,20,20,10,25,5,10,25,20,10,20,20,15,15,20,5,10,15,10,15,35,10,5,10,25,20,5,10,80,10,10,15,20,15,10,50,5,10,15,45,20,15,15,20,20,15,10,15,30,50,5,30,10,25,10,10,5,60,0,10,20,5,15,10,10,5,20,50,15,5,35,25,25,25,25,25,40,15,15,15,20,5,10,35,95,20,35,20,20,10,70,45,10,20,60,10,5,10,10,10,15,10
396,397,Language,Language spoken most often at home,Census Profile 98-316-X2016001,Language spoken most often at home for the tot...,2704420,28850,23735,12025,29875,27475,15635,25740,21255,12740,23215,6370,29500,21880,21570,7690,14215,9240,11380,17720,11295,9905,10870,13350,30915,26305,16425,15860,14105,9605,17100,26850,24390,36355,35020,11730,21090,15200,22570,9445,22060,18530,11470,21930,12675,10725,30245,14385,9655,15725,22105,23455,12475,16925,10365,12410,15540,14235,13610,43430,14090,11040,17110,17585,21985,9275,7985,43955,16165,14610,15175,16690,15435,10080,43775,10025,10460,26260,33375,17160,18860,13260,32830,16775,29275,11315,15560,23640,30585,11700,18400,13640,20985,18645,9215,13805,34790,10720,7755,15805,11045,10690,10360,22260,14905,20900,46090,10065,9765,16560,20915,27375,17790,24305,24950,27055,15495,21525,10125,20855,16305,7065,17195,65785,26985,32995,26095,17725,11100,27625,50285,16895,22145,53310,12445,7845,13345,11805,12530,27595,14055
397,398,Language,Language spoken most often at home,Census Profile 98-316-X2016001,Single responses,2458470,25820,21225,11215,28675,25055,14065,23900,19295,11600,21975,5780,25855,20935,18975,7245,12205,8850,10335,15745,10795,8645,10555,12430,29165,23280,14650,14745,12905,9065,16085,23275,20655,33875,30820,11030,20050,14190,19575,8500,19665,16910,10275,18270,11570,10390,26420,13585,9205,13330,20905,22440,11140,15220,9365,10710,13670,13335,11785,39725,13230,9890,14970,16520,19155,9030,7505,38965,14940,14220,14775,16245,14670,9485,37915,8680,9810,23640,31165,15190,17805,11920,28105,16100,27445,10610,13930,20515,29295,11330,15825,11670,19060,17185,8650,13270,31125,9605,7440,14055,10485,9495,9460,20120,14235,20205,40545,9715,8605,14410,19135,26195,16425,21910,23375,23620,13360,21145,8825,17205,15555,6770,15010,62260,24345,28550,22230,16325,9990,24235,45295,15380,20170,45830,12000,7445,12615,11305,12075,24205,12505


In [12]:
print(jobs_df['Characteristic'][1946])

    54 Professional, scientific and technical services


In [2]:
neighborhood_profile_url = 'https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv'
neighborhood_profile_df = pd.read_csv(neighborhood_profile_url)
jobs_df = neighborhood_profile_df.loc[1932:1954]
transport_df = neighborhood_profile_df.loc[1965:1971]
language_df = neighborhood_profile_df.loc[393:670]

In [None]:
url = 'https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Neighbourhood_MCI/FeatureServer/0/query?where=1%3D1&outFields=Neighbourhood,Hood_ID,Population,Assault_AVG,AutoTheft_AVG,Homicide_AVG,TheftOver_AVG,BreakandEnter_AVG,Robbery_AVG&outSR=4326&f=json'
results = requests.get(url).json()
crime_data = results['features']
dataframe = json_normalize(crime_data)
df_temp = pd.DataFrame(columns=['ID','Neighborhood'])
df_temp['ID']=dataframe['attributes.Hood_ID']
df_temp['Neighborhood']=dataframe['attributes.Neighbourhood']
df_temp.sort_values(by='ID',inplace=True)
df_temp.reset_index(drop=True,inplace=True)
df_temp['Borough']=0
df_temp['Borough'][0:20]='Etobicoke'
df_temp['Borough'][20:53]='North York'
df_temp['Borough'][53:61]='East York'
df_temp['Borough'][61:105] = 'Central Toronto'
df_temp['Borough'][105:115] = 'York'
df_temp['Borough'][115:140] = 'Scarborough'

In [None]:

df_temp['ID']=np.arange(1,141,1)
df_temp

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
import time

def get_neigh_data(borough):
    url='https://www.realosophy.com/{}-former-toronto/neighbourhood-map'.format(borough)
    driver.get(url)
    time.sleep(5)
    html= driver.page_source

    soup = BeautifulSoup(html,'lxml')
    all_divs = soup.find('div',{'class':'row mt-4'})
    neighborhoods_data = all_divs.find_all('a')

    names = []
    websites = []
    for neighborhood in neighborhoods_data: 
        names.append(neighborhood.text)
        websites.append(neighborhood['href'])
        
    pd.DataFrame({'Names':names,'Websites':websites}).to_csv('{}.csv'.format(borough))
    print('{} done!'.format(borough))

firefoxOptions = Options()
firefoxOptions.add_argument('-headless')
driver = webdriver.Firefox(options=firefoxOptions)

boroughs=['etobicoke','north-york','east-york','central-toronto','york','scarborough']

for borough in boroughs:
    get_neigh_data(borough)

driver.quit()

In [None]:
eto_df=pd.read_csv('etobicoke.csv')
north_df=pd.read_csv('north-york.csv')
east_df=pd.read_csv('east-york.csv')
central_df=pd.read_csv('central-toronto.csv')
york_df=pd.read_csv('york.csv')
scar_df=pd.read_csv('scarborough.csv')

In [None]:
eto_df.head()

In [None]:
neigh_id_website_df = pd.concat([eto_df,north_df,east_df,central_df,york_df,scar_df])

In [None]:
neigh_id_website_df.drop(['Unnamed: 0','Names'],axis=1,inplace=True)
neigh_id_website_df.sort_values('ID',inplace=True)
neigh_id_website_df.reset_index(drop=True,inplace=True)

In [None]:
neigh_id_website_df.to_csv('neigh_id_website.csv')

In [None]:
neigh_id_website_df.head()

In [None]:
neigh_id_website_df['Avg House Price']=0
neigh_id_website_df['Avg Income']=0
neigh_id_website_df.head()

In [None]:
def get_number_only(variable):
    clear_space = variable.text.strip()
    clear_dollarsign = clear_space.replace('$','')
    clear_character = clear_dollarsign.replace(',','')
    if 'M' in clear_character:
        string = clear_character.replace('M','')
        number = float(string)
        result = int(number*1000000)
        return result
    else:
        string = clear_character.replace('K','')
        number = int(string)
        result = number*1000
        return result

In [None]:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
import time

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

firefoxOptions = Options()
firefoxOptions.add_argument('-headless')
driver = webdriver.Firefox(options=firefoxOptions)

for i in range(len(neigh_id_website_df['Websites'])):
    url='https://www.realosophy.com{}'.format(neigh_id_website_df['Websites'][i])
    driver.get(url)
    time.sleep(5)
    html= driver.page_source

    soup = BeautifulSoup(html,'lxml')
    avg_house_price_data = soup.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})        
    avg_income_data = soup.find('p',{'class': 'h3 font-sans-caption-bold mb-0 text-center text-sm-left ng-binding ng-scope'})

    while avg_house_price_data == None or avg_income_data == None:
        url='https://www.realosophy.com{}'.format(neigh_id_website_df['Websites'][i])
        driver.get(url)
        time.sleep(5)
        html= driver.page_source

        soup = BeautifulSoup(html,'lxml')
        avg_house_price_data = soup.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})        
        avg_income_data = soup.find('p',{'class': 'h3 font-sans-caption-bold mb-0 text-center text-sm-left ng-binding ng-scope'})
    
    neigh_id_website_df['Avg House Price'][i] = get_number_only(avg_house_price_data)
    neigh_id_website_df['Avg Income'][i] = get_number_only(avg_income_data)
    
    print(neigh_id_website_df['ID'][i])
    
driver.quit()

In [None]:
neigh_id_website_df.to_csv('neigh_houseprice_raw.csv')

In [None]:
neigh_houseprice_raw = neigh_id_website_df.groupby('ID').mean()

In [None]:
neigh_houseprice_raw['Avg House Price']=neigh_houseprice_raw['Avg House Price'].astype(int)

In [None]:
neigh_houseprice_raw['Avg Income']=neigh_houseprice_raw['Avg Income'].astype(int)

In [None]:
neigh_houseprice_raw.to_csv('neigh_houseprice_grouped.csv')

In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
neigh_houseprice_grouped = pd.read_csv('neigh_houseprice_grouped.csv')
neigh_houseprice_grouped

In [None]:
toronto_neigh_houseprice_df = df_temp.set_index('ID').join(neigh_houseprice_grouped.set_index('ID'))

In [None]:
toronto_neigh_houseprice_df.to_csv('toronto_neigh_houseprice.csv')

## 3. Methodology<a name="methodology"></a>

## 4. Results<a name="results"></a>

## 5. Discussion<a name="discussion"></a>

## 6. Conclusion<a name="conclusion"></a>

### Thank you for reading this notebook! Feel free to read the __[full report]()__ and the __[blogpost]()__ too! 

## Author  
__[Titus Chin Jun Hong](https://www.linkedin.com/in/joseph-s-50398b136/)__  
**10 November 2020**