# Toronto's Neighborhoods Recommender System
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.wallpaperup.com%2Fuploads%2Fwallpapers%2F2013%2F12%2F19%2F199807%2F4d86b2357c55ff2bc433fc0af0705b97.jpg&f=1&nofb=1/toronto.jpeg%E2%80%9D" alt="toronto" align="left" width="600" />

## Table of Contents
1. **[Introduction](#introduction)**
2. **[Data](#data)**  
     2.1 **[Factors to consider while deciding where to settle](#factors)**  
     2.2 **[Description of data and data source](#source)**  
     2.3 **[Import data and data wrangling (optional)](#clean)**  
     2.4 **[All data for Toronto's neighborhoods recommender system](#all)**
3. **[Methodology](#methodology)**  
     3.1 **[Build a recommender system](#system)**  
     3.2 **[Data visualization](#visualization)**  
     3.3 **[Build an interactive dashboard](#dashboard)**  
     3.4 **[Deploy dashboard on Heroku](#heroku)**  
4. **[Results](#results)**
5. **[Discussion](#discussion)**
6. **[Conclusion](#conclusion)**

## 1. Introduction <a name="introduction"></a>
According to __[CIC News](https://www.cicnews.com/2020/02/which-cities-in-canada-attract-the-most-immigrants-0213741.html#)__, Canada welcomed more than 341,000 immigrants in 2019 and Toronto has successfully attracted nearly 118,000 immigrants which contribute to almost 35% of the total number of immigrants. **The statistics indicate that most of the immigrants prefer to settle in Toronto over other cities.** Why? __[VisaPlace](https://www.visaplace.com/blog-immigration-law/why-immigrants-settle-in-toronto-heres-10-reasons/)__ has listed out 10 reasons for this question. For me, the most convincing reason is Toronto is Canada’s business and financial capital, that's why immigrants prefer it.

Toronto is Canada’s largest city, it has 6 boroughs which are Etobicoke, North York, East York, Central Toronto, York and Scarborough. These 6 boroughs can be further divided into 140 neighborhoods. According to __[City of Toronto](https://www.toronto.ca/community-people/moving-to-toronto/about-toronto/)__, Toronto is one of the most multicultural cities in the world due to its large population of immigrants all over the world, each Toronto's neighborhood might be quite different from one another. **Therefore, out of 140 neighborhoods in Toronto, how can immigrants decide which neighborhood suits them best?** This is exactly what I want to resolve in this project.

**In this project, I will try to build a Toronto's neighborhoods recommender system based on 4 factors including job opportunities, affordability, safety and culture.** So, who would be interested in this recommender system? I can say that at least 118,000 people would and I believe that this number will be growing in the future. And of course, I can't wait to find out which neighborhood suit me best too because I wish to migrate to Canada and settle in Toronto in the future. How about you?

## 2. Data<a name="data"></a> 
Previously, I mentioned that the Toronto's neighborhoods recommender system is built on job opportunities, affordability, safety and culture. In this section, I will explain why these factors are important, describe the data that will be used and their source, import and clean the data, and finally list out all the data that will be used to create the Toronto's neighborhoods recommender system.

### 2.1 Factors to consider while deciding where to settle<a name="factors"></a>
* **Job opportunities**: We have to make a living to support ourselves or our family. And I bet we wish to get our dream job right? So, we need to know what are the common jobs for each neighborhood.
* **Affordability**: We would like to buy our dream house but how much does it cost? Curious of how much should we earn to afford to live in a specific neighborhood? To answer these questions, we need to know the affordability index for each neighborhood.
* **Safety**: We wish to live in a safe and peaceful area but how can we determine if the area is safe? To answer this question, we need to know the crime rate for each neighborhood.
* **Culture**: We will talk and eat everyday. If possible, we would like to communicate in our favorite language and eat our favorite food right? And it's even better if our favorite things are just around us. So, it's important to know what are the language spoken most often at home and what are the popular food in each neighborhood.

### 2.2 Description of data and data source<a name="source"></a>
|No.| Data           | Data Description  |   Data Source   | 
|:-------------| :------------- | :---------- | :----------- |
|I. | Common jobs| These data show the common jobs for each neighborhood. The data categorize jobs according to North American Industry Classification System (NAICS) 2012. For example: professional, construction, retail trade, etc. | I extracted the data from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. City of Toronto uses the 2016 Canadian Census to provide a portrait of the demographic, social and economic characteristics of the people and households in each Toronto's neighborhood. |
|II. | Affordability index| These data show the composite Housing Affordability Index (HAI) for each neighborhood. To calculate the HAI, we need the average house price and household income data for each neighborhood.| I scraped the average house price and household income data current as of October 2020 from __[Realosophy](https://www.realosophy.com/toronto/neighbourhood-map)__. Realosophy is a real estate brokerage company that helps their customers make better decision based on data. |
|III. |Crime rate| These data show the crime rate for each neighborhood. | I get the data from the __[Toronto Neighborhood Crime Rates Boundary File](https://data.torontopolice.on.ca/datasets/neighbourhood-crime-rates-boundary-file-?geometry=-79.598%2C43.673%2C-79.158%2C43.760&orderBy=OBJECTID&page=6)__ by calling a REST API from Toronto Police Service. The file contains the 2014-2019 crime data by neighborhood. |
|IV. |Language spoken most often at home|  These data show the language spoken most often at home in each neighborhood. For example: English, Spanish, Italian, French, etc.  | I extracted the data from the __[2016 Toronto Neighborhood Profiles](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv)__. |
|V. |Boundaries of neighborhoods| These data contain the boundary, latitude and longitude coordinate of each neighborhood. | I get the data from __[Boundaries of Toronto's Neighborhoods](https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=geojson&projection=4326)__. City of Toronto made the data available on its open data portal. |
|VI. |Popular food| These data show the popular food categories around each neighborhood according to Foursquare API. For example: Italian restaurant, Korean restaurant, Japanese restaurant, etc. | I get the data through __[Foursquare API](https://developer.foursquare.com/docs/)__. Foursquare is a location technology platform dedicated to improve how people move through the real world. |

### 2.3 Import data and data wrangling (optional)<a name="clean"></a>
This section will explain how to import data and steps to perform data wrangling in detail. If you're not interested in data wrangling, then you can skip this part and straight away go and get __[all data](#all)__ for Toronto's neighborhoods recommender system.

#### I. Common jobs data
Now, let's import and clean the common jobs data first.

In [None]:
# import necessary library
import pandas as pd
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

# import the 2016 toronto neighborhood profiles into toronto_df and clean the dataframe
toronto_df=pd.read_csv('https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv')
toronto_df.drop(['_id','Category','Data Source','City of Toronto'],axis=1,inplace=True)

# extract common jobs data from toronto_df into jobs_df and clean the dataframe
topic=['Industry - North American Industry Classification System (NAICS) 2012']
index=['Neighbourhood Number']
jobs_df=toronto_df[(toronto_df['Topic'].isin(topic))|toronto_df['Characteristic'].isin(index)]
jobs_df=jobs_df.drop('Topic',axis=1).set_index('Characteristic').T
jobs_df.columns=jobs_df.columns.str.strip()
jobs_df=jobs_df.drop(jobs_df.columns[1:4],axis=1).replace(',','',regex=True).astype(int)
jobs_df=jobs_df.sort_values(index).rename_axis(None,axis=1).reset_index()
jobs_df.rename(columns={'index':'Neighborhood','Neighbourhood Number':'ID',
                        '11 Agriculture, forestry, fishing and hunting':'Agriculture',
                        '21 Mining, quarrying, and oil and gas extraction':'Mining','22 Utilities':'Utilities',
                        '23 Construction':'Construction','31-33 Manufacturing':'Manufacturing',
                        '41 Wholesale trade':'Wholesale trade','44-45 Retail trade':'Retail trade',
                        '48-49 Transportation and warehousing':'Transportation',
                        '51 Information and cultural industries':'Cultural industry',
                        '52 Finance and insurance':'Finance','53 Real estate and rental and leasing':'Real estate',
                        '54 Professional, scientific and technical services':'Professional',
                        '55 Management of companies and enterprises':'Management',
                        '56 Administrative and support, waste management and remediation services':'Admin/support',
                        '61 Educational services':'Education','62 Health care and social assistance':\
                        'Health care','71 Arts, entertainment and recreation':'Arts',
                        '72 Accommodation and food services':'Accomodation',
                        '81 Other services (except public administration)':'Other services',
                        '91 Public administration':'Public admin'},inplace=True)
dummy_df=jobs_df.copy()
dummy_df.drop(['Neighborhood','ID'],axis=1,inplace=True)
dummy_df=dummy_df.sort_index(axis=1)
dummy_df.insert(0,'ID',jobs_df['ID'])
dummy_df.insert(0,'Neighborhood',jobs_df['Neighborhood'])
jobs_df=dummy_df.copy()
print('This dataframe consists of {} jobs!'.format(jobs_df.shape[1]-2))
jobs_df.head()

Nice! Now, let's convert the counts for each job into percentage.

In [None]:
# define a function to convert the counts to percentage
def convert_to_percentage(dataframe):
    temp_df=dataframe.copy()
    temp_df['Total']=temp_df.iloc[:,2:].sum(axis=1)
    for i in range(len(temp_df['ID'])):
        temp_df.iloc[i,2:-1]=((temp_df.iloc[i,2:-1]/temp_df.iloc[i,-1])*100).astype(float).round(2)
    temp_df.drop('Total',axis=1,inplace=True)
    return temp_df

# convert the counts to percentage and save the data into percent_jobs_df
percent_jobs_df=convert_to_percentage(jobs_df)
percent_jobs_df.head()

Looks great! Let's get the top 5 common jobs for each neighborhood.

In [None]:
# define a function to return a dataframe of top 5 elements
def get_top5_elements(dataframe,column_name):
    first_element=[]
    second_element=[]
    third_element=[]
    fourth_element=[]
    fifth_element=[]
    first_column=dataframe.iloc[:,0].values
    second_column=dataframe.iloc[:,1].values
    for i in range(140):
        sorted_elements=dataframe.iloc[i,2:].sort_values(ascending=False).index
        first_element.append(sorted_elements[0])
        second_element.append(sorted_elements[1])
        third_element.append(sorted_elements[2])
        fourth_element.append(sorted_elements[3])
        fifth_element.append(sorted_elements[4])
    return pd.DataFrame({'Neighborhood':first_column,'ID':second_column,
                         '1st Most Common {}'.format(column_name):first_element,
                         '2nd Most Common {}'.format(column_name):second_element,
                         '3rd Most Common {}'.format(column_name):third_element,
                         '4th Most Common {}'.format(column_name):fourth_element,
                         '5th Most Common {}'.format(column_name):fifth_element})

# get top 5 common jobs and save the data into top5_jobs_df
top5_jobs_df=get_top5_elements(jobs_df,'Job')
top5_jobs_df.head()

Cool! Let's normalize the common jobs data.

In [None]:
# def a function to normalize data
def data_normalization(dataframe):
    temp_df=dataframe.copy()
    columns_to_normalize=temp_df.columns[2:]
    for column in columns_to_normalize:
        temp_df[column]=(temp_df[column]/temp_df[column].max())*100
    return temp_df

# normalize the common jobs data
norm_jobs_df=data_normalization(jobs_df)
norm_jobs_df.head()

Awesome!

#### II. Affordability index data
To calculate the affordability index for each neighborhood, we need to scrape the average house price and household income from Realosophy. Before scraping these data, we need to get the website links that consist of these information first. So, let's scrape these website links from Realosophy now.

In [None]:
# import necessary libraries
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
from bs4 import BeautifulSoup
firefox_options=Options()
firefox_options.add_argument('-headless')

# define a function to scrape each neighborhood's name and website by borough
def get_neighborhood_websites(borough):
    driver=webdriver.Firefox(options=firefox_options)
    driver.get('https://www.realosophy.com/{}-former-toronto/neighbourhood-map'.format(borough))
    time.sleep(5)
    html=driver.page_source
    soup=BeautifulSoup(html,'lxml')
    all_data=soup.find('div',{'class':'row mt-4'})
    neighborhood_data=all_data.find_all('a')
    neighborhood=[]
    website=[]
    for data in neighborhood_data:
        neighborhood.append(data.text)
        website_temp=data['href'].replace('/','https://www.realosophy.com/',1)
        website.append(website_temp)
    driver.quit()
    print('...',end='')
    return pd.DataFrame({'Neighborhood':neighborhood,'Website':website})

# scrape etobicoke's neighborhoods and websites into etobicoke_df then insert neighborhoods id
print('Almost...',end='')
etobicoke_df=get_neighborhood_websites('etobicoke')
etobicoke_df.drop(25,inplace=True)
etobicoke_df['ID']=[20,11,1,14,13,17,8,9,14,6,19,12,17,18,10,14,4,7,2,16,5,15,16,3,11]

# scrape north york's neighborhoods and websites into northyork_df then insert neighborhoods id
northyork_df=get_neighborhood_websites('north-york')
northyork_df.loc[len(northyork_df.index)]=northyork_df.loc[31,:]
northyork_df.loc[len(northyork_df.index)]=northyork_df.loc[39,:]
northyork_df['ID']=[38,42,34,52,49,43,24,41,30,39,33,39,42,47,45,26,44,31,25,45,53,48,41,21,23,22,38,32,41,39,29,
                    36,45,23,46,28,43,41,35,37,40,27,31,50,51]

# scrape east york's neighborhoods and websites into eastyork_df then insert neighborhoods id
eastyork_df=get_neighborhood_websites('east-york')
eastyork_df['ID']=[56,57,59,61,56,58,54,55,54,54,60]

# scrape central toronto's neighborhoods and websites into centraltoronto_df then insert neighborhoods id
centraltoronto_df=get_neighborhood_websites('central-toronto')
centraltoronto_df=centraltoronto_df.drop([16,47,69,74]).reset_index(drop=True)
centraltoronto_df.loc[len(centraltoronto_df.index)]=centraltoronto_df.loc[39,:]
centraltoronto_df.loc[len(centraltoronto_df.index)]=centraltoronto_df.loc[35,:]
centraltoronto_df.loc[len(centraltoronto_df.index)]=centraltoronto_df.loc[20,:]
centraltoronto_df['ID']=[78,103,76,84,105,80,89,83,71,91,96,100,78,93,75,77,73,66,93,99,97,77,93,83,92,77,77,76,
                         101,102,82,77,77,90,87,94,74,78,70,82,81,103,98,73,103,67,96,92,72,68,70,86,98,95,79,77,
                         96,85,73,74,77,97,87,105,95,63,67,90,69,81,82,62,93,77,64,75,95,65,88,104]

# scrape york's neighborhoods and websites into york_df then insert neighborhoods id
york_df=get_neighborhood_websites('york')
york_df=york_df.drop(11).reset_index(drop=True)
york_df['ID']=[114,112,108,109,106,110,114,115,107,114,111,113]

# scrape scarborough's neighborhoods and websites into scarborough_df then insert neighborhoods id
scarborough_df=get_neighborhood_websites('scarborough')
scarborough_df.loc[len(scarborough_df.index)]=scarborough_df.loc[0,:]
scarborough_df['ID']=[128,127,122,120,120,123,122,126,138,140,134,125,124,117,132,119,130,135,135,121,133,131,139,
                      116,118,136,131,119,137,129]

# concatenate these six boroughs dataframe into website_df
website_df=pd.concat([etobicoke_df,northyork_df,eastyork_df,centraltoronto_df,york_df,scarborough_df])
website_df=website_df.sort_values('ID').reset_index(drop=True)
print('...Done!',end='')
website_df.head()

Excellent! Now, let's scrape the average house price and household income data from Realosophy and clean the data.

In [None]:
# define a function to get the avg house price or household income in numerical result
def get_number_only(data):
    result=data.text.strip().replace('$','').replace(',','')
    if 'M' in result:
        result=int((float(result.replace('M','')))*1000000)
        return result
    else:
        result=int((float(result.replace('K','')))*1000)
        return result

# scrape average house price and household income for each neighborhood into website_df
print('Progress:')
driver=webdriver.Firefox(options=firefox_options)
website_df['Avg House Price']=0
website_df['Avg Household Income']=0
for i in range(len(website_df['ID'])):
    driver.get(website_df['Website'][i])
    time.sleep(5)
    html=driver.page_source
    soup=BeautifulSoup(html,'lxml')
    houseprice_data=soup.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})
    income_class='h3 font-sans-caption-bold mb-0 text-center text-sm-left ng-binding ng-scope'
    income_data=soup.find('p',{'class':income_class})
    while houseprice_data==None or income_data==None:
        driver.get(website_df['Website'][i])
        time.sleep(5)
        html=driver.page_source
        soup=BeautifulSoup(html,'lxml')
        houseprice_data=soup.find('div',{'class':'key-stats__avg-sale-price ng-binding ng-scope'})
        income_data=soup.find('p',{'class':income_class})
    website_df.iloc[i,3]=get_number_only(houseprice_data)
    website_df.iloc[i,4]=get_number_only(income_data)
    print('.',end='')
driver.quit()

# group website_df by neighborhood id and save the data into houseprice_df
website_df.drop('Neighborhood',axis=1,inplace=True)
website_df=website_df.groupby('ID').mean().reset_index()
website_df['Avg House Price']=website_df['Avg House Price'].astype(int)
website_df['Avg Household Income']=website_df['Avg Household Income'].astype(int)
houseprice_df=jobs_df.iloc[:,0:2]
houseprice_df['Avg House Price']=website_df['Avg House Price']
houseprice_df['Avg Household Income']=website_df['Avg Household Income']
print('...Done!')
houseprice_df.head()

Superb! Now, let's calculate the composite __[Housing Affordability Index (HAI)](https://www.frbsf.org/education/publications/doctor-econ/2003/december/housing-affordability-index/)__ for each neighborhood by using this formula:
___
$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Housing\ Affordability\ Index\ (HAI)\ =\ \frac{median\ household\ income}{qualifying\ income}\ x\ 100$  

Assumption:  
a. The average house price and household income are equal to the median house price and household income.  
b. The __[effective mortgage rates](https://www.ratehub.ca/best-mortgage-rates/1-year/fixed?scenario=purchase&home_price=1000000&down_payment_percent=0.2&downPayment=200000&approximateMortgageAmount=800000&amount=800000&amortization=25&live_in_property=true&pre_approval=false)__ is 2%.  
c. Home buyers make a 20% __[down payment](https://www.canada.ca/en/financial-consumer-agency/services/mortgages/down-payment.html)__.  
d. The maximum monthly mortgage payment is 25% of gross monthly income.
___
If a neighborhood's Housing Affordability Index (HAI) is higher than 100, it means that most of the residents are able to afford a house in the neighborhood, the greater the HAI, the higher the housing affordability. However, if a neighborhood's HAI is lower than 100, it means that most of the residents are unable to afford a house in the neighborhood, the lower the HAI, the lower the housing affordability.

In [None]:
# calculate the hai for each neighborhood and save the data into affordability_df
mortgage_payment=houseprice_df['Avg House Price']*0.8*(0.02/12)/(1-(1/(1+0.02/12)**360))
qualifying_income=mortgage_payment*4*12
affordability_df=houseprice_df.iloc[:,0:2]
affordability_df['HAI']=(houseprice_df['Avg Household Income']/qualifying_income)*100
affordability_df.head()

Great! Now, let's normalize the Housing Affordability Index (HAI).

In [None]:
# normalize the hai and save the data into norm_affordability_df
norm_affordability_df=data_normalization(affordability_df)
norm_affordability_df.head()

Cool! Let's convert the value of Housing Affordability Index (HAI) into integer.

In [None]:
# convert hai into integer
affordability_df['HAI']=affordability_df['HAI'].astype(int)
affordability_df.head()

Awesome!

#### III. Crime rate data
Now, let's get the crimes data by calling a REST API from Toronto Police Service and clean the data.

In [None]:
# import necessary libraries
import requests
from pandas.io.json import json_normalize

# request the crimes data by using toronto police service's api and clean the data
url='https://services.arcgis.com/S9th0jAJ7bqgIRjw/arcgis/rest/services/Neighbourhood_MCI/FeatureServer/0/query?where=1%3D1&outFields=Neighbourhood,Hood_ID,Population,Assault_AVG,AutoTheft_AVG,Homicide_AVG,TheftOver_AVG,BreakandEnter_AVG,Robbery_AVG&outSR=4326&f=json'
results=requests.get(url).json()
crime_data=results['features']
crime_df=json_normalize(crime_data)
crime_df.drop('geometry.rings',axis=1,inplace=True)
crime_df.columns=(pd.Series(crime_df.columns)).replace('attributes.','',regex=True)
crime_df.rename(columns={'Neighbourhood':'Neighborhood','Hood_ID':'ID'},inplace=True)
crime_df['ID']=crime_df['ID'].astype(int)
crime_df=crime_df.sort_values('ID').reset_index(drop=True)
crime_df['Neighborhood']=jobs_df['Neighborhood']
crime_df.head()

Great! Now, let's calculate the __[crime rate](https://www.azcalculator.com/formula/crime-rate-calculator.php)__ for each neighborhood by using this formula:
___
$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ crime\ rate\ =\ \frac{number\ of\ crimes}{population}\ x\ 100,000\ people $
___

In [None]:
# calculate the crime rate for each neighborhood
for column in crime_df.columns[3:]:
    crime_df[column]=(crime_df[column]/crime_df['Population'])*100000
crime_df['Crime Rate']=crime_df.iloc[:,3:].sum(axis=1)
crime_df.drop(crime_df.columns[2:-1],axis=1,inplace=True)
crime_df.head()

Cool! Now, let's normalize the crime rate data for each neighborhood.

In [None]:
# normalize the crime rate data and save the data into norm_crime_df
norm_crime_df=crime_df.copy()
norm_crime_df['Crime Rate']=(norm_crime_df['Crime Rate'].min()/norm_crime_df['Crime Rate'])*100
norm_crime_df.head()

Nice! Now, let's convert the value of crime rate into integer.

In [None]:
# convert crime rate into integer
crime_df['Crime Rate']=crime_df['Crime Rate'].astype(int)
crime_df.head()

Nice!

#### IV. Language spoken most often at home data
Now, let's import and clean the language spoken most often at home data first.

In [None]:
# extract language spoken most often at home data from toronto_df into language_df and clean the dataframe
topic=['Language spoken most often at home']
index=['Neighbourhood Number']
language_df=toronto_df[(toronto_df['Topic'].isin(topic))|toronto_df['Characteristic'].isin(index)]
language_df=language_df.drop('Topic',axis=1).set_index('Characteristic').T.replace(',','',regex=True).astype(int)
language_df.sort_values(index,inplace=True)
language_df.columns=language_df.columns.str.strip()
condition=(language_df.sum()==0)
for i in range(condition.shape[0]):
    if(condition[i])|('response'in condition.index[i].lower())|('language'in condition.index[i].lower())\
    |('n.o.s'in condition.index[i].lower())|('number'in condition.index[i].lower())==True:
        language_df.drop(condition.index[i],axis=1,inplace=True)
language_df.drop('English and French',axis=1,inplace=True)
language_df.insert(loc=0,column='ID',value=jobs_df['ID'].values) 
language_df=language_df.rename_axis(None,axis=1).reset_index()
language_df=language_df.rename(columns={'index':'Neighborhood','Tagalog (Pilipino, Filipino)':'Tagalog',
                                        'Punjabi (Panjabi)':'Punjabi','Assyrian Neo-Aramaic':'Assyrian',
                                        'Persian (Farsi)':'Persian',
                                        'Min Nan (Chaochow, Teochow, Fukien, Taiwanese)':'Min Nan'})
dummy_df=language_df.copy()
dummy_df.drop(['Neighborhood','ID'],axis=1,inplace=True)
dummy_df=dummy_df.sort_index(axis=1)
dummy_df.insert(0,'ID',language_df['ID'])
dummy_df.insert(0,'Neighborhood',language_df['Neighborhood'])
language_df=dummy_df.copy()
print('This dataframe consists of {} languages!'.format(language_df.shape[1]-2))
language_df.head()

Nice! Now, let's convert the counts for each language into percentage.

In [None]:
# convert the counts to percentage and save the data into percent_language_df
percent_language_df=convert_to_percentage(language_df)
percent_language_df.head()

Cool! Let's get the top 5 languages for each neighborhood.

In [None]:
# get the top 5 languages for each neighborhood and save the data into top5_language_df
top5_language_df=get_top5_elements(language_df,'Language')
top5_language_df.head()

Awesome! Now, let's normalize the language spoken most often at home data.

In [None]:
# normalize the language spoken most often at home data and save the data into norm_language_df
norm_language_df=data_normalization(language_df)
norm_language_df.head()

Great!

#### V. Boundaries of neighborhoods data
Now let's import and clean the boundary data first.

In [None]:
# import boundary data into boundary_df and clean the dataframe
url='https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=geojson&projection=4326'
toronto_geojson=requests.get(url).json()
boundary_data=toronto_geojson['features']
boundary_df=json_normalize(boundary_data)
boundary_df=pd.DataFrame({'Neighborhood':boundary_df['properties.AREA_NAME'],
                          'ID':boundary_df['properties.AREA_SHORT_CODE'],
                          'Latitude':boundary_df['properties.LATITUDE'],
                          'Longitude':boundary_df['properties.LONGITUDE']})
boundary_df=boundary_df.sort_values('ID').reset_index(drop=True)
boundary_df.iloc[:,0]=jobs_df.iloc[:,0]
boundary_df.head()

Nice! Now, let's visualize the boundary of each neighborhood on a map.

In [None]:
# import necessary libraries
import geocoder
import plotly.express as px

# visualize the boundary of each neighborhood on toronto_map
g=geocoder.osm('Leaside,Toronto,Ontario')
toronto_map=px.choropleth_mapbox(norm_jobs_df,
                                 geojson=toronto_geojson,
                                 locations='ID',
                                 featureidkey='properties.AREA_SHORT_CODE',
                                 mapbox_style='carto-positron',
                                 hover_data=['Neighborhood'],
                                 zoom=10,
                                 center={'lat':g.latlng[0],'lon':g.latlng[1]},
                                 opacity=0.5)
hovertemplate='<br>%{customdata}'
toronto_map.data[0]['hovertemplate']=hovertemplate
toronto_map.update_layout(margin={'r':0,'t':0,'l':0,'b':0})
toronto_map

Superb!

#### VI. Popular food data
Now, let's get the popular food data for each neighborhood by using Foursquare API.

In [None]:
# get the popular food data using foursquare api and save the data into food_df
print('Progress:')
food_df=pd.DataFrame(columns=['ID','Popular Food'])
for i in range(len(boundary_df['ID'])):
    url='https://api.foursquare.com/v2/venues/explore'
    params=dict(client_id='VCPPKYR4FRVFQDA22KSFIEHL0501YTNVH0KIEXLQT4VI4HHM',
                client_secret='XXHQWCHMWZFJ0SKUAGBNJIKK50X0TO2A0CAASVHA305H4RWA',
                v='20201117',
                ll='{},{}'.format(boundary_df['Latitude'][i],boundary_df['Longitude'][i]),
                categoryId='4d4b7105d754a06374d81259',
                radius=5000,
                time='any',
                day='any',
                limit=50)
    results=requests.get(url=url,params=params).json()['response']['groups'][0]['items']
    food_data=[]
    for j in range(len(results)):
        food_data.append(results[j]['venue']['categories'][0]['name'])
    temp_df=pd.DataFrame({'ID':boundary_df.iloc[i,1],'Popular Food':food_data})
    food_df=pd.concat([food_df,temp_df],ignore_index=True)
    print('.',end='')
print('...done!')
temp_df=pd.get_dummies(data=food_df['Popular Food'],prefix_sep="")
temp_df.insert(0,'ID',food_df['ID'])
food_df=temp_df.groupby('ID',as_index=False).sum()
food_df.insert(0,'Neighborhood',boundary_df['Neighborhood'])
food_df.drop('Food',axis=1,inplace=True)
print('This dataframe consists of {} food categories!'.format(food_df.shape[1]-2))
food_df.head()

Awesome! Let's convert the counts for each food category into percentage.

In [None]:
# convert counts into percentage and save the data into percent_food_df
percent_food_df=convert_to_percentage(food_df)
percent_food_df.head()

Great! Now, let's get the top 5 food categories for each neighborhood.

In [None]:
# get the top 5 food categories and save the data into top5_food_df
top5_food_df=get_top5_elements(food_df,'Food')
top5_food_df.head()

Cool! Now, let's normalize the popular food data.

In [None]:
# normalize the popular food data and save the data into norm_food_df
norm_food_df=data_normalization(food_df)
norm_food_df.head()

Nice!

### 2.4 All data for Toronto's neighborhoods recommender system<a name="all"></a>
All data needed to build a Toronto's neighborhoods recommender system are listed below. For easier replication of this project, I uploaded all data to my __[GitHub repository](https://github.com/titus-chin/Toronto-Neighborhoods-Recommender-System)__.
* percent_jobs_df
* top5_jobs_df
* norm_jobs_df
* affordability_df
* norm_affordability_df
* crime_df
* norm_crime_df
* percent_language_df
* top5_language_df
* norm_language_df
* toronto_geojson
* percent_food_df
* top5_food_df
* norm_food_df

With these data, we are ready to build a Toronto's neighborhoods recommender system! Let's get started!

## 3. Methodology<a name="methodology"></a>
I wish the recommender system can be a web-based interactive dashboard, so that more people can access to it. To achieve this, I will first build a recommender system to automatically rank the neighborhoods based on our preferences. After that, I will use some attractive plots to visualize the results. Then I will convert the recommender system to an interactive dashboard. If everything is cool, then I will deploy my interactive dashboard on Heroku and guess what? A Toronto's neighborhoods recommender system is born!

### 3.1 Build a recommender system<a name="system"></a>
To build a recommender system, we will need to define some functions to automatically rank the neighborhoods based on our preferences. Let's import all the data needed first.

In [None]:
# import necessary library
import pandas as pd
pd.set_option('display.max_rows',None)
pd.set_option('display.max_columns',None)

# read all data required
percent_jobs_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/percent_jobs.csv')
top5_jobs_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/top5_jobs.csv')
norm_jobs_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/norm_jobs.csv')
affordability_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/affordability.csv')
norm_affordability_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/norm_affordability.csv')
crime_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/crime.csv')
norm_crime_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/norm_crime.csv')
percent_language_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/percent_language.csv')
top5_language_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/top5_language.csv')
norm_language_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/norm_language.csv')
percent_food_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/percent_food.csv')
top5_food_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/top5_food.csv')
norm_food_df=pd.read_csv('https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/norm_food.csv')

We have jobs, affordability, crime rate, language and food data. So, we will rank the neighborhoods based on these 5 factors. Let's define a function to get the score of each neighborhood based on our choices.

In [None]:
# import necessary libraries
import ipywidgets as widgets
from ipywidgets import interactive,HBox,VBox
import numpy as np

# define a function to get the score of each neighborhood based on our choices
def get_score(jobs,language,food,job_weightage,hai_weightage,safety_weightage,language_weightage,food_weightage):
    result_df=norm_jobs_df.iloc[:,0:2]
    temp_jobs=0
    for job in jobs:
        temp_jobs=temp_jobs+norm_jobs_df[job]
    if len(jobs)>0:
        temp_jobs=temp_jobs/len(jobs)   
    temp_language=0
    for lan in language:
        temp_language=temp_language+norm_language_df[lan]
    if len(language)>0:
        temp_language=temp_language/len(language)
    temp_food=0
    for f in food:
        temp_food=temp_food+norm_food_df[f]
    if len(food)>0:
        temp_food=temp_food/len(food)        
    result_df['Score']=(job_weightage*temp_jobs+\
                        hai_weightage*norm_affordability_df['HAI']+\
                        safety_weightage*norm_crime_df['Crime Rate']+\
                        language_weightage*temp_language+\
                        food_weightage*temp_food)/\
                       (job_weightage+hai_weightage+safety_weightage+language_weightage+food_weightage)
    result_df['Score']=result_df['Score'].round(2)
    result_df.sort_values('Score',ascending=False,inplace=True)
    result_df.insert(loc=0,column='Rank',value=np.arange(1,141,1))
    result_df.reset_index(drop=True,inplace=True)
    return display(result_df)

# set up widgets to filter the results
item=[widgets.SelectMultiple(options=list(norm_jobs_df.columns[2:]),description='Jobs',
                             value=('Professional','Management')),
      widgets.SelectMultiple(options=list(norm_language_df.columns[2:]),description='Language',
                             value=('English','Mandarin')),
      widgets.SelectMultiple(options=list(norm_food_df.columns[2:]),description='Food',
                             value=('Fish & Chips Shop','Pizza Place')),
      widgets.IntSlider(min=0,max=5,value=4,step=1,description='Job Weightage',continuous_update=False),
      widgets.IntSlider(min=0,max=5,value=5,step=1,description='Affordability Weightage',continuous_update=False),
      widgets.IntSlider(min=0,max=5,value=4,step=1,description='Safety Weightage',continuous_update=False),
      widgets.IntSlider(min=0,max=5,value=4,step=1,description='Language Weightage',continuous_update=False),
      widgets.IntSlider(min=0,max=5,value=3,step=1,description='Food Weightage',continuous_update=False)]

# use interactive to call the get_score fucntion to rank the neighborhood by score
intwidget=interactive(get_score,jobs=item[0],language=item[1],food=item[2],job_weightage=item[3],
                      hai_weightage=item[4],safety_weightage=item[5],language_weightage=item[6],
                      food_weightage=item[7])
intoutput=intwidget.children[-1]
left=VBox([item[3],item[4],item[5],item[6],item[7]])
middle=VBox([item[0],item[2],])
right=VBox([item[1]])
box=HBox([left,middle,right])
display(VBox([box,intoutput]))
intwidget.update()

Superb! I set the default value of job weightage, affordability weightage, safety weightage, language weightage and food weightage to 4, 5, 4, 4 and 3 respectively. I set the default jobs as professional and management, default languages as English and Mandarin, default food as fish & chips shop and pizza place. And the top 1 neighborhood returned by the recommender system is Waterfront Communities-The Island with the score of 59.31%, the higher the score, the higher the chance we will like the neighborhood. If we change the weightage of each factor and choose other types of jobs, languages or food, it will return another neighborhood that suit us the best. Therefore, a Toronto's neighborhoods recommender system is born!

### 3.2 Data visualization<a name="visualization"></a>
Now, we will use some attractive plots to visualize the results. We use __[Plotly](https://plotly.com/python/)__ library to perform the data visualization. Let's return the score of each neighborhood on a map first.

In [None]:
# import necessary libraries
import requests
import geocoder
import plotly.express as px

# get the toronto boundary data into toronto_geojson
url='https://raw.githubusercontent.com/titus-chin/Toronto-Neighborhoods-Recommender-System/main/Data/boundary.geojson'
toronto_geojson=requests.get(url).json()
g=geocoder.osm('Leaside,Toronto,Ontario')

# define a function to get the score of each neighborhood based on our choices
def get_score(jobs,language,food,job_weightage,hai_weightage,safety_weightage,language_weightage,food_weightage):
    result_df=norm_jobs_df.iloc[:,0:2]
    temp_jobs=0
    for job in jobs:
        temp_jobs=temp_jobs+norm_jobs_df[job]
    if len(jobs)>0:
        temp_jobs=temp_jobs/len(jobs)   
    temp_language=0
    for lan in language:
        temp_language=temp_language+norm_language_df[lan]
    if len(language)>0:
        temp_language=temp_language/len(language)
    temp_food=0
    for f in food:
        temp_food=temp_food+norm_food_df[f]
    if len(food)>0:
        temp_food=temp_food/len(food)        
    result_df['Score']=(job_weightage*temp_jobs+\
                        hai_weightage*norm_affordability_df['HAI']+\
                        safety_weightage*norm_crime_df['Crime Rate']+\
                        language_weightage*temp_language+\
                        food_weightage*temp_food)/\
                       (job_weightage+hai_weightage+safety_weightage+language_weightage+food_weightage)
    result_df['Score']=result_df['Score'].round(2)
    result_df.sort_values('Score',ascending=False,inplace=True)
    result_df.insert(loc=0,column='Rank',value=np.arange(1,141,1))
    result_df.reset_index(drop=True,inplace=True)
    
    # refine the function to visualize the score on a map
    toronto_map=px.choropleth_mapbox(result_df,
                                     geojson=toronto_geojson,
                                     color='Score',
                                     color_continuous_scale='teal',
                                     locations='ID',
                                     featureidkey='properties.AREA_SHORT_CODE',
                                     mapbox_style='carto-positron',
                                     hover_data=['Rank','Neighborhood','Score'],
                                     zoom=10,
                                     center={'lat':g.latlng[0],'lon':g.latlng[1]},
                                     opacity=0.5)
    hovertemplate='<br>Rank: %{customdata[0]}'\
                  '<br>%{customdata[1]}'\
                  '<br>Score: %{customdata[2]:.3s}%'
    toronto_map.data[0]['hovertemplate']=hovertemplate

    # draw selected neighborhood
    selected_geojson=toronto_geojson.copy()
    for feature in selected_geojson['features']:
        if feature['properties']['AREA_SHORT_CODE']==result_df.iloc[0,2]:
            selected_geojson['features']=[feature]
    data_selected={'ID':result_df.iloc[0,2],
                   'Neighborhood':result_df.iloc[0,1],
                   'default_value':[''],
                   'Rank':result_df.iloc[0,0],
                   'Score':result_df.iloc[0,3]}
    fig_temp=px.choropleth_mapbox(data_selected,
                                  geojson=selected_geojson,
                                  locations='ID',
                                  color='default_value',
                                  featureidkey='properties.AREA_SHORT_CODE',
                                  mapbox_style="carto-positron",
                                  hover_data=['Rank','Neighborhood','Score'],
                                  zoom=10,
                                  center={'lat':g.latlng[0],'lon':g.latlng[1]},
                                  opacity=1)
    toronto_map.add_trace(fig_temp.data[0])
    hovertemplate='<br>Rank: %{customdata[0]}'\
                  '<br>%{customdata[1]}'\
                  '<br>Score: %{customdata[2]:.3s}%'\
                  '<extra></extra>'
    toronto_map.data[1]['hovertemplate']=hovertemplate  
    toronto_map.update_layout(margin={'r':0,'t':0,'l':0,'b':0},coloraxis_showscale=False,showlegend=False)
    return toronto_map.show()

# use interactive to call the get_score fucntion to rank the neighborhood by score
intwidget=interactive(get_score,jobs=item[0],language=item[1],food=item[2],job_weightage=item[3],
                      hai_weightage=item[4],safety_weightage=item[5],language_weightage=item[6],
                      food_weightage=item[7])
intoutput=intwidget.children[-1]
left=VBox([item[3],item[4],item[5],item[6],item[7]])
middle=VBox([item[0],item[2],])
right=VBox([item[1]])
box=HBox([left,middle,right])
display(VBox([box,intoutput]))
intwidget.update()

Awesome! The choropleth map displays the score of each neighborhood with different color scale. The darker the color, the higher the score. And the choropleth map produced by Plotly is highly interactive. As we hover over the neighborhoods, the information about the rank, name and score of each neighborhood will be displayed. Now, let's visualize the top 5 common jobs, languages and food of the top 1 neighborhood returned by the recommender system which is Waterfront Communities-The Island.

In [None]:
# import necessary library
import plotly.graph_objs as go

# set layout
fig_layout_defaults=dict(plot_bgcolor='#F9F9F9',paper_bgcolor='#F9F9F9')

# visualize the top5 jobs, languages and food of the top 1 neighborhood
top5=list(top5_jobs_df.iloc[76,2:].values)+list(top5_language_df.iloc[76,2:].values)+\
     list(top5_food_df.iloc[76,2:].values)
temp_jobs=list(percent_jobs_df.iloc[76,2:].values)
temp_jobs.sort(reverse=True)
temp_language=list(percent_language_df.iloc[76,2:].values)
temp_language.sort(reverse=True)
temp_food=list(percent_food_df.iloc[76,2:].values)
temp_food.sort(reverse=True)
percent=temp_jobs[0:5]+temp_language[0:5]+temp_food[0:5]
factor=['Jobs','Jobs','Jobs','Jobs','Jobs','Languages','Languages','Languages','Languages','Languages','Food',
       'Food','Food','Food','Food']
result_df=pd.DataFrame({'Factor':factor,'Top5':top5,'Percent':percent})
top5_fig=px.sunburst(result_df,
                     path=['Factor','Top5'],
                     values='Percent',
                     color='Percent',
                     color_continuous_scale='teal')
hovertemplate='<br>%{label}'\
              '<br>%{value:.3s}%'
top5_fig.data[0]['hovertemplate']=hovertemplate
top5_fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0},coloraxis_showscale=False,showlegend=False,
                       **fig_layout_defaults)
top5_fig

Superb! The sunburst figure shows the top 5 common jobs, languages and food of the Waterfront Communities-The Island clearly and effectively. As you hover over an item of the figure, it will display the infomation like its label and percentage. The color scale of the figure is depending on the percentage of each item, the darker the color, the higher the percentage. One more advantage of sunburst figure is you can click on the category that you are interested in and it will display only the category for you. For example, if you're interested in the jobs category, you can simply click on the 'Jobs', and the sunburst figure will transform and show you the information of jobs only. Once you're satisfied with it, you can click again the 'Jobs', and it will return to its original state. By comparing our choices with the sunburst figure, it proves that the recommender system has done a pretty good job. Waterfront Communities-The Island has almost all the elements that we want such as professional jobs, English and Mandarin languages, and pizza place. Now, let's visualize the crime rate and affordability index of Waterfront Communities-The Island.

In [None]:
# compute the average crime rate and affordability index for toronto city
avg_crime=int(crime_df['Crime Rate'].mean())
avg_affordability=int(affordability_df['HAI'].mean())

# visualize the crime rate and affordability index of the top 1 neighborhood
indicator=go.Figure()
indicator.add_trace(go.Indicator(mode='number+delta',
                                 value=crime_df.iloc[76,2],
                                 title={'text':'Crime Rate'},
                                 delta={'reference':avg_crime,'relative':True,'increasing_color':'#FF4136',
                                        'decreasing_color':'#3D9970'},
                                 domain={'x':[0.4,1],'y':[0,1]}))
indicator.add_trace(go.Indicator(mode='number+delta',
                                 value=affordability_df.iloc[76,2],
                                 title={'text':'Affordability Index'},
                                 delta={'reference':avg_affordability,'relative':True},
                                 domain={'x':[0,0.6],'y':[0,1]}))
indicator.update_layout(margin={'r':0,'t':0,'l':0,'b':0},**fig_layout_defaults)
indicator.show()

Great! The indicator is a simple and effective way to convey numerical information. Green arrow means good result and red arrow means bad result. The affordability index of Waterfront Communities-The Island is 116 which is 43% higher than the average. However, the crime rate of this neighborhood is 1960 per 100,000 people which is 60% higher than the average. These indicators mean that Waterfront Communities-The Island is a highly affordable but unsafe neihgborhood compared to others.

### 3.3 Build an interactive dashboard<a name="dashboard"></a>
In this part, I will convert the recommender system to an interactive dashboard. I use __[Dash](https://dash.plotly.com/)__ library to create the dashboard. Since Dash's result won't display in Jupyter Notebook, I will just briefly explain the steps to build an interactive dashboard and attach the photo of the Dash's result in the end of this part. I have uploaded the full code for building an interactive dashboard to my __[GitHub repository](https://github.com/titus-chin/Toronto-Neighborhoods-Recommender-System)__, feel free to check it out.

### 3.4 Deploy dashboard on Heroku<a name="heroku"></a>
In this part, I will deploy the interactive dashboard on __[Heroku](https://www.heroku.com/)__ and then a Toronto's neighborhoods recommender system is born!

## 4. Results<a name="results"></a>

## 5. Discussion<a name="discussion"></a>

## 6. Conclusion<a name="conclusion"></a>

Thank you for reading this notebook! Feel free to read this __[article]()__ to learn how to create this recommender system. Checkout the __[GitHub repository](https://github.com/titus-chin/Toronto-Neighborhoods-Recommender-System)__ for the source code. Access to the recommender system via this __[link]()__.

## Author  
__[Titus Chin Jun Hong](https://www.linkedin.com/in/titus-chin-a17ba41bb?lipi=urn%3Ali%3Apage%3Ad_flagship3_profile_view_base_contact_details%3BsLjmV3x9QaG9sLprGF5OBA%3D%3D)__   
**27 November 2020**