# Modeling Local Demographics based on Local Venues in Toronto, Canada

### This Notebook is my submission for the final project in the IBM Applied Datascience Capstone Course. For this project I have chosen to experiment with modeling and predicting local demographics based on local venue data for neighborhoods in the city of Toronto, Canada. ¶

*By Andrew Dahlstrom*

*6/14/2019*

### Introduction

Population demographic maps are useful tools for many humanitarian efforts inclding to manage disease outbreaks, water scarcity, disaster relief efforts, eletrical grid expansion, expandsion of health or education services etc. Demographic data is not always readily available in some areas so any contribution to methodology that can improve the accuracy of population maps could be useful to humanitarian efforts. The goal of this project is to explore how accurately local venue data can predict the demographic data for that neighborhood (organized by postal code) using relevant machine learning techniques to create a prediction model. 

### Data

This city of Toronto was selected because of the detailed and recent demograpic data available publicly and the large collection of venue data available for each neighborrhood. The data for this project has been collected from the following sources:

* Toronto postal code data can be found on [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).
* The demographic data for Toronto has been collected from the [2016 Census](https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A).
* The local venue data will be retrieved from the [FourSquare API](https://developer.foursquare.com/).
* Latitude and logitude geospatial data for postal codes provided by IBM course website

The first step is to create a data frame of postal codes for the city of Toronto using a web scraper in order to organize the venue data into smaller communities.

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

In [2]:
# Scrape text from wikitable online and load into a Pandas data frame

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

for table in soup.find_all('table', class_= 'wikitable'):
    neightable = []
    
    for row in table.find_all('tr'):
        neighrow = []
        
        for data in row.find_all('td'):
            neighrow.append(data.text.rstrip('\n'))
        
        neightable.append(neighrow)

#Clean data remove unassigned boroughs

neighdf = pd.DataFrame(neightable, columns = ['PostalCode', 'Borough', 'Neighborhood'])
neighdf.drop(index=0, inplace=True)
todrop = neighdf[neighdf['Borough'] == "Not assigned"].index
neighdf.drop(todrop, inplace=True)
neighdf.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In order to clean the Toronto postal code data we need to merge neighborhoods that share the same postal code and give unnamed neighbourhoods the name of their borough.

In [3]:
# Replace unassigned neighborhoods or neighborhoods with incorrect names with their borough name 

for index, row in neighdf.iterrows(): 
    s = neighdf.at[index, 'Neighborhood']
    if  s == "Not assigned" or len(s.split()) > 4 :
        neighdf.at[index, 'Neighborhood'] = neighdf.at[index, 'Borough']

# Merge neighborhoods with same postalcode

neighdf = neighdf.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighdf.head()       

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Next we need to get the csv file from the link provided in the course website to get the latitude and longitude coordinates for each postal code.

In [4]:
import io
geourl="http://cocl.us/Geospatial_data"
s = requests.get(geourl).content
geodata = pd.read_csv(io.StringIO(s.decode('utf-8')))
geodata.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Add latitude and longitude data to neighborhood dataframe.

In [5]:
for index, row in neighdf.iterrows():
    
    for i, r in geodata.iterrows():
        if neighdf.at[index, 'PostalCode'] == geodata.at[i, 'Postal Code']:
            neighdf.at[index, 'Latitude'] = geodata.at[i, 'Latitude']
            neighdf.at[index, 'Longitude'] = geodata.at[i, 'Longitude']

neighdf.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Next we need the FourSquare API to get the venue data for each postal code.

In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
### Create dataframe containing venue data for each postal code ###

# Define function to get venue data for each postal code.
# Takes as an argument location name, latitude and logitude coordinates
# returns a dataframe containing venue data for the nearest 
# LIMIT number of venues within the radius

# Max number of venues within radius
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Create the dataframe using the getNearbyVenues function

toronto_venues = getNearbyVenues(names=neighdf['PostalCode'],
                                 latitudes=neighdf['Latitude'],
                                 longitudes=neighdf['Longitude'],
                                )

toronto_venues.shape

(4886, 7)

In [8]:
print('There are {} unique venue categories and the following venue counts for each postal code...'.format(len(toronto_venues['Venue Category'].unique())))
toronto_venues.groupby('Postal Code').count()

There are 326 unique venue categories and the following venue counts for each postal code...


Unnamed: 0_level_0,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
M1B,18,18,18,18,18,18
M1C,5,5,5,5,5,5
M1E,23,23,23,23,23,23
M1G,8,8,8,8,8,8
M1H,30,30,30,30,30,30
M1J,12,12,12,12,12,12
M1K,25,25,25,25,25,25
M1L,29,29,29,29,29,29
M1M,13,13,13,13,13,13
M1N,15,15,15,15,15,15


The next step is to build the dataframe of community venue profiles for each postal code based on the venue data. I will use a method called one hot encoding to build a binary classification table which can then be used to build a frequency table for the occurences of a venue in each category for each postal code. 

In [9]:
# One hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add back in postal code column and move it to front
toronto_onehot['Postal Code'] = toronto_venues['Postal Code']
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# Build frequency table by postal code and by taking the mean of the frequency of occurrence for each category
toronto_venue = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_venue.head()

Unnamed: 0,Postal Code,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0


In [10]:
# Add neighborhood titles in for reference
toronto_venue['Neighborhood'] = neighdf['Neighborhood']
cols = list(toronto_venue)
cols.insert(1, cols.pop(cols.index('Neighborhood')))
toronto_venue = toronto_venue.loc[:, cols]
toronto_venue.head()

Unnamed: 0,Postal Code,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,M1B,"Rouge, Malvern",0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1C,"Highland Creek, Rouge Hill, Port Union",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,"Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,0.033333,0.0


#### Now that the neighborhood venue profile dataframe is complete, the next step is to build the neighborhood demographics profile dataframe using data from the 2016 Canada Census.

In [38]:
from zipfile import ZipFile

# Get demographic data from Canada Census website, load into dataframe
url = "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/CompDataDownload.cfm?LANG=E&PID=109790&OFT=CSV"
r = requests.get(url)
z = ZipFile(io.BytesIO(r.content))
z.extractall()

demo = pd.read_table(z.open('98-400-X2016008_English_CSV_data.csv'), 
                     usecols = ['L9Z', 'Average age', '50.0'],
                     sep = ',', skiprows = 112395, nrows = 12192)
demo.rename(columns = {"L9Z": "Postal Code", "Average age": "Index", 
                       "50.0" : "Population"}, inplace = True)
demo.head()

Unnamed: 0,Postal Code,Index,Population
0,M1B,Total - Age,66110.0
1,M1B,0 to 14 years,11535.0
2,M1B,0 to 4 years,3540.0
3,M1B,Under 1 year,675.0
4,M1B,1,705.0


Next let's organize the table so that the index is postal codes and the columns are population per age groups in increments of 5 years.

In [41]:
demo2 = demo.pivot(index = 'Postal Code', columns = 'Index', values = 'Population')
keep_columns = ['0 to 4 years', '5 to 9 years', '10 to 14 years', '15 to 19 years', 
                '20 to 24 years', '25 to 29 years', '30 to 34 years', '35 to 39 years',
                '40 to 44 years', '45 to 49 years', '50 to 54 years', '55 to 59 years',
                '60 to 64 years', '65 to 69 years', '70 to 74 years', '75 to 79 years',
                '80 to 84 years', '85 to 89 years', '90 to 94 years', '95 to 99 years',
                '100 years and over']
demo2 = demo2[keep_columns]
demo2.head()

Index,0 to 4 years,5 to 9 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,45 to 49 years,...,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,95 to 99 years,100 years and over
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,3540.0,3920.0,4080.0,4680.0,5090.0,4835.0,4415.0,3950.0,4150.0,4410.0,...,4570.0,4070.0,3615.0,2470.0,1610.0,925.0,530.0,220.0,65.0,10.0
M1C,1575.0,1760.0,1905.0,2380.0,2695.0,2130.0,1870.0,1955.0,1965.0,2390.0,...,2970.0,2670.0,2340.0,1695.0,1090.0,665.0,385.0,185.0,40.0,5.0
M1E,2405.0,2585.0,2585.0,3100.0,3450.0,2975.0,2665.0,2550.0,2720.0,3240.0,...,3610.0,3050.0,2370.0,1840.0,1545.0,1195.0,840.0,405.0,105.0,15.0
M1G,1720.0,1925.0,1950.0,2140.0,2350.0,2125.0,1835.0,1810.0,1775.0,1935.0,...,1970.0,1530.0,1320.0,1015.0,925.0,685.0,370.0,140.0,20.0,5.0
M1H,1330.0,1365.0,1175.0,1320.0,1970.0,2000.0,1910.0,1705.0,1540.0,1660.0,...,1605.0,1340.0,1010.0,820.0,670.0,655.0,385.0,185.0,35.0,5.0


In [42]:
# Now we can build a demographic profile based on
# the distribution of the population in each age group

toronto_demo = demo2.apply(lambda x: x / x.sum()).reset_index()
#toronto_demo.rename(index=str, columns={"Age": "Index"}, inplace=True)
#toronto_demo['Age'] = toronto_demo['Index']
toronto_demo.head(10)

Index,Postal Code,0 to 4 years,5 to 9 years,10 to 14 years,15 to 19 years,20 to 24 years,25 to 29 years,30 to 34 years,35 to 39 years,40 to 44 years,...,55 to 59 years,60 to 64 years,65 to 69 years,70 to 74 years,75 to 79 years,80 to 84 years,85 to 89 years,90 to 94 years,95 to 99 years,100 years and over
0,M1B,0.026023,0.029028,0.032086,0.032149,0.026134,0.020752,0.019655,0.02012,0.022752,...,0.024991,0.026441,0.027685,0.026383,0.021127,0.015259,0.012989,0.011182,0.013742,0.012821
1,M1C,0.011578,0.013033,0.014981,0.01635,0.013837,0.009142,0.008325,0.009958,0.010773,...,0.016241,0.017346,0.017921,0.018105,0.014304,0.01097,0.009435,0.009403,0.008457,0.00641
2,M1E,0.017679,0.019142,0.020329,0.021296,0.017714,0.012769,0.011864,0.012989,0.014912,...,0.019741,0.019815,0.01815,0.019654,0.020274,0.019713,0.020586,0.020584,0.022199,0.019231
3,M1G,0.012644,0.014255,0.015335,0.014701,0.012066,0.00912,0.008169,0.009219,0.009731,...,0.010773,0.00994,0.010109,0.010842,0.012138,0.0113,0.009068,0.007116,0.004228,0.00641
4,M1H,0.009777,0.010108,0.00924,0.009068,0.010115,0.008584,0.008503,0.008685,0.008443,...,0.008777,0.008706,0.007735,0.008759,0.008792,0.010805,0.009435,0.009403,0.0074,0.00641
5,M1J,0.017201,0.018698,0.01852,0.017517,0.013889,0.010408,0.010729,0.012352,0.013185,...,0.013288,0.012539,0.011488,0.012391,0.012007,0.011547,0.011151,0.010673,0.013742,0.00641
6,M1K,0.019591,0.020957,0.02139,0.020643,0.01643,0.014228,0.014268,0.016376,0.017406,...,0.019605,0.019003,0.01704,0.017144,0.016928,0.016084,0.015317,0.01169,0.008457,0.012821
7,M1L,0.016871,0.015958,0.016318,0.015319,0.011347,0.009743,0.011909,0.013804,0.013925,...,0.012988,0.011499,0.010186,0.009827,0.00912,0.009485,0.009925,0.015502,0.014799,0.012821
8,M1M,0.007425,0.009182,0.010302,0.01027,0.007676,0.005064,0.005275,0.005756,0.008114,...,0.010418,0.010687,0.010147,0.008759,0.008464,0.010145,0.010293,0.011182,0.011628,0.012821
9,M1N,0.007866,0.008553,0.008533,0.007866,0.006649,0.004678,0.005409,0.007029,0.008607,...,0.010308,0.010362,0.010109,0.009186,0.007677,0.008083,0.00919,0.010165,0.011628,0.00641


#### Notes about the data: 

* The demographic data comes from a 2016 census but the venue data is current meaning that the demographic data is lagging behind the venue data by a couple of years so this will affect the accuracy of the model.
* For financial reasons I must limit the number of venues I am able to collect data on for each neighborhood. This means I will normalize the venue data into a neighborhood profile which will give an estimation of the composition of venues in various categories for each neighborhood rather than the total venue count for each category which will limit model accuracy.  
* I will be exploring how well venue data predicts the age distribution of the population in each neighborhood with a precision of 5 year range increments.