<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

Downloading all the necessary dependencies.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [2]:
!conda install -c anaconda beautifulsoup4 --yes
!conda install -c anaconda lxml --yes

Collecting package metadata: done
Solving environment: \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::anaconda==5.3.1=py37_0
  - defaults/linux-64::astropy==3.0.4=py37h14c3975_0
  - defaults/linux-64::bkcharts==0.2=py37_0
  - defaults/linux-64::blaze==0.11.3=py37_0
  - defaults/linux-64::bokeh==0.13.0=py37_0
  - defaults/linux-64::bottleneck==1.2.1=py37h035aef0_1
  - defaults/linux-64::dask==0.19.1=py37_0
  - defaults/linux-64::datashape==0.5.4=py37_1
  - defaults/linux-64::mkl-service==1.1.2=py37h90e4bf4_5
  - defaults/linux-64::numba==0.39.0=py37h04863e7_0
  - defaults/linux-64::numexpr==2.6.8=py37hd89afb7_0
  - defaults/linux-64::odo==0.5.1=py37_0
  - defaults/linux-64::pytables==3.4.4=py37ha205bf6_0
  - defaults/linux-64::pytest-arraydiff==0.2=py37h39e3cac_0
  - defaults/linux-64::pytest-astropy==0.4.0=py37_0
  - defaults/linux-64::pytest-doctestplus==0.1.3=py37_0
  - defaults

In [2]:
from bs4 import BeautifulSoup

<a id='item1'></a>

## 1. Download and Explore Dataset

Downloading Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe

In [3]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
print('Data downloaded!')

Data downloaded!


Loading the data as a soup object

In [3]:
with open("toronto_data.html") as html_file:
    soup = BeautifulSoup(html_file,"lxml")

Using the BeautifulSoup package to extract the table with neccessary information and loading the data into a dataframe for further cleansing

In [4]:
table=soup.find("tbody")
dftable=pd.DataFrame(columns=['PostalCode','Borough','Neighborhood'])
for i, tr in enumerate(table.find_all("tr")):
    dftable=dftable.append({'PostalCode':"",'Borough':"",'Neighborhood':""}, ignore_index=True)
    for j, td in enumerate(tr.find_all("td")):
        dftable.iloc[i,j]=td.text
dftable=dftable[dftable.Borough != 'Not assigned']
dftable=dftable.drop([0])
dftable.reset_index(inplace=True, drop=True)
dftable['Neighborhood'] = dftable['Neighborhood'].str.replace('\n','')
idx = dftable.index[dftable['Neighborhood'] == 'Not assigned']
dftable['Neighborhood'].iloc[idx]= dftable.Borough[dftable['Neighborhood'] == 'Not assigned']
dftable=dftable.groupby(['PostalCode', 'Borough'], as_index=False, sort=False).agg(', '.join)
dftable.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [5]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(dftable['Borough'].unique()),
        dftable.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


## 2. Download and merge the coordinates into the dataframe

In [6]:
!wget -q -O 'toronto_coord.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [7]:
dfcoord=pd.read_csv('toronto_coord.csv')

In [12]:
df=pd.merge(dftable, dfcoord, left_on='PostalCode', right_on='Postal Code')
df=df.drop("Postal Code", axis=1)

In [13]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
