<h1> Exploring and clustering the neighborhoods in Toronto </h1>

<h2> Problem 1 </h2>

Scraping Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  using pandas

In [1]:
# Import Libraries
!conda install -c conda-forge bs4 --yes
from bs4 import BeautifulSoup
import pandas as pd
import requests


Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - bs4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.9.0       |   py36h9f0ad1d_0         160 KB  conda-forge
    bs4-4.9.0                  |                0           4 KB  conda-forge
    soupsieve-1.9.4            |   py36h9f0ad1d_1          58 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         222 KB

The following NEW packages will be INSTALLED:

  beautifulsoup4     conda-forge/linux-64::beautifulsoup4-4.9.0-py36h9f0ad1d_0
  bs4                conda-forge/noarch::bs4-4.9.0-0
  soupsieve          conda-forge/linux-64::soupsieve-1.9.4-py36h9f0ad1d_1



Downloading and Extracting Packag

In [2]:
# Wikipedia page url 
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

<h4> Begin Scrapping </h4>

In [3]:
# Create page to handle the contents of the website
page  = requests.get(url).text

In [4]:
# pull out data from the page
soup = BeautifulSoup(page, 'html.parser')
# find table
table=soup.find('table')

<h4> Create Dataframe </h4

In [5]:
# Extrate reqiured columns
cols = ['Postalcode','Borough','Neighborhood']
# Create Dataframe from the columns
df = pd.DataFrame(columns = cols)

In [6]:
# Search all the Postcode, Borough, Neighborhood 
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data



In [7]:
# print the first 5 content of the frame
df.head(5)

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<h4> Data Cleasing </h4>
Removing cells with a borough that is Not assigned.

In [8]:
df=df.drop(df[(df.Borough == "Not assigned")].index)

In [9]:
df.head(5)

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


A cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [10]:
df.Neighborhood.replace("Not assigned", df.Borough, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area, These rows will be combined into one row with the neighborhoods separated with a comma 

In [12]:
# Merged rows into one row
def neighborhood_list(grouped):    
    return ', '.join(sorted(grouped['Neighborhood'].tolist()))
                    
grp = df.groupby(['Postalcode', 'Borough'])
df = grp.apply(neighborhood_list).reset_index(name='Neighborhood')
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [13]:
df.shape

(103, 3)

<h2> Problem 2 </h2>

We need to get the latitude and the longitude coordinates of each neighborhood.

In [14]:
# read geo csv
df_geo = pd.read_csv('http://cocl.us/Geospatial_data')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
df_geo.rename(columns={'Postal Code':'Postalcode'},inplace=True)
df2 = pd.merge(df_geo, df, on='Postalcode')

In [16]:
df2 = df2[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]
df2.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


<h2> Problem 3 </h2>

Exploring and clustering the neighborhoods in Toronto.

In [19]:
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import matplotlib.cm as cm
from pandas.io.json import json_normalize
import numpy as np
import folium
import os

In [21]:
"""
Users foursquare credentials.
1) users client id
2) users client secrete
"""
# Client ID
C_ID = 'KGWVZLX1JOX4VMUSLY3VC1VTLABY22ZUVNNF2H4TMKBL5UUB'
# Client Secrete
C_SE = '2LL0B2L5YQ5Y4MBUGVRZR3OLP3GWDS4PHP4RL00TJK0SMYM3'