# Clustering Toronto - Part 1

### Load libraries

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
from geopy.geocoders import Nominatim
import folium

### Data Exctraction and Cleaning

Toronto neighbourhood information was extractred from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). It was then cleaned and converted to a Pandas DataFrame 

In [2]:
# Scrape wiki

with urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") as fp:
    soup = BeautifulSoup(fp)

In [3]:
# Extract table

table_html = soup.find('table').find_all('tr')

In [4]:
# Define function for converting html to list

def cleanhtml(raw_html):
    cleaner = re.compile('<.*?>')
    text = re.sub(cleaner, '', str(raw_html)).replace('&amp;','').splitlines()
    return list(filter(None, text))

In [5]:
# Get list

table = [cleanhtml(row) for row in table_html]

In [6]:
# Convert list to df and clean

table_df = pd.DataFrame(table, columns = table[0])
table_df = table_df.drop(table_df.index[0])
table_df = table_df[table_df.Borough != 'Not assigned'].reset_index()

### Data checking and verifying

The DataFrame was checked to make sure no post codes were shared and that all had neighbourhood(s) assigned. Finally the shape of the Dataframe was shown. 

In [7]:
len(table_df)

103

In [8]:
len(table_df[table_df.Neighbourhood != 'Not assigned'])

103

Here we can see no neighbourhoods are unassigned.

In [9]:
table_df['Postal Code'].value_counts().value_counts()

1    103
Name: Postal Code, dtype: int64

Here we can see no Postal Codes are shared.

In [10]:
table_df.shape

(103, 4)

### Adding latitude and the longitude coordinates

The Geospatial_Coordinates.csv provided was used and read through Pandas and combined with table extracted from Wikipedia

In [11]:
GeoCoords = pd.read_csv('Geospatial_Coordinates.csv')
GeoCoords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
tor_coord = table_df.merge(GeoCoords,how='right',on='Postal Code').drop('index', axis = 1)

### Neighbourhood Clusters

All neighbourhoods containing 'Toronto' in their title were mapped.

In [13]:
# Get Boroughs only with 'Toronto' in name
tor_boro = tor_coord[tor_coord.Borough.str.contains("Toronto", regex = False)]
tor_boro.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [14]:
#Get central coordinates for Toronto map
latitude = tor_boro.Latitude.mean()
longitude = tor_boro.Longitude.mean()

In [15]:
# Create map object for Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# Add map markers for each neighbourhood
for lat, lng, borough, neighborhood in zip(tor_boro.Latitude, tor_boro.Longitude, tor_boro.Borough, tor_boro.Neighbourhood):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto