# Clustering Toronto - Part 1

### Load libraries

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

### Data Exctraction and Cleaning

Toronto neighbourhood information was extractred from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). It was then cleaned and converted to a Pandas DataFrame 

In [2]:
# Scrape wiki

with urlopen("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M") as fp:
    soup = BeautifulSoup(fp)

In [3]:
# Extract table

table_html = soup.find('table').find_all('tr')

In [4]:
# Define function for converting html to list

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    text = re.sub(cleanr, '', str(raw_html)).splitlines()
    return list(filter(None, text))

In [5]:
# Get list

table = [cleanhtml(row) for row in table_html]

In [6]:
# Convert list to df and clean

table_df = pd.DataFrame(table, columns = table[0])
table_df = table_df.drop(table_df.index[0])
table_df = table_df[table_df.Borough != 'Not assigned'].reset_index(drop=True)

### Data checking and verifying

The DataFrame was checked to make sure no post codes were shared and that all had neighbourhood(s) assigned. Finally the shape of the Dataframe was shown. 

In [7]:
len(table_df)

103

In [8]:
len(table_df[table_df.Neighbourhood != 'Not assigned'])

103

Here we can see no neighbourhoods are unassigned.

In [9]:
table_df['Postal Code'].value_counts().value_counts()

1    103
Name: Postal Code, dtype: int64

Here we can see no Postal Codes are shared.

In [10]:
table_df.shape

(103, 3)

### Adding latitude and the longitude coordinates

My Attempt at geolocating latitude and the longitude coordinates using geopy. This is mostly built on trial and error as some locations could not be found. Latitude and longitude for M7R are from the geospatial csv as geopy couldn't find it using any method.

In [11]:
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="aaa")

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

table_df['location'] = table_df.Neighbourhood.split(',')[0].split('-')[0].apply(geocode)

AttributeError: 'Series' object has no attribute 'split'

In [None]:
table_df

In [20]:
table_df.iloc[76]#.split(',')[0].split('-')[0]

Postal Code                                        M7R
Borough                                    Mississauga
Neighbourhood    Canada Post Gateway Processing Centre
Name: 76, dtype: object

In [22]:
location = geolocator.geocode('M7R, Toronto, Ontario')
# print(location.latitude, location.longitude)
location

In [34]:
lat = []

for PC in table_df.Neighbourhood:
    try:
        lat.append(geolocator.geocode('{}, Toronto, Ontario'.format(PC.split(',')[0].split('-')[0])).latitude)
        print(PC)

    except AttributeError:
        lat.append(geolocator.geocode('{}, Toronto, Ontario'.format(PC.split(',')[1].split('-')[0])).latitude)
        print(PC, 'xysssssssssssssssssssssssssssssssssssssssssss')
    except IndexError:
        lat.append(43.6369656)

        
# lat = [geolocator.geocode('{}, Toronto, Ontario'.format(PC.split(',')[0].split('-')[0])).latitude for PC in table_df.Neighbourhood]
# lon

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

IndexError: list index out of range

In [None]:
lat

In [None]:
lat = ['{}, Toronto, Ontario'.format(PC) for PC in table_df['Postal Code']]
lat = [geolocator.geocode('{}, Toronto, Ontario'.format(PC.split(',')[0].split('-')[0])).latitude for PC in table_df.Neighbourhood]

Using csv provided

In [None]:
GC = pd.read_csv('Geospatial_Coordinates.csv')
len(GC)