# Clustering Toronto Neighbourhoods
#### Part 1: Data preperation

What is the aim of this notebook?

### Load Libraries

In [1]:
import pandas as pd # Data structures
import folium # Visualising interactive maps

## Data Exctraction and Cleaning

Toronto neighbourhood information was extractred from [Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) directly using the pandas read_html funtion.

In [2]:
# Extract table from wikipedia using Pandas
table_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

# Remove unassigned boroughs
table_df = table_df[table_df.Borough != 'Not assigned'].reset_index(drop=True)

table_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Above we can see, we have the correct columns and the first 5 rows of the data frame look correct. 

## Data checking and verifying

The DataFrame was checked to make sure no post codes were shared and that all had neighbourhood(s) assigned. Finally the shape of the Dataframe was shown. 

In [3]:
table_df['Postal Code'].value_counts().value_counts()

1    103
Name: Postal Code, dtype: int64

Here we can see we have 103 distinct Postal Codes.

In [4]:
len(table_df[table_df.Neighbourhood != 'Not assigned'])

103

Here we can see we have 103 distinct postcodes with neighbourhood(s) assigned.

In [5]:
table_df.shape

(103, 3)

Our dataframe has 103 rows and 3 columns. From above we can deduce that there are no duplicate post codes in the table, nor do any post codes have no assigned neighbourhood(s)

## Adding latitude and the longitude coordinates

The Geospatial_Coordinates.csv provided was used and read through Pandas and combined with table extracted from Wikipedia

In [6]:
GeoCoords = pd.read_csv('https://cocl.us/Geospatial_data')

tor_coord = table_df.merge(GeoCoords,how='right',on='Postal Code')

tor_coord.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Neighbourhood Visualisation

All neighbourhoods containing 'Toronto' in their title were saved to a new dataframe, tor_boro, and mapped using folium.

In [11]:
# Get Boroughs only with 'Toronto' in name
tor_boro = tor_coord[tor_coord.Borough.str.contains("Toronto", regex = False)].reset_index(drop=True)

#Get central coordinates for Toronto map
latitude = tor_boro.Latitude.mean()
longitude = tor_boro.Longitude.mean()

# Create map object for Toronto
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# Add map markers for each neighbourhood
for lat, lng, borough, neighbourhood in zip(tor_boro.Latitude, tor_boro.Longitude, 
                                           tor_boro.Borough, tor_boro.Neighbourhood):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto

Finally, we save the tor_boro dataframe to a csv for further analysis.

In [None]:
tor_boro.to_csv('tor_boro.csv',index=False)