# Segmenting and Clustering Neighborhoods in Toronto

### Importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np

### Data Import

#### Using the pandas function read_html, the Postal Codes data is imported into a dataframe

In [2]:
df=pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

#### This is what we get initially after fetching the data

In [3]:
df[0].head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Data Processing

#### Let's remove all the borough data with values not assigned

In [4]:
df = df[0][df[0].Borough != "Not assigned"]

#### Let's assign data from borough into Neighborhood data where Neighborhood is not assigned

In [5]:
df['Neighborhood'] = np.where(df['Neighborhood'] == "Not assigned", df['Borough'], df['Neighborhood'])

#### Resetting the index

In [6]:
df.reset_index(inplace=True)
df.drop("index",axis=1,inplace=True)

#### Renaming the Postal Code column to PostalCode

In [7]:
df.rename(columns={'Postal Code':'PostalCode'},inplace=True)

#### This is the final result we get

In [8]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### Now let's look at the dimensions of our dataset

In [9]:
df.shape

(103, 3)

<html><div><i>Assumption: The dataset already has neighborhood data comma separated so that step was not required</i></div></html>  

### Importing the coordinates

#### Using the csv file from the URL we import the coordinates data

In [10]:
df_loc= pd.read_csv("http://cocl.us/Geospatial_data")

#### Now, let's merge the coordinates with the neighborhood dataset we already imported before

In [11]:
df_loc = pd.merge(left=df, right=df_loc, left_on='PostalCode', right_on='Postal Code')
df_loc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Postal Code,Latitude,Longitude
0,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
1,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",M5A,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",M6A,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M7A,43.662301,-79.389494


#### Let's remove the extra Postal Code column and see how our data looks now

In [12]:
df_loc.drop("Postal Code",axis=1,inplace=True)
df_loc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [13]:
df_loc.shape

(103, 5)