# Segmenting and Clustering Neighborhoods in Toronto

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

In [163]:
import pandas as pd
import numpy as np


Read in the data from Wikipedia using the pd.read_html helper from the Pandas library



In [164]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

### Data Cleaning and Preprocessing
In this section we are going to clean and prepare the data using the following guidelines

1. Rename the Postcode column to PostalCode
2. Remove all rows with unassigned Boroughs
3. Aggregate data so that Neightborhoods sharing more than one postal code are grouped in a comma separated value format
4. Upate the value of unassigned Neighbourhoods with the value of the Borough

In [169]:
#1. Rename column
df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)

In [None]:
#2. Filter out unassigned boroughs
filtered_data = df[df['Borough' ] != 'Not assigned'].reset_index(drop=True)
   
#3. Group by postal code to show neighborhoods sharing same postal code
groups = filtered_data.groupby(['PostalCode', 'Borough'], as_index=False).agg(lambda x: ','.join(x))

#4. Update any neighborhoods that are unassigned and give them the name of the borough
groups.loc[groups['Neighbourhood'] == 'Not assigned', ['Neighbourhood']] = groups['Borough']
# groups.loc[matching row, [columns] = values

groups

Show the shape of the cleaned data set

In [167]:
groups.shape

(103, 3)