# Segmenting and Clustering Neighborhoods in Toronto

## To create the above dataframe:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have complete information and not greyed out or not assigned.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
* In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Page Layout of Postal Codes of Canada: M

The layout of this page appears to have changed since the original instructions were written. The page now presents the Postal Codes, the Boroughs and the Neighborhoods in simple table format:

![alt text][logo]
[logo]: ./postalcode_m.jpg "Postal Codes of Canada: M"


---

### Use Pandas `read_html` to get the table from the WikiPedia page
The table I am interested in is the first table on the page.

In [1]:
import pandas as pd

# Read the table
# The table headers are in row 0
table = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)

# Create the initial dataframe from the table
df = pd.DataFrame(data = table[0])

# Print the shape
print('The shape of the Raw Inital Datafram is: ', df.shape)

# Output the Head of the Table
df.head()

The shape of the Raw Inital Datafram is:  (289, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


---

## Handle rows where Borough is set but Neighbourhood is _Not assigned_

Some of the rows have Borough set but Neighbourhood is _Not assigned_. In this situation the Neighbourhood is to be set to the same value as the Borough.

In [2]:
# Find these instances
df[(df.Borough != 'Not assigned') & (df.Neighbourhood == 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


The is only one such value so it is easiest to manually fix

In [3]:
df.loc[df.Borough == "Queen's Park", 'Neighbourhood'] = "Queen's Park"

In [4]:
# Check again
df[(df.Borough != 'Not assigned') & (df.Neighbourhood == 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood


---

## Remove rows where Borough & Neighbourhood are _Not assigned_

In [5]:
df = df[(df.Borough != 'Not assigned') | (df.Neighbourhood != 'Not assigned')]

# Print the shape
print('The shape of the Raw Inital Datafram is: ', df.shape)

# Output the Head of the Table
df.head()

The shape of the Raw Inital Datafram is:  (212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


### Count the distinct values in each colum

In [6]:
print('There are %d unique Postal Codes in the table' % df.Postcode.nunique())
print('There are %d unique Boroughs in the table' % df.Borough.nunique())
print('There are %d unique Neighbourhoods in the table' % df.Neighbourhood.nunique())

There are 103 unique Postal Codes in the table
There are 11 unique Boroughs in the table
There are 211 unique Neighbourhoods in the table


In [7]:
df.reset_index(drop=True, inplace=True)

---

## Some of the Neighbourhood values need to be cleaned up

There are issues with some of the Neighbourhood values containing the `]` character

In [8]:
df[df.Neighbourhood.str.contains(']')]

Unnamed: 0,Postcode,Borough,Neighbourhood
33,M1E,Scarborough,Guildwood]]
38,M6E,York,Caledonia-Fairbanks]]
114,M6N,York,]The Junction North]]
115,M6N,York,Runnymede]]


This is very easy to clean up

In [9]:
df['Neighbourhood'] = df['Neighbourhood'].str.replace(']', '')

In [10]:
df[df.Neighbourhood.str.contains(']')]

Unnamed: 0,Postcode,Borough,Neighbourhood


----

## Group by Postal Code and Borough

The final task is to group by Postal Code and Borough and produce a list of all Neighbourhoods in each.

In [11]:
part_01 = pd.DataFrame(df.groupby(
    ['Postcode', 'Borough'])['Neighbourhood'].apply(
    lambda x: ', '.join(x))).reset_index()

In [12]:
part_01.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [13]:
part_01.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."
102,M9W,Etobicoke,Northwest


In [14]:
part_01.shape

(103, 3)