# Applied Data Science Capstone - Week 3
The purpose is to explore and clean the neighborhoods in Toronto.

The first step is to import the needed libraries and the data from Wikipedia.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Those with 'Not assigned' in the cells of the column 'Borough' are ignored by taking only the rows that we are concerned with.

In [3]:
df = df[df["Borough"] != 'Not assigned']
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


I am finding the number of unique postal codes to know how many rows the dataframe should have when I combine similar postal codes into one row's data.

In [4]:
len(df['Postal Code'].unique())

103

I am creating a new dataframe with only Postal Code and Borough columns to later be used in merging with the resulting dataframe after I combine similar postal codes together using a groupby.

In [5]:
df_borough = df[['Postal Code', 'Borough']]
df_borough

Unnamed: 0,Postal Code,Borough
2,M3A,North York
3,M4A,North York
4,M5A,Downtown Toronto
5,M6A,North York
6,M7A,Downtown Toronto
...,...,...
160,M8X,Etobicoke
165,M4Y,Downtown Toronto
168,M7Y,East Toronto
169,M8Y,Etobicoke


I am creating a new dataframe consisting of unique Postal Codes and appending Neighborhoods with similar Postal Codes together.

In [6]:
df1 = df.groupby('Postal Code')['Neighbourhood'].apply(lambda x: ', '.join(set(x.dropna()))).reset_index()
df1

Unnamed: 0,Postal Code,Neighbourhood
0,M1B,"Malvern, Rouge"
1,M1C,"Rouge Hill, Port Union, Highland Creek"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae
...,...,...
98,M9N,Weston
99,M9P,Westmount
100,M9R,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,"South Steeles, Silverstone, Humbergate, Jamest..."


By merging the dataframes, df_borough and df1, I can create a new dataframe with three columns needed.

In [7]:
df_merged = pd.merge(df_borough, df1, on='Postal Code' )
df_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


I am checking to see if there are any neighborhoods containing 'Not assigned' in their cells. 

In [8]:
neighbourhood_not_assigned = df_merged[df_merged['Neighbourhood'] == 'Not assigned']
neighbourhood_not_assigned, len(neighbourhood_not_assigned)

(Empty DataFrame
 Columns: [Postal Code, Borough, Neighbourhood]
 Index: [],
 0)

Because there are no neighborhoods with 'Not assigned in their cells, the data does not need to be cleaned for that assumption.

The resulting rows and columns of the dataframe are found.

In [9]:
df_merged.shape

(103, 3)