## Segmenting and Clustering Neighborhoods in the City of Toronto, Canada Part 1
We will build code to achieve the following:
* scrape data from the following site: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
* create an organized dataframe from the data extracted

### 1. Scraping the Data

In [112]:
#import pandas
import pandas as pd

In [113]:
#create a url reference and pull the data
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_df = pd.read_html(wiki_url)
df_setup = wiki_df[0]

In [114]:
#Convert the data to a dataframe
df = pd.DataFrame(df_setup[['Postal Code','Borough','Neighborhood']])
df.rename(columns={"Postal Code":"PostalCode"}, inplace=True)
#let's view the unmodified data
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


### 2. Cleaning up the data & creating our final dataframe
Now that we have our data in a dataframe, let's clean it up.

In [115]:
#drop all rows without assigned Boroughs
df.drop(df[df['Borough'].str.contains('Not assigned')].index, inplace=True)

#check for any Neighborhoods with 'Not assigned' values
print(df.loc[df['Neighborhood'] == 'Not assigned'])

Empty DataFrame
Columns: [PostalCode, Borough, Neighborhood]
Index: []


In [116]:
#group the results together by unique PostalCode and create a new dataframe
df_grouped = df['Neighborhood'].groupby(df['PostalCode']).unique()

#convert to a dataframe and remove brackets
df_grouped_2 = pd.DataFrame(df_grouped)
df_grouped_2['Neighborhood'] = df_grouped_2['Neighborhood'].str.get(0)

df_grouped_2

Unnamed: 0_level_0,Neighborhood
PostalCode,Unnamed: 1_level_1
M1B,"Malvern, Rouge"
M1C,"Rouge Hill, Port Union, Highland Creek"
M1E,"Guildwood, Morningside, West Hill"
M1G,Woburn
M1H,Cedarbrae
M1J,Scarborough Village
M1K,"Kennedy Park, Ionview, East Birchmount Park"
M1L,"Golden Mile, Clairlea, Oakridge"
M1M,"Cliffside, Cliffcrest, Scarborough Village West"
M1N,"Birch Cliff, Cliffside West"


In [117]:
#merge the dataframes
df_final = pd.merge(df_grouped_2, df[['PostalCode', 'Borough']], on='PostalCode')
df_final = df_final[['PostalCode','Borough','Neighborhood']]

#removing duplicate Neighborhoods in favor of the first PostalCode
df_final.drop_duplicates(subset ="Neighborhood", keep = 'first', inplace = True) 

#final dataframe
df_final

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [119]:
#final row count
print("The final dataframe contains this many rows and columns:", df_final.shape)

The final dataframe contains this many rows and columns: (99, 3)
