# Segmenting and Clustering neighborhoods in Toronto

### Part 1 : Scrap the provided wikipedia page and build a pandas dataframe
#### Please scroll down for part 2

In [1]:
import pandas as pd

In [2]:
# Reading data from wikipedia page using pandas read_html method
# and getting the required table through the first obeject
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
print(tables[0])

    Postcode           Borough  \
0        M1A      Not assigned   
1        M2A      Not assigned   
2        M3A        North York   
3        M4A        North York   
4        M5A  Downtown Toronto   
5        M6A        North York   
6        M6A        North York   
7        M7A      Queen's Park   
8        M8A      Not assigned   
9        M9A  Downtown Toronto   
10       M1B       Scarborough   
11       M1B       Scarborough   
12       M2B      Not assigned   
13       M3B        North York   
14       M4B         East York   
15       M4B         East York   
16       M5B  Downtown Toronto   
17       M5B  Downtown Toronto   
18       M6B        North York   
19       M7B      Not assigned   
20       M8B      Not assigned   
21       M9B         Etobicoke   
22       M9B         Etobicoke   
23       M9B         Etobicoke   
24       M9B         Etobicoke   
25       M9B         Etobicoke   
26       M1C       Scarborough   
27       M1C       Scarborough   
28       M1C  

In [3]:
# Converting into a dataframe
toronto_df = pd.DataFrame(data=tables[0])
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned
toronto_df.drop(toronto_df.index[toronto_df['Borough']=='Not assigned'], axis=0, inplace=True)
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


#### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [5]:
toronto_df=toronto_df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x.astype(str))).reset_index()
toronto_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


#### If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [6]:
toronto_df.loc[toronto_df['Neighbourhood']=='Not assigned', 'Neighbourhood'] = toronto_df['Borough']

In [7]:
# checking the above for Queens park
toronto_df[toronto_df['Borough']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


#### Use the .shape method to print the number of rows of your dataframe

In [8]:
toronto_df.shape

(103, 3)

### End of part 1

### Part 2 : We now need to get the latitude and longitude of each neighbourhood. Use the Geocoder package or the csv file to create a dataframe containing the latitude and longitude.

In [9]:
# we shall use the csv file as the geocoder package is unreliable
# creating a new lat,long dataframe using the csv file
url = "https://cocl.us/Geospatial_data"
latlong_df = pd.read_csv(url)

latlong_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Great! We have the latitude and longitude of the Postal Codes found above. Before merging/joining the two dataframes we shall check whether the two have same number of rows i.e. unique postal codes.

In [10]:
latlong_df.shape

(103, 3)

#### So both the toronto_df and latlong_df have the same number of rows. 
#### One thing to note before merging the two dataframes is that 'Postcode' column in toronto_df dataframe is the same as 'Postal Code' column in latlong_df database. So we shall rename column in latlong_df as 'Postcode' to bring uniformity 

In [11]:
# rename postal code column in latlong_df
latlong_df.rename(columns={'Postal Code':'Postcode'}, inplace=True)
latlong_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Finally we shall merge the two dataframes based on their postcode

In [12]:
toronto_df_new = pd.merge(toronto_df, latlong_df, on='Postcode')
toronto_df_new.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


#### Save the file as a csv to be used in part 3

In [13]:
toronto_df_new.to_csv(r'toronto_df_new.csv')

### End of part 2