# Segmenting and Clustering Neighborhoods in Toronto

## Resources

[Coursera Task](https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit)

[List of canadian postal codes](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)


Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![alt text](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1580083200000&hmac=tBqZJZLeugrqekXb-dp1kP5E0QSb18FQ65wguNFrIsQ "Logo Title Text 1")


###  To create the above dataframe:

The dataframe will consist of three columns: 
- PostalCode, Borough, and Neighborhood
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

## Step 1 Load Data from Wikipedia into a DataFrame

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

markup = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(markup, 'lxml')
postcodes_table = soup.find("table", {"class": "wikitable sortable"})

postcodes = []
boroughs = []
neighbourhoods = []

for row in postcodes_table.find_all('tr')[1:]:
    postcode_cell = row.find_all('td')[0]
    borough_cell = row.find_all('td')[1]
    neighbourhood_cell = row.find_all('td')[2]

    postcodes.append(postcode_cell.text.strip())
    boroughs.append(borough_cell.text.strip())
    neighbourhoods.append(neighbourhood_cell.text.strip())

df = pd.DataFrame(
    {
        'PostalCode': postcodes,
        'Borough': boroughs,
        "Neighborhood": neighbourhoods
    }
)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


## Step 2 Clear the Dataframe

In [2]:
# STEP 2 clean the dataframe

#  Only process the cells that have an assigned borough.
#  Ignore cells with a borough that is Not assigned.
df = df[df.Borough != "Not assigned"]


def map_not_assigned_neighborhoods(row):
    if row.Neighborhood == "Not assigned":
        row.Neighborhood = row.Borough

    return row


df = df.apply(lambda x: map_not_assigned_neighborhoods(x), axis=1)

# More than one neighborhood can exist in one postal code area.
# For example, in the table on the Wikipedia page, you will notice that
# M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park.
# These two rows will be combined into one row with the neighborhoods separated
# with a comma as shown in row 11 in the above table.
df.set_index('PostalCode', drop=False, inplace=True)
df_post_codes = df.PostalCode.unique()

for code in df_post_codes:
    candidate_for_duplicate = df.loc[code]
    # size must be above 3 because its a series item
    if candidate_for_duplicate.size > 3:
        str_neighbourhood = ", ".join(candidate_for_duplicate['Neighborhood'])
        df.loc[code, "Neighborhood"] = str_neighbourhood
        
df.drop_duplicates('PostalCode', inplace=True)

In [3]:
df

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
M3A,M3A,North York,Parkwoods
M4A,M4A,North York,Victoria Village
M5A,M5A,Downtown Toronto,Harbourfront
M6A,M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,M7A,Downtown Toronto,Queen's Park
...,...,...,...
M8X,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
M4Y,M4Y,Downtown Toronto,Church and Wellesley
M7Y,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
M8Y,M8Y,Etobicoke,"Humber Bay, King's Mill Park, Kingsway Park So..."


## Solution for PART 1 

In [4]:
print(df.shape)

(103, 3)


In [5]:
import os.path as path

if not path.exists("./data_tmp/geo_data.csv"):
    print("download  csv file")
    r = requests.get("https://cocl.us/Geospatial_data")
    with open("./data_tmp/geo_data.csv", "wb") as f:
        f.write(r.content)

csv_data = pd.read_csv("./data_tmp/geo_data.csv")

# I will merge the Frames on the PostalCode Column
csv_data.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
# Lets remove the PostalCode index - merge won't work else
df.reset_index(level="PostalCode", drop=True, inplace=True)
df_combined  = pd.merge(df, csv_data, on="PostalCode")


## Solution for Part 2

In [8]:
df_combined.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Queen's Park,Queen's Park,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


In [10]:
# export df to json

df_combined.to_json(r'./data_tmp/dataset-canada.json')
