<center> <h1> Assignment: Segmenting and Clustering Neighborhoods in Toronto </h1> </center>

**Preliminary note:** *this notebook will be developed throughout our capstone project which looks very exciting!*

## Importing libraries

*Before we get the data and start exploring it, let's download all the libraries that we will need.*

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Downloading and scraping the web page

We download the contents of the web page:

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
canada_data  = requests.get(url).text

We create the `soup` object

In [4]:
soup = BeautifulSoup(canada_data,"html.parser")

Let us find all the tables of the webpage.

In [5]:
tables = soup.find_all('table') # in html table is represented by the tag <table>

Now, we scrape the **postal codes table** in the desired format.

In [6]:
postal_code_data = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])
for row in tables[0].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        postal_code = col[0].text.strip()
        
        postal_code_data = postal_code_data.append({"PostalCode":postal_code, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)

postal_code_data.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Preparing the postal codes dataframe

Now, we only deal with cells with an **assigned** borough. Therefore:

In [7]:
postal_code_data.drop(postal_code_data[postal_code_data['Borough']=='Not assigned'].index, inplace=True)
postal_code_data.reset_index(drop = True,inplace = True)
#postal_code_data.drop(['index'],axis=1, inplace=True)
postal_code_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let us group the postal code dataset by boroughs

In [8]:
df_postcode = postal_code_data.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_postcode.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let us find the **shape** of the dataframe:

In [9]:
postal_code_data.shape

(103, 3)

There are 103 lines in the dataset!

## Obtaining the longitude and latitude of each neighborhood

In [10]:
print('We are going to read data from a .csv file')
geospatial_data = pd.read_csv('Geospatial_Coordinates.csv',index_col='Postal Code')
geospatial_data.head()

We are going to read data from a .csv file


Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Now, we **join** the two datasets to obtain the target dataset:

In [11]:
df_postcode_joined = df_postcode.join(geospatial_data, on='PostalCode')
df_postcode_joined.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
