<center> <h1> Assignment: Segmenting and Clustering Neighborhoods in Toronto </h1> </center>

**Preliminary note:** *this notebook will be developed throughout our capstone project which looks very exciting!*

## Importing libraries

*Before we get the data and start exploring it, let's download all the libraries that we will need.*

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## Downloading and scraping the web page

We download the contents of the web page:

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
canada_data  = requests.get(url).text

We create the `soup` object

In [4]:
soup = BeautifulSoup(canada_data,"html.parser")

Let us find all the tables of the webpage.

In [5]:
tables = soup.find_all('table') # in html table is represented by the tag <table>

In [6]:
len(tables) #number of tables of the webpage

3

In [7]:
#Pretiffy the first table
print(tables[0].prettify())

<table class="wikitable sortable">
 <tbody>
  <tr>
   <th>
    Postal Code
   </th>
   <th>
    Borough
   </th>
   <th>
    Neighbourhood
   </th>
  </tr>
  <tr>
   <td>
    M1A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M2A
   </td>
   <td>
    Not assigned
   </td>
   <td>
    Not assigned
   </td>
  </tr>
  <tr>
   <td>
    M3A
   </td>
   <td>
    North York
   </td>
   <td>
    Parkwoods
   </td>
  </tr>
  <tr>
   <td>
    M4A
   </td>
   <td>
    North York
   </td>
   <td>
    Victoria Village
   </td>
  </tr>
  <tr>
   <td>
    M5A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Regent Park, Harbourfront
   </td>
  </tr>
  <tr>
   <td>
    M6A
   </td>
   <td>
    North York
   </td>
   <td>
    Lawrence Manor, Lawrence Heights
   </td>
  </tr>
  <tr>
   <td>
    M7A
   </td>
   <td>
    Downtown Toronto
   </td>
   <td>
    Queen's Park, Ontario Provincial Government
   </td>
  </tr>
  <tr>
   <td>
    M8

Now, we scrape the **postal codes table** in the desired format.

In [8]:
postal_code_data = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])
for row in tables[0].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        borough = col[1].text.strip()
        neighborhood = col[2].text.strip()
        postal_code = col[0].text.strip()
        
        postal_code_data = postal_code_data.append({"PostalCode":postal_code, "Borough":borough, "Neighborhood":neighborhood}, ignore_index=True)

postal_code_data.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## Preparing the postal codes dataframe

Now, we only deal with cells with an **assigned** borough. Therefore:

In [9]:
postal_code_data.drop(postal_code_data[postal_code_data['Borough']=='Not assigned'].index, inplace=True)
postal_code_data.reset_index(drop = True,inplace = True)
#postal_code_data.drop(['index'],axis=1, inplace=True)
postal_code_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let us find the **shape** of the dataframe:

In [10]:
postal_code_data.shape

(103, 3)

There are 103 lines in the dataset!

## Obtaining the longitude and latitude of each neighborhood

In [11]:
print('We are going to read data from a .csv file')
geospatial_data = pd.read_csv('Geospatial_Coordinates.csv')
geospatial_data.head()

We are going to read data from a .csv file


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, we **join** the two datasets to obtain the target dataset:

In [21]:
postal_code_data_joined = postal_code_data.join(geospatial_data.set_index('Postal Code'),on='PostalCode')
postal_code_data_joined.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
