# Applied Data Science Capstone

Welcome to my Coursera's Capstone Project notebook. On this notebook I'll develop my final project for the [IBM Data Science](https://www.coursera.org/professional-certificates/ibm-data-science) course.

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## Segmenting and Clustering
### Convert table to dataframe

At first, I crawled and parsed the Wikipedia's page.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

The Wikipedia's page has a table containing all the postal codes. So, I fetched the HTML node then coverted it to a dataframe.

In [4]:
table = soup.find('table')
codes = pd.read_html(str(table))[0]
codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


To prevent invalid borough values, I removed all not assigned boroughs from the dataframe.

In [13]:
codes = codes[(codes.Borough != 'Not assigned')]
codes.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [6]:
codes.shape

(210, 3)

### Merge postal codes and geolocations dataframes
Next, I created `geo` dataframe, based on the given `Geospatial_Coordinates.csv`.

In [10]:
geo = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/Coursera_Capstone/master/Geospatial_Coordinates.csv')
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Then I renamed the column `Postcode` to `Postal Code` and merged the two dataframes, `codes` and `geo`, into a new `df` dataframe containing all the information.

In [32]:
codes.rename(columns = {'Postcode' : 'Postal Code', 'Neighbourhood':'Neighborhood'}, inplace = True)
codes = codes.groupby(by=['Postal Code','Borough'], sort=False).agg( ', '.join).reset_index()
df = codes.merge(geo, on = 'Postal Code')
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
