#                               Assignment No.1
## Segmenting and Clustering Neighbourhoods in Toronto

### Question 1- Scraping The Data

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from tabulate import tabulate
import requests

We will use requests.get() to get the data from the website in text format

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

BeautifulSoup() will convert the website data and print it in its HTML format, so we can see the table from which we have to scrape the data.

In [52]:
soup = BeautifulSoup(website_url,'lxml')
result = soup.prettify().splitlines()
print('\n'.join(result[:10]))

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":906439794,"wgRevisionId":906439794,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June",

From the HTML format of the website, we found that the data we want is stored under the class "wikitable sortable".
We will use soup.find() to find the table data and store it in My_table.

In [4]:
My_table= soup.find('table',{'class':'wikitable sortable'})

Now, we will create a dataframe using pd.read_html() which will have the contents of My_table.
And print it in tabular form using tabulate().

In [5]:
df = pd.read_html(str(My_table))
print( tabulate(df[0], headers='keys', tablefmt='psql') )


+-----+----------+------------------+---------------------------------------------------+
|     | 0        | 1                | 2                                                 |
|-----+----------+------------------+---------------------------------------------------|
|   0 | Postcode | Borough          | Neighbourhood                                     |
|   1 | M1A      | Not assigned     | Not assigned                                      |
|   2 | M2A      | Not assigned     | Not assigned                                      |
|   3 | M3A      | North York       | Parkwoods                                         |
|   4 | M4A      | North York       | Victoria Village                                  |
|   5 | M5A      | Downtown Toronto | Harbourfront                                      |
|   6 | M5A      | Downtown Toronto | Regent Park                                       |
|   7 | M6A      | North York       | Lawrence Heights                                  |
|   8 | M6

Note, that the df dataframe is not of type DataFrame.
We will convert it to a dataframe and store it in df_final.

In [6]:
df_final=df[0]
df_final=df_final.rename(columns=df_final.iloc[0]).drop(df_final.index[0])
df_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


We will now clean the dataframe and remove all the rows which have "Not assigned" value for Borough

In [7]:
df_fin=df_final[df_final.Borough != 'Not assigned']
df_fin.reset_index(inplace=True, drop=True)
df_fin.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Again, Cleaning the dataframe and replacing the value of Neighbourhood from "Not assigned" to their corresponding Borough

In [50]:
for i in df_fin.index:
        val = df_fin.loc[i,'Neighbourhood']
        if val=='Not assigned':
            df_fin.loc[i,'Neighbourhood'] = df_fin.loc[i, 'Borough']
df_fin.head(10)  
df_neigh=df_fin.groupby(['Postcode','Borough'])['Neighbourhood'].apply(' ,'.join).reset_index()
print('Shape ',df_neigh.shape)
df_neigh.head()

Shape  (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge ,Malvern"
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union"
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Question 2- Creating Dataframe for Geolocation

In [9]:
df_geo= pd.read_csv('https://cocl.us/Geospatial_data')

In [51]:
df_geoloc= pd.merge(df_neigh, df_geo, left_on='Postcode', right_on='Postal Code', how='left').drop('Postal Code', axis=1)
df_geoloc.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge ,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek ,Rouge Hill ,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood ,Morningside ,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
