# Explore and Cluster Neighborhoods in Toronto
### Scrape wiki page for a table of postal codes and pre process into a dataframe.
### Use dataframe to explore and cluster neighborhoods 

#### Explore Beautifulsoup library for web scraping. 

##### Documentation: https://beautiful-soup-4.readthedocs.io/en/latest/
##### Youtube video: https://www.youtube.com/watch?v=ng2o98k983k

In [1]:
from bs4 import BeautifulSoup
import requests

import pandas as pd

In [2]:
weblink="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

toronto_data=requests.get(weblink)
#toronto_data.text

Now we scrape data for required info (tables)

Begin with find_all 'table' and Get to know class of required table

Use BeautifulSoup to map the table to a variable

In [4]:
soup = BeautifulSoup(toronto_data.text,'lxml')
#print(soup.prettify())
#print(soup.get_text())

#table=soup.find_all('table')
#soup.table['class']
Toronto = soup.find('table',{'class':'wikitable sortable'})
Toronto

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

It can be seen that:
1) Every row is a "tr"
2) Heading in row "th"
3)row data in "td"

In [5]:
toto_dict=[]

for tr in Toronto.find_all('tr')[1:-1]:
    data=tr.find_all(['th','td'])
    pcode=data[0].string
    bor=data[1].string
    neigh=(data[2].text).split("\n")[0]
    #print(neigh)
    
    toto_dict.append({'Postcode':pcode, 'Borough':bor, 'Neighbourhood':neigh})

toto_dict

[{'Postcode': 'M1A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M2A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M3A', 'Borough': 'North York', 'Neighbourhood': 'Parkwoods'},
 {'Postcode': 'M4A',
  'Borough': 'North York',
  'Neighbourhood': 'Victoria Village'},
 {'Postcode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Harbourfront'},
 {'Postcode': 'M5A',
  'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Regent Park'},
 {'Postcode': 'M6A',
  'Borough': 'North York',
  'Neighbourhood': 'Lawrence Heights'},
 {'Postcode': 'M6A',
  'Borough': 'North York',
  'Neighbourhood': 'Lawrence Manor'},
 {'Postcode': 'M7A',
  'Borough': "Queen's Park",
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M8A',
  'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned'},
 {'Postcode': 'M9A',
  'Borough': 'Etobicoke',
  'Neighbourhood': 'Islington Avenue'},
 {'Postcode': 'M1B', 'Borough': 'Scarborough', 'Nei

In [6]:
#Convert to pandas DataFrame
Table=pd.DataFrame.from_dict(toto_dict, orient='columns')
Table=Table[['Postcode','Borough','Neighbourhood']]
print(Table.shape)
Table.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### Filter Table for columns having an assigned Borough

In [8]:
Table=Table[Table.Borough != 'Not assigned']
print(Table.shape)
Table.reset_index(drop=True,inplace=True)
Table.head()

(212, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Now process for:
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [9]:
neigh_none=Table['Neighbourhood'].values.tolist().index('Not assigned')
Table['Neighbourhood'][neigh_none]=Table['Borough'][neigh_none]
Table.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


Merge neighborhoods in the same Borough

In [10]:
Table1=Table.groupby(['Postcode','Borough'])['Neighbourhood'].apply(list).reset_index()
print(Table1.shape)
for ind in range(0,Table1.shape[0]):
    Table1['Neighbourhood'][ind]=",".join(Table1['Neighbourhood'][ind])
Table1

(103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


Print shape and top 5 lines of resultant table

In [12]:
print("Table dimensions are:",Table1.shape)

Table dimensions are: (103, 3)
