# Segmenting and Clustering Neighborhood in Toronto

![Image of Yaktocat](http://www.marks-clerk.fr/MarksClerk/media/MCMediaLib/Office%20Page%20Images%20fr/Toronto.jpg?width=976&height=340&ext=.jpg)


***

## Step 1: Retrieve Neighborhood Data

First step, to be able to clustering and segmenting Toronto neighborhood is to have data on it. Unfortunatly, these data doesn't exist in a dataset form. We should use wikipedia to retrieve and create our own dataset. 

To do this:

- [x] Source the wikipedia page
- [x] Web scrapping of it
- [x] Clean the data and create a data frame


Install required package:

In [75]:
#Install (or upgrade) BeautifulSoup
#!pip install -U beautifulsoup4

#install parse lxml
#!pip install lxml

In [76]:
#import BeautifulSoup
import bs4 #for html parsing
import requests #to reate an html file from an url
import pandas as pd

Source the Wikipedia page and get the html code:

In [77]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
responses=requests.get(url)
#responses.text

In [78]:
#Enhance the display of html doc to make more readable
html_soup = bs4.BeautifulSoup(responses.text, 'html.parser') 
#html_soup

<code> html_soup </code> contains the html source code of the page.

We will keep only what you want, the table of neighborhood. In HTML, it is the <code> table </code> tag and <code> tr, th </code> tags (rows and headers).

Below **table_rows** contains all rows of the table, header included. 
Each rows is a list and table_row is also a list.

In [79]:
#find a specific type of html
table_rows = html_soup.find_all('tr')#, class_ = 'lister-item mode-advanced')
table_rows[0:2]

[<tr>
 <th>Postal code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>,
 <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>]

First we retrieve the header for our dataframe.

In [80]:
#Standardize the name of columns (cleaning)
columns=table_rows[0].find_all('th')
for i in range(0,len(columns)):
    columns[i]=str(columns[i])
    columns[i]=columns[i].replace('<th>','')    
    columns[i]=columns[i].replace('\n','')
    columns[i]=columns[i].replace('</th>','')
#    print(columns)

In [81]:
#Subset the table to keep only the content (remove header)
length=len(table_rows)
table_content=table_rows[1:length+1]

A sublist of the <code> Table_content </code> list is an HTML row. 

In [82]:
table_content[0]

<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>

In [83]:
#Retrieve the content of each row and cleanning unwanted HTML balises.
d=[]
for tr in table_content:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    
    #clean the row of its HTML tag
    for i in range(0,len(row)):
        row[i]=row[i].replace('\n','')
        row[i]=row[i].replace('/',',')
        
    #to avoid error with the last row of the table
    if len(row)==2:
        row=row+['']
        
    #If Borough is asisgned but neighborhood not assigned, replace by the borough name.
    if (row[2]=='Not assigned') and (row[1] != 'Not assigned'):
        row[2]=row[1]
    
    #Keep only rows with a Borough assigned
    if (row[1] != '') and (row[1] != 'Not assigned'):
        dict=[{'Postal code':row[0], 'Borough':row[1], 'Neighborhood':row[2]}]
        d = d + dict


Result is a list with all information. We transform in Dataframe.

In [84]:
data=pd.DataFrame(d,columns=columns)


We discover that the 3 last rows are odd and we decide to remove from our dataset.

In [85]:
data.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"


In [86]:
data.tail()

Unnamed: 0,Postal code,Borough,Neighborhood
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."
102,M8Z,Etobicoke,"Mimico NW , The Queensway West , South of Bloo..."
103,"NLNSPENBQCONMBSKABBCNU,NTYTABCEGHJKLMNPRSTVXY",NL,NS
104,NL,NS,PE
105,A,B,C


In [87]:
data = data.drop([103,104,105],axis=0)
print('Dataset has ',data.shape[0],' rows (Borough)')
print(data.head())
print('...')
print(data.tail())

Dataset has  103  rows (Borough)
  Postal code           Borough                                  Neighborhood
0         M3A        North York                                     Parkwoods
1         M4A        North York                              Victoria Village
2         M5A  Downtown Toronto                    Regent Park , Harbourfront
3         M6A        North York             Lawrence Manor , Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park , Ontario Provincial Government
...
    Postal code           Borough  \
98          M8X         Etobicoke   
99          M4Y  Downtown Toronto   
100         M7Y      East Toronto   
101         M8Y         Etobicoke   
102         M8Z         Etobicoke   

                                          Neighborhood  
98    The Kingsway , Montgomery Road  , Old Mill North  
99                                Church and Wellesley  
100              Business reply mail Processing CentrE  
101  Old Mill South , King's Mill Park , Sun

 ----

## Step 2 : Enrich our dataset with geographic coordinates

## Additional sources used to realized all the tasks above (from beginning to the end):
* https://hackersandslackers.com/scraping-urls-with-beautifulsoup/
* https://www.dataquest.io/blog/web-scraping-beautifulsoup/
* https://fr.python-requests.org/en/latest/user/advanced.html
* https://geocoder.readthedocs.io/api.html#install
* 
