# Segmenting and Clustering Neighborhood in Toronto

![Image of Yaktocat](http://www.marks-clerk.fr/MarksClerk/media/MCMediaLib/Office%20Page%20Images%20fr/Toronto.jpg?width=976&height=340&ext=.jpg)


## Step 1: Retrieve Neighborhood Data

First step, to be able to clustering and segmenting Toronto neighborhood is to have data on it. Unfortunatly, these data doesn't exist in a dataset form. We should use wikipedia to retrieve and create our own dataset. 

To do this:

- [x] Source the wikipedia page
- [x] Web scrapping of it
- [x] Clean the data and create a data frame


In [4]:
#Install (or upgrade) BeautifulSoup
#!pip install -U beautifulsoup4


In [5]:
#install parse lxml
#!pip install lxml

In [6]:
#import BeautifulSoup
import bs4 #for html parsing
import requests #to reate an html file from an url
import pandas as pd

In [7]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
responses=requests.get(url)
#responses.text

In [8]:
#Enhance the display of html doc to make mroe readable
html_soup = bs4.BeautifulSoup(responses.text, 'html.parser') 
#html_soup

<code> html_soup </code> contains the html source code of the page. We will keep only what you want, the table of neighborhood. In HTML, it is the <code> table </code> tag and <code> tr, th </code> tag (rows and headers).
Below **table_rows** contains all rows of the table, header included. Each rows is a list and table_row is also a list.

In [9]:
#find a specific type of html
table_rows = html_soup.find_all('tr')#, class_ = 'lister-item mode-advanced')
table_rows[0:2]

[<tr>
 <th>Postal code
 </th>
 <th>Borough
 </th>
 <th>Neighborhood
 </th></tr>, <tr>
 <td>M1A
 </td>
 <td>Not assigned
 </td>
 <td>
 </td></tr>]

First we retrieve the header for our dataframe.

In [10]:
#Standardize the name of columns (cleaning)
columns=table_rows[0].find_all('th')
for i in range(0,len(columns)):
    columns[i]=str(columns[i])
    columns[i]=columns[i].replace('<th>','')    
    columns[i]=columns[i].replace('\n','')
    columns[i]=columns[i].replace('</th>','')
#    print(columns)

In [11]:
#Subset the table to keep only the content (remove header)
length=len(table_rows)
table_content=table_rows[1:length+1]

A sublist of the <code> Table_content </code> list is an HTML row. 

In [12]:
table_content[0]

<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>
</td></tr>

In [13]:
d=[]
for tr in table_content:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    
    #clean the row of its HTML tag
    for i in range(0,len(row)):
        row[i]=row[i].replace('\n','')
        row[i]=row[i].replace('/',',')
        
    #to avoid error with the last row of the table
    if len(row)==2:
        row=row+['']
        
    #If Borough is asisgned but neighborhood not assigned, replace by the borough name.
    if (row[2]=='Not assigned') and (row[1] != 'Not assigned'):
        row[2]=row[1]
    
    #Keep only rows with a Borough assigned
    if (row[1] != '') and (row[1] != 'Not assigned'):
        dict=[{'Postal code':row[0], 'Borough':row[1], 'Neighborhood':row[2]}]
        d = d + dict


In [78]:
data=pd.DataFrame(d,columns=columns)


We discover that the 3 last rows are odd and we decide to remove from our dataset.

In [77]:
data.tail()


Unnamed: 0,Postal code,Borough,Neighborhood
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."
102,M8Z,Etobicoke,"Mimico NW , The Queensway West , South of Bloo..."


In [79]:
data = data.drop([103,104,105],axis=0)
print('Dataset has ',data.shape[0],' rows (Borough)')
print(data.head())
print('...')
print(data.tail())

Dataset has  103  rows (Borough)
  Postal code           Borough                                  Neighborhood
0         M3A        North York                                     Parkwoods
1         M4A        North York                              Victoria Village
2         M5A  Downtown Toronto                    Regent Park , Harbourfront
3         M6A        North York             Lawrence Manor , Lawrence Heights
4         M7A  Downtown Toronto  Queen's Park , Ontario Provincial Government
...
    Postal code           Borough  \
98          M8X         Etobicoke   
99          M4Y  Downtown Toronto   
100         M7Y      East Toronto   
101         M8Y         Etobicoke   
102         M8Z         Etobicoke   

                                          Neighborhood  
98    The Kingsway , Montgomery Road  , Old Mill North  
99                                Church and Wellesley  
100              Business reply mail Processing CentrE  
101  Old Mill South , King's Mill Park , Sun

## Step 2 : Enrich our dataset with geographic coordinates

Now we have our dataset. The next step is to enrich it with geo coordinates of location. 
To do it, we use the geocoder library.

In [82]:
#!pip install geocoder

In [51]:
import geocoder

In the descirption fo the assignement, we use google as provider for the coordinates. But this doesn't work (after 1hours, no coordinates retrieve). So we decide to change provider.

From the documentation of the library, It exist the geocodefarm, which can retrieve our coordinates (better than google)

In [None]:
# @hidden_cell
#key for mapquest
api_key='HLJmHd8IPVFjJmXifeWEml3TCWFABHbG'

In [143]:
test = data.loc[83,'Postal code'] + ', Toronto, Ontario'
print(test)
g = geocoder.mapquest(test,key=api_key)
print(g.latlng)


M4T, Toronto, Ontario
[43.651893, -79.381713]


So to use it, first we create two empty columns to store our results.

In [106]:
import numpy as np
data['latitude']=np.nan
data['longitude']=np.nan
data.head()

Unnamed: 0,Postal code,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,,
1,M4A,North York,Victoria Village,,
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",,
3,M6A,North York,"Lawrence Manor , Lawrence Heights",,
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",,


Then we run the loop to fill in our dataframe with the coordinates.

In [138]:

for i,code in enumerate(data['Postal code']):
   
    # initialize your variable to None
    lat_lng_coords = None
   # postal_code= data.loc[i,'Postal code']
    location= code + ',Toronto, Ontario'
    
    k=0
    while (lat_lng_coords == None) and (k<6):
        k= k+1
        g = geocoder.mapquest(location, key=api_key)
        lat_lng_coords = g.latlng
        
    if lat_lng_coords == None:
        print(i, ': ', code, ' - ' ,lat_lng_coords)
    else:
        data.loc[i,'latitude']=lat_lng_coords[0]
        data.loc[i,'longitude']=lat_lng_coords[1]

Checking if we have fetch all coordinates:

In [141]:
data[data['latitude'].isnull()==True]

Unnamed: 0,Postal code,Borough,Neighborhood,latitude,longitude


Ok So our data are good.

In [144]:
data

Unnamed: 0,Postal code,Borough,Neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.654060,-79.368190
1,M4A,North York,Victoria Village,43.654060,-79.368190
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.654060,-79.368190
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.654060,-79.368190
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662500,-79.392550
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.651893,-79.381713
99,M4Y,Downtown Toronto,Church and Wellesley,43.651893,-79.381713
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.651893,-79.381713
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.651893,-79.381713


We source the csv file for checking our fetched coordinates:

In [145]:
csv_link='https://cocl.us/Geospatial_data'
geo = pd.read_csv(csv_link)
geo.columns=['Postal code','Latitude','Longitude']

In [146]:
Toronto = pd.merge(data,geo,on='Postal code')

In [147]:
Toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood,latitude,longitude,Latitude,Longitude
0,M3A,North York,Parkwoods,43.65406,-79.36819,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.65406,-79.36819,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65406,-79.36819,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.65406,-79.36819,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6625,-79.39255,43.662301,-79.389494


It seems that there is a difference in coodinates, but little. We keep these 4 columns and rename it to distinguish

In [149]:
Toronto.columns=['Postal code','Borough','Neighborhood','Latitude_API','Longitude_API','Latitude_csv','Longitude_csv']

In [150]:
Toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude_API,Longitude_API,Latitude_csv,Longitude_csv
0,M3A,North York,Parkwoods,43.65406,-79.36819,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.65406,-79.36819,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65406,-79.36819,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.65406,-79.36819,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6625,-79.39255,43.662301,-79.389494


In [1]:
#Save the dataframe to not have to rerun entire code
local_path=r'C:\Users\ASUS\Documents\Projects\GithubRepository\course_project\Capstone Project\Toronto.csv'
remote_path='repo/course_project/Capstone Project/Toronto.csv'
Toronto.to_csv(remote_path,index=False)

NameError: name 'Toronto' is not defined

## Step 3 - First Mapping

In [6]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 

#!conda install -c conda-forge geocoder --yes
import geocoder

import requests 
from pandas import json_normalize 

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    click-7.1.2                |     pyh9f0ad1d_0          64 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    future-0.18.2              |   py36h9f0ad1d_1         714 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    pysocks-1.7.1              |   py36h9f0ad1d_1          27 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    urllib3-1.25.9       

In [15]:
file='Toronto.csv'
#with open(file,'r') as f:
Toronto=pd.read_csv(file)
Toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude_API,Longitude_API,Latitude_csv,Longitude_csv
0,M3A,North York,Parkwoods,43.65406,-79.36819,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.65406,-79.36819,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65406,-79.36819,43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.65406,-79.36819,43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.6625,-79.39255,43.662301,-79.389494


In [8]:
# @hidden_cell
#key for mapquest
api_key='HLJmHd8IPVFjJmXifeWEml3TCWFABHbG'

In [9]:
toronto_coord= geocoder.mapquest('Toronto', key=api_key)
toronto_coord

<[OK] Mapquest - Geocode [Toronto]>

In [10]:
toronto_lat=toronto_coord.latlng[0]
toronto_long=toronto_coord.latlng[1]



In [16]:
Toronto_map = folium.Map(location=[toronto_lat, toronto_long], zoom_start=12)

# add markers to map
for lat, lng, label in zip(Toronto['Latitude_csv'], Toronto['Longitude_csv'], Toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(Toronto_map)  
    
Toronto_map

## Source used to realized taks above:
* https://hackersandslackers.com/scraping-urls-with-beautifulsoup/
* https://www.dataquest.io/blog/web-scraping-beautifulsoup/
* https://fr.python-requests.org/en/latest/user/advanced.html
