# **Segmenting and Clustering Neighborhoods in Toronto**
This notebook is part 1 of a 2 part project. 

### *Goal: Obtain and Prepare Data for Project*
#### *Steps:*
    1. Scrape Wikipedia for Postal Codes, Boroughs and Neighbourhoods in Toronto.
        For this step I used beautifulsoup module and html parser to pull the data from wikipedia. 
    2. Clean and convert Data into Pandas DataFrame
    3. Obtain location cordinates for all Boroughs
         You may use python's geocoder module for this task. For this project I already had the location data in a CSV.
    4. Create a new DataFrame containing Postal Codes, Boroughs and Neighbourhoods  

##### **Import Libraries for Project**

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import matplotlib.pyplot as plt
import bs4 
import requests
import html.parser
from bs4 import BeautifulSoup as soup
print('done')

done


##### **Set Output Parameters**

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#### *STEP 1:*

#### *1. Download Toronto Data Table on Toronto [Wikipedia]('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'/ "Toronto Table") page using BeautifulSoup*

In [2]:
url=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

In [3]:
soup_url= soup(url, 'html.parser')
web_tab= soup_url.find_all('table')
web_table= soup_url.find('tbody')
web_table(limit = 20)

[<tr>
 <th>Postcode</th>
 <th>Borough</th>
 <th>Neighbourhood
 </th></tr>,
 <th>Postcode</th>,
 <th>Borough</th>,
 <th>Neighbourhood
 </th>,
 <tr>
 <td>M1A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>,
 <td>M1A</td>,
 <td>Not assigned</td>,
 <td>Not assigned
 </td>,
 <tr>
 <td>M2A</td>
 <td>Not assigned</td>
 <td>Not assigned
 </td></tr>,
 <td>M2A</td>,
 <td>Not assigned</td>,
 <td>Not assigned
 </td>,
 <tr>
 <td>M3A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td></tr>,
 <td>M3A</td>,
 <td><a href="/wiki/North_York" title="North York">North York</a></td>,
 <a href="/wiki/North_York" title="North York">North York</a>,
 <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
 </td>,
 <a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>,
 <tr>
 <td>M4A</td>
 <td><a href="/wiki/North_York" title="North York">North York</a></td>
 <td><a href="/wiki/Victoria_Village" titl

### **Step 2.**

### *A. Extract & Clean raw html output above to obtain desired data and convert it to a Pandas DataFrame*

In [4]:
tab = []
for tr in web_table.find_all("tr"):
    td = tr.find_all("td")
    row = [tr.text.strip() for tr in td]
    tab.append(row)
    
tdf = pd.DataFrame(tab, columns=["Postcode","Borough", "Neighborhood"])
print(tdf.shape)
tdf.head()

(288, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


#### *2.B. Drop empty row at index 0*

In [5]:
#Drop First row of df to remove 'None'
tdf.drop(tdf.index[0], inplace= True)
tdf.reset_index(drop =True, inplace = True)
print(tdf.shape)
tdf.head()

(287, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


#### *2.C. Drop any row that has no assigned Borough from the 'Borough' Column* 

In [6]:
#drop not assigned Borough 
tdf.drop(tdf[tdf['Borough']=="Not assigned"].index, axis=0, inplace=True)
tdf.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


#### *2.D. Set all rows in 'Neighborhoods' that is not assigned to be the same as its Borough*

In [7]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough
tdf.loc[tdf['Neighborhood']=="Not assigned",'Neighborhood']=tdf.loc[tdf['Neighborhood']=="Not assigned",'Borough']
print("Shape of DataFrame = ",tdf.shape) 
print('No. of Unique Postcodes =', len(tdf['Postcode'].unique())) 
tdf.head()

Shape of DataFrame =  (210, 3)
No. of Unique Postcodes = 103


Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


#### *2.E. Group all Neighborhoods that share the same post code to one row to remove Duplicate Postcodes.*
    From the above table, No. of Unique Postcodes = 103 and No. of Rows = 210. This shows some Neighborhoods share the same Postcodes.

In [8]:
# more than one neighbourhood in one postal code -> These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
tdf1= tdf.groupby('Postcode', sort =False).agg(lambda x: ','.join(x))
print(tdf1.shape)
tdf1.head(10)

(103, 2)


Unnamed: 0_level_0,Borough,Neighborhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,"North York,North York","Lawrence Heights,Lawrence Manor"
M7A,Downtown Toronto,Queen's Park
M9A,Queen's Park,Queen's Park
M1B,"Scarborough,Scarborough","Rouge,Malvern"
M3B,North York,Don Mills North
M4B,"East York,East York","Woodbine Gardens,Parkview Hill"
M5B,"Downtown Toronto,Downtown Toronto","Ryerson,Garden District"


### **Project Task 1**

#### *2.F.* 
#### *Reset the Table index to normal from hirarchical so that table column 'Postcode' is no more the table index*

In [9]:
tdf1.reset_index(level="Postcode", inplace = True)
print(tdf1.shape)
tdf1.head(10)

(103, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,"North York,North York","Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
5,M9A,Queen's Park,Queen's Park
6,M1B,"Scarborough,Scarborough","Rouge,Malvern"
7,M3B,North York,Don Mills North
8,M4B,"East York,East York","Woodbine Gardens,Parkview Hill"
9,M5B,"Downtown Toronto,Downtown Toronto","Ryerson,Garden District"


#### *Step 3.* 
#### *Download location data via Geocode Module or Read location data from CSV if given.*

In [10]:
geodata= pd.read_csv('https://cocl.us/Geospatial_data')
geodata.rename(columns = {'Postal Code':'Postcode'}, inplace=True)
geodata.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### **Project Task 2**

#### *Step: 4*
#### Append Location Data Table to above Table containing "Postcode" and "Borough" Data

In [11]:
dfj = tdf1.join(geodata.set_index('Postcode'), on="Postcode")

In [12]:
dfj

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,"North York,North York","Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,"Etobicoke,Etobicoke,Etobicoke","The Kingsway,Montgomery Road,Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
101,M8Y,"Etobicoke,Etobicoke,Etobicoke,Etobicoke,Etobic...","Humber Bay,King's Mill Park,Kingsway Park Sout...",43.636258,-79.498509


### **Project Task 3**

#### *Step: 4*
#### Visualize Final Data on Map using Folium Module to show Toronto Neighborhood Cluster

In [21]:
from geopy.geocoders import Nominatim
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [33]:
# create map of New York using latitude and longitude values
import folium
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(dfj['Latitude'], dfj['Longitude'], dfj['Borough'], dfj['Neighborhood']):
    label1 = '{}'.format(neighborhood)
    label1 = folium.Popup(label1, parse_html=True)
    label2 = '{}'.format(borough)
    label2 = folium.Tooltip(label2)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label1,
        tooltip = label2,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto