# Segmenting and Clustering Neighborhoods in Toronto

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup

print('Libraries imported.')

Libraries imported.


## 2. Download and Refine Dataset

Here is where i download the data and save it into a file


In [2]:
!wget -q -O 'toronto_data.html' https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Open the file and use Beautiful soup to read it, get the first table of the html and then for each element of the table i create a row for the table_data array after that i set the first row with the headers (the first row is a TH and not a TD so it had blank data)

In [3]:
with open("toronto_data.html") as html_doc:
    soup = BeautifulSoup(html_doc, 'lxml') 
    table_data = soup.find_all('table')[0] 
    table_data = [[cell.text.replace('\n', '') for cell in row("td")]
                         for row in table_data("tr")]
    table_data[0] = ["PostalCode","Borough","Neighborhood"]

using numpy i transform the array into a numpyarray and then in a data frame 

In [4]:
data = np.array(table_data)
df = pd.DataFrame({'PostalCode':data[1:,0],'Borough':data[1:,1],'Neighborhood':data[1:,2]})

not_assigned = 'Not assigned'

Now the cleanup process

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
df = df[df['Borough']  != not_assigned]

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [6]:
df['Neighborhood'].loc[df['Neighborhood'] == not_assigned] = df['Borough']

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, 
you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [7]:
toronto_data = df.groupby(['PostalCode', 'Borough'],as_index=False).agg({'Neighborhood': lambda x : ', '.join(x) } ) 

Using the .shape method to print the number of rows of the dataframe.

In [8]:
toronto_data.reset_index(drop=True,inplace=True)
toronto_data.shape

(103, 3)

# 3. Localize the data

First i download the csv file

In [9]:
!wget -q -O 'geo_data.csv' https://cocl.us/Geospatial_data

I read the local file

In [10]:
df_loc = pd.read_csv("geo_data.csv")

Sort both data to match the postal code on both dataframes

In [11]:
df_loc.sort_values(by="Postal Code").reset_index(drop=True,inplace=True)
toronto_data.sort_values(by="PostalCode").reset_index(drop=True,inplace=True)

Now i concatenate both dataframe and clean up the extra colum

In [12]:
localized_data = pd.concat([toronto_data, df_loc], axis=1, join_axes=[toronto_data.index])
localized_data.drop(["Postal Code"], axis=1,inplace=True)
localized_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 4. Final section

We are going to see all the Neighborhood that are located in Scarborough only. So the first thing is to split the data for only Scarborough Borough data.

In [16]:
scar_data = localized_data[localized_data['Borough'] == 'Scarborough'].reset_index(drop=True)
scar_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Now we have to find the Scarborough latitude and longitude information

In [17]:
address = 'Scarborough, Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Scarborough are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Scarborough are 43.773077, -79.257774.


At this point we can create map of Scarborough using latitude and longitude values previously taken

In [18]:
map_scar = folium.Map(location=[latitude, longitude], zoom_start=11)

and finally add markers to map for every Neighborhood in Scarborough


In [19]:
for lat, lng, label in zip(scar_data['Latitude'], scar_data['Longitude'], scar_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scar)  
    
map_scar