<h1 align=center><font size = 10><em>Battle of Neighbourhood</em></font></h1>


<h1 align=center>A Capstone project for Segmentation and Clustering Neighhood in Toronto</h1>

##### Project By:
*** Ranjeet Sahay ***

## Introduction

This botebook is divided in 3 sections : 
1. <a href='#dw'>Data Wrangling </a> - Webscraping and collecting the data, cleaning the data, and preparing data.
2. <a href='#gm'>Geo Mapping data</a> - Mapping Geo coordinates the boroughs and neighbourhoods to their geo location for City of Toronto
3. <a href='#map'>Ploting Toronto</a> - Plotting Toronto boroughs and neighbourhoods on Folium Map


In [23]:
##Import the required libraries

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### 1. Web Scraping and Data Cleaning<a id='dw'>.</a>

#### Webscraping Postal Codes
First step is to read the link which contains list of postal codes and Borough and Neighbourhood in tabular format
Since the link is in html, pandas <code>"read_html"</code> method is used here. The method reads all the tables found in the html page and creates frame.
We are interested only the first tables for data, so we take table[0] and it returns a panda dataframe. 

The <code>header=0</code> indicates the table has a header and is the first row of the table. 

In [24]:
#Using Panda library function read_html, read the html tables. All tables are read. 
tables = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)
df_pctab = tables[0] # get the first table containing postal code.
df_pctab.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [25]:
df_pctab.shape

(288, 3)

There are 288 rows in this dataframe.<br><br>
This completes first requirement of this assignment.
___

#### Cleaning Boroughs not assigned

The Boroughs contain "Not Assigned" values which is to be removed to obtain clean dataframe. This is second part of assignment.
In this code section 
1. the rows where Borough value is "Not Assigned is filter out using code <code>df_clean["Borough"]=='Not assigned'</code>
2. A <code>.index</code> returns all Indices matching the filter for "Not Assigned" boroughs, 
3. Apply "drop" method of pandas dataframe to remove the matching indexes. <code>inplace=True</code> modifies the dataframe.

In [26]:
df_clean = df_pctab[:] #Creating a clean copy of data frame  

df_clean.drop(df_clean[df_clean["Borough"]=='Not assigned'].index, axis=0, inplace=True) #remove all rows where Borough is "Not Assigned"

df_clean.shape


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


(211, 3)

><b>Note:</b><br>
>    <i>The number of records or row size is reduced from 288 to 211 after removing rows with "Not Assigned"</i> 
    
    



#### Aggregating the Neighbourhoods

Since there could be more than one neighbourhood for a borough, the requirement here is to join them as <code>,</code> separated values.</b>

1. Group by Postcode will group distinct values of Postcode.
2. More than one Neighbourhood might be present for same borough and postal code \
   so they need to be aggregated.
3. The <code>agg</code> function of DataframeGroup will aggregate the rows for same Postcodes and Borough. 
4. The <code>', '.join</code> joins the values of Neighbourhood for unique Postcode and Borough 

In [27]:
# Aggerate Neighbourhoods by Postalcode and Borough

df_grp_pc= df_clean.groupby(["Postcode","Borough"]).agg({'Neighbourhood':', '.join}).reset_index()
df_grp_pc.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [28]:
df_grp_pc.shape

(103, 3)

><b>Note:</b><br>
>    <i>Notice the reduced records size after aggregating rows.</i> 
    


#### Updating Neighbourhood data

The neighbourhoods for a borough if not assigned, the requirement is to replace the neighbhourhood with Borough itself. To do it :

1. Iterate rows of the cleaned dataframe 
2. For each row, if the third column (Neighbourhood here) has value "Not Assigned" , replace the value of the cell to that of cell value at index 2 or "Borough"
3. The dataframe is cleaned again as in earlier step to aggregate neighbourhood as , separated neighbourhoods.

A new dataframe is created to present the output.


In [29]:
#Replacing Neighbourhood with Borough where value is "Not Assigned"

for i, row in df_clean.iterrows():
    if(row[2]=='Not assigned'):
        row[2] = row[1]  #Can also use df_clean.set_value('Neighbourhood', 'i', row[2])    

#Checked again after reassignment
df_grp_pc= df_clean.groupby(["Postcode","Borough"]).agg({'Neighbourhood':', '.join}).reset_index()
df_grp_pc.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [30]:
df_grp_pc.shape

(103, 3)

> <b>Note</b><br>
>    <i>Notice the reduced records matches previously aggregating rows.</i> 
---

## 2. Mapping Geographical Co-ordinates of Postal Codes<a id='gm'>.</a>
In this section we get the geographical coordinates of each postal code from following web location http://cocl.us/Geospatial_data and map and merge the latitude and longitude values to corresponding postal codes.  

<b>Read the CSV file using pandas dataframe</b>

In [31]:
df_latlog = pd.read_csv("http://cocl.us/Geospatial_data")
df_latlog.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [32]:
df_latlog.shape

(103, 3)

##### Identify common column
* The geographical coordinates data imported from CSV is converted to dataframe <code>df_latlog</code> and contains 3 columns postal code, latitude, longitude. <br>
* From the shape of the "df_latlog" dataframe, it can be see it has same number of Postal codes as in the grouped dataframe "df_grp_pc" in previous section.
* The dataframes can be merged together to single dataframe and lat, long become new colums corresponding to postal code matching both dataframes. 
* However, the column name of geo coordinate is different, so first rename the column name to match them.   

In [33]:
# rename the coloumn postal code to be same in both the df
df_latlog.rename(columns ={"Postal Code":"Postcode"}, inplace=True)
df_latlog.head(10)

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


##### Merge Dataframes 
* Now both dataframes has common column for reference i.e "Postalcode".
* Now merged the dataframes df_grp_pc & df_latlog to new dataframe reference column is "Postalcode".

In [34]:
df_merged = pd.merge(df_grp_pc,df_latlog, on="Postcode")  
df_merged.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


#### Merged dataframe
- New dataframe "df_merged" now contains 5 columns and new columns of Latitude and Longitude matches with "Postal code".
- Verify row size of merge dataframe matches the postal codes 

In [35]:
#verified by shape the size of merged dataframe has same number of rows as that of geographical codes found in CSV for Toronto  
df_merged.shape

(103, 5)

### The merged dataframe "df_merge" contains all row data required as per the assignment requirement
___

## 3. Ploting Toronto Neighbourhood Maps<a id='map'>.</a>

#### Get and install library for plotting geolocations and map of neighborhoods

In [14]:
!conda install -c conda-forge folium=0.5.0 --yes # Library for Map - Folium
import folium # map rendering library


Solving environment: | ^C
failed

CondaError: KeyboardInterrupt



##### Import geopy the geo locator for getting geographical coordinates of address

In [15]:
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Geo Libraries imported.')

Solving environment: / ^C
failed

CondaError: KeyboardInterrupt

Geo Libraries imported.


First get geo locations of Toronto, city in Ontario  using geopy's Nominatim library

In [36]:
address = "Toronto, Ontario"
geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto, Ontario are 43.653963, -79.387207.


We will use the Folium Map to locate the neighbourhoods from the location data of postcodes.
To do that:
1. create map of Toronto using Folium map
2. Use circle marker to create series of circles corresponding to each postcode (located by the lat, long)
3. Fill circle with color <code>#3186cc</code>
4. Add each of the pointers to the toronto map

In [37]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighbourhood in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Borough'], df_merged['Neighbourhood']):
    #print(lat, lng)
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

#### The above map is further segment and clustered to neighborhoods of names with "Toronto". 
- We will slice the dataframe all neighbhourhoods and create a new dataframe for the data only for borough name containing "Toronto".

Let's get the geographical coordinates of Toronto again.

In [38]:
address = "Toronto, Ontario"
geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, Ontario are {}, {}.'.format(latitude, longitude))


The geograpical coordinate of Toronto, Ontario are 43.653963, -79.387207.


Get all Boroughs of Toronto containing name "Toronto" in it. Examples are East Toronto, West Toronto, Downtown Toronto etc..

In [41]:
borough_toronto = df_merged[df_merged['Borough'].str.contains('Toronto')].reset_index(drop=True)
borough_toronto.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


* Create the borough map for all data containing "Toronto" in its name
* Again, use circle marker to highlight the neighbourhoods of those boroughs in filtered list 
* Add labels of neighbourhoods and borough name on html popups when circle is clicked by mouse.

In [40]:
# create map of Scarborough using latitude and longitude values
toronto_borough = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighbourhood in zip(borough_toronto['Latitude'], borough_toronto['Longitude'], borough_toronto['Borough'], borough_toronto['Neighbourhood']):
    #print(lat, lng)
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_borough)  
    
toronto_borough

### -- End of Assignment --