# Week 3 Exercise 1: Create Neighbourhood Dataframe

In this ipynb, we will perform segmentation and clustering of neighbourhoods in the Canadian city of Toronto.  For the neighbourhood data, a Wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) exists that has all the information we need to explore and cluster the neighborhoods in Toronto.
We will have to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format.  

In [1]:
import pandas as pd
from bs4 import BeautifulSoup 
import requests

In [2]:
List_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(List_url).text 

In [3]:
soup = BeautifulSoup(source, "xml")

In [4]:
table = soup.find("table")

In [5]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
col_names = ["PostalCode", "Borough", "Neighbourhood"]
df = pd.DataFrame(columns = col_names)  
df.head() # Check that column names were initialised correctly

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [6]:
# Collect postal codes, boroughs and neighbourhoods into dataframe 
for tr_cell in table.find_all("tr"):
    row_data = []
    for td_cell in tr_cell.find_all("td"):
        row_data.append(td_cell.text.strip())
    if len(row_data) == 3:
        df.loc[len(df)] = row_data

In [7]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


**Data Cleaning Tasks**

1. Only process the cells that have an assigned borough. Ignore cells with a borough that is *Not assigned*,
<br> i.e. remove rows where the value for *Borough* is *Not assigned*

2. If a cell has a *Borough* but a *Not assigned* neighbourhood, then the neighbourhood will be the same as the borough,
<br> i.e. for cells with a *Borough* value AND *Not assigned* neighbourhood, neighbourhood = Borough

In [8]:
# To start off, let's filter out bad rows
df = df[~df["Borough"].isnull()] 

# Task 1: Remove rows where Borough = Not assigned
df.drop(df[df.Borough == "Not assigned"].index, inplace=True)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [9]:
# Task 2: For any cell in which [Borough = VALUE] AND [Neighbourhood = Not assigned], [neighbouhood VALUE = borough VALUE]  
df = df.groupby(["PostalCode","Borough"])["Neighbourhood"].apply(lambda x: ",".join(x)).reset_index()
df["Neighbourhood"].replace("Not assigned", df["Borough"], inplace=True)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [10]:
df.shape

(103, 3)

In [11]:
whos

Variable        Type             Data/Info
------------------------------------------
BeautifulSoup   type             <class 'bs4.BeautifulSoup'>
List_url        str              https://en.wikipedia.org/<...>postal_codes_of_Canada:_M
col_names       list             n=3
df              DataFrame            PostalCode           <...>n\n[103 rows x 3 columns]
pd              module           <module 'pandas' from '/o<...>ages/pandas/__init__.py'>
requests        module           <module 'requests' from '<...>es/requests/__init__.py'>
row_data        list             n=3
soup            BeautifulSoup    <?xml version="1.0" encod<...>);</script></body></html>
source          str              \n<!DOCTYPE html>\n<html <...></script></body></html>\n
table           Tag              <table class="wikitable s<...>/td></tr></tbody></table>
td_cell         Tag              <td>Not assigned\n</td>
tr_cell         Tag              <tr>\n<td>M9Z\n</td>\n<td<...>>Not assigned\n</td></tr>


# Week 3 Exercise 2: Geographical Coordinate Dataframe

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilise the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.


In [12]:
# load coordinates into a new dataframe 
geo_data_url = "http://cocl.us/Geospatial_data"
geo_coords_df = pd.read_csv(geo_data_url)
geo_coords_df.head(10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [13]:
# Verify that the dimensions of the coordinate dataframe match those of the neighbourhood dataframe 
try:
    geo_coords_df.shape == df.shape
except:
    print("ERROR: dataframe dimensions do not match!")

In [14]:
# Change the column names as appropriate and merge the two dataframes 
geo_coords_df.rename(columns = {"Postal Code": "PostalCode"}, inplace=True)
merged_data = pd.merge(geo_coords_df, df)
full_data = merged_data[["PostalCode", "Borough", "Neighbourhood", "Latitude", "Longitude"]]
full_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [15]:
whos

Variable        Type             Data/Info
------------------------------------------
BeautifulSoup   type             <class 'bs4.BeautifulSoup'>
List_url        str              https://en.wikipedia.org/<...>postal_codes_of_Canada:_M
col_names       list             n=3
df              DataFrame            PostalCode           <...>n\n[103 rows x 3 columns]
full_data       DataFrame            PostalCode           <...>n\n[103 rows x 5 columns]
geo_coords_df   DataFrame            PostalCode   Latitude<...>n\n[103 rows x 3 columns]
geo_data_url    str              http://cocl.us/Geospatial_data
merged_data     DataFrame            PostalCode   Latitude<...>n\n[103 rows x 5 columns]
pd              module           <module 'pandas' from '/o<...>ages/pandas/__init__.py'>
requests        module           <module 'requests' from '<...>es/requests/__init__.py'>
row_data        list             n=3
soup            BeautifulSoup    <?xml version="1.0" encod<...>);</script></body></html>
sou

# Week 3 Exercise 3: Explore and Cluster the Neighbourhoods in Toronto

In [None]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print("Libraries imported")

Solving environment: / 

In [None]:
# Output the number of boroughs and neighbourhoods in Toronto 
print("The dataframe has {} boroughs and {} neighbourhoods".format(
        len(full_data['Borough'].unique()),
        full_data.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighbourhoods


#### Use the geopy library to get the latitude and longitude values of Toronto

In [18]:
address = "Toronto, ON"

geolocator = Nominatim(user_agent = "Toronto")
location = geolocator.geocode(address)
latitude_toronto = location.latitude
longitude_toronto = location.longitude
print("The geograpical coordinate of Toronto are {}, {}.".format(latitude_toronto, longitude_toronto))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


#### Create a map of Toronto with neighbourhoods superimposed on top

In [None]:
# Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude_toronto, longitude_toronto], zoom_start=10)

# add markers to map
for lat, lng, borough, Neighbourhood in zip(full_data['Latitude'], full_data['Longitude'], full_data['Borough'], full_data['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto