# Segmenting and Clustering Neighbourhoods in Toronto - 2

In this assignment, I will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. I will to scrape the Wikipedia page and:

   - Wrangle the data
   - Clean it 
   - Read it into a pandas dataframe so that it is in a structured format like the New York dataset

Once the data is in a structured format, I will replicate the analysis that I did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

### Import Relevant Libraries

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from bs4 import BeautifulSoup # Library to handle web scraping

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Download Dataset

The neighbourhood data for Toronto can be downloaded from Wikipedia.

The following link will be used: https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050 and can be scraped using the Beautifulsoup library.

In [3]:
# gather HTML data via request

data = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050').text
#data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# parse the data from html into a Beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store data
postalCodeList = []
boroughList = []
neighbourhoodList = []

#### BeautifulSoup Functionality

```javascript
#find table
soup.find('table')

# find all the rows of the table
soup.find('table').find_all('tr')

# for each row of the table, find all the table data
for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
```

In [5]:
# append data to each respective list

for row in soup.find('table').find_all('tr'):
    cells = row.find_all('td')
    #if 'Not assigned' in cells:
    if(len(cells) > 0):
        postalCodeList.append(cells[0].text)
        boroughList.append(cells[1].text)
        neighbourhoodList.append(cells[2].text.rstrip('\n')) # avoid new lines in neighborhood cell

In [6]:
# create a new DataFrame from the three lists

toronto_df1 = pd.DataFrame({"PostalCode": postalCodeList,
                           "Borough": boroughList,
                           "Neighbourhood": neighbourhoodList})

print(toronto_df1.shape)
toronto_df1.head()

(287, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Data Preprocessing

The data must be processed into a usable format.

- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
- Only cells with an assigned Borough will be processed; Boroughs which are 'Not assigned' are ignored.
- More than one neighborhood exist in one postal code area. For example, in the table on the Wikipedia page, notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma, as shown in the 11th row.
- If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.
- The .shape method to print the number of rows of your dataframe, for reference

In [7]:
# drop cells with a Borough that is 'Not assigned'

toronto_df2 = toronto_df1.replace('Not assigned', np.nan)
toronto_df2 = toronto_df2.dropna(subset=['Borough'])
print(toronto_df2.shape)
toronto_df2.head()

(210, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [8]:
# group neighbourhoods in the same borough

toronto_df2_grouped = toronto_df2.groupby(["PostalCode", "Borough"], as_index=False).agg(lambda x: ", ".join(x))
print(toronto_df2_grouped.shape)
toronto_df2_grouped.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
# when Neighbourhood is 'Not assigned' make the value same as Borough

for index, row in toronto_df2_grouped.iterrows():
    if row["Neighbourhood"] == "Not assigned":
        row["Neighbourhood"] = row["Borough"]

print(toronto_df2_grouped.shape)
toronto_df2_grouped.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
# print the number of rows in the DataFrame

toronto_df3 = toronto_df2_grouped
toronto_df3.shape

(103, 3)

## Load Coordinates CSV File

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [11]:
coordinates = pd.read_csv('http://cocl.us/Geospatial_data')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
# rename column to 'PostalCode' for easy merge

coordinates.rename(columns= {"Postal Code": "PostalCode"}, inplace=True)

In [13]:
# merge coordinates with toronto neighbourhood data set

toronto_df4 = toronto_df3.merge(coordinates, on = "PostalCode", how = "left")
print(toronto_df4.shape)
toronto_df4.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
