# Segmenting and Clustering Neighborhoods in Toronto

_Anna A. Stepanova, Ph.D_

## Introduction

In this project, I will scrape the web data using a package *BeautifulSoup*. Then, I'll get the neighborhood information data from web, convert addresses into their equivalent latitude and longitude values. Also, I will use the Foursquare API to explore neighborhoods in Torono. I will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I will use the *k*-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents


1. [Scrape Wikipedia Page](#item1)
2. [Add Geospatial Data](#item2)



Let's download required packages before we explore the data

In [53]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.11.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

from bs4 import BeautifulSoup # web scrapping library

#!pip install geocoder
import geocoder # import geocoder

print('Libraries imported.')

Libraries imported.


<a id="item1"></a>
## 1. Scrape Wikipedia Page

Let's build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes, boroughs and neighborhoods.
We'll use BeautifulSoup library to extract the table from the web-page.

In [54]:
# import the library we use to open URLs
import urllib.request

# specify the URL of the Wikipedia page page we are going to be scraping
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

Using the *urllib.request* library, we want to query the page and put the HTML data into a variable (which we have called ‘url’):

In [55]:
page = urllib.request.urlopen(url)

Then we use Beautiful Soup to parse the HTML data we stored in our ‘url’ variable and store it in a new variable called ‘soup’ in the Beautiful Soup format. We use the “lxml” library option:

In [56]:
# parse the HTML from our URL into the BeautifulSoup parse tree format
soup = BeautifulSoup(page, "lxml")

Extract the table data from the xml using class "wikitable sortable":

In [57]:
# find and extract table data from the Wikipedia page
table=soup.find('table', class_='wikitable sortable')

Now that the table has been found, let's use BeautifulSoup to extract rows into 3 future columns. Then we'll use *pandas* to create a data frame. All entries end up with new lines **\n** which we should replace prior to further analysis. We will exclude cells with a borough that is **Not assigned** and reset the indexes in a modified data frame. 
Let's print first 12 rows

In [64]:
### Let's get column data
#Initialize the columns
A=[]
B=[]
C=[]


for row in right_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        A.append(cells[0].find(text=True))
        B.append(cells[1].find(text=True))
        C.append(cells[2].find(text=True))
        
#### Now let's create a data frame using pandas library
tor=pd.DataFrame(A,columns=['PostalCode'])
tor['Borough']=B
tor['Neighborhood']=C


# First filter out those rows which 
# does not contain any data 
tor = tor.dropna(how = 'all')

# Remove \n from data frame
tor = tor.replace('\n','', regex=True)

### Drop rows Not assigned
tor.drop(tor[tor['Borough'] == 'Not assigned'].index, inplace = True)

### print the data frame resetting the index
tor.reset_index(drop=True, inplace = True)

### Print the modified dataframe 
tor.head(12)
     


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


The dataframe consists of three columns: PostalCode, Borough, and Neighborhood and 103 entries for these columns.

In [51]:
### print the shape of the data frame
print(tor.shape)


(103, 3)


<a id="item2"></a>
## 2. Add Geospatial Data

We'll get the latitude and longitude coordinates of a given postal code using *Geocoder* package: NOT WORKING

As I was not able to get the geographical coordinates of the neighborhoods using the Geocoder package, I used a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [52]:
### Read Geospatial information from csv
postal_data = pd.read_csv('http://cocl.us/Geospatial_data')

postal_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now we can create a new data frame containing information about neighborhoods and their coordinates by merging Neighborhood data with Geospatial data. First 12 rows printed

In [72]:
# Merge toronto data frame with postal_data
geo_tor = tor.merge(postal_data, left_on='PostalCode', right_on='Postal Code')

# drop repetitive Postal Code
geo_tor.drop(["Postal Code"], axis = 1, inplace = True)

# Print first 12 rows
geo_tor.head(12)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


Our new data frame contains 103 entries and now has neighborhood coordinates.

In [71]:
### print the shape of the data frame
geo_tor.shape

(103, 5)