<h1>Segmenting and Clustering Neighborhoods in Toronto</h1>
<h2>Robert Leon</h2>
<h3>Coursera IBM Data Science Capstone</h3>
<p>In this exercise, I will be using the Foursquare API and scikitlearn's k-means clustering to classify Toronto neighborhoods according to the makeup of local business categories<p>

The first step will be retrieving a list of neighborhoods in Toronto. A list is available for scraping at https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [4]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

In [5]:
#URL of info source
zip_codes_url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#create http request to query URL content
wiki_response = requests.get(zip_codes_url)

In [253]:
#split cell to avoid multiple http requests everytime I rerun the cell
#create BeautifulSoup object to parse response content
wikipage = BeautifulSoup(wiki_response.text)



In [7]:
#use pd.read_html to create a list of dataframes, save dataframe to a variable for easy referencing 
#I used prettify() after realizing beautifulsoup returns the object as text by default, and read_html parses HTML
wiki_page_table = pd.read_html(wikipage.table.prettify())[0]

In [256]:
#Confirm success!
wiki_page_table.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [9]:
#Create mask of all rows with 'Not assigned' as borough
table_mask = wiki_page_table['Borough']!='Not assigned'
#Apply mask to wiki_page_table and save as 'clean' version to use
clean_wiki_page_table = wiki_page_table[table_mask]

In [10]:
#Pandas is not letting me view all rows, changing the 'display.max_rows' setting's value to 'None'
pd.set_option('display.max_rows', None)

#Displaying dataframe in its entirety 
clean_wiki_page_table

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [22]:
#Display number of rows x number of columns for easy reference
clean_wiki_page_table.shape

(103, 3)

The next step is getting the Geo coordinates for each Zip code in thelist.

In [44]:
#import Geocoder library
import geocoder
#import json to handle geocoding api responses
import json
#import statistics package to average out boxed coordinates
import statistics

Here I create my dataframe to store each zip code's coordinates

In [None]:
#create the dataframe to store my results
lat_lon_columns=['Latitude','Longitude','Code Searched']
lat_lon_df=pd.DataFrame(columns=lat_lon_columns)

#used a for loop to loop through Postal codes, and save the center coordinates for each in a dataframe
#coordinates are presented as corners of an area, center of each area seems best if doing a radius search

for zipcode in clean_wiki_page_table['Postal Code']:
        g=geocoder.google('{}, Toronto, Ontario'.format(zipcode), key='HIDDEN')
        latitude = statistics.mean([g.json['bbox']['northeast'][0],g.json['bbox']['southwest'][0]])
        longitude = statistics.mean([g.json['bbox']['northeast'][1],g.json['bbox']['southwest'][1]])
        lat_lon_df = lat_lon_df.append({'Latitude':latitude,'Longitude':longitude,'Code Searched':zipcode}, ignore_index=True)
        
lat_lon_df

After a long time torubleshooting null resonses from the foursquare API further below, I found the above block originally logged the coordinates in the wrong order

In [131]:
#Create new DataFrame to hold merged toronto neighborhood data
t_neighborhood_columns = ['PostalCode', 'Burough', 'Neighborhood', 'Latitude', 'Longitude']
toronto_neighborhoods = pd.DataFrame(columns=t_neighborhood_columns)
lat_lon_df.set_index('Code Searched', inplace=True)

#Add the longitude and latitude coordinates to the new DF
for index, row in clean_wiki_page_table.iterrows():
    postcode = row['Postal Code']
    borough = row['Borough']
    nhood = row['Neighbourhood']
    lat = lat_lon_df.loc[postcode]['Latitude']
    lon = lat_lon_df.loc[postcode]['Longitude']
    toronto_neighborhoods = toronto_neighborhoods.append({
        'PostalCode':postcode,
        'Burough':borough,
        'Neighborhood':nhood,
        'Latitude':lat,
        'Longitude':lon
    },ignore_index=True
    )
    
#Display dataframe    
toronto_neighborhoods
    


Unnamed: 0,PostalCode,Burough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.750384,-79.335351
1,M4A,North York,Victoria Village,43.729376,-79.312923
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.649954,-79.352845
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723346,-79.450757
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662415,-79.389786
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.662091,-79.528267
6,M1B,Scarborough,"Malvern, Rouge",43.810176,-79.190328
7,M3B,North York,Don Mills,43.748739,-79.35641
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70738,-79.311904
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657723,-79.378585
