# Segmenting and Clustering Neighborhoods in Toronto

## Part 1: Web Scraping, Transforming Data into a Pandas Dataframe and Cleaning Data

This is the first part of the 3rd week assignement. Our task is to scrape a webpage with the table of postal codes of Canada, more specifically, Toronto, clean the data from the table and transform it into a usable *pandas dataframe*. 

The first step is to import the libraries and packages we need:

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from geopy.geocoders import Nominatim
import folium
import json 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

The next step is to scrape the data we need and turn it into a pandas dataframe:

In [2]:
res = requests.get(
    'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find('table')

df = pd.read_html(str(table))
df = df[0]
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,Islington Avenue


As we can see, there are rows lacking the necessary information, so we need to clean the data and keep only those rows in the 'Borough' column where we have the data we need. Let's get rid of the missing data: 

In [3]:
df = df[df.Borough != 'Not assigned']
df = df.reset_index(drop=True)
df.head(11)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Etobicoke,Islington Avenue
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


(There's an alternative way to get the same result:

df = df.replace('Not assigned', np.nan)<br/>
df = df.dropna()

But, here we'll stick to the first method.)

Now, let's rename the columns, according to the suggestions from the description of the assignment, and combine the rows where the postal code is the same for more than one neighborhoods (separating their names with a comma): 

In [4]:
df.rename(columns={'Postcode': 'PostalCode',
                   'Neighbourhood': 'Neighborhood'}, inplace=True)

df = df.groupby(['PostalCode', 'Borough'])[
    'Neighborhood'].apply(', '.join).reset_index()

df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Great. Now, let's see how many rows our dataframe consists of:

In [5]:
df.shape

(103, 3)

Good. We have scraped data, turned it into a pandas dataframe, cleaned it and got the dataframe we'll need for the next steps. 

## Part 2: Latitude and Longitude Coordinates of Toronto Neighborhoods

The most convenient way to get the coordinates we need is to use this csv file: https://cocl.us/Geospatial_data.  

In [6]:
df_longlat = pd.read_csv('https://cocl.us/Geospatial_data')
df_longlat.head(11)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [7]:
df_longlat.shape

(103, 3)

If we compare *df* and *df_longlat*, we can see that the number of the rows and the data we need here are the same.

OK, now let's add the two columns from the latter dataframe to the first dataframe: 

In [8]:
df = df.assign(Latitude=df_longlat.Latitude.values, Longitude=df_longlat.Longitude.values)
df.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


Cool. All is good now.

## Part 3: Exploring and Clustering the Neighborhoods in Toronto 

In [9]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinates of '+ address + ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto, Ontario are 43.653963, -79.387207.


In [10]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color='gray',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto) 
    
map_toronto

In [11]:
downtown_toronto = df[df['Borough'] == 'Downtown Toronto'].reset_index(drop=True)
downtown_toronto.head(11)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide, King, Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752


In [12]:
address = 'Downtown Toronto, Toronto'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinates of ' + address + ' are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Downtown Toronto, Toronto are 43.6541737, -79.38081164513409.


In [13]:
downtown_map = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label in zip(downtown_toronto['Latitude'], downtown_toronto['Longitude'], downtown_toronto['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color='gray',
        fill_opacity=0.7,
        parse_html=False).add_to(downtown_map)  
    
downtown_map

In [14]:
CLIENT_ID = 'DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F'
CLIENT_SECRET = 'MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE'
VERSION = '20180605'

print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

CLIENT_ID: DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F
CLIENT_SECRET:MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE


In [15]:
downtown_toronto.loc[13]

PostalCode                                            M5T
Borough                                  Downtown Toronto
Neighborhood    Chinatown, Grange Park, Kensington Market
Latitude                                          43.6532
Longitude                                           -79.4
Name: 13, dtype: object

In [16]:
neighborhood_latitude = downtown_toronto.loc[13, 'Latitude'] 
neighborhood_longitude = downtown_toronto.loc[13, 'Longitude'] 
neighborhood_name = downtown_toronto.loc[13, 'Neighborhood']

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Chinatown, Grange Park, Kensington Market are 43.6532057, -79.4000493.


In [17]:
radius = 500
LIMIT = 100
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    neighborhood_latitude,
    neighborhood_longitude,
    VERSION, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?client_id=DLEPKVUPBD22IWQUTN2PAYMXRPPSSNG322V3KVLF35CLNV0F&client_secret=MFVKOQGWMBP3IU35Y04EJECLOGISGUEQPFSKQZ2VE2JMBUSE&ll=43.6532057,-79.4000493&v=20180605&radius=500&limit=100'

In [18]:
results = requests.get(url).json()
results

',
        'city': 'Toronto',
        'state': 'ON',
        'country': 'Canada',
        'formattedAddress': ['421 Dundas St W',
         'Toronto ON M5T 2W4',
         'Canada']},
       'categories': [{'id': '4bf58dd8d48988d1f5931735',
         'name': 'Dim Sum Restaurant',
         'pluralName': 'Dim Sum Restaurants',
         'shortName': 'Dim Sum',
         'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/dimsum_',
          'suffix': '.png'},
         'primary': True}],
       'photos': {'count': 0, 'groups': []}},
      'referralId': 'e-0-4ddbe8697d8b771c0b09b885-71'},
     {'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4adb5472f964a520fc2521e3',
       'name': 'Asian Legend 味香村',
       'location': {'address': '418 Dundas St W',
        'crossStreet': 'btwn Beverley & Huron St',
        'lat': 43.65360271388312,
        'lng': -7

In [19]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']