# Segmenting and Clustering Neighborhoods in Toronto

### This Notebook is part of the IBM Applied Datascience Capstone course. In this practice assignment location and venue data from Toronto will be segmented and clustered.

*By Andrew Dahlstrom*

*04/22/2019*

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

### The first step is to scrape the text from the wikitable of postal codes which can be found here:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
### Then we need parse clean and load that text data into a pandas dataframe.

In [2]:
# Scrape text from wikitable online and load into a Pandas dataframe

source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')
#print(soup.prettify())

for table in soup.find_all('table', class_= 'wikitable'):
    neightable = []
    
    for row in table.find_all('tr'):
        neighrow = []
        
        for data in row.find_all('td'):
            neighrow.append(data.text.rstrip('\n'))
        
        neightable.append(neighrow)

#Clean data remove unassigned boroughs

neighdf = pd.DataFrame(neightable, columns = ['PostalCode', 'Borough', 'Neighborhood'])
neighdf.drop(index=0, inplace=True)
todrop = neighdf[neighdf['Borough'] == "Not assigned"].index
neighdf.drop(todrop, inplace=True)
neighdf.reset_index(drop=True)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Not assigned
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### The next step involves organizing the table by merging neighborhoods with the same postal codes and giving unnamed neighbourhoods the name of their borough.

In [3]:
#Merge neighborhoods with same postalcode

neighdf = neighdf.groupby(['PostalCode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
neighdf

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [4]:
# Replace unassigned neighborhoods with borough names 

for index, row in neighdf.iterrows(): 
    if neighdf.at[index, 'Neighborhood'] == "Not assigned":
        neighdf.at[index, 'Neighborhood'] = neighdf.at[index, 'Borough']
      
neighdf.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### The shape of our final dataframe looks like this:

In [5]:
neighdf.shape

(103, 3)

### Next we need to get the csv file from http://cocl.us/Geospatial_data to get the latitude and longitude coordinates for each postal code.

In [6]:
import io
geourl="http://cocl.us/Geospatial_data"
s = requests.get(geourl).content
geodata = pd.read_csv(io.StringIO(s.decode('utf-8')))
geodata.head(5)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Add latitude and longitude data to neighborhood dataframe.

In [7]:

for index, row in neighdf.iterrows():
    
    for i, r in geodata.iterrows():
        if neighdf.at[index, 'PostalCode'] == geodata.at[i, 'Postal Code']:
            neighdf.at[index, 'Latitude'] = geodata.at[i, 'Latitude']
            neighdf.at[index, 'Longitude'] = geodata.at[i, 'Longitude']

neighdf

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


In [8]:
neighdf.shape

(103, 5)

In [9]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/DSX-Python35

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-2.2.2               |           py35_1         462 KB  conda-forge
    certifi-2018.8.24          |        py35_1001         139 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.3.1               |             py_0          25 KB  conda-forge
    ca-certificates-2019.3.9   |       hecc5488_0         146 KB  conda-forge
    openssl-1.0.2r             |       h14c3975_0         3.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.0 MB

The following NEW packages will

### Draw map with markers indicating postal codes.

In [10]:
# Get latitude and longitude for Toronto
tlatitude = neighdf[['Latitude']].mean(axis=0)
tlongitude = neighdf[['Longitude']].mean(axis=0)

map_toronto = folium.Map(location=[tlatitude, tlongitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighdf['Latitude'], neighdf['Longitude'], neighdf['Borough'], neighdf['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Use FourSquare API to get venue data from postal codes in Toronoto

In [12]:
#Specify FourSquare credentials 
CLIENT_ID = '5EYCAML1SE2IY0IGDAUGRFDLK2ULOQCPIRXCHWY1PEQNGVOO' # your Foursquare ID
CLIENT_SECRET = 'HD2XQG14HPJHFTZALTR2ULIBIJCK432GM3JQ4KCFFFWXP1P4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

### Create dataframe containing venue data for each postal code

In [19]:
#Define function to venue data for each postal code.
#Takes as an argument location name, coordinates and venue category
#returns a dataframe containing venue data.

#Max number of venues within radius
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, category, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            category,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'Postal Code Latitude', 
                  'Postal Code Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#FourSquare category code for bus stop
#category = '52f2ab2ebcbc57f1066b8b4f'

toronto_venues = getNearbyVenues(names=neighdf['PostalCode'],
                                   latitudes=neighdf['Latitude'],
                                   longitudes=neighdf['Longitude'],
                                   category='52f2ab2ebcbc57f1066b8b4f'
                                  )

toronto_venues.shape

M1B
M1C
M1E
M1G
M1H
M1J
M1K
M1L
M1M
M1N
M1P
M1R
M1S
M1T
M1V
M1W
M1X
M2H
M2J
M2K
M2L
M2M
M2N
M2P
M2R
M3A
M3B
M3C
M3H
M3J
M3K
M3L
M3M
M3N
M4A
M4B
M4C
M4E
M4G
M4H
M4J
M4K
M4L
M4M
M4N
M4P
M4R
M4S
M4T
M4V
M4W
M4X
M4Y
M5A
M5B
M5C
M5E
M5G
M5H
M5J
M5K
M5L
M5M
M5N
M5P
M5R
M5S
M5T
M5V
M5W
M5X
M6A
M6B
M6C
M6E
M6G
M6H
M6J
M6K
M6L
M6M
M6N
M6P
M6R
M6S
M7A
M7R
M7Y
M8V
M8W
M8X
M8Y
M8Z
M9A
M9B
M9C
M9L
M9M
M9N
M9P
M9R
M9V
M9W


(131, 7)

### The analysis that I am performing using FourSquare location data is to determine the access to public transportation in each postal code by comparing the number of bus stops in each postal code.

In [32]:
toronto_venues

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1E,43.763573,-79.188711,Bus Stop: 116 @ Danzig,43.765708,-79.184802,Bus Stop
1,M1E,43.763573,-79.188711,Leslie & Eastern Avenue Bus Stop,43.762175,-79.183684,Bus Stop
2,M1L,43.711112,-79.284577,TTC Stop 14795 Bus 68,43.711002,-79.279967,Bus Stop
3,M1W,43.799525,-79.318389,TTC Stop # 2494/2495,43.797300,-79.313362,Bus Stop
4,M2J,43.778517,-79.346556,Don Mills Vivastation,43.777496,-79.347885,Bus Stop
5,M2J,43.778517,-79.346556,Yrt Leith Hill Stop,43.777443,-79.347690,Bus Stop
6,M2J,43.778517,-79.346556,169 huntingwood,43.775903,-79.346930,Bus Stop
7,M2J,43.778517,-79.346556,167 pharmacy,43.775761,-79.346530,Bus Stop
8,M2J,43.778517,-79.346556,TTC Parkway Forest Dr (West Side),43.775685,-79.344813,Bus Stop
9,M2J,43.778517,-79.346556,TTC Stop Sheppard Ave E / Don Mills Rd (West),43.775044,-79.347812,Bus Stop


### Next we need to sum the number of bus stops in each postal code

In [64]:
toronto_bus = toronto_venues.groupby(['Postal Code']).size().reset_index(name='Bus Stops').sort_values(by=['Bus Stops'], ascending=[False])
toronto_bus.reset_index(inplace=True,drop=True)
toronto_bus

Unnamed: 0,Postal Code,Bus Stops
0,M5L,10
1,M5K,9
2,M5X,7
3,M5W,7
4,M2J,6
5,M4C,6
6,M6J,5
7,M5J,5
8,M5A,4
9,M6K,4


### Thanks for viewing this notebook!