## Segmenting and Clustering Neighborhoods in Toronto



# Part I: Data Improt and Preprocessing

## 1. Data import by scraping the Wikipedia page

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import folium
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import wget

#### Scraping the web page

In [2]:
url="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wp=requests.get(url).text
wp_scrape=BeautifulSoup(wp, 'lxml')

#### Find the target table
webpage-> Inspect -> locate the table in HTML:   
 i) div class="mw-parser-output"  
 ii) table class="wikitable sortable jquery-tablesorter"  
          tab: 'table"   
          class: "wikitable sortable jquery-tablesorter"  

In [3]:
wp_table=wp_scrape.find('table', class_="wikitable sortable")

print("the data type is:",type(wp_table))
print("the name is:", wp_table.name)
print("the Table Header is:", wp_table.tr.text, wp_table.tr)

the data type is: <class 'bs4.element.Tag'>
the name is: table
the Table Header is: 
Postcode
Borough
Neighbourhood
 <tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


## 2. Read the data into a pandas Dataframe

#### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [4]:

col_name=['Postcode','Borough','Neighborhood']
wp_df=pd.DataFrame(columns=col_name)
wp_df

Unnamed: 0,Postcode,Borough,Neighborhood


In [5]:
for tr in wp_table.find_all('tr'):
    i=0
    tx=['','','']
    for td in tr.find_all('td'):
        tx[i]=td.text
        #print(i, tx[i])
        i=i+1
    #print(tx[0], tx[1], tx[2])
    wp_df=wp_df.append({'Postcode':tx[0],'Borough':tx[1], 'Neighborhood': tx[2].rstrip('\n')}, ignore_index=True)
print("DataFrame's shape:", wp_df.shape)
wp_df.head(10)


DataFrame's shape: (289, 3)


Unnamed: 0,Postcode,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Not assigned


## 3. Data Preprocessing


#### 3-1) Filter out "Not Assigned":
Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [6]:
wp_df['Borough'].value_counts()  

Not assigned        77
Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
                     1
Name: Borough, dtype: int64

In [7]:
wp_df=wp_df[( wp_df['Borough']!='Not assigned') & (wp_df['Borough']!="" )]
wp_df['Borough'].value_counts()  

Etobicoke           45
North York          38
Downtown Toronto    37
Scarborough         37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

#### 3-2) Not Assigned neighborhood
If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [8]:
# App#1
#wp_df.loc[wp_df['Neighborhood']=='Not assigned','Neighborhood']=wp_df['Borough']

# App#2:
for index, row in wp_df.iterrows():
    if row['Neighborhood'] == 'Not assigned':
        row['Neighborhood'] = row['Borough']
wp_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights
8,M6A,North York,Lawrence Manor
9,M7A,Queen's Park,Queen's Park
11,M9A,Etobicoke,Islington Avenue
12,M1B,Scarborough,Rouge
13,M1B,Scarborough,Malvern


#### 3-3) Combine the neighborhood for the same FSA:
More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.


In [9]:
wp_df=wp_df.groupby(['Postcode','Borough']).agg(','.join)
wp_df.head()
# reset_index to realign Postcode and Borough
wp_df_final=wp_df.reset_index()
wp_df_final.head(5)

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### 3-4) show the shape of the DataFrame
In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
wp_df_final.shape

(103, 3)

# Part II: Get the coordinate (longitude, latitude) for each FSA

### (1) Get the Postcode, Latitude, Longitude data file

In [11]:
import wget
   
wget.download('http://cocl.us/Geospatial_data', 'tor_fsa_lng_lat.csv')
print('Data downloaded!')

  0% [                                                                                ]    0 / 2891100% [................................................................................] 2891 / 2891Data downloaded!


In [12]:
tor_ll=pd.read_csv('tor_fsa_lng_lat.csv')
tor_ll.columns=['Postcode', 'Latitude', 'Longitude']
tor_ll.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### (2) Merge the Toronto FSA and Long-Lat files

In [13]:
Tor_df=pd.merge(wp_df_final, tor_ll, on='Postcode')
Tor_df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


# Part III: Neighborhood Analysis

## 1. Demonstrate the Downtown Toronto neighorhoods on the Map

#### Retrieve Toronto's coordinate

In [14]:
address='Toronto, ON'
geolocator=Nominatim(user_agent="Toronto_Explorer")
tor_loc=geolocator.geocode(address)
tor_long=tor_loc.longitude
tor_lat=tor_loc.latitude
print("Toronto Geo Coordinate are: Latitude is {}, and Longitude is {}".format(tor_lat, tor_long))

Toronto Geo Coordinate are: Latitude is 43.653963, and Longitude is -79.387207


#### DataFrame for "Downtown Toronto"

In [15]:
print(Tor_df['Borough'].value_counts())
DT_Tor=Tor_df.loc[Tor_df['Borough']=='Downtown Toronto'].reset_index(drop=True)
DT_Tor

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
East Toronto         5
York                 5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
1,M4X,Downtown Toronto,"Cabbagetown,St. James Town",43.667967,-79.367675
2,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
3,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
4,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
5,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
6,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
7,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
8,M5H,Downtown Toronto,"Adelaide,King,Richmond",43.650571,-79.384568
9,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752


#### Show Toronto & Downtown Toronto on the map

In [16]:
map_toronto=folium.Map(location=[tor_lat, tor_long], zoom_start=13)

for lat, lng, borough, ngbr, in zip(DT_Tor['Latitude'],DT_Tor['Longitude'],DT_Tor['Borough'],DT_Tor['Neighborhood']):
    label='{}, {}'.format(borough, ngbr)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## 2. Analyze Downtown Toronto's neighborhood through Foursquare
get the top 100 venues that are within a radius of 500 meters from Downtown Toronto's neighborhoods  
using .valuies[] or .item() to convert a series into a scalar value, from .iloc[] row/column selection

In [17]:
Limit=100
radius=500

CLIENT_ID = 'BlockedforPrivacy' # your Foursquare ID
CLIENT_SECRET = 'BlockedforPrivacy' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#lat=DT_Tor.loc[DT_Tor['Neighborhood']=='Cabbagetown,St. James Town','Latitude']
lat=DT_Tor.loc[DT_Tor['Neighborhood']=='Cabbagetown,St. James Town','Latitude'].values[0]
lng=DT_Tor[DT_Tor['Neighborhood']=='Cabbagetown,St. James Town']['Longitude'].item()
ngbr=DT_Tor[DT_Tor['Neighborhood']=='Cabbagetown,St. James Town']['Neighborhood']
print(lat)
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    lat, 
    lng, 
    radius, 
    Limit)
url # display URL

43.667967


'https://api.foursquare.com/v2/venues/explore?&client_id=W4C2R2D0TMSFJDDSI2QUQDXOH4AFF1GC1J1PLYIUSMOMTGG0&client_secret=WBDIERVYDWFTIAYYETE4QSFZGFDAAJ242SDYX3PTBLJDT5IH&v=20180605&ll=43.667967,-79.3676753&radius=500&limit=100'

In [18]:
results=requests.get(url).json()
#results

In [19]:
venues=results['response']['groups'][0]['items']
nearby_venues=json_normalize(venues)
print(type(nearby_venues))
print(nearby_venues.columns)
filter_col=[['']]
nearby_venues.head(2)

<class 'pandas.core.frame.DataFrame'>
Index(['reasons.count', 'reasons.items', 'referralId', 'venue.categories',
       'venue.id', 'venue.location.address', 'venue.location.cc',
       'venue.location.city', 'venue.location.country',
       'venue.location.crossStreet', 'venue.location.distance',
       'venue.location.formattedAddress', 'venue.location.labeledLatLngs',
       'venue.location.lat', 'venue.location.lng',
       'venue.location.neighborhood', 'venue.location.postalCode',
       'venue.location.state', 'venue.name', 'venue.photos.count',
       'venue.photos.groups', 'venue.venuePage.id'],
      dtype='object')


Unnamed: 0,reasons.count,reasons.items,referralId,venue.categories,venue.id,venue.location.address,venue.location.cc,venue.location.city,venue.location.country,venue.location.crossStreet,...,venue.location.labeledLatLngs,venue.location.lat,venue.location.lng,venue.location.neighborhood,venue.location.postalCode,venue.location.state,venue.name,venue.photos.count,venue.photos.groups,venue.venuePage.id
0,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-4b646a6ff964a5205cb12ae3-0,"[{'id': '4bf58dd8d48988d147941735', 'name': 'D...",4b646a6ff964a5205cb12ae3,601 Parliament St.,CA,Toronto,Canada,at Wellesley St. E,...,"[{'label': 'display', 'lat': 43.6678427705951,...",43.667843,-79.369407,,M4X 1P9,ON,Cranberries,0,[],
1,0,"[{'summary': 'This spot is popular', 'type': '...",e-0-4e4e7aa06365e1419d021044-1,"[{'id': '4bf58dd8d48988d110941735', 'name': 'I...",4e4e7aa06365e1419d021044,12 Amelia St,CA,Toronto,Canada,Parliament St,...,"[{'label': 'display', 'lat': 43.66753590663226...",43.667536,-79.368613,,M4X 1E1,ON,F'Amelia,0,[],
