# Question 1:

Import Necessary Stuff

In [1]:
from bs4 import BeautifulSoup as bs
import lxml
import requests
import numpy as np
import pandas as pd

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [2]:
# Scrape

url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(url, 'lxml')
table = soup.find('table', class_='wikitable sortable')
# print(soup.prettify())
# oh this looks like a NIGHTMARE I'm commenting it out

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [3]:
toronto = pd.DataFrame(columns = ['PostalCode','Borough','Neighborhood'])
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood


Reading in the Data

In [4]:
# tr and td
for tr_cell in table.find_all('tr'):
    row_data = []
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data) == 3:
        toronto.loc[len(toronto)] = row_data

toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [5]:
toronto = toronto[toronto['Borough'] != 'Not assigned']
toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [6]:
toronto_grp = toronto.groupby(['PostalCode','Borough'], sort = False, as_index = False).agg( ', '.join)
toronto_grp.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [7]:
toronto_grp.loc[toronto_grp['Neighborhood'] =='Not assigned', 'Neighborhood'] = toronto_grp['Borough']
toronto_grp.head()
# I spent...like a whole hour wondering why my code wasn't working only to figure out I misspelled 'Neighborhood' in the grouping function

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [8]:
toronto_grp.shape

(103, 3)

# Question 2:

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [9]:
!wget -q -O 'Toronto.csv'  http://cocl.us/Geospatial_data
toronto_loc = pd.read_csv('Toronto.csv')
toronto_loc.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Changing the case of the 'Postal Code' column so I can get my merge right

In [10]:
toronto_loc.columns = ['PostalCode','Latitude','Longitude']
toronto_loc.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Create new df from prior two with PostalCode in common

In [11]:
tor_grploc = pd.merge(toronto_grp, toronto_loc, on = 'PostalCode')
tor_grploc.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Question 3:

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

In [12]:
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
from pandas.io.json import json_normalize
print('Libraries imported.')

Libraries imported.


In [13]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="Toronto")
location = geolocator.geocode(address)
toronto_lat = location.latitude
toronto_long = location.longitude
print('Toronto Coordinates: {}, {}.'.format(toronto_lat, toronto_long))

Toronto Coordinates: 43.6534817, -79.3839347.


In [14]:
toronto_map = folium.Map(location=[toronto_lat, toronto_long], zoom_start=10)

# add markers to map
for lat, lng, borough, Neighborhood in zip(tor_grploc['Latitude'], tor_grploc['Longitude'], tor_grploc['Borough'], tor_grploc['Neighborhood']):
    label = '{}, {}'.format(Neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='orange',
        fill=True,
        fill_color='gold',
        fill_opacity=0.5,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

Define FourSquare Credentials (Copy-Pasted from Lab)

In [15]:
CLIENT_ID = '5G1RAVU2YF0UDWUNCMCCKEIC5IAWNQ0FN42VNAHFJLHAVOAX'
CLIENT_SECRET = 'AVKNN5YHYHCNRMEUTI04HCABA4YKAFVWEFEHK4QGDWIIZJIK'
VERSION = '20180605'

Get first neighborhood name.

In [16]:
tor_grploc.loc[0, 'Neighborhood']

'Parkwoods'

Set Parkwoods Longitude and Latitude with limit and radius, then get new url

In [17]:
neighborlat = tor_grploc.loc[0, 'Latitude']
neighborlong = tor_grploc.loc[0, 'Longitude']
radius = 500
LIMIT = 100

url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborlat, 
    neighborlong, 
    radius, 
    LIMIT)

url2

'https://api.foursquare.com/v2/venues/explore?&client_id=5G1RAVU2YF0UDWUNCMCCKEIC5IAWNQ0FN42VNAHFJLHAVOAX&client_secret=AVKNN5YHYHCNRMEUTI04HCABA4YKAFVWEFEHK4QGDWIIZJIK&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

Get Results (I am not printing them because it would look cluttered)

In [18]:
results = requests.get(url2).json()

In [19]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean json and stick in pandas df

In [20]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [21]:
print('{} nearby venues returned.'.format(nearby_venues.shape[0]))

2 nearby venues returned.
