## Coursera IBM Data Science Capstone Project

By Tamela Maciel, June 2020

This jupyter notebook completes Coursera's IBM Data Science Professional Certificate capstone project.

It uses the Foursquare API, BeautifulSoup for webscraping, and various python libraries to gather location data for the city of Toronto and compares and clusters various neighborhoods.

### Import libraries

In [1]:
import pandas as pd  #database wrangling
import numpy as np #linear algebra
import string
from bs4 import BeautifulSoup #webscraping

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import requests # library to handle requests

import geocoder # convert a postcode into latitude and longitude values

#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### Part 1 - Create Database of Toronto Neighborhoods
Use BeautifulSoup to scrape postcodes, boroughs, and neighborhoods from this wikipedia table:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Step 1. Request html from the website url

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup = BeautifulSoup(response.text)

Step 2. Find the first table in the html items

In [3]:
table = soup.find('table')

Step 3. Iterate through rows of table (denoted by 'tr' tags), and append to 'neighborhood_df' dataframe if the borough is assigned a name

In [4]:
#create empty dataframe 
column_names=['PostalCode','Borough','Neighborhood']
neighborhood_df = pd.DataFrame(columns=column_names) 

#read in data using table and get_text() functions from Beautiful Soup
row_id = 0
for row in table.find_all('tr'):
    columns = row.find_all('td')
    try:
        postcode=columns[0].get_text().rstrip('\n')
        borough=columns[1].get_text().rstrip('\n')
        neighborhood=columns[2].get_text().rstrip('\n')
        #print(repr(postcode), repr(borough),neighborhood)
        if borough!='Not assigned':
            neighborhood_df = neighborhood_df.append({'PostalCode': postcode,
                                                      'Borough':borough,
                                                      'Neighborhood': neighborhood
                                                     }, ignore_index=True)
    except:
        pass
    row_id += 1

neighborhood_df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [5]:
print("There are {} rows in the 'neighborhoods_df' database".format(neighborhood_df.shape[0]))

There are 103 rows in the 'neighborhoods_df' database


### Part 2 - Get neighborhood lat, long using geocoder

After quite a lot of trial and error with geocoder as well as geopy.geocoders, I'm unable to consistently get lat and long data points for all neighborhoods. Most post codes repeatedly return 'None'.

So will read in csv file instead

In [20]:
### OLD CODE
#postal_code='M1B'
## initialize your variable to None
#lat_lng_coords = None
#
## loop until you get the coordinates
##while(lat_lng_coords is None):
#g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#lat_lng_coords = g.latlng
#
#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]
#print(latitude,longitude)

In [21]:
lat_lng_file=pd.read_csv("Geospatial_Coordinates.csv")

In [24]:
lat_lng_file=lat_lng_file.rename(columns={'Postal Code':'PostalCode'})
lat_lng_file.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [29]:
neighborhood_df=neighborhood_df.merge(lat_lng_file, on="PostalCode")
neighborhood_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [31]:
print("There are now {} rows and {} columns in the 'neighborhoods_df' database".format(neighborhood_df.shape[0],neighborhood_df.shape[1]))

There are now 103 rows and 5 columns in the 'neighborhoods_df' database
