# Notebook for Coursera Capstone Project#

## First step: Load and prepare the data ##

These steps follow closely the approach taken in the previous peer-reviewed assignment. The goal is to load the data, clean it, and prepare it for the classification task.

In [3]:
import urllib
from bs4 import BeautifulSoup
import pandas as pd
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
import json
import requests
import numpy as np

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  52.10 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  35.50 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  39.07 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  47.36 MB/s


We first need to load the Wikipedia page on postal codes, and extract the information, clean it, and put it in a Pandas DataFrame.

In [4]:
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wiki_file = urllib.request.urlopen(wiki_url)
wiki_raw = wiki_file.read()


soup = BeautifulSoup(wiki_raw, "lxml")
table_rows = soup.body.table.tbody
first = True
rows = []
for table_row in table_rows.find_all('tr'):
    if first:
        first = False
        continue

    fields = []
    for field in table_row.find_all('td'):
        fields.append(field.get_text().strip())
    rows.append(fields)

    
pc_dict = {}
for row in rows:
    if row[1] == 'Not assigned':
        continue
    if row[2] == 'Not assigned':
        row[2] = row[1]
    if row[0] in pc_dict:
        pc_dict[row[0]][1] = pc_dict[row[0]][1] + ', ' + row[2]
    else:
        pc_dict[row[0]] = [row[1], row[2]]
cleaned_data = []
for key, value in pc_dict.items():
    cleaned_data.append([key, value[0], value[1]])
cleaned_data = sorted(cleaned_data)


pc_data = pd.DataFrame(cleaned_data)
pc_data.columns = ['PostalCode', 'Borough', 'Neighborhood']
pc_data.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


We use the previous assignment's .csv file to get the longitudes and latitudes of Toronto neighborhoods.

In [5]:
lats_and_longs = pd.read_csv("https://cocl.us/Geospatial_data")
lats_and_longs.columns = ['PostalCode', 'Latitude', 'Longitude']

We join the two data frames on the postal code.

In [6]:
to_data = pc_data.join(lats_and_longs.set_index('PostalCode'), 'PostalCode', 'left')
to_data.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


The following code creates a function for querying FourSquare for the data.

In [9]:
CLIENT_ID = 'ODSOBKP0KNUEBOLKIIHACB3OVMEG5NYJLIP5VEDBU1FT3CVS'
CLIENT_SECRET = 'PZHGYID3TNFODPSMZATNIWFKBA1O4BKK22QFNZ2SXNCMMDOO'
VERSION = '20180605'
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Queries FourSquare for the venue information for each of the neighborhoods.

In [10]:
to_venues = getNearbyVenues(names=to_data['Neighborhood'],
                            latitudes=to_data['Latitude'],
                            longitudes=to_data['Longitude'])

Rouge, Malvern
Highland Creek, Rouge Hill, Port Union
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park, Ionview, Kennedy Park
Clairlea, Golden Mile, Oakridge
Cliffcrest, Cliffside, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Scarborough Town Centre, Wexford Heights
Maryvale, Wexford
Agincourt
Clarks Corners, Sullivan, Tam O'Shanter
Agincourt North, L'Amoreaux East, Milliken, Steeles East
L'Amoreaux West, Steeles West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
Silver Hills, York Mills
Newtonbrook, Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park, Don Mills South
Bathurst Manor, Downsview North, Wilson Heights
Northwood Park, York University
CFB Toronto, Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens, Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The D

In [11]:
to_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge, Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
2,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood, Morningside, West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,"Guildwood, Morningside, West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


We consider cafes and coffee shops to be interchangeable, so we replace references to 'Cafe' with 'Coffee Shop'.

In [18]:
to_venues.replace('Cafe', 'Coffee Shop', inplace=True)

In [20]:
to_onehot = pd.get_dummies(to_venues[['Venue Category']], prefix="", prefix_sep="")
to_onehot['Neighborhood'] = to_venues['Neighborhood'] 
to_onehot.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
to_grouped = to_onehot.groupby('Neighborhood').mean().reset_index()
to_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens, Beaumond Heights, Humbergate, ...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Next, we find the frequencies.

In [44]:
to_norm = to_grouped.drop(['Neighborhood'], axis=1).div(to_grouped.sum(axis=1, numeric_only=True), axis=0)
to_norm.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,...,0.010101,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.010101,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
to_neighborhood = to_grouped['Neighborhood']
to_neighborhood.head()

0                             Adelaide, King, Richmond
1                                            Agincourt
2    Agincourt North, L'Amoreaux East, Milliken, St...
3    Albion Gardens, Beaumond Heights, Humbergate, ...
4                               Alderwood, Long Branch
Name: Neighborhood, dtype: object

We now extract the target - coffee shops - and leave behind the features.

In [46]:
to_target_all = to_norm['Coffee Shop']
to_target_all.head()

0    0.060606
1    0.000000
2    0.000000
3    0.090909
4    0.111111
Name: Coffee Shop, dtype: float64

In [47]:
to_feature_all = to_norm.drop(['Coffee Shop'], axis=1)
to_feature_all.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.030303,...,0.010101,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.010101,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [48]:
from sklearn.model_selection import train_test_split

In [49]:
X_train, X_test, y_train, y_test = train_test_split(to_feature_all, to_target_all, test_size=0.1, random_state=7)

In [50]:
X_train.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
13,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
22,0.0,0.011905,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.011905,...,0.0,0.011905,0.0,0.011905,0.0,0.0,0.011905,0.011905,0.0,0.011905
49,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010101,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [51]:
X_test.head()

Unnamed: 0,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.0,0.0
70,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
77,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
95,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
y_train.head()

13    0.000000
22    0.071429
49    0.131313
12    0.000000
85    0.050000
Name: Coffee Shop, dtype: float64

In [53]:
y_test.head()

20    0.040000
70    0.250000
77    0.075000
62    0.000000
95    0.666667
Name: Coffee Shop, dtype: float64

I'm going to use a random forest regressor.

In [54]:
from sklearn.ensemble import RandomForestRegressor

In [55]:
forest = RandomForestRegressor(max_depth=20, random_state=7, n_estimators=10)

In [56]:
forest.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=20,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=7, verbose=0, warm_start=False)

In [57]:
y_pred = forest.predict(X_test)

In [59]:
from sklearn.metrics import mean_squared_error

In [60]:
mean_squared_error(y_test, y_pred)

0.048139203056855963

At this point, we subtract the prediction from the test. This gives us the difference between what we predict and the actual. The higher the value, the greater the capacity for more coffee shops in a neighborhood.

In [62]:
y_pred - y_test

20   -0.010000
70   -0.175000
77    0.025595
62    0.000000
95   -0.666667
15   -0.031043
54    0.026129
37    0.062619
26    0.000000
65    0.000000
Name: Coffee Shop, dtype: float64

Eye-balling it shows that index 37 is the greatest.

In [69]:
to_neighborhood.loc[37]

'Downsview Northwest'

Now lets do something a bit shady. Let's predict all the neighborhoods. This is shady, because we used much of this data for training the dataset. Still, it's our best answer for the neighborhood with the greatest coffee capacity.

In [70]:
all_pred = forest.predict(to_feature_all)

In [71]:
diffs = all_pred - to_target_all

In [77]:
top_neighborhood = diffs.max()
top_neighborhood

0.06261904761904763

In [79]:
top_neighborhood_idx = diffs.idxmax()
top_neighborhood_idx

37

In [80]:
to_neighborhood.loc[37]

'Downsview Northwest'

We have a winner! Downsview Northwest is the best neighborhood in which to open a coffee shop.