# Project intro Week 4

## A description of the problem and a discussion of the background.
Suppose that we were looking to open a bakery in Toronto. Some considerations would be:
*  How much competition (other bakeries) are there in the area. Ideally we want low competition.
*  People are more likely to visit a bakery if it is close to other shops, so how many cafe's, restaurants, foot traffic is in the area. Ideally we want more shops to generate foot traffic, but not other bakeries selling competing produce.

## A description of the data and how it will be used to solve the problem.
Here are the steps we go through to collect the data:
*  Collect postal codes from [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) and format it in a similar way as we did for the previous week.
*  Use the foursquare API to collect venue information for each of these areas.
*  Export the data we have collected into csv format so that it is ready to work with without having to run foursquare API queries again.
*  Look for clusters of bakeries. Do the same for venues such as cafes and restaurants.
*  A per the problem description we are looking for an area that has strong clustering for cafes and restaurants, but where there aren't many competing bakeries.

In [1]:
import requests
import lxml.html as lh
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [2]:
page = requests.get(url)
doc = lh.fromstring(page.content)

In [3]:
tr_elements = doc.xpath('//tr')

In [4]:
data = []
for tr in tr_elements:
    tr_content = tr.text_content().split("\n")
    row = [item for item in tr_content if str(item).strip() != ""]
    if len(row) == 3:
        data.append(row)

In [5]:
df = pd.DataFrame(data)
df.columns = df.loc[0].values.tolist()
df.drop(df.index[0], inplace=True)
df = df[df["Borough"] != "Not assigned"]
df.reset_index(inplace=True, drop=True)
df.head(3)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront


In [6]:
pd.options.mode.chained_assignment = None  # default='warn'

postcodes = df["Postcode"].unique()
neighbourhoods = []
boroughs = []
for postcode in postcodes:
    boro = df[df["Postcode"] == postcode]["Borough"].unique().tolist()
    if len(boro) != 1:
        raise ValueError("There should only be one Borough")
    boroughs.append(boro[0])
    
    hood = df[df["Postcode"] == postcode]["Neighbourhood"].unique().tolist()
    neighbourhoods.append(", ".join(hood))

dft = pd.DataFrame({"Postcode": postcodes, "Borough": boroughs, "Neighbourhood": neighbourhoods})

# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
# specifially, this turns up for M7A, Queen's Park above
borough_tocopy = dft.loc[dft["Neighbourhood"]=="Not assigned"]["Borough"].values.tolist()
#print(borough_tocopy)
dft.loc[dft["Neighbourhood"]=="Not assigned", "Neighbourhood"] = borough_tocopy

# rename the postal code column
dft.rename(columns={'Postcode':'PostCode'}, inplace=True)
dft.head(3)

Unnamed: 0,PostCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"


In [7]:
dft.shape  # we have 103 rows, 3 columns

(103, 3)

Now that we have the wikipedia data, add coordinates...

In [8]:
df_coords = pd.read_csv("Geospatial_Coordinates.csv")
df_coords.rename(columns={'Postal Code':'PostCode'}, inplace=True)
df_coords.set_index("PostCode", inplace=True)

lat = []
lng = []
for pc in dft["PostCode"].values:
    lat.append(df_coords.at[pc, "Latitude"])
    lng.append(df_coords.at[pc, "Longitude"])

dft["Latitude"] = lat
dft["Longitude"] = lng

dft.head()

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


## Import the libraries we need

In [9]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Get the latitude and longitude for Toronto.

In [10]:
# https://www.latlong.net/place/toronto-on-canada-27230.html
tlat = 43.651070
tlng = -79.347015

dft.columns

Index(['PostCode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude'], dtype='object')

Filtering the rows to only boroughs that contain the word Toronto...

In [11]:
dff = dft[dft["Borough"].str.contains("Toronto")]
dff.head()

Unnamed: 0,PostCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [12]:
map_toronto = folium.Map(location=[tlat, tlng], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(dff['Latitude'], dff['Longitude'], dff['Borough'], dff['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

So dff is the filtered dataframe. 

## Collect venue data for these filtered neighbourhoods
This getNearbyVenues function is directly from the labs... append to a venues list:

In [13]:
# first we need credentials
CLIENT_ID = '2NOYPO3T3EULZM4SMI5Q4B00SBBLEYD3LWNYARIQNGQMIPWB'     # your Foursquare ID
CLIENT_SECRET = 'RHT5F4BM1TSDIMA33YRGST1POCB3VZQ2UNNQR2ZC3245Q4UI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2NOYPO3T3EULZM4SMI5Q4B00SBBLEYD3LWNYARIQNGQMIPWB
CLIENT_SECRET:RHT5F4BM1TSDIMA33YRGST1POCB3VZQ2UNNQR2ZC3245Q4UI


In [14]:
LIMIT=500

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude', 
                             'Venue Category']
    
    return(nearby_venues)


In [15]:
toronto_venues = getNearbyVenues(names=dff['Neighbourhood'],
                                 latitudes=dff['Latitude'],
                                 longitudes=dff['Longitude'])

Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
Fir

In [20]:
toronto_venues["Venue Category"].value_counts().head()

Coffee Shop           143
Café                   94
Restaurant             49
Italian Restaurant     47
Bakery                 45
Name: Venue Category, dtype: int64

In [26]:
import pandas as pd

toronto_venues.to_csv("toronto_venues.csv")

The data has now been exported and is ready for analysis.

***