# Capstone Project -  
### Applied Data Science Capstone by IBM/Coursera

## Introduction: Business Problem

Start a new coffee shop in the city of Toronto by analyzing the existing distribution of coffee shops across different neighborhoods.

A new group of coffee shop owners wants to set up a new shop in Toronto. Unable to decide the right spot to set up the shop, we'll use visualizations using folium to locate the existing coffee shops across different neighborhoods in Toronto and select the place using map visualization where there are the least number of coffee shops. This will ensure that there is less competition from similar businesses and will help grow the venture in that locality.

## Data

The data used are - 

1. List of boroughs from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and their associated neighborhood. It will be downloaded and stored into 'table.csv' file
2. Latitude and Longitude of each neighborhood from "https://cocl.us/Geospatial_data"
3. Merge the above two data
4. The boroughs related to Toronto, ie. East Toronto, West Toronto, Downtown Toronto, Central Toronto and their associated neighborhoods are selected.
5. Using Foursquare API, Venues near the neighborhoods are located. From this all coffee shops are selected.
6. All the coffee shops in the neighborhoods are then plotted on the map. The locations with the least density are the ideal place to start a new coffee shop.

## ** The code for the above project starts from here **

In [3]:
# Import pandas 
import pandas as pd 
  
# read postal codes of canada csv file extracted from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 
#It has already been saved to the file "table.csv". 

postcanada = pd.read_csv("table.csv") 
postcanada

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [4]:
#Boroughs with unassigneed values are dropped.
postcanada_clean = postcanada.dropna()
postcanada_clean

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
#Resetting index
p = postcanada_clean.reset_index()
p.drop(["index"], axis = 1, inplace = True)
p

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [6]:
#shape of dataset
p.shape

(103, 3)

We can see that there are 103 neighborhoods across different boroughs.

Now we Append Latitude and Longitude of the neighborhoods from "https://cocl.us/Geospatial_data"

In [7]:
latlong = pd.read_csv("https://cocl.us/Geospatial_data") 
latlong

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [8]:
p.sort_values(by=["Postal Code"], inplace = True)

In [9]:
# Neighborhoods are sorted by Postal code 
p = p.reset_index()
p.drop(["index"], axis = 1, inplace = True)
p

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [10]:
#postal code is dropped to append lat and long with the neighborhood
latlong.drop(["Postal Code"], axis = 1, inplace = True)
latlong

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476
...,...,...
98,43.706876,-79.518188
99,43.696319,-79.532242
100,43.688905,-79.554724
101,43.739416,-79.588437


In [11]:
result = pd.concat([p, latlong], axis=1, join='inner')
result

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [12]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(result['Borough'].unique()),
        result.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [13]:
#List all boroughs in the table
result.Borough.unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       'Mississauga', 'Etobicoke'], dtype=object)

In [14]:
#We are interested in setting up the shop in the city of Toronto. So select only Borough with names toronto

toronto_data1 = result[result['Borough'] == 'East Toronto']
toronto_data2= result[result['Borough'] == 'Central Toronto']
toronto_data3 = result[result['Borough'] == 'Downtown Toronto']   
toronto_data4 = result[result['Borough'] == 'West Toronto']

toronto_data = pd.concat([toronto_data1, toronto_data2, toronto_data3, toronto_data4], \
                         ignore_index=True, sort=False)

toronto_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
5,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
6,M4P,Central Toronto,Davisville North,43.712751,-79.390197
7,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
8,M4S,Central Toronto,Davisville,43.704324,-79.38879
9,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316


There are 39 neighborhoods in the city of Toronto.

Import necessary libraries for plotting

In [15]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans


import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

In [16]:
#Use geopy library to get the latitude and longitude values of Toronto.
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Locate the neighborhoods around the above four boroughs

In [17]:
import numpy as np # library to handle data in a vectorized manner

In [66]:
#save the required foursquare credentials
CLIENT_ID = '1F0G5M51KKVCY4BOV1XVKVXYGWC1WJICRCSBCYQM4GLVITXB' # your Foursquare ID
CLIENT_SECRET = '2SHP52TXURGLZ22FCEKQKUFH0OG2HOJMDGYDKKFVTOXJX0MW' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 1F0G5M51KKVCY4BOV1XVKVXYGWC1WJICRCSBCYQM4GLVITXB
CLIENT_SECRET:2SHP52TXURGLZ22FCEKQKUFH0OG2HOJMDGYDKKFVTOXJX0MW


In [72]:
# Use this function to get the nearby venues of Toronto neighborhoods within 3000m
def getNearbyVenues(names, latitudes, longitudes, radius=3000):
    LIMIT = 300   
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [73]:

toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Business reply mail Processing Centre, South Central Letter Processing Plant Toronto
Lawrence Park
Davisville North
North Toronto West, Lawrence Park
Davisville
Moore Park, Summerhill East
Summerhill West, Rathnelly, South Hill, Forest Hill SE, Deer Park
Roselawn
Forest Hill North & West, Forest Hill Road Park
The Annex, North Midtown, Yorkville
Rosedale
St. James Town, Cabbagetown
Church and Wellesley
Regent Park, Harbourfront
Garden District, Ryerson
St. James Town
Berczy Park
Central Bay Street
Richmond, Adelaide, King
Harbourfront East, Union Station, Toronto Islands
Toronto Dominion Centre, Design Exchange
Commerce Court, Victoria Hotel
University of Toronto, Harbord
Kensington Market, Chinatown, Grange Park
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Stn A PO Boxes
First Canadian Place, Underground city
Christie
Queen's Park, Ontar

In [74]:
print(toronto_venues.shape)
toronto_venues.head()

(3900, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Fox Theatre,43.672801,-79.287272,Indie Movie Theater
2,The Beaches,43.676357,-79.293031,The Beech Tree,43.680493,-79.288846,Gastropub
3,The Beaches,43.676357,-79.293031,Tori's Bakeshop,43.672114,-79.290331,Vegetarian / Vegan Restaurant
4,The Beaches,43.676357,-79.293031,Beaches Bake Shop,43.680363,-79.289692,Bakery


A total of 3900 venues along with their coordinates are returned. Now we have to find coffee shops in them.

In [75]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 199 uniques categories.


In [76]:
#Find out the top 10 venues in the neighborhood
t = toronto_venues.groupby('Venue Category').count()
t['Neighborhood'].nlargest(10)

Venue Category
Coffee Shop            307
Park                   250
Café                   208
Italian Restaurant     133
Bakery                 123
Restaurant              94
Brewery                 66
Japanese Restaurant     60
Pizza Place             60
Sandwich Place          60
Name: Neighborhood, dtype: int64

We can see that there are 307 coffee shops together in all the neighborhoods

In [77]:
#List all the coffee shops
coffee_shop = toronto_venues[toronto_venues['Venue Category'] == 'Coffee Shop']
coffee_shop

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
10,The Beaches,43.676357,-79.293031,The Remarkable Bean,43.672801,-79.287038,Coffee Shop
14,The Beaches,43.676357,-79.293031,Buds Coffee Bar,43.669375,-79.303218,Coffee Shop
37,The Beaches,43.676357,-79.293031,Press Books Coffee Vinyl,43.687672,-79.304457,Coffee Shop
70,The Beaches,43.676357,-79.293031,Pomarosa Coffee & Kitchen,43.683201,-79.325849,Coffee Shop
72,The Beaches,43.676357,-79.293031,Starbucks,43.668539,-79.307821,Coffee Shop
...,...,...,...,...,...,...,...
3807,"Runnymede, Swansea",43.651571,-79.484450,Wibke's Espresso Bar,43.649132,-79.484802,Coffee Shop
3821,"Runnymede, Swansea",43.651571,-79.484450,Starbucks,43.651395,-79.475990,Coffee Shop
3859,"Runnymede, Swansea",43.651571,-79.484450,Outpost Coffee Roasters,43.656059,-79.454106,Coffee Shop
3877,"Runnymede, Swansea",43.651571,-79.484450,Starbucks,43.627931,-79.489286,Coffee Shop


In [81]:
# create map of Coffee Shops in all the Toronto neighborhoods using latitude and longitude values
map_coffee_shop = folium.Map(location=[latitude, longitude], zoom_start=11)
# add markers to map
for lat, lng, venue, neighborhood in zip(coffee_shop['Venue Latitude'], coffee_shop['Venue Longitude'], \
                                           coffee_shop['Venue'], coffee_shop['Neighborhood']):
    label = '{}, {}'.format(neighborhood, venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#2186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_coffee_shop)  
    
map_coffee_shop

From the map, we can identify the locations where density of coffee shops are less or zero. These are some 
of the ideal places where new coffee shops could be started. Other factors that determine the optimum
location are proximity to places such as parks, universities, businesses, stadiums etc.