# Applied Data Science Capstone Project

## 1. Introduction

**The Problem: *Where should I open a new coffee shop in London?***

If I am interested in opening a new coffee shop in London - a bustling city with plenty of independent cafes and restaurants scattered over many neighbourhoods - where would the best location be? In order to find the perfect spot, we should probably consider the following as markers of a "good" location:

1. Few existing coffee shops (low competition)
2. Near students (high demand for caffeine!)
3. Near retail, museums or other attractions (passing trade)

In this project, I will attempt to find the best possible neighbourhoods by exploring the central London area and using clustering of postal codes based on the above criteria to select a group of candidate locations.

## 2. Data

The data used for this project comes from wikipedia (locations) and Foursquare (information on surrounding venues).  

Let's start by importing some useful packages:

In [1]:
# libraries
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
import geocoder
import folium
import requests

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as colors

%matplotlib inline

# my foursquare credentials (saved as .py file for privacy)
import foursquare_id as login

### 2.1. London neighbourhoods

First, I need to get the London postal codes I'm interested in as well as their longitude and latitude. I'm going to focus on unique postal codes that have "London" as the postal town.

#### *Postal codes*

I'll use pandas to import the London areas from this page: https://en.wikipedia.org/wiki/List_of_areas_of_London

In [2]:
locations = pd.read_html("https://en.wikipedia.org/wiki/List_of_areas_of_London")[1]
locations.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [3]:
print('Size of London locations dataframe:')
locations.shape

Size of London locations dataframe:


(533, 6)

Next, I need to clean up the location data....  
There's a lot of locations in the above dataframe, let's restrict our locations to those with a postal town of "London". Here are the top 5 rows of the resulting, cleaned dataframe:  

In [4]:
locations = locations[locations['Post town'].str.contains('LONDON')].reset_index(drop=True)
locations.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Aldgate,City[10],LONDON,EC3,20,TQ334813
3,Aldwych,Westminster[10],LONDON,WC2,20,TQ307810
4,Anerley,Bromley[11],LONDON,SE20,20,TQ345695


In [5]:
print('Size of London locations dataframe restricted to London postal town:')
locations.shape

Size of London locations dataframe restricted to London postal town:


(310, 6)

As you can see from the above datadrame, some of the London neighbourhoods cover more that one postal code, and some postal codes will cover more than one neighbourhood.  
So I'll split up the postal codes to make each row a separate location, and count total number of unique postcodes in my dataset:

In [6]:
postcodes = locations['Postcode\xa0district'].str.split(', ', expand=True).stack().reset_index(drop=True).unique().tolist()
print('** Using',len(postcodes),'unique London postal codes **')

** Using 133 unique London postal codes **


#### *Longitude and Latitude*

Now that I have my postal codes, I'll fetch the longtitude and latutude of each London location.     
To do this, I'll use the geocoder package. Here's a function to get coordinates from a postcode:

In [7]:
def get_latlng(loc):
    
    lat_lng_coords = None
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, United Kingdom'.format(loc))
        lat_lng_coords = g.latlng
        lat_lng_coords.append(loc) #add location to output
    return lat_lng_coords

I'll apply this function to all of my London postcodes, ending up with a dataframe. Each row contains a postcode with it's longitude and latitude values. Here are the top 5 rows of the resulting dataframe:  

In [8]:
coords = pd.DataFrame([get_latlng(i) for i in postcodes])
coords.columns = ["Latitude","Longitude","Postcode"]
coords.head()

Unnamed: 0,Latitude,Longitude,Postcode
0,51.49245,0.12127,SE2
1,51.51324,-0.26746,W3
2,51.48944,-0.26194,W4
3,51.512,-0.08058,EC3
4,51.51651,-0.11968,WC2


Here's a map of all of my locations, generated using folium.
To create the map, I'm using the geopy package to fetch the longtitude and latitude of London:

In [9]:
address = 'London, United Kingdom'
geolocator = Nominatim(user_agent="ln_explorer")
location = geolocator.geocode(address)
lon_lat = location.latitude
lon_lng = location.longitude
print('The coordinates of London are {}, {}.'.format(lon_lat, lon_lng))

The coordinates of London are 51.5073219, -0.1276474.


In [10]:
# create map of London using latitude and longitude values
map_london = folium.Map(location=[lon_lat, lon_lng], zoom_start=11)

# add markers to map
for lat, lng, label in zip(coords['Latitude'], coords['Longitude'], coords['Postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=8,
        popup=label,
        color='#4e54c8',
        fill=True,
        fill_color='#8f94fb',
        fill_opacity=0.7,
        parse_html=False).add_to(map_london)  
    
map_london

### 2.2. Foursquare data

Now that I have my London locations, I'll set up the Foursquare API to obtain data about venues within the proximity of each of my London locations. To do this, I need to load in my credentials and specify relevant search terms and parameters.

Note that, for privacy, my foursquare credentials were loaded in from file *foursquare_id.py* as *login* at the beginning of the *Data* section.

In [11]:
# stored in file foursquare_id.py (gitignored) to avoid showing here
CLIENT_ID = login.foursquare['accessID'] # Foursquare ID
CLIENT_SECRET = login.foursquare['secretID'] # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

Finally, here are the search terms that I'll use within Foursquare to gather information about my London locations:

In [12]:
QUERY_1 = "Coffee"
QUERY_2 = "Universities"
QUERY_3 = "Shopping"
QUERY_4 = "Fun"