# Coursera Capstone Project - A brand new bicycle shop in Cambridge, UK

## Introduction

In a few months, I'm moving to Cambridge, UK to start a new job as a software developer. I'm currently looking for a flat, and I'll soon be looking for a bicycle to commute easily and keep healthy.  
This brought me to a simple yet stimulating idea for this final project.  

Suppose we may want to open a new bicycle shop in Cambridge, UK. I've already lived there for some time, and from what I've seen, Cambridge is a very biker-friendly city, and it's got plenty of places where people can enjoy a ride on their two-wheeled companion and discover amazing views and fascinating landscapes. We, as brand new enterpreneurs in the bicycle business, we want to exploit the great opportunities that this city offers in order to open our profitable bicycle shop.  

Of course we need to know whether any other bicycle shop already exists (spoiler alert - they do) and where they're located, so to avoid having to deal with a tight competition with other bicycle shop owners.  
An important consideration we may want to take into account (and that I'll focus on in this project) is to decide where our shop will be located in relation with places where most people riding a bicycle might be found. Specifically, we will look for spots close to the riverside, or parks, or riding routes, or also gyms and sport venues, in order to maximise our profit: if someone (say, a tourist) walks along the river or across a park and sees all those people riding their bikes and having fun, probably he will think something like "Oh man, I wish I had a bicycle too".  
And there we are, with our brand new shop full of shiny bicycles, that anyone can either buy or rent for a day!
In the same way, it is more common to find someone whose bike needs a fix in these places, so we may also profit from bicycle repair and mainteinance. 

___

## Data

First of all, for the sake of simplicity, I chose to focus my attention on the main city of Cambridge, namely neighborhoods whose postcodes start with CB1 up to CB5. So, we'll need all these postcodes and the related neighborhood names; luckily, I found [this great resource](https://www.doogal.co.uk/AdministrativeAreas.php?district=E07000008) which offers these data as a simple CSV file, complete with latitude and longitude coordinates of each postcode.  

We will then use Foursquare to find existing bicycle shops and remove these places from our list of candidates spots where our new shop will be located; on the other hand, we will identify local venues such as parks, cycling routes and other places of natural interest where bikers might most commonly be found, as well as gyms and sport venues, because people probably will cycle to these places in order to warm up before their favourite sport class.  
These places will be our best candidates for our shop. We might also use a clustering approach to further inspect our results, but I suspect it will be an overkill for this rather simple task. 

___ 

## Methodology

In [3]:
# !pip install folium 
# !pip install geopy

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib.cm as cm 
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
import folium
import requests

First of all, let's download the Cambridge postcodes data, from the above-mentioned URL. 

In [5]:
!wget -O data/cambridge_postcodes.csv https://www.doogal.co.uk/AdministrativeAreasCSV.ashx?district=E07000008 

/bin/sh: wget: command not found


In [2]:
cam_codes = pd.read_csv("../data/cambridge_postcodes.csv")
cam_codes.head()

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,Ward,Parish,Introduced,Terminated,Altitude,Country,Last Updated,Quality,LSOA Code
0,CB1 0AA,No,52.192267,0.137208,546184,257045,TL461570,Coleridge,"Cambridge, unparished area",2017-04-01,2018-05-01,11,England,2019-05-29,Within the building of the matched address clo...,E01017966
1,CB1 0AB,No,52.192267,0.137208,546184,257045,TL461570,Coleridge,"Cambridge, unparished area",2017-05-01,2017-12-01,11,England,2019-05-29,Within the building of the matched address clo...,E01017966
2,CB1 0AD,No,52.192267,0.137208,546184,257045,TL461570,Coleridge,"Cambridge, unparished area",2017-08-01,2019-04-01,11,England,2019-05-29,Within the building of the matched address clo...,E01017966
3,CB1 0AE,No,52.192267,0.137208,546184,257045,TL461570,Coleridge,"Cambridge, unparished area",2017-09-01,2018-04-01,11,England,2019-05-29,Within the building of the matched address clo...,E01017966
4,CB1 0AF,No,52.192267,0.137208,546184,257045,TL461570,Coleridge,"Cambridge, unparished area",2017-10-01,2018-04-01,11,England,2019-05-29,Within the building of the matched address clo...,E01017966


In [3]:
cam_codes.shape

(5864, 16)

As we can see, we have a lot of information here. Let's first remove all the data that we're not interested in.  
The first rows of the dataframe report some postcodes that are no longer used, so we'll drop them. 

In [4]:
cam_codes = cam_codes[cam_codes["In Use?"] == "Yes"]

We also want to focus on `CB1` to `CB5` postcodes, avoiding outskirts. Luckily our dataframe already fulfills this need, so we're fine. 

In [5]:
cam_codes[~cam_codes["Postcode"].str.startswith("CB1") & 
          ~cam_codes["Postcode"].str.startswith("CB2") & 
          ~cam_codes["Postcode"].str.startswith("CB3") & 
          ~cam_codes["Postcode"].str.startswith("CB4") & 
          ~cam_codes["Postcode"].str.startswith("CB5")]

Unnamed: 0,Postcode,In Use?,Latitude,Longitude,Easting,Northing,Grid Ref,Ward,Parish,Introduced,Terminated,Altitude,Country,Last Updated,Quality,LSOA Code


Now let's drop features that are useless for our purpose and keep only the data that we need, to obtain a clean and simple dataset. 

In [6]:
cam_codes = cam_codes[["Postcode", "Latitude", "Longitude"]]
cam_codes.head()

Unnamed: 0,Postcode,Latitude,Longitude
6,CB1 0AH,52.192254,0.137179
9,CB1 0AN,52.192254,0.137179
14,CB1 0AU,52.192267,0.137208
18,CB1 0AZ,52.192267,0.137208
20,CB1 0BB,52.192267,0.137208


In [7]:
cam_codes.shape

(2788, 3)

However, it seems like we still have a bit too many postcodes to deal with! I tried to show all these points in a map, and my session froze unmercifully (by the way, [this](https://checkmypostcode.uk/cambridgeshire/cambridge#.XQZbnFXVL4a) is more or less what our map would look like... quite overwhelming if you ask me).  
So I thought I'd simply leave out the last character of each postcode, and use the mean latitude and longitude for that area. Using `CB1 0A_` as an example, its new latitude and longitude coordinates would be:  

In [8]:
cam_codes[cam_codes["Postcode"].str.startswith("CB1 0A")][["Latitude", "Longitude"]].mean()

Latitude     52.192261
Longitude     0.137193
dtype: float64

This seems to be a good compromise to reduce our data without losing too much information, so let's code something more programmatic to achieve this. 

In [9]:
cam_codes["new_postcode"] = cam_codes["Postcode"].str[:-1]
cam_df = cam_codes.groupby("new_postcode").mean()
cam_df.reset_index(inplace=True)
cam_df.rename({"new_postcode": "Postcode"}, axis=1, inplace=True)
cam_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,CB1 0A,52.192261,0.137193
1,CB1 0B,52.192265,0.137203
2,CB1 0D,52.192267,0.137208
3,CB1 0E,52.192267,0.137208
4,CB1 0F,52.192266,0.137205


In [10]:
cam_df.shape

(294, 3)

Now we should be ready to go on. 

Let's create a basic map to visually check where we are (sometimes the `Nominatim` geocoder doesn't work well; in this case, Cambridge latitude and longitude coordinates can be found [on this page](https://postal-code.co.uk/postcode/Cambridge)). 

In [11]:
address = "Cambridge, UK"
geoloc = Nominatim(user_agent="cambridge_explorer")
loc = geoloc.geocode(address)
cam_lat = loc.latitude
cam_lng = loc.longitude
# cam_lat = 52.2053370
# cam_lng = 0.1218170
print(cam_lat, cam_lng)

52.2034823 0.1235817


In [32]:
cam_map = folium.Map(location=[cam_lat, cam_lng], zoom_start=13)
for lat, lng, postcode in zip(cam_df["Latitude"], cam_df["Longitude"], cam_df["Postcode"]):
    label = "{}_".format(postcode)
    label = folium.Popup(label, parse_html=True, min_width=50, max_width=200)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color="#ff7f00",
        fill=True,
        fill_color='#fdbf6f',
        fill_opacity=0.7,
        parse_html=False).add_to(cam_map)

cam_map

We can now proceed to fetch information about the most common venues in each postcode, using Foursquare's API. 

In [17]:
CLIENT_ID = "R4QHYVVNITPDMCMSFQOJ0URIOAKKIOSLH41NKCNP5MLBYF1L"
CLIENT_SECRET = "XLNH4ZR44MQ3GUTJ1FKCAEHC3211ECLCMV441M3RLHDQ4FHE" 
VERSION = "20180605" 
LIMIT = 20
RADIUS = 500

In [19]:
def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, RADIUS, LIMIT)
        results = requests.get(url).json()["response"]["groups"][0]["items"]
        
        venues_list.append([(name, lat, lng, 
            v["venue"]["name"], 
            v["venue"]["location"]["lat"], 
            v["venue"]["location"]["lng"],  
            v["venue"]["categories"][0]["name"]) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ["Postcode", "PostcodeLatitude", "PostcodeLongitude", 
                             "Venue", "VenueLatitude", "VenueLongitude", "VenueCategory"]
    
    return(nearby_venues)

In [20]:
cam_venues = getNearbyVenues(cam_df["Postcode"], cam_df["Latitude"], cam_df["Longitude"])
cam_venues.head()

Unnamed: 0,Postcode,PostcodeLatitude,PostcodeLongitude,Venue,VenueLatitude,VenueLongitude,VenueCategory
0,CB1 0A,52.192261,0.137193,Caffè Nero,52.194526,0.136673,Coffee Shop
1,CB1 0A,52.192261,0.137193,Five Guys,52.190194,0.137075,Burger Joint
2,CB1 0A,52.192261,0.137193,Nando's,52.190552,0.136846,Portuguese Restaurant
3,CB1 0A,52.192261,0.137193,PureGym,52.19016,0.136967,Gym / Fitness Center
4,CB1 0A,52.192261,0.137193,Ibis Hotel,52.19483,0.13726,Hotel


In [21]:
cam_venues.shape

(3379, 7)

This last command took quite a bit to run, so we'll save this dataset for future use. 

In [22]:
cam_venues.to_csv("../data/cambridge_venues.csv", index=False)

Let's have a look at the categories of the venues found. 

In [24]:
cam_venues["VenueCategory"].unique()

array(['Coffee Shop', 'Burger Joint', 'Portuguese Restaurant',
       'Gym / Fitness Center', 'Hotel', 'Buffet', 'Café', 'Grocery Store',
       'French Restaurant', 'Pub', 'Bar', 'Indian Restaurant',
       'Sandwich Place', 'BBQ Joint', 'Restaurant', 'Multiplex',
       'Performing Arts Venue', 'Bookstore', 'Theater',
       'English Restaurant', 'Market', 'Lounge', 'American Restaurant',
       'Record Shop', 'Science Museum', 'Thai Restaurant', 'Gym',
       'Clothing Store', 'Furniture / Home Store', 'Pharmacy', 'Gym Pool',
       'Supermarket', 'Electronics Store', 'Sporting Goods Shop',
       'Rental Car Location', "Women's Store", 'Dumpling Restaurant',
       'Noodle House', 'Chinese Restaurant', 'Park', 'Gastropub',
       'Bakery', 'Indie Movie Theater', 'Department Store',
       'Breakfast Spot', 'Steakhouse', 'Eastern European Restaurant',
       'Shopping Mall', 'African Restaurant', 'Salad Place',
       'Korean Restaurant', 'Brewery', 'Hookah Bar', 'Pool',
       'Del

The venue categories that might be interesting for our bicycle shop are `Gym / Fitness Center`, `Gym`, `Gym Pool`, `Park`, `Pool`, `Playground`, `Soccer Field`, `Campground`, `Tennis Court`, `Canal`, `Soccer Stadium`, `Golf Course`, `Lake`, `Hockey Field`, `Cricket Ground`, `Rugby Stadium`, `Harbor / Marina`, `River`, `Golf Driving Range`, `Athletics & Sports`, `Field`, `Sports Club`. 

In [25]:
cam_venues = cam_venues[cam_venues["VenueCategory"].isin(["Gym / Fitness Center", "Gym", "Gym Pool", "Park", 
                                                          "Pool", "Playground", "Soccer Field", "Campground", 
                                                          "Tennis Court", "Canal", "Soccer Stadium", "Golf Course", 
                                                          "Lake", "Hockey Field", "Cricket Ground", "Rugby Stadium", 
                                                          "Harbor / Marina", "River", "Golf Driving Range", 
                                                          "Athletics & Sports", "Field", "Sports Club"])]
cam_venues.shape

(234, 7)

We have now reduced extensively our data, so we can create a map showing our venues of interest. 

In [30]:
cam_map = folium.Map(location=[cam_lat, cam_lng], zoom_start=13)
for lat, lng, venue, categ in zip(cam_venues["VenueLatitude"], cam_venues["VenueLongitude"], 
                                  cam_venues["Venue"], cam_venues["VenueCategory"]):
    label = "{} ({})".format(venue, categ)
    label = folium.Popup(label, parse_html=True, min_width=150, max_width=300)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color="#ff7f00",
        fill=True,
        fill_color='#fdbf6f',
        fill_opacity=0.7,
        parse_html=False).add_to(cam_map)

cam_map

___

## Results 

___ 

## Discussion

___

## Conclusion