# Introduction

This project was in fulfilment of the [Applied Data Science Capstone](https://www.coursera.org/learn/applied-data-science-capstone) course offered by IBM and hosted on Coursera.
The task was to think of an idea that leveraged [Foursquare](https://foursquare.com) location data to explore or
compare neighborhoods or cities.

In this notebook, we consider how someone can analyse the existing venues in a seaside town in the UK in order to decide on the best place in which to open a new Fish & Chips shop.

# Business Problem

There are **10,500** Fish & Chips shops in the UK, with an annual spend of **£1.2 billion** on fish and chips ([source](https://www.nfff.co.uk/pages/fish-and-chips)).
[Bournemouth](https://www.bournemouth.co.uk) was ranked as the number one most popular seaside resort of 2019
in the UK ([source](https://www.independent.co.uk/travel/news-and-advice/uk-seaside-towns-beach-best-heatwave-summer-staycation-british-a8978111.html)).
We want to open a new Fish & Chips shop in **Bournemouth** to capitalise on its popularity with a food shop that is likely to be popular with the locals.

There are some considerations that we need to make when choosing where to open our new shop.
We want it to be in an area of the town where people would want to eat this kind of food.
That is, in an area that has lots of activities, such as drinking, shopping, and entertainment venues.
Preferably, such an area would have few existing food venues, so that competition is as low as possible.

# Data

To tackle this problem, we need to understand what venues are already in Bournemouth, so that we can analyse them and decide on the best area in which to open our own shop.
We will use the [venue explore API](https://developer.foursquare.com/docs/api/venues/explore) by Foursquare to gain insight into the existing venues in Bournemouth.
We will then categorize these venues into six high-level groups: `Drink`, `Entertainment`, `Food`, `Hotel`, `Shopping`, and `Transport`.

We will use these groups to understand how venues are dispersed by business type.
Then, we will use [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), an unsupervised data clustering algorithm, to cluster venues based on their relative distance from each other.
This information will help us to determine where to open up our own Fish & Chips shop based upon existing venue density and business type.

To begin, let's import our dependent modules.

In [None]:
import requests
import os
import pandas as pd
import seaborn as sns
import numpy as np
import folium
from random import randint
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
%matplotlib inline

Let's also define the constants we will use in this notebook.

In [None]:
BOURNEMOUTH = (50.721680, -1.878530)
ZOOM = 15

COL_LAT = 'Latitude'
COL_LNG = 'Longitude'
COL_VENUE_NAME = 'Venue Name'
COL_VENUE_LAT = 'Venue Latitude'
COL_VENUE_LNG = 'Venue Longitude'
COL_VENUE_CAT = 'Venue Category'
COL_VENUE_GRP = 'Venue Group'
COL_VENUE_CLS = 'Venue Cluster'

DRINK = 'Drink'
ENTERTAINMENT = 'Entertainment'
FOOD = 'Food'
HOTEL = 'Hotel'
SHOPPING = 'Shopping'
TRANSPORT = 'Transport'

Now, we will import our Bournemouth venue data that we had previously requested from the Foursquare API on `June 30 2019`, which was downloaded with the following specification:

* We used the Foursquare API version no later than `June 30 2019`.
* We used a latitude of `50.721680` and longitude of `-1.878530` to point to Bournemouth.
* We wanted venues up to `1.5km` around Bournemouth.
* We limited our results to `100` venues.

We store these data in a dataframe called `df` and show its shape, as well as the first five rows.

In [None]:
path_data = os.path.join('..', 'input', 'bournemouth_venues.csv')

df = pd.read_csv(path_data)

print(df.shape)
df.head()

# Methodology

Let's quickly visualise the venues in Bournemouth.
We use `generate_map` to visualise venue information.

In [None]:
def generate_map(df, lat, lng, zoom, col_lat, col_lng,
                 col_popup=None, popup_colors=False, def_color='red',
                 tiles='cartodbpositron'):
    folmap = folium.Map(location=[lat, lng], zoom_start=zoom, tiles=tiles)
    
    popup = list(df[col_popup].unique())
    
    if popup_colors:
        colors = make_color_palette(len(popup))
    
    for index, row in df.iterrows():
        folium.CircleMarker(
            location=(row[col_lat], row[col_lng]),
            radius=6,
            popup=row[col_popup] if col_popup is not None else '',
            fill=True,
            color=colors[popup.index(row[col_popup])] if popup_colors else def_color,
            fill_opacity=0.6
            ).add_to(folmap)
    
    return folmap

This function uses `make_color_palette` to help product random colours for different categories.

In [None]:
def make_color_palette(size, n_min=50, n_max=205):
    r = lambda: hex(randint(0, 255))[2:]
    colors = []
    
    while len(colors) < size:
        c = '#{}{}{}'.format(r(), r(), r())
        
        if c not in colors:
            colors.append(c)
    
    return colors

We plot all `100` venues as follows.
Clicking on the markers will show the name of the venue.

In [None]:
generate_map(df, BOURNEMOUTH[0], BOURNEMOUTH[1], ZOOM, COL_VENUE_LAT, COL_VENUE_LNG, col_popup=COL_VENUE_NAME)

## Venue Groups

How many different types of venue category have been returned?

In [None]:
venue_cat = df[COL_VENUE_CAT].unique()
venue_cat.sort()

print('Venue count:', len(venue_cat))
venue_cat

This is quite a lot!
Let's put these categories into six high-level groups:
`Drink`, `Entertainment`, `Food`, `Hotel`, `Shopping`, and `Transport`.
We will use `change_group` to help change many groups into a single group.

In [None]:
def change_group(df, grp_from_list, grp_to):
    for grp_from in grp_from_list:
        df.loc[df[COL_VENUE_GRP] == grp_from, COL_VENUE_GRP] = grp_to

In [None]:
# Quickly set venue groups to the last word in each venue category
df[COL_VENUE_GRP] = df[COL_VENUE_CAT].str.split(' ').str[-1]

# Remove the train station platform venue because we already have the nearby train station as a venue
df = df[df[COL_VENUE_GRP] != 'Platform']

# Change the crude, last-word groups into more high-level groups
change_group(df,
             ['Bar', 'Brewery', 'Nightclub', 'Pub'],
             DRINK)

change_group(df,
             ['Aquarium', 'Beach', 'Center', 'Garden', 'Gym', 'Lookout',
              'Multiplex', 'Museum', 'Outdoors', 'Park', 'Theater'],
             ENTERTAINMENT)

change_group(df,
             ['Café', 'Diner', 'House', 'Joint', 'Place', 'Restaurant'],
             FOOD)

change_group(df,
             ['Plaza', 'Shop', 'Store'],
             SHOPPING)

change_group(df,
             ['Station', 'Stop'],
             TRANSPORT)

venue_grp = df[COL_VENUE_GRP].unique()
venue_grp.sort()

print('Group count:', len(venue_grp))
venue_grp

Now we graph venues based upon their new group instead.
We see that `Entertainment` venus are closer to the beach, and most `Food` and `Drink` cluster in the center of town, along with all of the `Shopping` venues.
`Hotel` venues are dispersed across town, and `Transport` is the furthest out of town.

In [None]:
generate_map(df, BOURNEMOUTH[0], BOURNEMOUTH[1], ZOOM, COL_VENUE_LAT, COL_VENUE_LNG, col_popup=COL_VENUE_GRP, popup_colors=True)

## Density-Based Clustering

Let's now perform a density-based clustering to locate areas of high venue density.
We will use the **DBSCAN** data clustering algorithm for this task.
Firstly, we need to convert the **latitude** and **longitude** of each venue into a format that is more appropriate for DBSCAN to process.
We extract the latitude and longitude, as follows.

In [None]:
df_latlng = df[[COL_VENUE_LAT, COL_VENUE_LNG]]
df_latlng.head()

We will normalise these data using a [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), so that DBSCAN can interpret these data without losing relative distances between venues.

In [None]:
latlng = StandardScaler().fit_transform(np.nan_to_num(df_latlng))
latlng[:5]

Now, we train a DBSCAN model, called `dbscan`, which generates `8` labels, as well as the outlier label `-1`.

In [None]:
dbscan = DBSCAN(eps=0.2, min_samples=3)
dbscan.fit(latlng)

print('labels:', np.unique(dbscan.labels_))

We add the labels to the data frame as `Venue Cluster`.

In [None]:
df[COL_VENUE_CLS] = dbscan.labels_
df.head()

Now, we map the venues again, but color code them by the cluster value to which they've been assigned.

In [None]:
generate_map(df, BOURNEMOUTH[0], BOURNEMOUTH[1], ZOOM, COL_VENUE_LAT, COL_VENUE_LNG, col_popup=COL_VENUE_CLS, popup_colors=True)

## Cluster Analysis

Let's check which clusters are the most densely populated.

In [None]:
df[COL_VENUE_CLS].plot(kind='hist')

We see that, outside of the outlier class `-1`, the two most densely populated clusters are `0` and `2`, which are both in the center of town.
Let's put these into their own separate data frames.

In [None]:
df_0 = df.loc[df[COL_VENUE_CLS] == 0]
df_2 = df.loc[df[COL_VENUE_CLS] == 2]

print('df_0:', df_0.shape)
print('df_2:', df_2.shape)

## Which Cluster?

We have identified the two largest venue clusters in Bournemouth.
Now, we want to analyse the venue distribution within each cluster, to help us reason about which cluster we should open our own Fish & Chips shop in.
We create two bar charts showing venue frequency per clusters `0` and `2`.

In [None]:
df_0[COL_VENUE_GRP].value_counts().plot(kind='barh', title='df_0: {} venues'.format(df_0.shape[0]))

In [None]:
df_2[COL_VENUE_GRP].value_counts().plot(kind='barh', title='df_2: {} venues'.format(df_2.shape[0]))

We see that, of the two clusters, cluster `0` has fewer `Food` venues, which is desirable, but is only marginally lower than the `Food` venues in `2`.
However, cluster `2` has a more non-`Food` venues than `0`, which is more preferable.
Areas with many places to `Drink` are a hotspot for Fish & Chips food after a night of drinking, and people who do `Shopping` in the day time might like to eat Fish & Chips for lunch or dinner.

# Conclusion

In conclusion, we would want to open our Fish & Chips shop in **cluster 2**, given that it is densely popular with venues that are not `Food` related, despite having slightly more `Food` venues than cluster `0`.