# Data Science Capstone Project - Solving Business Problems with Location Data

## Table of Contents

1. Introduction/Business Problem
2. Data

## Introduction/Business Problem

For this project, we have been tasked with coming up with the best possible location for a company's first Cafe and Coffee Shop. The business owners have very ambitious plans for growth in the coming years as a result of their business acumen and experience in different hospitality ventures over the years.

They have identified the area of South Dublin to open their first location, for two main reasons. The first reason is that Dublin City Centre is so densely populated with coffee shops, cafes and restaurants that it is too competitive to flourish in such a competitive market, especially with a new and unknown brand.

The second reason is that South Dublin is considered a very wealthy and affluent area. It is the assumption of the business owners that if they can provide a leading service and quality of product, that the inhabitants of the South side of the River Liffey will spend more money and keep coming back. They hope that this location will be the first of many; once they develop the brand here, they can then branch out in to the city centre with a better-known brand and quality of coffee.

Further to the South of the city being an ideal area to start this business, it is on the coastal area of the South where the quality of life is higher. This area also boasts the DART (Dublin Area Rapid Transit) Railway network leading in to the city and so the owners would ideally like to open a location close to one of these stations, hoping to attract early-morning commuters as well as catering for a lunchtime crowd. In order to identify the primary location for the first outlet, we need to take in to account the amount of competitor outlets that are already operating there, while also exploring if there are many shops and other services in the area. We will achieve this by utilizing the FourSquare API to return such locations.

## Data

### **Part 1:**

Before we do any data preprocessing, let's install the packages that we will need throughout this project.

In [29]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

The first part of our data process will be to obtain the list of DART stations on the Dublin Southside. One such list can be found [here](https://data.smartdublin.ie/dataset/dlr-dart-stations/resource/db0ff728-884e-40c0-9e3c-e13344d00dfc). After reading in this file and inspecting the data it seems that the X and Y coordinates are incorrect for the city of Dublin so we will need to ascertain the correct coordinates using the Nominatim feature of the geopy package at a later step. For now though we will remove these columns along with the column named *stop_id* as there is no use for it during this process.

In [30]:
# Read in csv file and inspect the first few rows

filename = "https://data.smartdublin.ie/dataset/3d15a715-0dc3-416a-a2c7-71534494efdc/resource/db0ff728-884e-40c0-9e3c-e13344d00dfc/download/dlr_east_coast_dart_stations.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,X_Coord,Y_Coord,stop_id,stop_name
0,716232,734404.0,825GA00065,Tara Street Train Station
1,716699,735015.0,825GA00167,Connolly Train Station
2,717213,734847.0,825GA00184,Docklands Train Station
3,716656,733978.0,825GA00204,Pearse Train Station
4,719156,731539.0,825GA00079,Sydney Parade Train Station


In [31]:
# Remove unwanted columns

df = df.drop(df.columns[[0, 1, 2]], axis=1)
df.head()

Unnamed: 0,stop_name
0,Tara Street Train Station
1,Connolly Train Station
2,Docklands Train Station
3,Pearse Train Station
4,Sydney Parade Train Station


The second issue with the dataset is that it there are duplicate rows. We only want rows 0-18 so we need to remove the duplicates.

In [32]:
# Remove duplicate rows across the dataframe

df = df.drop_duplicates()
df

Unnamed: 0,stop_name
0,Tara Street Train Station
1,Connolly Train Station
2,Docklands Train Station
3,Pearse Train Station
4,Sydney Parade Train Station
5,Shankill Train Station
6,Seapoint Train Station
7,Sandymount Train Station
8,Glenageary Train Station
9,Sandycove and Glasthule Train Station


The next step in the process is to replace the words *Train Station* with the string *, Dublin* so that the locations can be correctly picked up when we fetch the coordinates using geopy.

In [33]:
# Replace part of string with another

df = df.replace({'Train Station':', Dublin'}, regex=True)
df

Unnamed: 0,stop_name
0,"Tara Street , Dublin"
1,"Connolly , Dublin"
2,"Docklands , Dublin"
3,"Pearse , Dublin"
4,"Sydney Parade , Dublin"
5,"Shankill , Dublin"
6,"Seapoint , Dublin"
7,"Sandymount , Dublin"
8,"Glenageary , Dublin"
9,"Sandycove and Glasthule , Dublin"


Next we export the sole remaining column (stop_name) to a list so that we can use it in a for loop and append to the new dataframe that we create.

In [34]:
# Export entire column to a workable list for looping purposes

stations1 = df['stop_name'].tolist()
stations1

['Tara Street , Dublin',
 'Connolly , Dublin',
 'Docklands , Dublin',
 'Pearse , Dublin',
 'Sydney Parade , Dublin',
 'Shankill , Dublin',
 'Seapoint , Dublin',
 'Sandymount , Dublin',
 'Glenageary , Dublin',
 'Sandycove and Glasthule , Dublin',
 'Salthill and Monkstown , Dublin',
 'Blackrock , Dublin',
 'Booterstown , Dublin',
 'Dalkey , Dublin',
 'Dun Laoghaire , Dublin',
 'Lansdowne Road , Dublin',
 'Grand Canal Dock , Dublin',
 'Killiney , Dublin',
 'Bray , Dublin']

We then create the empty dataframe called *stations* with the column names *station* (replacing stop_name), *latitude* and *longitude*.

In [35]:
# Create a variable to store column names

column_names = ['Station', 'Latitude', 'Longitude'] 

# Create blank with dataframe using the previously-created column names

stations = pd.DataFrame(columns=column_names)

After supplying the list of stops to the geopy function, we obtain the correct coordinates for each of the stations and then append these along with the station names to our new dataframe.

In [36]:
# Loop through the list appending station name and coordinates obtained from geolocator in to stations dataframe

for i in stations1:
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(i)
    latitude = location.latitude
    longitude = location.longitude
    stations = stations.append({'Station': i,
                                        'Latitude': latitude,
                                        'Longitude': longitude}, ignore_index=True)
    
    
stations

Unnamed: 0,Station,Latitude,Longitude
0,"Tara Street , Dublin",53.347063,-6.254314
1,"Connolly , Dublin",53.350949,-6.249872
2,"Docklands , Dublin",53.353666,-6.228333
3,"Pearse , Dublin",53.343335,-6.248463
4,"Sydney Parade , Dublin",53.320787,-6.211552
5,"Shankill , Dublin",53.230228,-6.124181
6,"Seapoint , Dublin",53.466102,-6.191389
7,"Sandymount , Dublin",53.327928,-6.22105
8,"Glenageary , Dublin",53.281238,-6.123108
9,"Sandycove and Glasthule , Dublin",53.288252,-6.127045


We want to tidy our dataset a little further by removing stations Tara St, Connolly and Docklands as these are deemed to be in the city centre and not a location we are interested in. 

In [37]:
# Remove certain rows from the dataset

stations2 = stations[~stations.Station.isin(['Tara Street , Dublin', 'Connolly , Dublin', 
                                          'Docklands , Dublin'])]

stations2

Unnamed: 0,Station,Latitude,Longitude
3,"Pearse , Dublin",53.343335,-6.248463
4,"Sydney Parade , Dublin",53.320787,-6.211552
5,"Shankill , Dublin",53.230228,-6.124181
6,"Seapoint , Dublin",53.466102,-6.191389
7,"Sandymount , Dublin",53.327928,-6.22105
8,"Glenageary , Dublin",53.281238,-6.123108
9,"Sandycove and Glasthule , Dublin",53.288252,-6.127045
10,"Salthill and Monkstown , Dublin",53.295391,-6.152424
11,"Blackrock , Dublin",53.301864,-6.178834
12,"Booterstown , Dublin",53.308629,-6.196652


### **Part 2:**

The second part of the data process is to utilize the FourSquare API to get a list of competing businesses in the areas we want to explore. Firstly, we must define our API credentials.

In [38]:
# Define API credentials

CLIENT_ID = 'S220GACGVSNRKG1XWRUG4GGR0UEJ5P2IHQF3KFGSDDK15ADK' # Foursquare ID
CLIENT_SECRET = 'ULHQFM5ED0KBFZH4Y04XUON2Y2OBQJ2423OCHWWMQ30UAYS4' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

We call the [venues endpoint](https://developer.foursquare.com/docs/places-api/endpoints/) and obtain the venue name, coordinates and the type of business it is and associate them with the stations that we have defined in Part 1.

In [39]:
# Create URL

LIMIT = 1000 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=S220GACGVSNRKG1XWRUG4GGR0UEJ5P2IHQF3KFGSDDK15ADK&client_secret=ULHQFM5ED0KBFZH4Y04XUON2Y2OBQJ2423OCHWWMQ30UAYS4&v=20180605&ll=53.203912,-6.2322203&radius=500&limit=1000'

In [40]:
# Obtain nearby venues to the stations from FourSquare

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['station', 
                  'station_latitude', 
                  'station_longitude', 
                  'venue', 
                  'venue_latitude', 
                  'venue_longitude', 
                  'venue_category']
    
    return(nearby_venues)

Once that is done we create a new dataframe combining venue and original information.

In [41]:
# Create new dataframe with updated venue information

dublin_venues = getNearbyVenues(names=stations2['Station'],
                                   latitudes=stations2['Latitude'],
                                  longitudes=stations2['Longitude'])

Pearse , Dublin
Sydney Parade , Dublin
Shankill , Dublin
Seapoint , Dublin
Sandymount , Dublin
Glenageary , Dublin
Sandycove and Glasthule , Dublin
Salthill and Monkstown , Dublin
Blackrock , Dublin
Booterstown , Dublin
Dalkey , Dublin
Dun Laoghaire , Dublin
Lansdowne Road , Dublin
Grand Canal Dock , Dublin
Killiney , Dublin
Bray , Dublin


Let's print the number of rows and columns and inspect the first 20 rows

In [42]:
# Print the number of rows and columns

print(dublin_venues.shape)

# Inspect the first 20 rows

dublin_venues.head(20)

(285, 7)


Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
0,"Pearse , Dublin",53.343335,-6.248463,Science Gallery,53.344186,-6.250524,Science Museum
1,"Pearse , Dublin",53.343335,-6.248463,Bread 41,53.344812,-6.251619,Bakery
2,"Pearse , Dublin",53.343335,-6.248463,Oscar Wilde Statue,53.340937,-6.250692,Outdoor Sculpture
3,"Pearse , Dublin",53.343335,-6.248463,Science Gallery Café,53.344348,-6.250779,Coffee Shop
4,"Pearse , Dublin",53.343335,-6.248463,Honey Truffle,53.344089,-6.248893,Coffee Shop
5,"Pearse , Dublin",53.343335,-6.248463,Sweny's Pharmacy,53.34191,-6.250392,Bookstore
6,"Pearse , Dublin",53.343335,-6.248463,Merrion Square Park,53.340138,-6.250451,Park
7,"Pearse , Dublin",53.343335,-6.248463,Arabica Coffee House,53.343676,-6.247063,Coffee Shop
8,"Pearse , Dublin",53.343335,-6.248463,Probus Wines & Spirits,53.341578,-6.248789,Wine Shop
9,"Pearse , Dublin",53.343335,-6.248463,The Ginger Man,53.341859,-6.24958,Pub


Upon inspecting the data we see that not all Train stations are picked up by the FourSquare location data. There are only 10 whereas there should be 16 in total.

In [43]:
trains = dublin_venues[dublin_venues.venue_category.isin(['Train Station', 'Light Rail Station'])]
trains

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
33,"Sydney Parade , Dublin",53.320787,-6.211552,Sydney Parade DART Station,53.320489,-6.211176,Train Station
55,"Sandymount , Dublin",53.327928,-6.22105,Sandymount DART Station,53.328439,-6.223726,Light Rail Station
83,"Sandycove and Glasthule , Dublin",53.288252,-6.127045,Sandycove & Glasthule Dart Station,53.28798,-6.127124,Train Station
100,"Salthill and Monkstown , Dublin",53.295391,-6.152424,Salthill & Monkstown DART Station,53.295372,-6.152205,Train Station
118,"Blackrock , Dublin",53.301864,-6.178834,Blackrock Dart Station,53.302744,-6.178733,Train Station
131,"Booterstown , Dublin",53.308629,-6.196652,Booterstown DART Station,53.310137,-6.195415,Train Station
220,"Lansdowne Road , Dublin",53.335233,-6.228178,Lansdowne Road DART Station,53.333999,-6.228834,Train Station
275,"Grand Canal Dock , Dublin",53.339819,-6.238188,Grand Canal Dock Railway Station,53.339532,-6.237297,Train Station
282,"Killiney , Dublin",53.255384,-6.11304,Killiney Railway Station,53.255588,-6.113045,Train Station


We need to create a new dataframe with those stations and their coordinates and merge them with our main dataframe.

In [44]:
# Read in csv of missing stations

missing_stations = pd.read_csv('https://raw.githubusercontent.com/shaneconn860/data-science-capstone-py/master/extra_stations.csv')
missing_stations

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
0,"Bray, Dublin",53.202405,-6.104793,Bray Daly Station,53.202405,-6.104793,Train Station
1,"Seapoint, Dublin",53.299106,-6.167547,Seapoint Train Station,53.299106,-6.167547,Train Station
2,"Pearse, Dublin",53.343327,-6.250215,Pearse Station,53.343327,-6.250215,Train Station
3,"Shankhill, Dublin",53.236608,-6.121142,Shankhill Station,53.236608,-6.121142,Train Station
4,"Glenageary, Dublin",53.280683,-6.13829,Glenageary Station,53.280683,-6.13829,Train Station
5,"Dun Laoghaire, Dublin",53.292308,-6.137815,Dun Laoghaire Station,53.292308,-6.137815,Train Station


In [45]:
# Concatenate missing stations with main dataframe

dublin_venues2 = pd.concat([dublin_venues, missing_stations])

In [46]:
# View number of rows to verify increase

print(dublin_venues2.shape)

(291, 7)


The rows have increased by 6 from 285 to 291 so we are happy with the dataset. Overall we now have our raw dataset and are ready to start exploring the data to analyze and select the best possible location for our first location.