# Data Science Capstone Project - Solving Business Problems with Location Data

## Table of Contents

1. Introduction/Business Problem
2. Data

## Introduction/Business Problem

For this project, we have been tasked with coming up with the best possible location for a company's first Cafe and Coffee Shop. The business owners have very ambitious plans for growth in the coming years as a result of their business acumen and experience in different hospitality ventures over the years. They have tasked us with coming up with an ideal location to open their business.

They have identified the area of South Dublin to open their first location, for two main reasons. The first reason is that Dublin City Centre is so densely populated with coffee shops, cafes and restaurants that it is too competitive to flourish in such a competitive market, especially with a new and unknown brand. It should noted that the business owners are open to locations immediately outside the city on the South side.

The second reason is that South Dublin is considered a very wealthy and affluent area. It is the assumption of the business owners that if they can provide a leading service and quality of product, that the inhabitants of the South side of the River Liffey will spend more money and keep coming back. They hope that this location will be the first of many; once they develop the brand here, they can then branch out in to the city centre with a better-known brand and quality of coffee.

Further to the South of the city being an ideal area to start this business, it is on the coastal area of the South where the quality of life is higher. This area also boasts the DART (Dublin Area Rapid Transit) Railway network leading in to the city and so the owners would ideally like to open a location close to one of these stations, hoping to attract early-morning commuters as well as catering for a lunchtime crowd. In order to identify the primary location for the first outlet, we need to take in to account the amount of competitor outlets that are already operating there, while also exploring if there are many shops and other services in the area. We will achieve this by utilizing the FourSquare API to return such locations.

## Data

### **Part 1:**

Before we do any data preprocessing, let's install the packages that we will need throughout this project.

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

The first part of our data process will be to obtain the list of DART stations on the Dublin Southside. One such list can be found [here](https://data.smartdublin.ie/dataset/dlr-dart-stations/resource/db0ff728-884e-40c0-9e3c-e13344d00dfc). After reading in this file and inspecting the data it seems that the X and Y coordinates are incorrect for the city of Dublin so we will need to ascertain the correct coordinates using the Nominatim feature of the geopy package at a later step. For now though we will remove these columns along with the column named *stop_id* as there is no use for it during this process.

In [3]:
# Read in csv file and inspect the first few rows

filename = "https://data.smartdublin.ie/dataset/3d15a715-0dc3-416a-a2c7-71534494efdc/resource/db0ff728-884e-40c0-9e3c-e13344d00dfc/download/dlr_east_coast_dart_stations.csv"
df = pd.read_csv(filename)
df.head()

Unnamed: 0,X_Coord,Y_Coord,stop_id,stop_name
0,716232,734404.0,825GA00065,Tara Street Train Station
1,716699,735015.0,825GA00167,Connolly Train Station
2,717213,734847.0,825GA00184,Docklands Train Station
3,716656,733978.0,825GA00204,Pearse Train Station
4,719156,731539.0,825GA00079,Sydney Parade Train Station


In [4]:
# Remove unwanted columns

df = df.drop(df.columns[[0, 1, 2]], axis=1)
df.head()

Unnamed: 0,stop_name
0,Tara Street Train Station
1,Connolly Train Station
2,Docklands Train Station
3,Pearse Train Station
4,Sydney Parade Train Station


The second issue with the dataset is that it there are duplicate rows. We only want rows 0-18 so we need to remove the duplicates.

In [5]:
# Remove duplicate rows across the dataframe

df = df.drop_duplicates()
df

Unnamed: 0,stop_name
0,Tara Street Train Station
1,Connolly Train Station
2,Docklands Train Station
3,Pearse Train Station
4,Sydney Parade Train Station
5,Shankill Train Station
6,Seapoint Train Station
7,Sandymount Train Station
8,Glenageary Train Station
9,Sandycove and Glasthule Train Station


The next step in the process is to replace the words *Train Station* with the string *, Dublin* so that the locations can be correctly picked up when we fetch the coordinates using geopy.

In [6]:
# Replace part of string with another

df = df.replace({'Train Station':', Dublin'}, regex=True)
df

Unnamed: 0,stop_name
0,"Tara Street , Dublin"
1,"Connolly , Dublin"
2,"Docklands , Dublin"
3,"Pearse , Dublin"
4,"Sydney Parade , Dublin"
5,"Shankill , Dublin"
6,"Seapoint , Dublin"
7,"Sandymount , Dublin"
8,"Glenageary , Dublin"
9,"Sandycove and Glasthule , Dublin"


Next we export the sole remaining column (stop_name) to a list so that we can use it in a for loop and append to the new dataframe that we create.

In [7]:
# Export entire column to a workable list for looping purposes

stations1 = df['stop_name'].tolist()
stations1

['Tara Street , Dublin',
 'Connolly , Dublin',
 'Docklands , Dublin',
 'Pearse , Dublin',
 'Sydney Parade , Dublin',
 'Shankill , Dublin',
 'Seapoint , Dublin',
 'Sandymount , Dublin',
 'Glenageary , Dublin',
 'Sandycove and Glasthule , Dublin',
 'Salthill and Monkstown , Dublin',
 'Blackrock , Dublin',
 'Booterstown , Dublin',
 'Dalkey , Dublin',
 'Dun Laoghaire , Dublin',
 'Lansdowne Road , Dublin',
 'Grand Canal Dock , Dublin',
 'Killiney , Dublin',
 'Bray , Dublin']

We then create the empty dataframe called *stations* with the column names *station* (replacing stop_name), *latitude* and *longitude*.

In [8]:
# Create a variable to store column names

column_names = ['Station', 'Latitude', 'Longitude'] 

# Create blank with dataframe using the previously-created column names

stations = pd.DataFrame(columns=column_names)

After supplying the list of stops to the geopy function, we obtain the correct coordinates for each of the stations and then append these along with the station names to our new dataframe.

In [9]:
# Loop through the list appending station name and coordinates obtained from geolocator in to stations dataframe

for i in stations1:
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(i)
    latitude = location.latitude
    longitude = location.longitude
    stations = stations.append({'Station': i,
                                        'Latitude': latitude,
                                        'Longitude': longitude}, ignore_index=True)
    
    
stations

Unnamed: 0,Station,Latitude,Longitude
0,"Tara Street , Dublin",53.347063,-6.254314
1,"Connolly , Dublin",53.350949,-6.249872
2,"Docklands , Dublin",53.353666,-6.228333
3,"Pearse , Dublin",53.343335,-6.248463
4,"Sydney Parade , Dublin",53.320787,-6.211552
5,"Shankill , Dublin",53.230228,-6.124181
6,"Seapoint , Dublin",53.466102,-6.191389
7,"Sandymount , Dublin",53.327928,-6.22105
8,"Glenageary , Dublin",53.281238,-6.123108
9,"Sandycove and Glasthule , Dublin",53.288252,-6.127045


We want to tidy our dataset a little further by removing stations Tara St, Connolly and Docklands as these are deemed to be in the city centre and not a location we are interested in. 

In [10]:
# Remove certain rows from the dataset

stations2 = stations[~stations.Station.isin(['Tara Street , Dublin', 'Connolly , Dublin', 
                                          'Docklands , Dublin'])]

stations2

Unnamed: 0,Station,Latitude,Longitude
3,"Pearse , Dublin",53.343335,-6.248463
4,"Sydney Parade , Dublin",53.320787,-6.211552
5,"Shankill , Dublin",53.230228,-6.124181
6,"Seapoint , Dublin",53.466102,-6.191389
7,"Sandymount , Dublin",53.327928,-6.22105
8,"Glenageary , Dublin",53.281238,-6.123108
9,"Sandycove and Glasthule , Dublin",53.288252,-6.127045
10,"Salthill and Monkstown , Dublin",53.295391,-6.152424
11,"Blackrock , Dublin",53.301864,-6.178834
12,"Booterstown , Dublin",53.308629,-6.196652


### **Part 2:**

The second part of the data process is to utilize the FourSquare API to get a list of competing businesses in the areas we want to explore. Firstly, we must define our API credentials.

In [11]:
# Define API credentials

CLIENT_ID = 'S220GACGVSNRKG1XWRUG4GGR0UEJ5P2IHQF3KFGSDDK15ADK' # Foursquare ID
CLIENT_SECRET = 'ULHQFM5ED0KBFZH4Y04XUON2Y2OBQJ2423OCHWWMQ30UAYS4' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

We call the [venues endpoint](https://developer.foursquare.com/docs/places-api/endpoints/) and obtain the venue name, coordinates and the type of business it is and associate them with the stations that we have defined in Part 1.

In [12]:
# Create URL

LIMIT = 1000 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=S220GACGVSNRKG1XWRUG4GGR0UEJ5P2IHQF3KFGSDDK15ADK&client_secret=ULHQFM5ED0KBFZH4Y04XUON2Y2OBQJ2423OCHWWMQ30UAYS4&v=20180605&ll=53.203912,-6.2322203&radius=500&limit=1000'

In [13]:
# Obtain nearby venues to the stations from FourSquare

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['station', 
                  'station_latitude', 
                  'station_longitude', 
                  'venue', 
                  'venue_latitude', 
                  'venue_longitude', 
                  'venue_category']
    
    return(nearby_venues)

Once that is done we create a new dataframe combining venue and original information.

In [14]:
# Create new dataframe with updated venue information

dublin_venues = getNearbyVenues(names=stations2['Station'],
                                   latitudes=stations2['Latitude'],
                                  longitudes=stations2['Longitude'])

Pearse , Dublin
Sydney Parade , Dublin
Shankill , Dublin
Seapoint , Dublin
Sandymount , Dublin
Glenageary , Dublin
Sandycove and Glasthule , Dublin
Salthill and Monkstown , Dublin
Blackrock , Dublin
Booterstown , Dublin
Dalkey , Dublin
Dun Laoghaire , Dublin
Lansdowne Road , Dublin
Grand Canal Dock , Dublin
Killiney , Dublin
Bray , Dublin


Let's print the number of rows and columns and inspect the first 20 rows

In [15]:
# Print the number of rows and columns

print(dublin_venues.shape)

# Inspect the first 20 rows

dublin_venues.head(20)

(286, 7)


Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
0,"Pearse , Dublin",53.343335,-6.248463,Science Gallery,53.344186,-6.250524,Science Museum
1,"Pearse , Dublin",53.343335,-6.248463,Bread 41,53.344812,-6.251619,Bakery
2,"Pearse , Dublin",53.343335,-6.248463,Oscar Wilde Statue,53.340937,-6.250692,Outdoor Sculpture
3,"Pearse , Dublin",53.343335,-6.248463,Science Gallery Café,53.344348,-6.250779,Coffee Shop
4,"Pearse , Dublin",53.343335,-6.248463,Honey Truffle,53.344089,-6.248893,Coffee Shop
5,"Pearse , Dublin",53.343335,-6.248463,Sweny's Pharmacy,53.34191,-6.250392,Bookstore
6,"Pearse , Dublin",53.343335,-6.248463,Merrion Square Park,53.340138,-6.250451,Park
7,"Pearse , Dublin",53.343335,-6.248463,Arabica Coffee House,53.343676,-6.247063,Coffee Shop
8,"Pearse , Dublin",53.343335,-6.248463,Probus Wines & Spirits,53.341578,-6.248789,Wine Shop
9,"Pearse , Dublin",53.343335,-6.248463,The Ginger Man,53.341859,-6.24958,Pub


Upon inspecting the data we see that not all Train stations are picked up by the FourSquare location data. There are only 10 whereas there should be 17 in total.

In [16]:
trains = dublin_venues[dublin_venues.venue_category.isin(['Train Station', 'Light Rail Station'])]
trains

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
33,"Sydney Parade , Dublin",53.320787,-6.211552,Sydney Parade DART Station,53.320489,-6.211176,Train Station
55,"Sandymount , Dublin",53.327928,-6.22105,Sandymount DART Station,53.328439,-6.223726,Light Rail Station
68,"Glenageary , Dublin",53.281238,-6.123108,Glenageary Dart Station,53.281148,-6.122844,Train Station
86,"Sandycove and Glasthule , Dublin",53.288252,-6.127045,Sandycove & Glasthule Dart Station,53.28798,-6.127124,Train Station
101,"Salthill and Monkstown , Dublin",53.295391,-6.152424,Salthill & Monkstown DART Station,53.295372,-6.152205,Train Station
120,"Blackrock , Dublin",53.301864,-6.178834,Blackrock Dart Station,53.302744,-6.178733,Train Station
134,"Booterstown , Dublin",53.308629,-6.196652,Booterstown DART Station,53.310137,-6.195415,Train Station
226,"Lansdowne Road , Dublin",53.335233,-6.228178,Lansdowne Road DART Station,53.333999,-6.228834,Train Station
279,"Grand Canal Dock , Dublin",53.339819,-6.238188,Grand Canal Dock Railway Station,53.339532,-6.237297,Train Station
283,"Killiney , Dublin",53.255384,-6.11304,Killiney Railway Station,53.255588,-6.113045,Train Station


We need to create a new dataframe with those stations and their coordinates and merge them with our main dataframe.

In [17]:
# Read in csv of missing stations

missing_stations = pd.read_csv('https://raw.githubusercontent.com/shaneconn860/data-science-capstone-py/master/extra_stations.csv')
missing_stations

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
0,"Bray, Dublin",53.20452,-6.100993,Bray Daly Station,53.20452,-6.100993,Train Station
1,"Seapoint, Dublin",53.299106,-6.165473,Seapoint Train Station,53.299106,-6.165473,Train Station
2,"Pearse, Dublin",53.343327,-6.248315,Pearse Station,53.343327,-6.248315,Train Station
3,"Shankhill, Dublin",53.236508,-6.117142,Shankhill Station,53.236508,-6.117142,Train Station
4,"Glenageary, Dublin",53.281183,-6.12309,Glenageary Station,53.281183,-6.12309,Train Station
5,"Dun Laoghaire, Dublin",53.2949,-6.134548,Dun Laoghaire Station,53.2949,-6.134548,Train Station
6,"Dalkey, Dublin",53.2758,-6.1033,Dalkey Station,53.2758,-6.1033,Train Station


In [18]:
# Concatenate missing stations with main dataframe

dublin_venues2 = pd.concat([dublin_venues, missing_stations])

In [19]:
# View number of rows to verify increase

print(dublin_venues2.shape)

(293, 7)


In [20]:
dublin_venues2.to_csv('venues.csv')

The rows have increased by 7 so we are happy with the dataset. Overall we now have our raw dataset and are ready to start exploring the data to analyze and select the best possible location for our first location.

## Methodology

### Analysis

We want to have a look at the list of business types as we may want to locate the business in an area where there are shopping centres, department stores or other amenities. Facilities in these areas indicate there would be a certain level of foot traffic meaning there should be more people around to enjoy a cup of coffee or a tasty sandwich during the normal hours of business.

In [21]:
# Get list of venue types and count them

dublin_venues2['venue_category'].value_counts()

Café                       26
Pub                        25
Coffee Shop                24
Hotel                      18
Train Station              16
Bar                        10
Italian Restaurant         10
Restaurant                  9
Supermarket                 6
Park                        5
Farmers Market              5
Bakery                      5
Gastropub                   5
Gym                         5
Plaza                       4
Chinese Restaurant          4
Breakfast Spot              4
Pizza Place                 4
Steakhouse                  3
Fish & Chips Shop           3
Ice Cream Shop              3
Wine Shop                   3
Fast Food Restaurant        3
Deli / Bodega               3
Gourmet Shop                3
Shopping Mall               3
Scenic Lookout              3
Gym / Fitness Center        3
Sports Bar                  2
Hotel Bar                   2
Bookstore                   2
Bistro                      2
Salad Place                 2
Museum    

We see from this list the types of businesses and public amenities we want to include in our exploration. These are: 

- Plaza
- Shopping Mall
- Department Store
- Clothing Store
- Waterfront
- Science Museum
- Art Museum
- Museum
- Flea Market
- Market
- Nail Salon
- Yoga Studio
- Playground

We can now refine our data even further by selecting only these venues along with cafes, coffee shops and sandwich places as these our the types of labels that are business will be associated with. We also add the train stations to this query to ensure all of our stations are picked up.

In [22]:
# Return only venues that we specifically define

dub3 = dublin_venues2[dublin_venues2.venue_category.isin(['Train Station', 'Light Rail Station',
                                                'Café', 'Coffee Shop', 'Sandwich Place', 'Plaza',
                                                       'Shopping Mall', 'Department Store', 'Clothing Store',
                                                       'Waterfront', 'Science Museum', 'Art Museum',
                                                       'Museum', 'Flea Market', 'Market', 'Nail Salon', 'Yoga Studio', 'Playground'])]
dub3

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category
0,"Pearse , Dublin",53.343335,-6.248463,Science Gallery,53.344186,-6.250524,Science Museum
3,"Pearse , Dublin",53.343335,-6.248463,Science Gallery Café,53.344348,-6.250779,Coffee Shop
4,"Pearse , Dublin",53.343335,-6.248463,Honey Truffle,53.344089,-6.248893,Coffee Shop
7,"Pearse , Dublin",53.343335,-6.248463,Arabica Coffee House,53.343676,-6.247063,Coffee Shop
11,"Pearse , Dublin",53.343335,-6.248463,National Gallery of Ireland,53.341558,-6.252528,Art Museum
12,"Pearse , Dublin",53.343335,-6.248463,Coffeeangel,53.342071,-6.254019,Coffee Shop
14,"Pearse , Dublin",53.343335,-6.248463,The Gallery Café,53.341766,-6.252516,Café
17,"Pearse , Dublin",53.343335,-6.248463,Lolly & Cooks,53.34109,-6.245572,Café
18,"Pearse , Dublin",53.343335,-6.248463,Pearse Square,53.343367,-6.241793,Plaza
28,"Pearse , Dublin",53.343335,-6.248463,Angel Park Café & Deli,53.339326,-6.245574,Café


Now we can produce our first visual display of these businesses and areas which will make things much easier for identifying the most ideal locations to set up our business.

In [38]:
# create map of Dublin venues using latitude and longitude values

map_venues = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for city, lat, lng in zip (dub3['venue'], dub3['venue_latitude'], dub3['venue_longitude']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_venues)  
    
map_venues

We can now also group this data by the number of businesses per station location, telling us easily which locations we might want to avoid (large volume of cafes at one location) and which ones we are more likely to consider (fewer volume).

In [24]:
# Print venue type, group by Station

dub_grouped = dub3.groupby(['station', 'venue_category']).size().to_frame('size').reset_index().sort_values(['station', 'size'], ascending=[True, False])
dub_grouped

Unnamed: 0,station,venue_category,size
0,"Blackrock , Dublin",Café,3
1,"Blackrock , Dublin",Coffee Shop,3
4,"Blackrock , Dublin",Shopping Mall,2
2,"Blackrock , Dublin",Department Store,1
3,"Blackrock , Dublin",Flea Market,1
5,"Blackrock , Dublin",Train Station,1
6,"Booterstown , Dublin",Playground,1
7,"Booterstown , Dublin",Train Station,1
8,"Bray, Dublin",Train Station,1
9,"Dalkey , Dublin",Café,2


We can use **one hot encoding** to analyze the area around each station in more detail, taking the mean of the frequency of occurrence of each venue.

In [41]:
# one hot encoding
dublin_onehot = pd.get_dummies(dub3[['venue_category']], prefix="", prefix_sep="")

# add station column back to dataframe
dublin_onehot['Station'] = dub3['station'] 

# move station column to the first column
fixed_columns = [dublin_onehot.columns[-1]] + list(dublin_onehot.columns[:-1])
dublin_onehot = dublin_onehot[fixed_columns]

dublin_grouped = dublin_onehot.groupby('Station').mean().reset_index()
dublin_grouped

Unnamed: 0,Station,Art Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Museum,Nail Salon,Playground,Plaza,Sandwich Place,Science Museum,Shopping Mall,Train Station,Waterfront,Yoga Studio
0,"Blackrock , Dublin",0.0,0.272727,0.0,0.272727,0.090909,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.181818,0.090909,0.0,0.0
1,"Booterstown , Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0
2,"Bray, Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,"Dalkey , Dublin",0.0,0.666667,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Dalkey, Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,"Dun Laoghaire , Dublin",0.0,0.307692,0.076923,0.538462,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Dun Laoghaire, Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
7,"Glenageary , Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8,"Glenageary, Dublin",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
9,"Grand Canal Dock , Dublin",0.0,0.357143,0.0,0.214286,0.0,0.0,0.0,0.0,0.0,0.0,0.214286,0.071429,0.0,0.0,0.071429,0.0,0.071429


Let's print each area along with the top 5 most common venues:

In [40]:
# Print top 5 venues per area based on mean figure calculated above

num_top_venues = 5

for area in dublin_grouped['Station']:
    print("----"+area+"----")
    temp = dublin_grouped[dublin_grouped['Station'] == area].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Blackrock , Dublin----
              venue  freq
0       Coffee Shop  0.27
1              Café  0.27
2     Shopping Mall  0.18
3  Department Store  0.09
4       Flea Market  0.09


----Booterstown , Dublin----
           venue  freq
0     Playground   0.5
1  Train Station   0.5
2     Art Museum   0.0
3     Waterfront   0.0
4  Shopping Mall   0.0


----Bray, Dublin----
           venue  freq
0  Train Station   1.0
1     Art Museum   0.0
2     Playground   0.0
3     Waterfront   0.0
4  Shopping Mall   0.0


----Dalkey , Dublin----
         venue  freq
0         Café  0.67
1  Coffee Shop  0.33
2   Art Museum  0.00
3        Plaza  0.00
4   Waterfront  0.00


----Dalkey, Dublin----
           venue  freq
0  Train Station   1.0
1     Art Museum   0.0
2     Playground   0.0
3     Waterfront   0.0
4  Shopping Mall   0.0


----Dun Laoghaire , Dublin----
            venue  freq
0     Coffee Shop  0.54
1            Café  0.31
2  Clothing Store  0.08
3          Museum  0.08
4      Art Museum  

In [42]:
# Sort venues in descending order

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each area.

In [45]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Station']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
stations_venues_sorted = pd.DataFrame(columns=columns)
stations_venues_sorted['Station'] = dublin_grouped['Station']

for ind in np.arange(dublin_grouped.shape[0]):
    stations_venues_sorted.iloc[ind, 1:] = return_most_common_venues(dublin_grouped.iloc[ind, :], num_top_venues)

#stations_venues_sorted.drop_duplicates()
stations_venues_sorted.head(17)

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Blackrock , Dublin",Café,Coffee Shop,Shopping Mall,Train Station,Department Store,Flea Market,Yoga Studio,Light Rail Station,Clothing Store,Nail Salon
1,"Booterstown , Dublin",Train Station,Playground,Yoga Studio,Light Rail Station,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon
2,"Bray, Dublin",Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
3,"Dalkey , Dublin",Café,Coffee Shop,Yoga Studio,Museum,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront
4,"Dalkey, Dublin",Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
5,"Dun Laoghaire , Dublin",Coffee Shop,Café,Museum,Clothing Store,Yoga Studio,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront
6,"Dun Laoghaire, Dublin",Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
7,"Glenageary , Dublin",Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
8,"Glenageary, Dublin",Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
9,"Grand Canal Dock , Dublin",Café,Plaza,Coffee Shop,Yoga Studio,Train Station,Sandwich Place,Light Rail Station,Clothing Store,Department Store,Flea Market


### Clustering of Stations

The next step in the analysis is to perform clustering of our stations, so we can easily see the main clusters we want to explore.

In [46]:
# set number of clusters
kclusters = 5

dublin_clustering = dublin_grouped.drop('Station', 1)
dublin_clustering

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(dublin_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([2, 3, 1, 0, 1, 2, 1, 1, 1, 2], dtype=int32)

In [47]:
# add clustering labels
stations_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [48]:
dublin_merged = dub3

# merge dublin_grouped with dublin_data to add latitude/longitude for each neighborhood
dublin_merged = dublin_merged.join(stations_venues_sorted.set_index('Station'), on='station')

dublin_merged.head(5)

Unnamed: 0,station,station_latitude,station_longitude,venue,venue_latitude,venue_longitude,venue_category,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Pearse , Dublin",53.343335,-6.248463,Science Gallery,53.344186,-6.250524,Science Museum,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
3,"Pearse , Dublin",53.343335,-6.248463,Science Gallery Café,53.344348,-6.250779,Coffee Shop,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
4,"Pearse , Dublin",53.343335,-6.248463,Honey Truffle,53.344089,-6.248893,Coffee Shop,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
7,"Pearse , Dublin",53.343335,-6.248463,Arabica Coffee House,53.343676,-6.247063,Coffee Shop,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
11,"Pearse , Dublin",53.343335,-6.248463,National Gallery of Ireland,53.341558,-6.252528,Art Museum,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market


In [49]:
# Remove unwanted columns for clustering process

df1 = dublin_merged.drop('venue', 1)
df2 = df1.drop('venue_category', 1)

df3 = df2.drop('venue_latitude', 1)
df4 = df3.drop('venue_longitude', 1)

df4 = df4.drop_duplicates().reset_index()

df4

Unnamed: 0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"Pearse , Dublin",53.343335,-6.248463,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
1,32,"Sydney Parade , Dublin",53.320787,-6.211552,2,Train Station,Coffee Shop,Waterfront,Yoga Studio,Light Rail Station,Café,Clothing Store,Department Store,Flea Market,Nail Salon
2,40,"Shankill , Dublin",53.230228,-6.124181,4,Shopping Mall,Coffee Shop,Yoga Studio,Museum,Café,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon
3,52,"Sandymount , Dublin",53.327928,-6.22105,0,Café,Light Rail Station,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon,Waterfront
4,68,"Glenageary , Dublin",53.281238,-6.123108,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
5,70,"Sandycove and Glasthule , Dublin",53.288252,-6.127045,2,Coffee Shop,Café,Train Station,Playground,Yoga Studio,Light Rail Station,Clothing Store,Department Store,Flea Market,Nail Salon
6,93,"Salthill and Monkstown , Dublin",53.295391,-6.152424,0,Train Station,Café,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
7,105,"Blackrock , Dublin",53.301864,-6.178834,2,Café,Coffee Shop,Shopping Mall,Train Station,Department Store,Flea Market,Yoga Studio,Light Rail Station,Clothing Store,Nail Salon
8,132,"Booterstown , Dublin",53.308629,-6.196652,3,Train Station,Playground,Yoga Studio,Light Rail Station,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon
9,141,"Dalkey , Dublin",53.275624,-6.103204,0,Café,Coffee Shop,Yoga Studio,Museum,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront


Now let's create a map with these clusters.

In [50]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df4['station_latitude'], df4['station_longitude'], df4['station'], df4['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[cluster-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [51]:
df4.head()

Unnamed: 0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"Pearse , Dublin",53.343335,-6.248463,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
1,32,"Sydney Parade , Dublin",53.320787,-6.211552,2,Train Station,Coffee Shop,Waterfront,Yoga Studio,Light Rail Station,Café,Clothing Store,Department Store,Flea Market,Nail Salon
2,40,"Shankill , Dublin",53.230228,-6.124181,4,Shopping Mall,Coffee Shop,Yoga Studio,Museum,Café,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon
3,52,"Sandymount , Dublin",53.327928,-6.22105,0,Café,Light Rail Station,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon,Waterfront
4,68,"Glenageary , Dublin",53.281238,-6.123108,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon


In [52]:
# Remove duplicate rows from df4

df4 = df4.drop_duplicates().reset_index()
df4

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,0,"Pearse , Dublin",53.343335,-6.248463,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
1,1,32,"Sydney Parade , Dublin",53.320787,-6.211552,2,Train Station,Coffee Shop,Waterfront,Yoga Studio,Light Rail Station,Café,Clothing Store,Department Store,Flea Market,Nail Salon
2,2,40,"Shankill , Dublin",53.230228,-6.124181,4,Shopping Mall,Coffee Shop,Yoga Studio,Museum,Café,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon
3,3,52,"Sandymount , Dublin",53.327928,-6.22105,0,Café,Light Rail Station,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon,Waterfront
4,4,68,"Glenageary , Dublin",53.281238,-6.123108,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
5,5,70,"Sandycove and Glasthule , Dublin",53.288252,-6.127045,2,Coffee Shop,Café,Train Station,Playground,Yoga Studio,Light Rail Station,Clothing Store,Department Store,Flea Market,Nail Salon
6,6,93,"Salthill and Monkstown , Dublin",53.295391,-6.152424,0,Train Station,Café,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
7,7,105,"Blackrock , Dublin",53.301864,-6.178834,2,Café,Coffee Shop,Shopping Mall,Train Station,Department Store,Flea Market,Yoga Studio,Light Rail Station,Clothing Store,Nail Salon
8,8,132,"Booterstown , Dublin",53.308629,-6.196652,3,Train Station,Playground,Yoga Studio,Light Rail Station,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon
9,9,141,"Dalkey , Dublin",53.275624,-6.103204,0,Café,Coffee Shop,Yoga Studio,Museum,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront


Let's examine each cluster further.

In [53]:
cluster0 = df4.loc[df4['Cluster Labels'] == 0] 
cluster0

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,3,52,"Sandymount , Dublin",53.327928,-6.22105,0,Café,Light Rail Station,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon,Waterfront
6,6,93,"Salthill and Monkstown , Dublin",53.295391,-6.152424,0,Train Station,Café,Yoga Studio,Museum,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
9,9,141,"Dalkey , Dublin",53.275624,-6.103204,0,Café,Coffee Shop,Yoga Studio,Museum,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront
11,11,207,"Lansdowne Road , Dublin",53.335233,-6.228178,0,Café,Nail Salon,Museum,Train Station,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Yoga Studio


In [59]:
cluster1 = df4.loc[df4['Cluster Labels'] == 1]
cluster1

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,4,68,"Glenageary , Dublin",53.281238,-6.123108,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
13,13,283,"Killiney , Dublin",53.255384,-6.11304,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
14,14,0,"Bray, Dublin",53.20452,-6.100993,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
15,15,1,"Seapoint, Dublin",53.299106,-6.165473,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
16,16,2,"Pearse, Dublin",53.343327,-6.248315,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
17,17,3,"Shankhill, Dublin",53.236508,-6.117142,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
18,18,4,"Glenageary, Dublin",53.281183,-6.12309,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
19,19,5,"Dun Laoghaire, Dublin",53.2949,-6.134548,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon
20,20,6,"Dalkey, Dublin",53.2758,-6.1033,1,Train Station,Yoga Studio,Museum,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Light Rail Station,Nail Salon


In [56]:
cluster2 = df4.loc[df4['Cluster Labels'] == 2] #df4.columns[[1] + list(range(5, df4.shape[1]))]]
cluster2

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,0,"Pearse , Dublin",53.343335,-6.248463,2,Coffee Shop,Café,Art Museum,Science Museum,Sandwich Place,Plaza,Light Rail Station,Clothing Store,Department Store,Flea Market
1,1,32,"Sydney Parade , Dublin",53.320787,-6.211552,2,Train Station,Coffee Shop,Waterfront,Yoga Studio,Light Rail Station,Café,Clothing Store,Department Store,Flea Market,Nail Salon
5,5,70,"Sandycove and Glasthule , Dublin",53.288252,-6.127045,2,Coffee Shop,Café,Train Station,Playground,Yoga Studio,Light Rail Station,Clothing Store,Department Store,Flea Market,Nail Salon
7,7,105,"Blackrock , Dublin",53.301864,-6.178834,2,Café,Coffee Shop,Shopping Mall,Train Station,Department Store,Flea Market,Yoga Studio,Light Rail Station,Clothing Store,Nail Salon
10,10,167,"Dun Laoghaire , Dublin",53.292279,-6.136008,2,Coffee Shop,Café,Museum,Clothing Store,Yoga Studio,Department Store,Flea Market,Light Rail Station,Nail Salon,Waterfront
12,12,234,"Grand Canal Dock , Dublin",53.339819,-6.238188,2,Café,Plaza,Coffee Shop,Yoga Studio,Train Station,Sandwich Place,Light Rail Station,Clothing Store,Department Store,Flea Market


In [57]:
cluster3 = df4.loc[df4['Cluster Labels'] == 3] #df4.columns[[1] + list(range(5, df4.shape[1]))]]
cluster3

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,8,132,"Booterstown , Dublin",53.308629,-6.196652,3,Train Station,Playground,Yoga Studio,Light Rail Station,Café,Clothing Store,Coffee Shop,Department Store,Flea Market,Nail Salon


In [58]:
cluster4 = df4.loc[df4['Cluster Labels'] == 4] #df4.columns[[1] + list(range(5, df4.shape[1]))]]
cluster4

Unnamed: 0,level_0,index,station,station_latitude,station_longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,2,40,"Shankill , Dublin",53.230228,-6.124181,4,Shopping Mall,Coffee Shop,Yoga Studio,Museum,Café,Clothing Store,Department Store,Flea Market,Light Rail Station,Nail Salon


## Results

Now that we have performed our main analysis, we can start to deduce what the best locations for the business will be. Based on the maps and tables above, we will split the areas in to three sections.

**1. Stations with lack of Services**

Right off the bat we can omit certain areas from our area of interest. We included the three most furthest stations but they are too far away AND have very little other services around them:

- Killiney
- Shankhill
- Bray

Despite their relative distance to the centre of Dublin, the folllowing stations have no meaningful services around them so it would not be wise to place a coffee shop/cafe here. These are:

- Booterstown
- Seapoint

**2. Stations with tough competition**

The areas of two stations which would provide fierce competition are **Blackrock** and **Dun Laoghaire**. Despite these being built-up areas with many services, they both have a high number of coffee shops (Blackrock - 6, Dun Laoghaire - 12). Further towards the city, Pearse St and Grand Canal Dock also have tough competition and due to their proximity to the city centre, they would not satisfy the criteria of being a truly South Dublin brand, as set out in the Introduction.

**3. Areas to consider**

- Lansdowne Road

- Sandymount

- Sydney Parade

- Monkstown

- Sandycove & Glasthule

- Dalkey