## Day 46 Lecture 1 Assignment

In this assignment, we will calculate a distance matrix for geographical Starbucks data and use it to identify locations that are close together and far apart. We will perform clustering on this dataset later on.

We will be using the "haversine" package to compute geographical distance. It can be pip installed.

In [3]:
!pip install haversine



In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from haversine import haversine

Below is a convenience function for calculating geographical distance matrices using lat-long data.

In [1]:
def geo_sim_matrix(df, col_name = 'Coordinates'):
    """
    A function that computes a geographical distance matrix (in miles).
    Each row in the dataframe should correspond to one location.
    In addition, the dataframe must have a column containing the lat-long of each location as a tuple (i.e. (lat, long)).
    
    Parameters:
        df (pandas dataframe): an nxm dataframe containing the locations to compute similarities between.
        col (string): the name of the column containing the lat-long tuples.
        
    Returns:
        distance (pandas dataframe): an nxn distance matrix between the geographical coordinates of each location.
    """
    
    df = df.copy()
    df.reset_index(inplace=True)
    haver_vec = np.vectorize(haversine, otypes=[np.float32])
    distance = df.groupby('index').apply(lambda x: pd.Series(haver_vec(df[col_name], x[col_name])))
    distance = distance / 1.609344  # converts to miles
    distance.columns = distance.index
    
    return distance


This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [4]:
# answer goes here
locations = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/starbucks_locations.csv')

locations.head()

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude
0,Starbucks,47370-257954,"Meritxell, 96",Licensed,"Av. Meritxell, 96",Andorra la Vella,7,AD,AD500,376818720.0,GMT+1:00 Europe/Andorra,1.53,42.51
1,Starbucks,22331-212325,Ajman Drive Thru,Licensed,"1 Street 69, Al Jarf",Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.42
2,Starbucks,47089-256771,Dana Mall,Licensed,Sheikh Khalifa Bin Zayed St.,Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.39
3,Starbucks,22126-218024,Twofour 54,Licensed,Al Salam Street,Abu Dhabi,AZ,AE,,,GMT+04:00 Asia/Dubai,54.38,24.48
4,Starbucks,17127-178586,Al Ain Tower,Licensed,"Khaldiya Area, Abu Dhabi Island",Abu Dhabi,AZ,AE,,,GMT+04:00 Asia/Dubai,54.54,24.51


Begin by narrowing down the dataset to a specific geographic area of interest. Since we will need to manually compute a distance matrix, which will be on the order of $n^{2}$ in terms of size, we would recommend choosing an area with 3000 or less locations. In this example, we will use Hawaii, which has about 100 locations; for reference, California has about 2800 locations. Feel free to choose a different region that is of more interest to you, if desired.

Subset the dataframe to only include records for Starbucks locations in Hawaii.

In [5]:
# answer goes here

locations.Country.value_counts().head()

US    13608
CN     2734
CA     1468
JP     1237
KR      993
Name: Country, dtype: int64

In [6]:
US_locations = locations.loc[locations['Country'] == 'US']
US_locations['Timezone'].value_counts().head()

GMT-05:00 America/New_York       4871
GMT-08:00 America/Los_Angeles    4194
GMT-06:00 America/Chicago        2901
GMT-07:00 America/Denver          804
GMT+000000 America/Phoenix        487
Name: Timezone, dtype: int64

In [7]:
GMT6_US_locations = US_locations.loc[US_locations['Timezone'].str.contains('GMT-06:00')]
GMT6_US_locations.head()

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude
12013,Starbucks,76795-96265,Target Alabaster T-2276,Licensed,250 S Colonial Dr,Alabaster,AL,US,350074657,205-564-2608,GMT-06:00 America/Chicago,-86.81,33.23
12014,Starbucks,13210-95453,I-65 & US HWY 31,Company Owned,345 South Colonial Dr,Alabaster,AL,US,350074690,205-664-3797,GMT-06:00 America/Chicago,-86.8,33.22
12015,Starbucks,10796-102254,Hwy 119 & Kent Dairy,Company Owned,2171 Kent Dairy Rd,Alabaster,AL,US,350075387,205-685-9705,GMT-06:00 America/Chicago,-86.83,33.21
12016,Starbucks,10248-100069,Hwy 72 & Braly,Company Owned,1286 Hwy 72 East,Athens,AL,US,356114404,256-230-9385,GMT-06:00 America/Chicago,-86.95,34.78
12017,Starbucks,47255-121766,"Kroger-Auburn, AL #260",Licensed,300 Dean Rd,Auburn,AL,US,368304404,334-821-1325,GMT-06:00 America/Chicago,-85.46,32.61


In [8]:
missings = GMT6_US_locations.isna().sum()*100/GMT6_US_locations.count()
missings.sort_values(ascending=False)

Phone Number      3.978495
Latitude          0.000000
Longitude         0.000000
Timezone          0.000000
Postcode          0.000000
Country           0.000000
State/Province    0.000000
City              0.000000
Street Address    0.000000
Ownership Type    0.000000
Store Name        0.000000
Store Number      0.000000
Brand             0.000000
dtype: float64

The haversine package takes tuples with 2 numeric elements and interprets them as lat-long to calculate distance, so add a new column called "Coordinates" that converts the lat and long in each row into a tuple. In other words, the last two columns of the dataframe should initially look like this:

**Latitude, Longitude**  
-121.64, 39.14  
-116.40, 34.13  
...

After adding the new column, the last three columns should look like this:

**Latitude, Longitude, Coordinates**  
-121.64, 39.14, (-121.64, 39.14)  
-116.40, 34.13, (-116.40, 34.13)  
...

In [9]:
# answer goes here

GMT6_US_locations['Coordinates'] = tuple(zip(GMT6_US_locations['Latitude'], GMT6_US_locations['Longitude']))
GMT6_US_locations['Coordinates'].head()

12013    (33.23, -86.81)
12014     (33.22, -86.8)
12015    (33.21, -86.83)
12016    (34.78, -86.95)
12017    (32.61, -85.46)
Name: Coordinates, dtype: object

Calculate the distance matrix using the starter code/function geo_sim_matrix() provided earlier in the notebook. It assumes the column containing the coordinates for each location is called "Coordinates". Examine the docstring for more details.

Note: the latitude and longitudes provided only go out to two decimal places, which limits the resolution of the distance calculations to about 0.5 miles. Distances that are very small may not be accurately represented here (e.g. several instances of "0 distance" for distinct Starbucks locations in very close proximity).

In [10]:
# answer goes here

geo_sim_matrix(GMT6_US_locations)

index,12013,12014,12015,12016,12017,12018,12019,12020,12021,12022,...,25514,25515,25516,25517,25518,25519,25520,25521,25522,25523
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12013,0.000000,0.900808,1.801659,107.394554,89.250038,88.234200,88.070801,9.786924,14.976054,21.368532,...,681.602051,680.437866,683.186096,679.063721,677.689697,706.564575,679.336365,679.336365,685.922485,735.638123
12014,0.900808,0.000000,1.866726,108.127602,88.414093,87.396118,87.229431,10.684411,15.411889,21.910694,...,682.342041,681.179443,683.927490,679.805420,678.431519,707.308899,680.074890,680.074890,686.658264,736.432434
12015,1.801659,1.866726,0.000000,108.694221,89.629837,88.602936,88.421814,10.105261,16.705015,23.002934,...,682.876160,681.707886,684.456726,680.333557,678.959229,707.827698,680.613831,680.613831,687.206848,736.756775
12016,107.394554,108.127602,108.694221,0.000000,172.668030,172.100784,172.424911,100.192001,94.281158,87.362122,...,574.222595,573.065247,575.812500,571.691650,570.318115,599.206299,571.952087,571.952087,578.531311,629.319641
12017,89.250038,88.414093,89.629837,172.668030,0.000000,1.164028,1.877870,98.667526,91.945122,96.761063,...,734.532532,733.593750,736.298096,732.241699,730.889832,759.960876,732.094482,732.094482,738.270081,795.990784
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25519,706.564575,707.308899,707.827698,599.206299,759.960876,759.751221,760.326782,699.033936,693.409607,686.478271,...,25.652605,26.389349,23.721310,27.730398,29.075272,0.000000,28.305248,28.305248,23.639866,82.095642
25520,679.336365,680.074890,680.613831,571.952087,732.094482,731.892578,732.472839,671.877869,666.102539,659.169800,...,2.894371,4.099537,5.315348,4.100186,4.542682,28.305248,0.000000,0.000000,7.543904,100.438660
25521,679.336365,680.074890,680.613831,571.952087,732.094482,731.892578,732.472839,671.877869,666.102539,659.169800,...,2.894371,4.099537,5.315348,4.100186,4.542682,28.305248,0.000000,0.000000,7.543904,100.438660
25522,685.922485,686.658264,687.206848,578.531311,738.270081,738.074890,738.659241,678.498352,672.650146,665.716980,...,6.989707,9.412443,7.863058,10.378148,11.429653,23.639866,7.543904,7.543904,0.000000,100.543930


For each Starbucks, identify its nearest neighboring location in Hawaii (and presumably in the world). Save the output to a dataframe with three columns: Location, Nearest Neighbor, and Distance (Miles).

In [11]:
# answer goes here

HI_locations = US_locations.loc[US_locations['State/Province'] == 'HI']
HI_locations['Coordinates'] = tuple(zip(HI_locations['Latitude'], HI_locations['Longitude']))

HI_dist_matrix = geo_sim_matrix(HI_locations)
HI_dist_matrix.head()

index,17202,17203,17204,17205,17206,17207,17208,17209,17210,17211,...,17291,17292,17293,17294,17295,17296,17297,17298,17299,17300
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17202,0.0,0.690934,0.643386,0.944091,1.4605,6.743576,6.743576,218.428635,218.976944,217.137772,...,109.065346,12.792754,2.373938,16.607107,166.981064,107.264542,98.783943,5.329021,5.675488,6.470637
17203,0.690934,0.0,0.944121,1.524305,1.888212,6.417289,6.417289,218.067474,218.61673,216.778824,...,108.83419,13.21521,2.050284,16.793444,166.565231,106.962166,98.552231,5.548625,6.120771,6.580589
17204,0.643386,0.944121,0.0,0.690934,0.944091,6.199704,6.199704,218.976944,219.525787,217.687332,...,109.671234,12.281363,2.921233,15.985024,167.494705,107.842613,99.389603,4.710641,5.179933,5.831356
17205,0.944091,1.524305,0.690934,0.0,0.643342,6.609524,6.609524,219.338745,219.886642,218.046921,...,109.904938,11.865932,3.304557,15.819399,167.911057,108.146912,99.6241,4.555939,4.747018,5.790081
17206,1.4605,1.888212,0.944091,0.643342,0.0,6.121677,6.121677,219.886642,220.435089,218.596085,...,110.509941,11.349262,3.826992,15.192343,168.424301,108.72406,100.22879,3.921273,4.240594,5.146738


In [20]:
def get_nearest_index(matrix, index):
    return matrix[index].sort_values().index[0]

indices = HI_locations.index

nearest_index = pd.Series([get_nearest_index(HI_dist_matrix.mask(HI_dist_matrix==0), ind) for ind in indices], index=indices)

In [21]:
nearests = pd.DataFrame(HI_locations['Store Name'])

nearests['Nearest Neighbor'] = [HI_locations.loc[ind, 'Store Name'] for ind in nearest_index]
nearests['Distance (Miles)'] =  [HI_dist_matrix.mask(HI_dist_matrix==0)[ind].min() for ind in nearest_index]

nearests

Unnamed: 0,Store Name,Nearest Neighbor,Distance (Miles)
17202,Aiea Shopping Center,Kaonohi St & Kam Hwy - Pearlridge,0.643386
17203,Stadium Marketplace,Aiea Shopping Center,0.643386
17204,Kaonohi St & Kam Hwy - Pearlridge,Aiea Shopping Center,0.643386
17205,Pearlridge Mall Uptown,Waimalu Shopping Center,0.643342
17206,Waimalu Shopping Center,Pearlridge Mall Uptown,0.643342
...,...,...,...
17296,Wailea Beach Resort - Marriott Maui,Kukui Mall,1.381868
17297,Safeway - Wailuku 3092,Queen Kaahumanu Center,1.291034
17298,Waikele Premium Outlets,Kunia Shopping Center,1.460423
17299,Laniakea Plaza at Ka Uka Blvd,Mililani Town Center,0.943881


If the nearest neighbor of a Starbucks location is far away, we could consider that Starbucks to be "on an island". Which five Starbucks in Hawaii are the most "on an island"?

In [22]:
# answer goes here


nearests.sort_values(by='Distance (Miles)', ascending=False).head()

Unnamed: 0,Store Name,Nearest Neighbor,Distance (Miles)
17267,Parker Ranch Center,Queens Marketplace,15.612447
17295,Queens Marketplace,Parker Ranch Center,15.612447
17294,Waianae Mall - Farrington Hwy,Schofield Barracks Main Store Mall,4.955147
17286,Kukui Grove Center,LIH Rotunda (Kauai),2.042614
17271,Kauai Village Shopping Center Kapaa,LIH Rotunda (Kauai),2.042614
