## Day 46 Lecture 1 Assignment

In this assignment, we will calculate a distance matrix for geographical Starbucks data and use it to identify locations that are close together and far apart. We will perform clustering on this dataset later on.

We will be using the "haversine" package to compute geographical distance. It can be pip installed.

In [1]:
!pip install haversine

Collecting haversine
  Downloading haversine-2.2.0-py2.py3-none-any.whl (4.9 kB)
Installing collected packages: haversine
Successfully installed haversine-2.2.0


In [2]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from haversine import haversine

Below is a convenience function for calculating geographical distance matrices using lat-long data.

In [12]:
def geo_sim_matrix(df, col_name = 'Coordinates'):
    """
    A function that computes a geographical distance matrix (in miles).
    Each row in the dataframe should correspond to one location.
    In addition, the dataframe must have a column containing the lat-long of each location as a tuple (i.e. (lat, long)).
    
    Parameters:
        df (pandas dataframe): an nxm dataframe containing the locations to compute similarities between.
        col (string): the name of the column containing the lat-long tuples.
        
    Returns:
        distance (pandas dataframe): an nxn distance matrix between the geographical coordinates of each location.
    """
    
    df = df.copy()
    df.reset_index(inplace=True)
    haver_vec = np.vectorize(haversine, otypes=[np.float32])
    distance = df.groupby('index').apply(lambda x: pd.Series(haver_vec(df[col_name], x[col_name])))
    distance = distance / 1.609344  # converts to miles
    distance.columns = distance.index
    
    return distance


This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [4]:
starbucks = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/starbucks_locations.csv')
starbucks

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude
0,Starbucks,47370-257954,"Meritxell, 96",Licensed,"Av. Meritxell, 96",Andorra la Vella,7,AD,AD500,376818720,GMT+1:00 Europe/Andorra,1.53,42.51
1,Starbucks,22331-212325,Ajman Drive Thru,Licensed,"1 Street 69, Al Jarf",Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.42
2,Starbucks,47089-256771,Dana Mall,Licensed,Sheikh Khalifa Bin Zayed St.,Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.39
3,Starbucks,22126-218024,Twofour 54,Licensed,Al Salam Street,Abu Dhabi,AZ,AE,,,GMT+04:00 Asia/Dubai,54.38,24.48
4,Starbucks,17127-178586,Al Ain Tower,Licensed,"Khaldiya Area, Abu Dhabi Island",Abu Dhabi,AZ,AE,,,GMT+04:00 Asia/Dubai,54.54,24.51
...,...,...,...,...,...,...,...,...,...,...,...,...,...
25595,Starbucks,21401-212072,Rex,Licensed,"141 Nguyễn Huệ, Quận 1, Góc đường Pasteur và L...",Thành Phố Hồ Chí Minh,SG,VN,70000,08 3824 4668,GMT+000000 Asia/Saigon,106.70,10.78
25596,Starbucks,24010-226985,Panorama,Licensed,"SN-44, Tòa Nhà Panorama, 208 Trần Văn Trà, Quận 7",Thành Phố Hồ Chí Minh,SG,VN,70000,08 5413 8292,GMT+000000 Asia/Saigon,106.71,10.72
25597,Starbucks,47608-253804,Rosebank Mall,Licensed,"Cnr Tyrwhitt and Cradock Avenue, Rosebank",Johannesburg,GT,ZA,2194,27873500159,GMT+000000 Africa/Johannesburg,28.04,-26.15
25598,Starbucks,47640-253809,Menlyn Maine,Licensed,"Shop 61B, Central Square, Cnr Aramist & Coroba...",Menlyn,GT,ZA,181,,GMT+000000 Africa/Johannesburg,28.28,-25.79


Begin by narrowing down the dataset to a specific geographic area of interest. Since we will need to manually compute a distance matrix, which will be on the order of $n^{2}$ in terms of size, we would recommend choosing an area with 3000 or less locations. In this example, we will use Hawaii, which has about 100 locations; for reference, California has about 2800 locations. Feel free to choose a different region that is of more interest to you, if desired.

Subset the dataframe to only include records for Starbucks locations in Hawaii.

In [9]:
hawaii = starbucks.loc[starbucks['State/Province']=='HI']
hawaii

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude
17202,Starbucks,21034-73360,Aiea Shopping Center,Company Owned,99-115 Aiea Heights Drive #125,Aiea,HI,US,967013913,808-484-1488,GMT-10:00 Pacific/Honolulu,-157.93,21.38
17203,Starbucks,21053-99755,Stadium Marketplace,Company Owned,4561 Salt Lake Boulevard,Aiea,HI,US,968183167,808-488-9313,GMT-10:00 Pacific/Honolulu,-157.93,21.37
17204,Starbucks,21006-10033,Kaonohi St & Kam Hwy - Pearlridge,Company Owned,98-125 Kaonohi Street,Aiea,HI,US,967012318,808-484-9548,GMT-10:00 Pacific/Honolulu,-157.94,21.38
17205,Starbucks,21005-10034,Pearlridge Mall Uptown,Company Owned,98-1005 Moanalua Road,Aiea,HI,US,967014705,808-484-9355,GMT-10:00 Pacific/Honolulu,-157.94,21.39
17206,Starbucks,21063-101700,Waimalu Shopping Center,Company Owned,"98-1277 Kaahumanu Street, Building E, Unit 7, ...",Aiea,HI,US,967015314,808-484-5802,GMT-10:00 Pacific/Honolulu,-157.95,21.39
...,...,...,...,...,...,...,...,...,...,...,...,...,...
17296,Starbucks,70063-139304,Wailea Beach Resort - Marriott Maui,Licensed,3700 Wailea Alanui Dr,Wailea,HI,US,967538347,808-874-7981,GMT-10:00 Pacific/Honolulu,-156.44,20.69
17297,Starbucks,19214-196545,Safeway - Wailuku 3092,Licensed,"58 Maui lani Pkwy, Waikele Center",Wailuku,HI,US,96793,808-243-3522,GMT-10:00 Pacific/Honolulu,-156.49,20.89
17298,Starbucks,21044-88761,Waikele Premium Outlets,Company Owned,"94-799 Lumiaina Street, Laniakea Plaza",Waipahu,HI,US,967975041,808-678-3418,GMT-10:00 Pacific/Honolulu,-158.01,21.40
17299,Starbucks,21061-99913,Laniakea Plaza at Ka Uka Blvd,Company Owned,"94-1221 Ka Uka Boulevard, Unit A-101",Waipahu,HI,US,967976202,808-680-9213,GMT-10:00 Pacific/Honolulu,-158.00,21.43


The haversine package takes tuples with 2 numeric elements and interprets them as lat-long to calculate distance, so add a new column called "Coordinates" that converts the lat and long in each row into a tuple. In other words, the last two columns of the dataframe should initially look like this:

**Latitude, Longitude**  
-121.64, 39.14  
-116.40, 34.13  
...

After adding the new column, the last three columns should look like this:

**Latitude, Longitude, Coordinates**  
-121.64, 39.14, (-121.64, 39.14)  
-116.40, 34.13, (-116.40, 34.13)  
...

In [10]:
hawaii['Coordinates'] = list(zip(hawaii['Longitude'],hawaii['Latitude']))
hawaii

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude,Coordinates
17202,Starbucks,21034-73360,Aiea Shopping Center,Company Owned,99-115 Aiea Heights Drive #125,Aiea,HI,US,967013913,808-484-1488,GMT-10:00 Pacific/Honolulu,-157.93,21.38,"(-157.93, 21.38)"
17203,Starbucks,21053-99755,Stadium Marketplace,Company Owned,4561 Salt Lake Boulevard,Aiea,HI,US,968183167,808-488-9313,GMT-10:00 Pacific/Honolulu,-157.93,21.37,"(-157.93, 21.37)"
17204,Starbucks,21006-10033,Kaonohi St & Kam Hwy - Pearlridge,Company Owned,98-125 Kaonohi Street,Aiea,HI,US,967012318,808-484-9548,GMT-10:00 Pacific/Honolulu,-157.94,21.38,"(-157.94, 21.38)"
17205,Starbucks,21005-10034,Pearlridge Mall Uptown,Company Owned,98-1005 Moanalua Road,Aiea,HI,US,967014705,808-484-9355,GMT-10:00 Pacific/Honolulu,-157.94,21.39,"(-157.94, 21.39)"
17206,Starbucks,21063-101700,Waimalu Shopping Center,Company Owned,"98-1277 Kaahumanu Street, Building E, Unit 7, ...",Aiea,HI,US,967015314,808-484-5802,GMT-10:00 Pacific/Honolulu,-157.95,21.39,"(-157.95, 21.39)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17296,Starbucks,70063-139304,Wailea Beach Resort - Marriott Maui,Licensed,3700 Wailea Alanui Dr,Wailea,HI,US,967538347,808-874-7981,GMT-10:00 Pacific/Honolulu,-156.44,20.69,"(-156.44, 20.69)"
17297,Starbucks,19214-196545,Safeway - Wailuku 3092,Licensed,"58 Maui lani Pkwy, Waikele Center",Wailuku,HI,US,96793,808-243-3522,GMT-10:00 Pacific/Honolulu,-156.49,20.89,"(-156.49, 20.89)"
17298,Starbucks,21044-88761,Waikele Premium Outlets,Company Owned,"94-799 Lumiaina Street, Laniakea Plaza",Waipahu,HI,US,967975041,808-678-3418,GMT-10:00 Pacific/Honolulu,-158.01,21.40,"(-158.01, 21.4)"
17299,Starbucks,21061-99913,Laniakea Plaza at Ka Uka Blvd,Company Owned,"94-1221 Ka Uka Boulevard, Unit A-101",Waipahu,HI,US,967976202,808-680-9213,GMT-10:00 Pacific/Honolulu,-158.00,21.43,"(-158.0, 21.43)"


Calculate the distance matrix using the starter code/function geo_sim_matrix() provided earlier in the notebook. It assumes the column containing the coordinates for each location is called "Coordinates". Examine the docstring for more details.

Note: the latitude and longitudes provided only go out to two decimal places, which limits the resolution of the distance calculations to about 0.5 miles. Distances that are very small may not be accurately represented here (e.g. several instances of "0 distance" for distinct Starbucks locations in very close proximity).

In [13]:
distances = geo_sim_matrix(hawaii, col_name = 'Coordinates')
distances

index,17202,17203,17204,17205,17206,17207,17208,17209,17210,17211,...,17291,17292,17293,17294,17295,17296,17297,17298,17299,17300
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
17202,0.000000,0.640306,0.690934,0.942025,1.523027,6.994635,6.994635,224.446625,225.053207,223.240814,...,115.111580,13.110185,2.436418,17.696173,169.674576,111.935181,104.275177,5.673964,5.800602,6.938969
17203,0.640306,0.000000,0.942025,1.455155,1.884081,6.725522,6.725522,224.146896,224.754303,222.943039,...,114.923065,13.465416,2.169428,17.846395,169.325500,111.686714,104.086296,5.851921,6.177308,7.027100
17204,0.690934,0.942025,0.000000,0.640352,0.942056,6.388272,6.388272,225.057007,225.664078,223.852325,...,115.771530,12.533210,3.045977,17.022440,170.253662,112.571602,104.934967,5.003312,5.238478,6.251312
17205,0.942025,1.455155,0.640352,0.000000,0.690934,6.732244,6.732244,225.357330,225.963547,224.150681,...,115.962173,12.184464,3.365657,16.889206,170.603195,112.821724,105.126198,4.878767,4.873359,6.218408
17206,1.523027,1.884081,0.942056,0.690934,0.000000,6.177646,6.177646,225.967407,226.574112,224.761871,...,116.621422,11.602576,3.952774,16.211374,171.181946,113.457397,105.785225,4.194797,4.301023,5.527473
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17296,111.935181,111.686714,112.571602,112.821724,113.457397,116.532761,116.532761,113.356705,113.932938,112.089897,...,11.744016,124.882446,109.527550,129.379044,62.665909,0.000000,13.131763,117.528198,117.642769,118.564980
17297,104.275177,104.086296,104.934967,105.126198,105.785225,109.367378,109.367378,123.483475,124.031357,122.172470,...,10.836831,116.976448,101.916878,121.919121,74.924126,13.131763,0.000000,109.934196,109.857498,111.071297
17298,5.673964,5.851921,5.003312,4.878767,4.194797,4.537746,4.537746,229.933716,230.542389,228.732803,...,120.770874,7.988654,8.020373,12.022514,175.014755,117.528198,109.934196,0.000000,2.042359,1.523179
17299,5.800602,6.177308,5.238478,4.873359,4.301023,6.554017,6.554017,230.220810,230.826599,229.013107,...,120.691475,7.311107,8.236157,12.453324,175.475052,117.642769,109.857498,2.042359,0.000000,3.296098


For each Starbucks, identify its nearest neighboring location in Hawaii (and presumably in the world). Save the output to a dataframe with three columns: Location, Nearest Neighbor, and Distance (Miles).

In [14]:
import sys

In [60]:
out_list= list()
for index,row in distances.iterrows():
    d = dict()
    location = starbucks.iloc[index]
    d['Location'] = location['Street Address']
    count=0
    for i,r in row.iteritems():
        if r!=0 and count==0:
            store = i
            distance = r
            count+=1
        try:
            if r<distance and r!=0:
                store = i
                distance = r
        except(NameError):
            continue
    d['Neighbor'] = starbucks.iloc[store]['Street Address']
    d['Distance'] = distance
    out_list.append(d)
    
out_list

out_df = pd.DataFrame(out_list)
out_df

Unnamed: 0,Location,Neighbor,Distance
0,99-115 Aiea Heights Drive #125,4561 Salt Lake Boulevard,0.640306
1,4561 Salt Lake Boulevard,99-115 Aiea Heights Drive #125,0.640306
2,98-125 Kaonohi Street,98-1005 Moanalua Road,0.640352
3,98-1005 Moanalua Road,98-125 Kaonohi Street,0.640352
4,"98-1277 Kaahumanu Street, Building E, Unit 7, ...",98-1005 Moanalua Road,0.690934
...,...,...,...
94,3700 Wailea Alanui Dr,1819 South Kihei Road,2.625981
95,"58 Maui lani Pkwy, Waikele Center",1 Keolani Airport Rd,1.381868
96,"94-799 Lumiaina Street, Laniakea Plaza","94-673 Kupuohi Street, A201",1.523179
97,"94-1221 Ka Uka Boulevard, Unit A-101","95-1249 Meheula Parkway, 172",1.455712


If the nearest neighbor of a Starbucks location is far away, we could consider that Starbucks to be "on an island". Which five Starbucks in Hawaii are the most "on an island"?

In [64]:
out_df.sort_values('Distance',ascending=False).head(5)

Unnamed: 0,Location,Neighbor,Distance
93,"69-201 Waikoloa Beach Drive, #1001 K-1",67-1185 Mamalahoa Highway D108,16.080042
65,67-1185 Mamalahoa Highway D108,"69-201 Waikoloa Beach Drive, #1001 K-1",16.080042
76,"2360 Kiahuna Plantation Drive, Suites E-70 & E-80",4454 Nuhou St,8.027675
89,55 Pukalani Street,New Terminal Bldg @ Bldg 340,7.599956
92,"86-120 Farrington Highway, Waikoloa Beach Resort",Bldg. 693,6.996286
