## Day 48 Lecture 2 Assignment

In this assignment, we will apply density-based clustering to a dataset containing the locations of all Starbucks in the U.S.

This assignment will also use the haversine and plotly packages, which you should already have installed from the previous assignment.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
import plotly.express as px

This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [None]:
# answer goes here
df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/starbucks_locations.csv')




Begin by narrowing down the dataset to a specific geographic area of interest. Try just the United States; since you won't be calculating a distance matrix you can use more than just one state.

In [None]:
# answer goes here

dfny = df.loc[df['State/Province']=='NY',:]
dfny['Coordinates'] = tuple(zip(dfny['Longitude'], dfny['Latitude']))
dfny['Coordinates'] 
dfny






A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude,Coordinates
20858,Starbucks,7379-1056,Wolf Road,Company Owned,"18 Wolf Rd, Crossgates Mall",Albany,NY,US,122052609,(518) 435-9280,GMT-05:00 America/New_York,-73.82,42.71,"(-73.82, 42.71)"
20859,Starbucks,11064-103915,Crossgates Mall,Company Owned,"1 Crossgates Mall Road, B231",Albany,NY,US,122035367,518-218-1520,GMT-05:00 America/New_York,-73.85,42.69,"(-73.85, 42.69)"
20860,Starbucks,15207-156777,Target Colonie T-1268,Licensed,1440 Central Ave,Albany,NY,US,122055118,518-489-1112,GMT-05:00 America/New_York,-73.82,42.71,"(-73.82, 42.71)"
20861,Starbucks,7922-92120,North Pearl Street,Company Owned,10 North Pearl St,Albany,NY,US,122072702,518-463-6990,GMT-05:00 America/New_York,-73.75,42.65,"(-73.75, 42.65)"
20862,Starbucks,75393-105057,College St. Rose,Licensed,"420 Western Ave, Hilton Garden Inn at Albany M...",Albany,NY,US,122031400,518-485-3946,GMT-05:00 America/New_York,-73.79,42.66,"(-73.79, 42.66)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21498,Teavana,28235-250120,Teavana - Westchester Ridge HIll,Company Owned,"117 Market St., Cross County Shopping Center",Yonkers,NY,US,10710,914-963-8670,GMT-05:00 America/New_York,-73.86,40.96,"(-73.86, 40.96)"
21499,Starbucks,13632-105641,Cross County Shopping Center,Company Owned,"8000 Mall Walk, Space 7080",Yonkers,NY,US,107041006,3479312521,GMT-05:00 America/New_York,-73.85,40.93,"(-73.85, 40.93)"
21500,Starbucks,14369-115444,"Yonkers, Bronx River Road",Company Owned,841-851 Bronx River Road,Yonkers,NY,US,107087058,914-237-3681,GMT-05:00 America/New_York,-73.84,40.93,"(-73.84, 40.93)"
21501,Starbucks,7901-67260,Yonkers,Company Owned,2458 Central Park Avenue,Yonkers,NY,US,107101125,914-337-0139,GMT-05:00 America/New_York,-73.83,40.98,"(-73.83, 40.98)"


Build a DBSCAN clustering model using eps=2 (miles) and min_samples=5. Some tips that may be helpful:

1. Unlike our approach for hierarchical clustering, we do not need to calculate the NxN distance matrix for DBSCAN upfront. It directly supports the haversine distance metric, provided the nearest-neighbors algorithm is a ball tree. Set the "algorithm" and "metric" parameters to the appropriate values. 
2. Scikit-learn's implementation of haversine distance expects radians instead of degrees. Therefore, it would be advisable to create two new columns, Lat_Rad and Lon_Rad, that convert the Latitude and Longitude columns into radians. (Hint: there is a numpy function that does this.)  
3. The eps parameter, which corresponds to the radius of the neighborhood, will also need to be in radians. The conversion factor for miles to radians is approximately 1/3958.748; in other words, if you want the neighborhood to have a radius of 3 miles, set eps = 3/3958.748.  

Side note: ball-tree is an indexing structure that is very useful for nearest-neighbor calculations. The general time-complexity of finding a nearest neighbor using a Ball Tree is O(nlog(n)). This is a vast improvement over the naive O($n^{2}$) and allows us to cluster on much larger subsets of the data, like the entire country. Scikit-learn directly supports creating ball-trees through sklearn.neighbors.BallTree; if inclined, you could extend the analysis in the first after-lecture assignment (in which we calculated a similarity matrix for Hawaii) to the entire country using a BallTree and identify "island Starbucks locations" on a much larger scale.

Additionally, save the predicted cluster assignments as a new column in your dataframe.

In [None]:
dfny['latrad'] = np.deg2rad(dfny['Latitude']) 
dfny['longrad'] = np.deg2rad(dfny['Longitude']) 
dfny



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude,Coordinates,latrad,longrad
20858,Starbucks,7379-1056,Wolf Road,Company Owned,"18 Wolf Rd, Crossgates Mall",Albany,NY,US,122052609,(518) 435-9280,GMT-05:00 America/New_York,-73.82,42.71,"(-73.82, 42.71)",0.745430,-1.288402
20859,Starbucks,11064-103915,Crossgates Mall,Company Owned,"1 Crossgates Mall Road, B231",Albany,NY,US,122035367,518-218-1520,GMT-05:00 America/New_York,-73.85,42.69,"(-73.85, 42.69)",0.745081,-1.288926
20860,Starbucks,15207-156777,Target Colonie T-1268,Licensed,1440 Central Ave,Albany,NY,US,122055118,518-489-1112,GMT-05:00 America/New_York,-73.82,42.71,"(-73.82, 42.71)",0.745430,-1.288402
20861,Starbucks,7922-92120,North Pearl Street,Company Owned,10 North Pearl St,Albany,NY,US,122072702,518-463-6990,GMT-05:00 America/New_York,-73.75,42.65,"(-73.75, 42.65)",0.744383,-1.287180
20862,Starbucks,75393-105057,College St. Rose,Licensed,"420 Western Ave, Hilton Garden Inn at Albany M...",Albany,NY,US,122031400,518-485-3946,GMT-05:00 America/New_York,-73.79,42.66,"(-73.79, 42.66)",0.744557,-1.287878
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21498,Teavana,28235-250120,Teavana - Westchester Ridge HIll,Company Owned,"117 Market St., Cross County Shopping Center",Yonkers,NY,US,10710,914-963-8670,GMT-05:00 America/New_York,-73.86,40.96,"(-73.86, 40.96)",0.714887,-1.289100
21499,Starbucks,13632-105641,Cross County Shopping Center,Company Owned,"8000 Mall Walk, Space 7080",Yonkers,NY,US,107041006,3479312521,GMT-05:00 America/New_York,-73.85,40.93,"(-73.85, 40.93)",0.714363,-1.288926
21500,Starbucks,14369-115444,"Yonkers, Bronx River Road",Company Owned,841-851 Bronx River Road,Yonkers,NY,US,107087058,914-237-3681,GMT-05:00 America/New_York,-73.84,40.93,"(-73.84, 40.93)",0.714363,-1.288751
21501,Starbucks,7901-67260,Yonkers,Company Owned,2458 Central Park Avenue,Yonkers,NY,US,107101125,914-337-0139,GMT-05:00 America/New_York,-73.83,40.98,"(-73.83, 40.98)",0.715236,-1.288577


In [None]:
# answer goes here
# Defining the agglomerative clustering
radius = 10
dbscan_cluster = DBSCAN(eps=radius/3958.748, min_samples=5, algorithm = 'ball_tree', metric ='euclidean')
# Fit model

dfny['radcoordinates']= tuple(zip(dfny['latrad'], dfny['longrad']))
dfny['clusters'] = dbscan_cluster.fit_predict(np.array(dfny[['latrad','longrad']])).astype('object')







A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
np.array(dfny[['latrad','longrad']])

array([[ 0.74543012, -1.28840205],
       [ 0.74508106, -1.28892565],
       [ 0.74543012, -1.28840205],
       ...,
       [ 0.71436326, -1.28875112],
       [ 0.71523593, -1.28857659],
       [ 0.72029738, -1.28770392]])

Finally, plot the resulting clusters on a map using the "scatter_geo" function from plotly.express. The map defaults to the entire world; the "scope" parameter is useful for narrowing down the region plotted in the map. The documentation can be found here:

https://www.plotly.express/plotly_express/#plotly_express.scatter_geo

How many clusters did DBSCAN produce? How many locations were treated as outliers (cluster = -1)?

In [None]:
# answer goes here
def plot_dendrogram(model, **kwargs):
    """
    A basic function for plotting a dendrogram. Sourced from the following link:
    https://github.com/scikit-learn/scikit-learn/blob/70cf4a676caa2d2dad2e3f6e4478d64bcb0506f7/examples/cluster/plot_hierarchical_clustering_dendrogram.py
    
    Parameters:
        model (object of class sklearn.cluster.hierarchical.AgglomerativeClustering): a fitted scikit-learn hierarchical clustering model.
    
    Output: a dendrogram based on the model based in the parameters.
    
    Returns: N/A    
    """
    # Children of hierarchical clustering
    children = model.children_

    # Distances between each pair of children
    # Since we don't have this information, we can use a uniform one for plotting
    distance = np.arange(children.shape[0])

    # The number of observations contained in each cluster level
    no_of_observations = np.arange(2, children.shape[0]+2)

    # Create linkage matrix and then plot the dendrogram
    linkage_matrix = np.column_stack([children, distance, no_of_observations]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)






In [None]:
import plotly.express as px

fig = px.scatter_geo(dfny, 'Latitude', 'Longitude', scope ='usa', color = 'clusters')
fig.update_traces(marker=dict(size=4))


From the previous plot, we should see a very large number of clusters (400+). This would suggest that our definition of neighborhood may have been too strict. Experiment with other values of eps and min_samples and see how your changes affect the output. Output a map with what you think is the "best" clustering result below.

In [None]:
# answer goes here



