## Day 48 Lecture 2 Assignment

In this assignment, we will apply density-based clustering to a dataset containing the locations of all Starbucks in the U.S.

This assignment will also use the haversine and plotly packages, which you should already have installed from the previous assignment.

In [0]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
import plotly.express as px

This dataset contains the latitude and longitude (as well as several other details we will not be using) of every Starbucks in the world as of February 2017. Each row consists of the following features, which are generally self-explanatory:

- Brand
- Store Number
- Store Name
- Ownership Type
- Street Address
- City
- State/Province
- Country
- Postcode
- Phone Number
- Timezone
- Longitude
- Latitude

Load in the dataset.

In [0]:
# answer goes here

df = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Data%20Sets%20Clustering/starbucks_locations.csv')



Begin by narrowing down the dataset to a specific geographic area of interest. Try just the United States; since you won't be calculating a distance matrix you can use more than just one state.

In [0]:
# answer goes here
df1 = df[(df['State/Province']== 'OR') | (df['State/Province']== 'WY')]
df1.shape

(382, 13)

Build a DBSCAN clustering model using eps=2 (miles) and min_samples=5. Some tips that may be helpful:

1. Unlike our approach for hierarchical clustering, we do not need to calculate the NxN distance matrix for DBSCAN upfront. It directly supports the haversine distance metric, provided the nearest-neighbors algorithm is a ball tree. Set the "algorithm" and "metric" parameters to the appropriate values. 
2. Scikit-learn's implementation of haversine distance expects radians instead of degrees. Therefore, it would be advisable to create two new columns, Lat_Rad and Lon_Rad, that convert the Latitude and Longitude columns into radians. (Hint: there is a numpy function that does this.)  
3. The eps parameter, which corresponds to the radius of the neighborhood, will also need to be in radians. The conversion factor for miles to radians is approximately 1/3958.748; in other words, if you want the neighborhood to have a radius of 3 miles, set eps = 3/3958.748.  

Side note: ball-tree is an indexing structure that is very useful for nearest-neighbor calculations. The general time-complexity of finding a nearest neighbor using a Ball Tree is O(nlog(n)). This is a vast improvement over the naive O($n^{2}$) and allows us to cluster on much larger subsets of the data, like the entire country. Scikit-learn directly supports creating ball-trees through sklearn.neighbors.BallTree; if inclined, you could extend the analysis in the first after-lecture assignment (in which we calculated a similarity matrix for Hawaii) to the entire country using a BallTree and identify "island Starbucks locations" on a much larger scale.

Additionally, save the predicted cluster assignments as a new column in your dataframe.

In [0]:
# answer goes here
df['Lat_Rad'] = np.radians(df['Latitude'])
df['Lon_Rad'] = np.radians(df['Longitude'])
df.head(3)

Unnamed: 0,Brand,Store Number,Store Name,Ownership Type,Street Address,City,State/Province,Country,Postcode,Phone Number,Timezone,Longitude,Latitude,Lat_Rad,Lon_Rad
0,Starbucks,47370-257954,"Meritxell, 96",Licensed,"Av. Meritxell, 96",Andorra la Vella,7,AD,AD500,376818720.0,GMT+1:00 Europe/Andorra,1.53,42.51,0.741939,0.026704
1,Starbucks,22331-212325,Ajman Drive Thru,Licensed,"1 Street 69, Al Jarf",Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.42,0.443663,0.968134
2,Starbucks,47089-256771,Dana Mall,Licensed,Sheikh Khalifa Bin Zayed St.,Ajman,AJ,AE,,,GMT+04:00 Asia/Dubai,55.47,25.39,0.443139,0.968134


In [0]:
df = df.dropna()

In [0]:
X = df[['Lat_Rad','Lon_Rad']]
X.shape

(18084, 2)

In [0]:
dbscan_cluster = DBSCAN(eps=0.0005, min_samples=5,algorithm='ball_tree')
clusters = dbscan_cluster.fit_predict(X)


Finally, plot the resulting clusters on a map using the "scatter_geo" function from plotly.express. The map defaults to the entire world; the "scope" parameter is useful for narrowing down the region plotted in the map. The documentation can be found here:

https://www.plotly.express/plotly_express/#plotly_express.scatter_geo

How many clusters did DBSCAN produce? How many locations were treated as outliers (cluster = -1)?

In [0]:
# answer goes here
df['cluster'] = clusters.astype(object)
px.scatter_geo(data_frame=df, lon = 'Longitude', lat='Latitude',scope='usa', color='cluster')


In [0]:
df['cluster'].value_counts()

-1      9803
 182     303
 407     293
 37      173
 56      164
        ... 
 359       3
 302       3
 14        3
 43        3
 44        2
Name: cluster, Length: 567, dtype: int64

There are 233 outliers 

From the previous plot, we should see a very large number of clusters (400+). This would suggest that our definition of neighborhood may have been too strict. Experiment with other values of eps and min_samples and see how your changes affect the output. Output a map with what you think is the "best" clustering result below.

In [0]:
# answer goes here
dbscan_cluster = DBSCAN(eps=0.02, min_samples=10,algorithm='ball_tree')
clusters = dbscan_cluster.fit_predict(X)


In [0]:
df['cluster'] = clusters.astype(object)
px.scatter_geo(data_frame=df, lon = 'Longitude', lat='Latitude',scope='usa', color='cluster')

We believe the "best" clustering is an eps of 0.02 and min_sample of 10. This follow our belief the Starbucks opened stores based on population density around cities in the US.

In [0]:
df['cluster'] = dbscan_cluster.fit_predict(df[['Lat_Rad', 'Lon_Rad']]).astype(object)
px.scatter_geo(data_frame=df, lon = 'Longitude', lat='Latitude',scope='usa', color='cluster')