# IBM Data Science Capstone: Where to Open a Coffee Shop?
## Author: Zack Kenyon

## Introduction

### Business Problem
This project will focus on determining the best location to open a coffee shop if you are wanting to take advantage of Arizona State University's very large student population. ASU has 4 campuses: Tempe, Downtown Phoenix, West (glendale) and Polytechnic (East Mesa). Each of these campuses offer vasitly different landscapes in regards to their surrounding areas. By analyzing these areas, the goal is to find neighborhoods near campus with an abundance of non-coffee related businesses. Consideration will also be given to the amount of major chains, Starbucks and Dutch Bros. compared to non-chains in the area.  

### Target Audience
The audience for a project such as this would be any aspiring restaurant owners, especially those interested in opening a cafe. Most restaurants close within a year of opening and while other factors certainly contribute to this, the hardest to overcome is a poor location as it is something that cannot easily be improved.    

## Data Collection 

### Geopy 
To convert the campus addresses to latitudes and longitudes, Geopy's nominatim package will be implemented.

### FourSquare API
Information regarding the surrounding businesses will be collected through use of FourSquare's API with each campus as the center point for searches. The radius will be set to 1600 as that gives approximately 1 miles worth of businesses to sample. From the results, the business names, categories, and locations will used to measure the denisty of coffee shops surrounding each campus, the number of chains, and finally number of other non-restuarant businesses. 

This information will ultimately determine which of campuses should be explored as the best option for opening a coffee shop as it gives insight into the amount of competition in the area and the amount of potential customers outside of just ASU. The best location being the one that minimizes the former while maximizing the latter. 

### Trending Venues
DISCLAIMER: Trending data will be excluded from this analysis due to the COVID-19 outbreak. As it stands, trending data will be largely skewed with many businesses, including ASU itself, being temporarily closed. 

In [None]:
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np 
import matplotlib as plt
import requests

from sklearn.cluster import KMeans 

!pip install -q geopy
from geopy.geocoders import Nominatim
!pip install -q geojson
from geojson
!conda install -c conda-forge folium=0.5.0 --yes
import folium

import json
with open('az_arizona_zip_codes_geo.min.json') as json_data: #this gives us the outline of zipcodes for choropleth later
    AZ_Zipcodes= json.load(json_data)
    

In [None]:
Campuses = ['NE_Tempe','SE_Tempe', 'Downtown Phx', 'West', 'Polytechnic']

NE_Tempe = '1001 S McAllister Ave. Tempe, Arizona'
SW_Tempe = '1151 S. Forest Ave. Tempe, Arizona'
DTP = '411 N. Central Ave. Phoenix, Arizona'
West = '13590 N. 47th Ave. Glendale, Arizona'
Poly = '7001 E Williams Field Rd. Mesa, Arizona'

addresses = [NE_Tempe, SW_Tempe, DTP, West, Poly]

Geolocator = Nominatim(user_agent = 'ZDK')
Lat = []
Long = []

for loc in addresses:
    location = Geolocator.geocode(loc)
    Lat.append(location.latitude)
    Long.append(location.longitude)

In [None]:
ASU = pd.DataFrame(list(zip(Campuses, addresses, Lat, Long)),  columns = {'Campus' : Campuses, 'Address' : addresses, 'Latitude' : Lat, 'Longitude': Long})

In [None]:
ASU.head()

In [None]:
def getVenues(Campus, Lat, Long):
    Client_ID = 'HMK4WZSTRPQ03QBMQ2UXPF1O4M4DAW5K0HHWE4BW15OPI3D5'
    Client_Secret = 'XFPVKCL2QM42NJNHFK1A4BAY0BPZ1HTECLBG1JZHQ31BZ21F'
    Version = '20200320'
    LIMIT = 200 
    Radius = 1600 #collecting venues within roughly 1 mile from campus 
    
    Venues = []
    for Campus, Lat, Long in zip(Campus, Lat, Long):
        url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(
            Client_ID, Client_Secret, Lat, Long, Version, Radius, LIMIT)
    
        results = requests.get(url).json()["response"]['groups'][0]['items']
        Venues.append([(
            Campus, 
            Lat, 
            Long, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for Venues in Venues for item in Venues])
    nearby_venues.columns = ['Campus', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)
        

In [None]:
Campus_Venues = getVenues(ASU['Campus'], ASU['Latitude'], ASU['Longitude'])
Campus_Venues.head()

In [None]:
#Building the list of zipcodes surrounding each campus by 
Zip = [] 
for lat, long in zip(Campus_Venues['Venue Latitude'], Campus_Venues['Venue Longitude']):
    Venue_Zip = Geolocator.reverse((lat, long))
    Zip.append(Venue_Zip.raw['address']['postcode']) #collecting only Zipcodes

#The GeoJson file only contains 5-digit codes, so any nine digit code needs to be trimmed to 5-digit form    
for i in range(0, len(Zip)): 
    if len(Zip[i]) > 5:
        Zip[i] = Zip[i][0:5]

In [None]:
Campus_Venues.insert(4, 'Zipcode', Zip)

#Recombining NE & SE Tempe and dropping duplicates that may have been picked up in their overlapping search radius 
Campus_Venues.drop_duplicates(['Venue Latitude', 'Venue Longitude'], keep = 'first', inplace =True)
Campus_Venues['Campus'].replace(['SE_Tempe', 'NE_Tempe'],['Tempe', 'Tempe'], inplace = True)

#getting a sense of the venues in each zipcode
Counted_Venues = Campus_Venues.drop(['Latitude', 'Longitude', 'Venue Category', 'Venue Latitude', 'Venue Longitude', 'Zipcode'], axis = 1).groupby('Campus').count()
Counted_Venues

### Number of Venues within 1 Mile of Each Campus
The Tempe campus was divided into 2 segments due to its size. Duplicates were then removed from the DF being sure to check only the subset of 'Venue Latitude' and 'Venue Longitude'. The 2 segments were then recombined to Tempe with 80 overlapping venues being removed.    

Regarding the number of venues, the first thing to notice here is that Polytechnic is the least saturated of the four campuses, followed by the west campus, then Downtown and finally Tempe. Looking at the documentation for Forsquare, 100 venues appears to be the max number that can be called with my current credentials. For this reason, the Downtown numbers may actually be appear than whats listed here. However, unlike tempe, the downtown campus is actually the smallest campus and thus, it wouldnt make sense to have 2 separate location coordinates to pull from.  

Moving forward, we will use the zipcodes surrounding the campuses to analyze the total venues in the area, total coffee shops, and number of chains.  

In [None]:
Hot_Campus = pd.get_dummies(Campus_Venues[['Venue Category']], prefix = '', prefix_sep = '')
Hot_Campus.insert(0, 'Zipcode', Campus_Venues['Zipcode'])
totals = Hot_Campus[['Zipcode', 'Coffee Shop']]
totals = totals.groupby('Zipcode').sum()
Hot_Campus = Hot_Campus.groupby('Zipcode').mean()
Hot_Campus

In [None]:
#Now that we have the one-hot matrix, we can determine the average frequency of coffee-shops in each Zipcode
Coffee_freq = Hot_Campus[['Coffee Shop']].reset_index()
Total_Coffee_Shops = totals[['Coffee Shop']].reset_index()
Coffee_freq

In [None]:
#Venue Category needed to be collected here so the search term could include all Restaurant Variants from FourSquare
Restaurants = Campus_Venues[['Zipcode','Venue Category']]
Rest = Restaurants[Restaurants['Venue Category'].str.contains('Restaurant|Place|Joint')]
total_Rest = Rest.groupby('Zipcode').count().reset_index()
total_Rest

In [None]:
All_Venues = Campus_Venues[['Zipcode', 'Venue']]
Chains = All_Venues[All_Venues['Venue'].str.contains('Starbucks|Dutch')]
Total_Chains = Chains.groupby('Zipcode').count().reset_index()
Total_Chains

In [None]:
Zip_Info = pd.merge_ordered(Coffee_freq, total_Rest)
Zip_Info = Zip_Info.fillna(0)
Zip_Info.rename(columns = {'Venue Category':'Total Restaurants', 'Coffee Shop' : 'Coffee Shop Freq'}, inplace = True)
Zip_Info.insert(2, 'Total Coffee Shops', Total_Coffee_Shops['Coffee Shop'])
Zip_Info

### Coffee Shops, Restaurants in Zipcode
We can see in the Zip_Info DF the frequency of coffee shops compared to other venues in the zipcode, along with total coffee shops and total restaurants. 

From this we can see the although 85280 has the greatest frequency, there is only 1 coffee shop and 0 restaurants. So frequency alone can be misleading. However we see there are a couple of tiers in regards to totals. We can pretty safely divide the total restaurants into >20, 20-10, <5. Something similar can be seen with total coffee shops. 

Moving forward, we will pull the top 10 venues for each zipcode and use that in coordination with Zip_Info for K-Means clustering. 

In [None]:
columns = []

for ind in np.arange(10):
    columns.append('#{} Most Common Venue'.format(ind+1))

Zip_Top10 = pd.DataFrame(columns=columns)

for row in Zip_Info['Zipcode']:
    freqs = Hot_Campus.loc[row,:]
    freqs = freqs.T.sort_values(ascending = False)
    freqs = freqs[0:10]
    top10 = freqs.index.values[0:10]
    Zip_Top10.loc[row, :] = top10.T

Zip_Top10.reset_index(inplace = True)
Zip_Top10.drop(['index'], axis = 1, inplace = True)

Zip_Info = Zip_Info.join(Zip_Top10, rsuffix = '')
Zip_Info.head()

In [None]:
#Using folium map package to create a map of Arizona with a zoom_start that encompasses all 4 campuses
ASU_map = folium.Map(location = [Lat[0], Long[0]], zoom_start = 11)

#Creating a Choropleth using the Zipcode Data, and total number of Coffee Shops
ASU_map.choropleth(
    geo_data=AZ_Zipcodes,
    data=Zip_Info,
    columns=['Zipcode', 'Total Coffee Shops'],
    key_on='feature.properties.ZCTA5CE10',
    threshold_scale = [0, 1, 2, 4, 6, 8],
    fill_color='Blues', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Coffee Shops in Zipcode'
)

#Need to add markers for each of the campuses that outline the search radius
for campus, lat, long in zip(ASU['Campus'], ASU['Latitude'], ASU['Longitude']):
    folium.Circle(
        [lat, long], 
        radius = 1600, 
        fill = False).add_to(ASU_map)

for campus, lat, long in zip(ASU['Campus'], ASU['Latitude'], ASU['Longitude']):
    label = folium.Popup(str(campus))
    folium.Marker([lat, long], popup = label).add_to(ASU_map)
    
# display map
ASU_map

### Coffee Shops Map
In the map above, a choropleth map was created using the geocode data for Arizona Zipcodes. Each campus was given a popup marker and circle to indicate the FourSquare search radius. Tempe kept the dual markers since the search radius was two separate rings. We can see from the graph that the highest number of coffee shops within 1 mile of an ASU campus are located on the East side of the downtown campus. The tempe campus and West side of Downtown have similar amounts of coffee shops. Finally, the West and Polytech campuses each only have 1-2 coffee shops in the surrounding zipcodes.   

In [None]:
#K-Means clustering based on a one-hot-encoded DF with freq, total shops/restaurants, and most common venues
clusters = 5
Zip_Cls = Zip_Info.drop(['Zipcode'], axis = 1)
Hot_Zip = pd.get_dummies(Zip_Cls)
K_Zip = KMeans(n_clusters = clusters, random_state = 0).fit(Hot_Zip)
K_Zip.labels_ 

In [None]:
#Recombing libraries
Clustered_Zip = Zip_Info.copy(deep = True)
Clustered_Zip.insert(1, 'Cluster', K_Zip.labels_)
Clustered_Zip.sort_values('Cluster')

In [None]:
#To get FULL_DATA, merge Campus Venues w/ Clustered_Zip on the Zipcode column. The outer merge and m:m validation 
#ensure that the DFs with different shapes can be merged on their 1 similar collumn. 
Full_Data = pd.merge(Campus_Venues, Clustered_Zip, on = 'Zipcode', how = 'outer', validate = 'many_to_many')
Full_Data.head()

In [None]:
#Creating the base layer for the map
Full_map = folium.Map([Full_Data['Latitude'].iloc[0], Full_Data['Longitude'].iloc[0]], zoom_start = 11)

#A list of colors to assig the clusters 
Colors = ['lightblue', 'darkgreen', 'yellow', 'orange', 'purple']

#Creating the Feature groups so user can toggle the clusters 
fg0 = folium.FeatureGroup(name = 'Cluster: 0, Light Blue')
Full_map.add_child(fg0)
fg1 = folium.FeatureGroup(name = 'Cluster: 1, Dark Green')
Full_map.add_child(fg1)
fg2 = folium.FeatureGroup(name = 'Cluster: 2, Yellow')
Full_map.add_child(fg2)
fg3 = folium.FeatureGroup(name = 'Cluster: 3, Orange')
Full_map.add_child(fg3)
fg4 = folium.FeatureGroup(name = 'Cluster: 4, Purple')
Full_map.add_child(fg4)

#Creating circle markers for every venue collected in the intial search and assigning them a color and featuregroup
#based on its assigned cluster 
for cluster, lat, long in zip(Full_Data['Cluster'], Full_Data['Venue Latitude'], Full_Data['Venue Longitude']):
    if cluster == 0:
        folium.CircleMarker(
            [lat, long],
            radius = 7,
            popup = folium.Popup('Cluster: ' + str(cluster)),
            weight = 0.1,
            fill = True,
            fill_color = Colors[int(cluster)],
            fill_opacity = 1
            ).add_to(fg0)
    elif cluster == 1:
        folium.CircleMarker(
            [lat, long],
            radius = 7,
            popup = folium.Popup('Cluster: ' + str(cluster)),
            weight = 0.1,
            fill = True,
            fill_color = Colors[int(cluster)],
            fill_opacity = 1
            ).add_to(fg1)
    elif cluster == 2:
        folium.CircleMarker(
            [lat, long],
            radius = 7,
            popup = folium.Popup('Cluster: ' + str(cluster)),
            weight = 0.1,
            fill = True,
            fill_color = Colors[int(cluster)],
            fill_opacity = 1
            ).add_to(fg2)
    elif cluster == 3:
        folium.CircleMarker(
            [lat, long],
            radius = 7,
            popup = folium.Popup('Cluster: ' + str(cluster)),
            weight = 0.1,
            fill = True,
            fill_color = Colors[int(cluster)],
            fill_opacity = 1
            ).add_to(fg3)
    elif cluster == 4:
        folium.CircleMarker(
            [lat, long],
            radius = 7,
            popup = folium.Popup('Cluster: ' + str(cluster)),
            weight = 0.1,
            fill = True,
            fill_color = Colors[int(cluster)],
            fill_opacity = 1
            ).add_to(fg4) 

#Showing the search radius and campus location for all 4 campuses
for campus, lat, long in zip(ASU['Campus'], ASU['Latitude'], ASU['Longitude']):
    folium.Circle(
        [lat, long], 
        radius = 1600, 
        fill = False).add_to(Full_map)

for campus, lat, long in zip(ASU['Campus'], ASU['Latitude'], ASU['Longitude']):
    label = folium.Popup(str(campus))
    folium.CircleMarker(
        [lat, long], 
        radius= 10, 
        fill = True, 
        fill_opacity = .7, 
        fill_color = 'darkred', 
        popup = label).add_to(Full_map)   
    
folium.LayerControl(collapsed = False).add_to(Full_map)   #collapsed set to false so users can see additional functionality
Full_map

## Cluster Map
There are a total of 5 clusters which can be toggled on and off to get a view of the general areas around each campus. Due to the limitations of foliums's choropleth map, the (Lat, Long) data for each venue was utilized to create CircleMarkers, color coordinated by cluster, which would then 'paint' the area where that cluster is prominent. This method helps to visualize the various groupings at glance which can then be further explored.

**When restaring the kernel, the cluster contents stay the same but the labels for clusters 0 and 3 flip almost everytime. For some reason, these two are the only ones to change labels. For clarification, just pay attention to the two locations the clusters show up at.**   

1)
**Cluster 0/Cluster 3** was the most common in terms of venues but only consisted of 2 Zipcodes located primarily on the southwest side of the Downtown campus and on the north side of the west campus. The 2 Zipcodes contained in Cluster 0 contain 18 and 24 total restuarants. As mentioned previously, there appears to be a couple obvious tiers in that regard. The top10 venues only have two categories in common: Pizza Place (1st and 3rd) and coffee shop (2nd and 9th). 

2)
**Cluster 1** was the most common cluster with 5 Zipcodes but it was least populated with venues at with only 13. Each Zipcode in Cluster 1 has 0-1 coffee shops and less than 5 restaurants within them. With 5 Zipcodes in this cluster, it does appear at all four campuses. Its top10 venues are nearly identical amongst 3 of the 5 zipcodes with major venues being museums, yoga studios and college stadiums. The remaining two are likewise very similar amongst its 5-10 venues, sharing commonality with the rest of the group in the presence of yoga studios. 

A safe assumption would be that the similarity between these zipcodes is misleading. After all venues were listed, the most common remaining frequncy would have been 0 which could explain why the latter half of top10 lists match so well.  

3)
**Cluster 2** was the second most populated cluster with 88 venues packed into only 1 zipcode located at the Tempe campus. It has 4 coffee shops and 45 restaurants giving it more restaurants than clusters 1, 3 and 4 combined. The majority of Cluster 2 is located on the northwest side of the Tempe campus making that location arguably the most competitive for opening a business. 

4)
**Cluster 3/Cluster 0** contains 75 venues spread across 2 zipcodes. The majority of cluster 2 is located on the east side of the downtown campus. Both Zipcodes of cluster three have a coffee shop as their most common venue with breakfast and deli shops following close behind. A coffee shop in cluster 3 would yield the most direct competition. 

5)
**Cluster 4** is similar to cluster 1 in that it has very few coffee shops and restaurants spread out amongst its 3 zipcodes. It is located primarily at the Polytechnic campus, a few at West, having 46 venues with only 2 of them are coffee shops and 12 are restaurants. Further, the top 10 venues have next to no commonality making a decision regarding viability on this cluster a bit difficult.