# The Way to a City's Heart...

How much can we tell about an area in a city based on the restaurants it has? That is the goal of this project.

## Introduction

People of varying backgrounds, economic statuses, and tastes will shape the restaurant options located near them. So, it would seem likely that the different neighborhoods in a city would be composed of different sorts of food establishments, as well as different cities. This project investigates how closely established city neighborhoods align with clusters determined by restaurants and what sorts of restaurants are present in the various clusters. It also compares the restaurant compositions of different cities.

## Using the Notebook

The first few cells in this notebook gather the dataset for analysis. They should just be run once to prevent needing to constantly make API calls and the data should be saved once gathered. To run the code, you will need your own Foursquare API credentials, and you will have to edit the codecell that loads mine from a file. 

The second half of the notebook is dedicated to analyzing the dataset once it is saved. Since I'm focused on Chicago, I place a few markers on the airports in that city for all my maps, but obviously they should be removed for other cities.

## Data

Foursquare's API has a search features that allows the user to search for venues in a city which belong to certain categories. Among these categories are Food and Bars. Each of these categories is subdivided into smaller subcategories such as Afghan Restaurant, Burger Joint, etc. Given a city, we can consider all of its restaurants and the corresponding list of which categories it belongs to. Using this data, we can hope to understand the "culinary neighborhoods" of a city. 

In [None]:
# imports

import numpy as np
import pandas as pd
import requests
import json
import folium
import matplotlib.pyplot as plt

from matplotlib import cm
from matplotlib.colors import to_hex

from math import sqrt

from collections import defaultdict

from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.preprocessing import OneHotEncoder

In [None]:
# get foursquare creds

with open('foursquare_api_key.json', 'r') as f:
    api_info = json.loads(f.read())
    client_id = api_info['client_id']
    client_secret = api_info['client_secret']
    v = api_info['v']

In [None]:
# set up variables for loading in data

loc_list = []
limit = 50
offset = 0
radius = 10000
cats = '4d4b7105d754a06374d81259,4bf58dd8d48988d116941735'
city = 'Chicago'
restaurants_id_set = set()

In [None]:
# helper function to parse the venue data

def get_restaurant_info(group_item):
    rest_id = group_item['id']
    name = group_item['name']
    lat = group_item['location']['lat']
    lng = group_item['location']['lng']
    cat = group_item['categories'][0]['name']
    return (rest_id, name, lat, lng, cat)

In [None]:
# populate location list

while len(loc_list) == offset:
    # pull api data for locations
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&near={}&radius={}&categoryId={}&offset={}&limit={}'.format(
        client_id,
        client_secret,
        v,
        city,
        radius,
        cats,
        offset,
        limit
    )
    response = requests.get(url)
    result = json.loads(response.content)['response']
    total_results = int(result['totalResults'])
    for item in result['groups'][0]['items']:
        info = get_restaurant_info(item['venue'])
        loc_list.append(info)
        restaurants_id_set.add(info[0])
    offset += limit
print('Found {} locations/ {} unique locations.'.format(len(loc_list), len(restaurants_id_set)))

In [None]:
# use the locations found to come up with a place to center the map

def getAvgCoords(df):
    return df['Lat'].mean(), df['Lng'].mean()

def getAvgCoords_list(loc_list):
    lat = 0
    lng = 0
    for item in loc_list:
        lat += item[2]
        lng += item[3]
    return lat/len(loc_list), lng/len(loc_list)

In [None]:
# create map

my_map = folium.Map(location = getAvgCoords_list(loc_list), zoom_start = 11, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for entry in loc_list:
    folium.CircleMarker(
        location=[entry[2],entry[3]], 
        popup = entry[1], 
        fill = True, 
        radius=3, 
        weight=2,
        color='green', 
        fill_color='white',
        fill_opacity=0.8).add_to(my_map)
my_map

We need more data! To get it, I propose the following: Use DBScan to locate isolated points and boundary points. Then pick a random isolated point and boundary point at random and query around around it. If no isolate points or boundary points are found, decrease the DBScan radius slightly and continue. Stop when either 
1. The DBScan radius hits a given threshhold
2. Some maximum number of points has been added.

In [None]:
# set up initial search parameters

search_radius = 1000
eps = 0.003

In [None]:
# form a dataframe from the initial location data collected

def create_df(loc_list, step):
    new_restaurant_data = pd.DataFrame(loc_list)
    new_restaurant_data.set_index(0)
    new_restaurant_data.rename( columns = {1:'Name', 2:'Lat', 3:'Lng', 4:'Category'},inplace=True)
    new_restaurant_data['step'] = step
    new_restaurant_data.index = new_restaurant_data[0]
    new_restaurant_data.index.rename('id', inplace=True)
    new_restaurant_data.drop(0, axis=1,inplace=True)
    return new_restaurant_data

restaurant_data = create_df(loc_list,0)

In [None]:
# run dbscan

def run_dbscan(eps, data):
    scanner = DBSCAN(eps=eps)
    data['label'] = scanner.fit_predict(data[['Lat','Lng']])
    core_indices = data.index[scanner.core_sample_indices_]
    data.loc[core_indices,'isCore'] = True
    data['isCore'].fillna(False, inplace=True)
    return data

restaurant_data = run_dbscan(0.003, restaurant_data)

In [None]:
# visualize results...
# first set up functions to color the points


def get_fill_color(cmap, label):
    if label == -1:
        return '#444444'
    else:
        return to_hex(cmap(label))
    
def get_boundary_color(cmap, is_core):
    if is_core:
        return '#000000'
    else:
        return '#666666'



In [None]:
# create map

cmap = cm.get_cmap(name='cool', lut=restaurant_data['label'].max())
my_map = folium.Map(location = getAvgCoords(restaurant_data), zoom_start = 11, width='50%')

# these are the airports in Chicago for reference but delete them if you use a new city
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for i in restaurant_data.index:
    entry = restaurant_data.loc[i,:]
    folium.CircleMarker(
        location=[entry['Lat'],entry['Lng']], 
        popup = entry['Name'], 
        fill = True, 
        radius=3, 
        weight=2,
        color=get_boundary_color(cmap, entry['isCore']), 
        fill_color=get_fill_color(cmap, entry['label']),
        fill_opacity=0.8).add_to(my_map)
my_map

In [None]:
# this function will take the venues in data and add them to loc_list if they do not appear in restaurants_id_set

def make_queries(data, seach_radius):
    for venue_id in data.index:
    
        venue = data.loc[venue_id]
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}&limit={}'.format(
            client_id,
            client_secret,
            v,
            venue['Lat'],
            venue['Lng'],
            str(search_radius),
            cats,
            limit
        )
        try:
            response = requests.get(url)
            result = json.loads(response.content)['response']
        except:
            print('At the time of failure...')
            print(response)
            print(response.content)
        if 'venues' in result.keys():
            for item in result['venues']:
                info = get_restaurant_info(item)
                if not info[0] in restaurants_id_set:
                    loc_list.append(info)
                    restaurants_id_set.add(info[0])
        elif 'venue' in result.keys():
            into = get_restaurant_info(result['venue'])
            if not info[0] in restaurants_id_set:
                    loc_list.append(info)
                    restaurants_id_set.add(info[0])
    

In [None]:
# restaurants_id_set and restaurant_data should be initialized with first data
# you then run this code cell repeatedly until you get enough restaurants and bars
# up the number in the loop if you want to run more at a time
# sometimes a 502 error arises, but that's probably ok

for step in range(1):
    # 1 query around the boundary points and outliers
    # 2 adjust parameters as follows:
    # 2a if there are few boundary points and outliers, decrease eps by about 10%
    # 2b if there are no new restaurants, increase search radius by about 10%
    # 3 make a new df with results for analysis
    # 4 run DBscan to get outliers and boundary points
    
    
    outliers = restaurant_data[restaurant_data['label'] == -1]
    boundaries = restaurant_data[(restaurant_data['label'] != -1) & (restaurant_data['isCore'] == False)]
    
    loc_list = []
    
    make_queries(outliers, search_radius)
    make_queries(boundaries, search_radius)
    
    
    
    if len(loc_list) < 10:
        seach_radius = int(1.1*search_radius)
    
    if len(outliers.index) + len(boundaries.index) < 20:
        eps /= 1.1
    if len(loc_list) > 0:  
        new_restaurant_data = create_df(loc_list, step)

        restaurant_data = pd.concat([restaurant_data[['Name','Lat','Lng','Category','step']],new_restaurant_data])

        restaurant_data = run_dbscan(eps, restaurant_data)


    

In [None]:
# create map

cmap = cm.get_cmap(name='cool', lut=restaurant_data['label'].max())
my_map = folium.Map(location = getAvgCoords(restaurant_data), zoom_start = 11, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for i in restaurant_data.index:
    entry = restaurant_data.loc[i,:]
    folium.CircleMarker(
        location=[entry['Lat'],entry['Lng']], 
        popup = entry['Name'], 
        fill = True, 
        radius=3, 
        weight=2,
        color=get_boundary_color(cmap, entry['isCore']), 
        fill_color=get_fill_color(cmap, entry['label']),
        fill_opacity=0.8).add_to(my_map)
my_map

In [None]:
# you've worked hard for your data, so save it

restaurant_data.to_csv('restaurant_data')



## Analysis

Now it's time to cluster and explore the data we've gathered. I'll start by examining how k-means clustering on the lat-long coordinates of the venues along with a one-hot encoding of the venue category. This will result in a 2D spatial vector $X$ and an ND "category vector" $V$. I'll normalize $X$ and then run k-means clustering on the combination $\alpha X + (1-\alpha)V$. The initial exploration will be carried out for various choices of $k$ and $\alpha$.

In [None]:
# read data from the csv file and clean

def make_clusters(k = 20, alpha = 1):
    restaurant_data = pd.read_csv('restaurant_data')
    restaurant_data.set_index('id', inplace=True)
    restaurant_data.drop(['Name', 'step','label','isCore'], axis = 1, inplace=True)
    coord_data = restaurant_data[['Lat','Lng']].copy()

    # encode category column

    restaurant_data = pd.get_dummies(restaurant_data, columns = ['Category'], prefix = '', prefix_sep = '')

    # standardize

    for col in ['Lat','Lng']:
        restaurant_data[col] = (restaurant_data[col] - restaurant_data[col].mean())/restaurant_data[col].std()

    # weight

    dummy_cols = [col for col in restaurant_data.columns if col not in ['Lat','Lng']]
    restaurant_data[['Lat','Lng']] *= alpha
    restaurant_data[dummy_cols] *= 1-alpha

    # predict, setting seed for comparability
    np.random.seed(0)
    clusterer = KMeans(n_clusters = k)
    restaurant_data['label'] = clusterer.fit_predict(restaurant_data)
    return restaurant_data


In [None]:
# create map
# make sure to load getAvgCoords and get_fill_color from above

cmap = cm.get_cmap(name='cool', lut=restaurant_data['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for i in restaurant_data.index:
    if np.random.rand() < 0.2:
        entry = coord_data.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(restaurant_data.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, restaurant_data.loc[i,'label']),
            fill_color=get_fill_color(cmap, restaurant_data.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

We also need to examine what sorts of values we got for our labels.

In [None]:
onehotcats = make_clusters(20,0)

cat_counts = onehotcats.drop(['Lat','Lng','label'], axis=1).sum().sort_values(ascending=False)
print(cat_counts.head)

cat_counts[20::-1].plot(kind='barh')
#plt.xticks(rotation=45)


Some things to note:
1. Many restaurant categories don't have but 1 entry, so we are greatly increasing the dimensionality of the dataspace by keeping them. Perhaps they should be dropped.
2. Some non-restaurant categories have crept in to the dataset! It's unclear why those have appeared with the search critera. Perhaps the first category isn't always the one we need.

The next step is to understand whether or not the groupings are actually picking out neighborhoods based on the restaurants in them or not. We start by clustering locations based only on their geographical coordindates and finding the proportions of different kinds of restaurants in that area. To measure how well the neighborhoods have concentrated restaurant types, we'll use the quantity $\sum_i p_i^2$, where $p_i$ is the proportion of restaurant type $i$ in that neighborhood. This can be averaged across all the neighborhoods to see how concentrated the restaurants in a neighborhood for a given setting of $\alpha$ are.

In [None]:
# create dataframe which stores the onehot info and lat/long info without any scaling at all

restaurant_data = pd.read_csv('restaurant_data')
restaurant_data.set_index('id', inplace=True)
restaurant_data.drop(['Name', 'step','label','isCore'], axis = 1, inplace=True)

restaurant_data = pd.get_dummies(restaurant_data, columns = ['Category'], prefix = '', prefix_sep = '')


In [None]:
# check how clustered the restaurant types are with geographical clustering only

# form df with labels and tally 
loc_only = pd.concat([restaurant_data,make_clusters(20,1)['label']],axis=1).drop(['Lat','Lng'],axis=1)
cluster_info = loc_only.groupby('label').sum()
cluster_info['total'] =  cluster_info.sum(axis=1)

# walk through each group and compute concentration 
base_result_table = []
base_prop_measure = 0
for i in cluster_info.index:
    cluster_info.loc[i,:] /= cluster_info.loc[i,'total']
    tmp = 100*cluster_info.loc[i,:].drop(['total']).sort_values(ascending=False).transpose()
    base_prop_measure += (tmp*tmp).sum()/10000
                                
print('Geographically clustered concentration measure:', base_prop_measure/len(cluster_info.index))

In [None]:
# try again with alpha = 0.5

# form df with labels and tally 
alpha8 = pd.concat([restaurant_data,make_clusters(20,0.5)['label']],axis=1)
cluster_info = alpha8.groupby('label').sum()
cluster_info['total'] =  cluster_info.sum(axis=1)

# walk through each group and compute concentration 
alpha8_result_table = []
alpha8_prop_measure = 0
for i in cluster_info.index:
    cluster_info.loc[i,:] /= cluster_info.loc[i,'total']
    tmp = 100*cluster_info.loc[i,:].drop(['Lat','Lng','total']).sort_values(ascending=False).transpose()
    alpha8_prop_measure += (tmp*tmp).sum()/10000


print('Alpha = 0.5 concentration measure:', alpha8_prop_measure/len(cluster_info.index))


We can also get a quantitaive measure for how clustered the neighborhoods are by computing variances of the square distances from the mean.

In [None]:
# compute variances of sqdists from mean with geographically clustered

loc_only = make_clusters(20,1)

# find centroids
centroids = loc_only[['Lat','Lng','label']].groupby('label').mean().reset_index()
loc_only = loc_only.merge(centroids, on='label')

# compute sqdist
loc_only['distsq'] = (loc_only['Lat_x']-loc_only['Lat_y'])**2 + (loc_only['Lng_x']-loc_only['Lng_y'])**2
                                
print('Mean variance of sqdist for geographically clustered:', loc_only.groupby('label').std()['distsq'].mean())

In [None]:
# same, but with alpha = 0.5

alpha5 = pd.concat([restaurant_data,make_clusters(20,0.5)['label']],axis=1)

# find centroids
centoids = alpha5[['Lat','Lng','label']].groupby('label').mean().reset_index()
alpha5 = alpha5.merge( centroids, on='label')

# compute sqdist
alpha5['distsq'] = (alpha5['Lat_x']-alpha5['Lat_y'])**2 + (alpha5['Lng_x']-alpha5['Lng_y'])**2

print('Mean variance of sqdist for alpha 0.5:', alpha5.groupby('label').std()['distsq'].mean())

### Compute concentrations and mean sqdist variance for varying alpha

Let's plot the above processes for several values of $\alpha$ to compare

In [None]:
alphas = []
concentrations = []
sqdistvars = []

for i in range(11):
    alpha = i/10
    alphas.append(alpha)
    clusters = make_clusters(20,alpha)
    results = pd.concat([restaurant_data,clusters['label']],axis=1)
    
    
    # add concentration info
    cluster_info = results.groupby('label').sum().drop(['Lat','Lng'], axis=1)
    cluster_info['total'] =  cluster_info.sum(axis=1)

    prop_measure = 0
    for i in cluster_info.index:
        cluster_info.loc[i,:] /= cluster_info.loc[i,'total']
        tmp = 100*cluster_info.loc[i,:].drop('total').sort_values(ascending=False).transpose()
        prop_measure += (tmp*tmp).sum()/10000
    concentrations.append(prop_measure/len(cluster_info.index))
    
    # add geographical dispersion info
    centoids = results[['Lat','Lng','label']].groupby('label').mean().reset_index()
    results = results.merge( centroids, on='label')
    results['distsq'] = (results['Lat_x']-results['Lat_y'])**2 + (results['Lng_x']-results['Lng_y'])**2
    sqdistvars.append(results.groupby('label').std()['distsq'].mean())

In [None]:
plt.plot(alphas, concentrations)
plt.plot(alphas, sqdistvars)
plt.title("Comparing clustering")
plt.xlabel("Alpha")
plt.legend(['Venue Concentrations', 'Sq Distance Variance'])
#pd.DataFrame(alpha8_result_table)

Since we want to choose $\alpha$ to make venue concentration high and distance variance small, let's choose $\alpha$ that maximizes $$\frac{\mbox{venue concentration}}{\mbox{square distance variance}}.$$

In [None]:
ratios = []
for i in range(len(alphas)):
    ratios.append(concentrations[i]/sqdistvars[i])
plt.plot(alphas, ratios)
plt.title("Ratio of interest")
plt.xlabel("Alpha")

So we'll form our final results using $\alpha = 0.4$, which has a good mix of geographic and cuisine-type clustering features.

## Final Evaluation

In [None]:
final_cluster = make_clusters(20,0.4)

In [None]:
# create map

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for i in restaurant_data.index:
    if np.random.rand() < 0.2:
        entry = coord_data.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
# centroids = pd.concat([restaurant_data, final_cluster['label']],axis=1).groupby('label').mean()[['Lat','Lng']]
# for i in centroids.index:
#     entry = centroids.loc[i,:]
#     folium.CircleMarker(
#             location=[entry['Lat'],entry['Lng']],
#             popup = "group" + str(i),
#             fill = True, 
#             radius=3, 
#             weight=5,
#             color=get_fill_color(cmap, i),#'#555555',#get_boundary_color(cmap, label[i]), 
#             fill_color=get_fill_color(cmap, i),
#             fill_opacity=1).add_to(my_map)

print('Locations of Final Groups')
my_map

In [None]:
# print out the information about restaurant types in each cluster

def process_row(row):
    sorted_row = row.sort_values(ascending=False)
    results = []
    for i in sorted_row.index:
        if sorted_row[i] > 1:
            results.append(str(i) + ', ' + '{:.2f}'.format(sorted_row[i]) + '%')
        else:
            break
    print('Group {}\n{}\n'.format(row.name, '\n'.join(results[:5])))

final_info = pd.concat([restaurant_data, final_cluster['label']], axis=1)
tallies = final_info.groupby('label').sum().drop(['Lat','Lng'],axis=1)
tallies['total'] = tallies.sum(axis=1)

for i in tallies.index:
    tallies.loc[i,:] /= tallies.loc[i,'total']
tallies.drop('total', axis = 1, inplace=True)
tallies = tallies*100
for i in tallies.index:
    process_row(tallies.loc[i,:])

In [None]:
# manually sort categories into lists for analysis below

mexican_restaurants = [0,6]
sandwich_places = [1,13]
fast_food = [2,3]
bars = [4,10,16]
american_and_food = [5,7,9,14,19]
restaurant = [8]
chinese_restaurants = [11]
coffee = [12]
pizza = [15,18]
donuts = [17]


What's happened? A few very popular sorts of restaurants like Mexican, Chinese, American have formed a cluster that contains only them. If we examine the distribution of one of these mono-category clusters, we get:

In [None]:
# only mexican clusters

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = mexican_restaurants
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

Compare this to another mono-category cluster of the same kind:

In [None]:
# only fast food clusters

# since it's hard to see the colors, let's make a custom function

def get_fastfood_color(label):
    if label == 2:
        return '#FF0000'
    else:
        return '#0000FF'
cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = fast_food
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fastfood_color(final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fastfood_color(final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

In [None]:
# bars

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = bars
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

In [None]:
# sandwich places

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = sandwich_places
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

In [None]:
# "food" and "american" restaurants

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = american_and_food 
for i in final_info.index:
    if np.random.rand() < 0.5 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

In [None]:
# pizza restaurants

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

LABELS = coffee+donuts
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
my_map

In [None]:
# view distribution of clusters with multiple venue categories

cmap = cm.get_cmap(name='cool', lut=final_cluster['label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 

mono_cats = [0, 2, 3, 6, 8, 9, 10, 11, 12, 15, 16,17]
LABELS = [i for i in range(20) if i not in mono_cats]
for i in final_info.index:
    if np.random.rand() < 1 and final_info.loc[i,'label'] in LABELS:
        entry = final_info.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(final_cluster.loc[i,'label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap, final_cluster.loc[i,'label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, final_cluster.loc[i,'label']),
            fill_opacity=1).add_to(my_map)
        
centroids = final_info[final_info['label'].isin(LABELS)].groupby('label').mean()[['Lat','Lng']]

for i in centroids.index:
    text = 'Group '+ str(i)
    iframe = folium.IFrame(text, width=30, height=30)
    popup = folium.Popup(iframe, max_width=3000)

    divIcon = folium.DivIcon(icon_size=(50,50), html='<div style="text-align:center;color:{}; background:#FFFFFF;"> {} </div>'.format(get_fill_color(cmap,i),i))
    Text = folium.Marker(location=[centroids.loc[i,'Lat'], centroids.loc[i,'Lng']], popup=popup,
                     icon=divIcon)

    my_map.add_child(Text)
centroids
my_map.save('clusters.html')
my_map

In [None]:
# for one last comparison, we do 8-means on the venues in the multilabel categories to see how they compare

multilabel_venues = restaurant_data[['Lat','Lng']]
#multilabel_venues = final_info[final_info['label'].isin(LABELS)][['Lat','Lng']]

multilabel_clusterer = KMeans(n_clusters = 8)
multilabel_venues['new_label'] = multilabel_clusterer.fit_predict(multilabel_venues)

cmap = cm.get_cmap(name='cool', lut=multilabel_venues['new_label'].max())
my_map = folium.Map(location = getAvgCoords(coord_data), zoom_start = 10, width='50%')

# these are the airports in Chicago for reference
folium.CircleMarker(location = [41.9742, -87.9073], popup="O'Hare", color='blue').add_to(my_map) 
folium.CircleMarker(location = [41.7868, -87.7522], popup="Midway", color='blue').add_to(my_map) 


for i in multilabel_venues.index:
    if np.random.rand() < 1:
        entry = multilabel_venues.loc[i,:]
        folium.CircleMarker(
            location=[entry['Lat'],entry['Lng']],
            popup = "group" + str(multilabel_venues.loc[i,'new_label']),
            fill = True, 
            radius=3, 
            weight=2,
            color=get_fill_color(cmap,multilabel_venues.loc[i,'new_label']),#'#555555',#get_boundary_color(cmap, label[i]), 
            fill_color=get_fill_color(cmap, multilabel_venues.loc[i,'new_label']),
            fill_opacity=1).add_to(my_map)
        
centroids = multilabel_venues.groupby('new_label').mean()[['Lat','Lng']]

for i in centroids.index:
    text = 'Group '+ str(i)
    iframe = folium.IFrame(text, width=30, height=30)
    popup = folium.Popup(iframe, max_width=3000)

    divIcon = folium.DivIcon(icon_size=(50,50), html='<div style="text-align:center;color:{}; background:#FFFFFF;"> {} </div>'.format(get_fill_color(cmap,i),i))
    Text = folium.Marker(location=[centroids.loc[i,'Lat'], centroids.loc[i,'Lng']], popup=popup,
                     icon=divIcon)

    my_map.add_child(Text)
centroids
my_map.save('k_means_8_all.html')
my_map

In [None]:
# a scratchpad to plot lat/long data if Chrome won't load the map


raw_data = pd.read_csv('restaurant_data')
#print(raw_data['Category'].value_counts())
X = raw_data[raw_data['Category'].isin(['Sandwich Place'])][['Lat','Lng']].reset_index()
ff_rest_clusterer = KMeans(n_clusters=1)
labels = ff_rest_clusterer.fit_predict(X[['Lat','Lng']])

for i in X.index:
    if labels[i] == 0:
        plt.plot(X.loc[i,'Lng'], X.loc[i,'Lat'], 'bo',alpha=.5)
    elif labels[i] == 1:
        plt.plot(X.loc[i,'Lng'], X.loc[i,'Lat'], 'bo')
    else:
        plt.plot(X.loc[i,'Lng'], X.loc[i,'Lat'], 'bo')
        
for i in final_info.index:
    if final_info.loc[i,'label'] in [13]:
        if final_info.loc[i,'label'] == 13:
            plt.plot(final_info.loc[i,'Lng'], final_info.loc[i,'Lat'], 'ro',alpha=.5)
        elif final_info.loc[i,'label'] == 7:
            plt.plot(final_info.loc[i,'Lng'], final_info.loc[i,'Lat'], 'bo')
        else:
            plt.plot(final_info.loc[i,'Lng'], final_info.loc[i,'Lat'], 'go')
plt.title('Sandwich Shops')
plt.legend([plt.Line2D([0], [0], marker='o', color='w', label='Scatter',
                          markerfacecolor='b', markersize=10),plt.Line2D([0], [0], marker='o', color='w', label='Scatter',
                          markerfacecolor='r', markersize=10)],['Original Dataset','Sandwich Place Cluster'],bbox_to_anchor=(1.05, 1), loc='upper left')