# Capstone Project - Assignment 5

## Objective

Machine Learning Tour Group (MLTG) is creating a tour package to New York city, with a one-day free and easy stay in Manhattan. The company is identifying neighborhoods that are within walking distance to popular F&B venues, places of interest and shopping locations. Popularity of different locations is determined by the "likes" scores obtained from Foursquare database. Suitable accommodations can then be arranged in identified neighborhoods.

## Data required

We will be using   
A. Geospatial coordinates of neighborhoods are obtained from https://geo.nyu.edu/catalog/nyu_2451_34572.  
B. Data of venues using Foursquare API. The data includes locations, categories and likes.

## Keywords

Tour, accommodations, Manhattan, DBSCAN, Foursquare

## Libraries

In [1]:
import numpy as np
import pandas as pd
import json
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from geopy import distance # compute distance in km between two pairs of latitude-longitude coordinates.
import requests # library to handle requests
from pandas import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import DBSCAN, KMeans
import folium # map rendering library
import wget # using pypi wget library


## Methodology

1. Geospatial data is downloaded from https://cocl.us/new_york_dataset. The neighborhoods in Manhattan are then filtered from this data .
2. Using Foursquare, we get geospatial data of venues within 8km of Manhattan city centre coordinates.
3. Based on the “categories” of the venues, they are place into different groups, namely F&B, places of interest and shops. This new column in the data frame is called “tag” and the labels are FNB, POI and SHOP respectively. This helps in analysis process. Any other “categories” that do not belong to these tags are removed from the data frame.
4. Using Foursquare, we obtain the count of `likes’ for each venue. The 25-th, 50-th and 75-th quantiles of each tag are computed. A new column “rating” is added to the data frame to assign categorical ratings of each value, with rating 4 being the highest and 1 the lowest.
5. We perform one-hot encoding. One set of data with only ‘rating’ encoded and another set of data with both ‘tag’ and ‘rating’ encoded are then then used for clustering. The latitude and longitude coordinates are included in the encoded dataset.
6. Clustering is done using DBSCAN package in scikit-learn library. We will also see the clustering by KMeans method. The results are shown in pivot tables.
7. After determining suitable DBSCAN hyper-parameters, we add the label of clusters to data frame and finally present the clusters on a map. 
8. The mean distance of each neighborhoods to venues in the selected cluster is computed. The top 5 neighborhoods with the lowest mean distance are recommended for scouting of accommodations.

## Pre-processing

Get geospatial data of neighborhoods in New York.

In [2]:
# Uncomment the line below if newyork_data.json file has not been downloaded.
# wget.download("https://cocl.us/new_york_dataset", out="newyork_data.json")

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']

Tranforming the data into a pandas dataframe

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In particular, we will be selecting neighborhoods in Manhattan.

In [7]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data['Neighborhood'].unique()

array(['Marble Hill', 'Chinatown', 'Washington Heights', 'Inwood',
       'Hamilton Heights', 'Manhattanville', 'Central Harlem',
       'East Harlem', 'Upper East Side', 'Yorkville', 'Lenox Hill',
       'Roosevelt Island', 'Upper West Side', 'Lincoln Square', 'Clinton',
       'Midtown', 'Murray Hill', 'Chelsea', 'Greenwich Village',
       'East Village', 'Lower East Side', 'Tribeca', 'Little Italy',
       'Soho', 'West Village', 'Manhattan Valley', 'Morningside Heights',
       'Gramercy', 'Battery Park City', 'Financial District',
       'Carnegie Hill', 'Noho', 'Civic Center', 'Midtown South',
       'Sutton Place', 'Turtle Bay', 'Tudor City', 'Stuyvesant Town',
       'Flatiron', 'Hudson Yards'], dtype=object)

Using Nominatim, we find geographical coordinates of Manhattan.

In [8]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


## Foursquare API

#### User credentials

In [9]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180604'

In [10]:
LIMIT = 500 # number of venues
radius = 8000 # within radius of 8km

#### Utility functions

In [11]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']

    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Pull data from Foursquare API

First, obtain the venues that are within 'radius' km from the coordinates of Manhattan

In [12]:
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?\
&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

In [13]:
#pull the actual data from the Foursquare API
all_data_json = requests.get(url).json()

Place data into dataframe

In [14]:
all_venues = all_data_json['response']['groups'][0]['items']
all_venues = json_normalize(all_venues)
all_venues.head()

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.crossStreet,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,...,venue.photos.groups,venue.venuePage.id,venue.delivery.id,venue.delivery.url,venue.delivery.provider.name,venue.delivery.provider.icon.prefix,venue.delivery.provider.icon.sizes,venue.delivery.provider.icon.name,venue.events.count,venue.events.summary
0,e-0-4ba233dbf964a5206fe337e3-0,0,"[{'summary': 'This spot is popular', 'type': '...",4ba233dbf964a5206fe337e3,East Meadow,Central Park,5th Ave btwn 98th & 101st St,40.79016,-73.955498,"[{'label': 'display', 'lat': 40.79015961797437...",...,[],,,,,,,,,
1,e-0-4a229fa8f964a520797d1fe3-1,0,"[{'summary': 'This spot is popular', 'type': '...",4a229fa8f964a520797d1fe3,Jacqueline Kennedy Onassis Reservoir,Central Park,btwn 85th & 96th St,40.784519,-73.960966,,...,[],,,,,,,,,
2,e-0-4a746fb2f964a52025de1fe3-2,0,"[{'summary': 'This spot is popular', 'type': '...",4a746fb2f964a52025de1fe3,The Jewish Museum,1109 5th Ave,at E 92nd St,40.785276,-73.957411,"[{'label': 'display', 'lat': 40.7852755619908,...",...,[],33720137.0,,,,,,,,
3,e-0-412d2800f964a520df0c1fe3-3,0,"[{'summary': 'This spot is popular', 'type': '...",412d2800f964a520df0c1fe3,Central Park,59th St to 110th St,5th Ave to Central Park West,40.784083,-73.964853,"[{'label': 'display', 'lat': 40.78408342593807...",...,[],,,,,,,,,
4,e-0-4a9ad8d2f964a520213320e3-4,0,"[{'summary': 'This spot is popular', 'type': '...",4a9ad8d2f964a520213320e3,Conservatory Garden,1231 5th Ave,at 105th St,40.793531,-73.952032,"[{'label': 'display', 'lat': 40.793531, 'lng':...",...,[],,,,,,,,,


For ease of analysis, select relevant columns.

In [15]:
selectedCols = ['venue.name', 'venue.id', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
all_venues =all_venues.loc[:, selectedCols]
all_venues['venue.categories'] = all_venues.apply(get_category_type, axis=1)

In [16]:
all_venues.head()

Unnamed: 0,venue.name,venue.id,venue.categories,venue.location.lat,venue.location.lng
0,East Meadow,4ba233dbf964a5206fe337e3,Field,40.79016,-73.955498
1,Jacqueline Kennedy Onassis Reservoir,4a229fa8f964a520797d1fe3,Reservoir,40.784519,-73.960966
2,The Jewish Museum,4a746fb2f964a52025de1fe3,Museum,40.785276,-73.957411
3,Central Park,412d2800f964a520df0c1fe3,Park,40.784083,-73.964853
4,Conservatory Garden,4a9ad8d2f964a520213320e3,Garden,40.793531,-73.952032


We notice that the column headers are all starting with prefix "venue.". We shall rename the column headers.

In [17]:
# We split the string and select only the last word.
new_col_headers = [col_header.split(".")[-1] for col_header in all_venues.columns]
all_venues.columns = new_col_headers

#### Grouping by categories

In [18]:
all_venues['categories'].unique()

array(['Field', 'Reservoir', 'Museum', 'Park', 'Garden', 'Fountain',
       'Flower Shop', 'Exhibit', 'Art Museum', 'Gym', 'Plaza',
       'Yoga Studio', 'Bakery', 'Track', 'Cycle Studio', 'Beer Store',
       'Boxing Gym', 'Scenic Lookout', 'Waterfront', 'Pizza Place',
       'Thai Restaurant', 'Theater', 'Event Space',
       'Performing Arts Venue', 'Concert Hall', 'Opera House',
       'Italian Restaurant', 'Gym / Fitness Center', 'Jazz Club',
       'Seafood Restaurant', 'Hotel', 'Library', 'Community Center',
       'Salon / Barbershop', 'Dance Studio', 'Gift Shop',
       'American Restaurant', 'Church', 'Grocery Store', 'Food Truck',
       'Resort', 'Gourmet Shop', 'Taco Place', 'Bar', 'Ice Cream Shop',
       'Indie Theater', 'Sandwich Place', 'Bookstore'], dtype=object)

Clean up the dataframe by removing categories that are not needed. Then, create lists for grouping different categories.

In [19]:
remove_categories = ['Field', 'Gym', 'Yoga Studio', 'Track', 'Cycle Studio', 'Boxing Gym', 'Event Space', 
                     'Gym / Fitness Center', 'Hotel', 'Community Center', 'Salon / Barbershop', 'Dance Studio']

FNB_categories = ['Bakery', 'Beer Store', 'Pizza Place', 'Thai Restaurant', 'Italian Restaurant', 
                  'Seafood Restaurant', 'American Restaurant', 'Food Truck', 'Taco Place', 'Bar', 
                  'Ice Cream Shop', 'Sandwich Place']

places_of_interest_categories = ['Reservoir','Museum', 'Park', 'Garden', 'Fountain',
                                 'Exhibit', 'Art Museum', 'Plaza', 'Scenic Lookout', 'Waterfront', 
                                 'Theater', 'Performing Arts Venue', 'Concert Hall', 'Opera House',
                                 'Jazz Club', 'Library', 'Church', 'Resort', 'Indie Theater']

shopping_categories  = ['Flower Shop', 'Gift Shop', 'Grocery Store', 'Gourmet Shop', 'Bookstore']

### Check lists

If there are mismatches, update the respective lists in cell above.

In [20]:
all_split_categories = remove_categories + FNB_categories + places_of_interest_categories + shopping_categories
all_categories_list = all_venues['categories'].unique().tolist()

missing_categories = []
for category in all_categories_list:
    if category not in all_split_categories:
        missing_categories.append(category)

print('Missing categories: ')
print(missing_categories)

Missing categories: 
[]


In [21]:
# Check for redundant elements.
redundant_categories = []
for category in all_split_categories:
    if category not in all_categories_list:
        redundant_categories.append(category)

print('Redundant categories: ')
print(redundant_categories)

Redundant categories: 
[]


NOTE: We will be using update_all_venues for the rest of the code and append new columns/info to it.

In [22]:
# create a new dataframe
updated_all_venues = all_venues.copy()

# remove some categories
updated_all_venues = all_venues[~all_venues['categories'].isin(remove_categories)]

Create a new column called new_categories and fill it with NaN

In [23]:
updated_all_venues = updated_all_venues.assign(tag=[np.nan] * updated_all_venues.shape[0])

Tag each row with FNB, POI or SHOP categories under the 'tag' column.  

These groupings are created to distinguish different activities for tourists. Moreover, venues of different groups may receive different number of likes. For example, there are usually more likes given to places of interest. When we compute the quantiles later, the values are more reflective for the respective groups.

In [24]:
indexes = updated_all_venues['categories'].isin(FNB_categories)
updated_all_venues.loc[indexes, 'tag'] = "FNB"

In [25]:
indexes = updated_all_venues['categories'].isin(places_of_interest_categories)
updated_all_venues.loc[indexes, 'tag'] = "POI"

In [26]:
indexes = updated_all_venues['categories'].isin(shopping_categories)
updated_all_venues.loc[indexes, 'tag'] = "SHOP"

In [27]:
updated_all_venues.head()

Unnamed: 0,name,id,categories,lat,lng,tag
1,Jacqueline Kennedy Onassis Reservoir,4a229fa8f964a520797d1fe3,Reservoir,40.784519,-73.960966,POI
2,The Jewish Museum,4a746fb2f964a52025de1fe3,Museum,40.785276,-73.957411,POI
3,Central Park,412d2800f964a520df0c1fe3,Park,40.784083,-73.964853,POI
4,Conservatory Garden,4a9ad8d2f964a520213320e3,Garden,40.793531,-73.952032,POI
5,Conservatory Garden Center Fountain,4d2b4592d86aa0907fa322c0,Fountain,40.793896,-73.952816,POI


#### Get "likes" values

Using a list of IDs, we can retrieve likes from Foursquare for each ID.

In [28]:
IDs = updated_all_venues['id'].tolist()

In [29]:
likes_list = []

for ID in IDs:
    url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(
        ID, CLIENT_ID, CLIENT_SECRET, VERSION)
    like_data = requests.get(url).json()
    likes = like_data['response']['likes']['count']
    likes_list.append(likes)
    
print(likes_list)

[717, 791, 21527, 280, 69, 180, 32, 295, 14196, 63, 502, 134, 127, 64, 809, 181, 370, 31, 317, 722, 96, 211, 784, 2398, 89, 585, 616, 853, 1369, 129, 806, 293, 98, 1705, 255, 728, 229, 403, 111, 207, 1007, 132, 1255, 257, 560, 250, 325, 13306, 473, 427, 193, 3246, 1097, 3288, 504, 274, 4198, 6864, 21, 205, 1327, 136, 154, 1173, 551, 226, 89, 115, 310, 77, 224, 406, 417, 10580, 260, 419, 258, 952, 746]


In [30]:
updated_all_venues['total_likes'] = likes_list

In [31]:
updated_all_venues.head()

Unnamed: 0,name,id,categories,lat,lng,tag,total_likes
1,Jacqueline Kennedy Onassis Reservoir,4a229fa8f964a520797d1fe3,Reservoir,40.784519,-73.960966,POI,717
2,The Jewish Museum,4a746fb2f964a52025de1fe3,Museum,40.785276,-73.957411,POI,791
3,Central Park,412d2800f964a520df0c1fe3,Park,40.784083,-73.964853,POI,21527
4,Conservatory Garden,4a9ad8d2f964a520213320e3,Garden,40.793531,-73.952032,POI,280
5,Conservatory Garden Center Fountain,4d2b4592d86aa0907fa322c0,Fountain,40.793896,-73.952816,POI,69


#### Finding quantile values

For each tag, determine what are the 25-th, 50-th and 75-th quantile marks. The values are then used for partitioning total_likes into 4 groups.

In [32]:
def find_percentiles(df):
    # use numpy percentile function to 
    p25 = np.percentile(df['total_likes'], 25)
    p50 = np.percentile(df['total_likes'], 50)
    p75 = np.percentile(df['total_likes'], 75)
    print(p25, p50, p75)
    return p25, p50, p75

In [33]:
p25_FNB, p50_FNB, p75_FNB = find_percentiles(updated_all_venues[updated_all_venues['tag'] == "FNB"])

158.0 313.5 452.0


In [34]:
p25_POI, p50_POI, p75_POI = find_percentiles(updated_all_venues[updated_all_venues['tag'] == "POI"])

205.5 439.5 1036.0


In [35]:
p25_SHOP, p50_SHOP, p75_SHOP = find_percentiles(updated_all_venues[updated_all_venues['tag'] == "SHOP"])

136.0 274.0 427.0


Based on the percentiles found, we add a new category 'rating' with 4 being the highest rating and 1 being the lowest rating for each tag. We use pandas apply() method to update this new column. Before that, we need to define a function and pass it to apply() method.

In [36]:
# NOTE: Values are stored as string for one-hot encoding in subsequent steps.
# The column will be converted to integers when finding the pivot table.
def partition_tag(df):
    if df['tag'] == "FNB":
        if df['total_likes'] > p75_FNB:
            return "4"
        elif df['total_likes'] > p50_FNB:
            return "3"
        elif df['total_likes'] > p25_FNB:
            return "2"
        else:
            return "1"
    elif df['tag'] == "POI":
        if df['total_likes'] > p75_POI:
            return "4"
        elif df['total_likes'] > p50_POI:
            return "3"
        elif df['total_likes'] > p25_POI:
            return "2"
        else:
            return "1"
    else:
        if df['total_likes'] > p75_SHOP:
            return "4"
        elif df['total_likes'] > p50_SHOP:
            return "3"
        elif df['total_likes'] > p25_SHOP:
            return "2"
        else:
            return "1"
        

In [37]:
updated_all_venues['rating']=updated_all_venues.apply(partition_tag, axis=1)

In [38]:
updated_all_venues.head()

Unnamed: 0,name,id,categories,lat,lng,tag,total_likes,rating
1,Jacqueline Kennedy Onassis Reservoir,4a229fa8f964a520797d1fe3,Reservoir,40.784519,-73.960966,POI,717,3
2,The Jewish Museum,4a746fb2f964a52025de1fe3,Museum,40.785276,-73.957411,POI,791,3
3,Central Park,412d2800f964a520df0c1fe3,Park,40.784083,-73.964853,POI,21527,4
4,Conservatory Garden,4a9ad8d2f964a520213320e3,Garden,40.793531,-73.952032,POI,280,2
5,Conservatory Garden Center Fountain,4d2b4592d86aa0907fa322c0,Fountain,40.793896,-73.952816,POI,69,1


#### One-hot encoding

We create a new dataframe with one-hot encoding for tag and rating before performing DBSCAN clustering.

#### Not tagged

In [39]:
# one hot encoding
onehot_all_venues_nottagged = pd.get_dummies(updated_all_venues[['rating']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot_all_venues_nottagged['lat'] = updated_all_venues['lat']
onehot_all_venues_nottagged['lng'] = updated_all_venues['lng']

# the name column doesn't play a role actually. For viewing purpose only.
onehot_all_venues_nottagged['Name'] = updated_all_venues['name']

# move neighborhood column to the first column
fixed_columns = [onehot_all_venues_nottagged.columns[-1]] + list(onehot_all_venues_nottagged.columns[:-1])
onehot_all_venues_nottagged = onehot_all_venues_nottagged[fixed_columns]

onehot_all_venues_nottagged.head()

Unnamed: 0,Name,1,2,3,4,lat,lng
1,Jacqueline Kennedy Onassis Reservoir,0,0,1,0,40.784519,-73.960966
2,The Jewish Museum,0,0,1,0,40.785276,-73.957411
3,Central Park,0,0,0,1,40.784083,-73.964853
4,Conservatory Garden,0,1,0,0,40.793531,-73.952032
5,Conservatory Garden Center Fountain,1,0,0,0,40.793896,-73.952816


#### tagged

In [40]:
# one hot encoding
onehot_all_venues_tagged = pd.get_dummies(updated_all_venues[['tag', 'rating']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot_all_venues_tagged['lat'] = updated_all_venues['lat']
onehot_all_venues_tagged['lng'] = updated_all_venues['lng']

# the name column doesn't play a role actually. For viewing purpose only.
onehot_all_venues_tagged['Name'] = updated_all_venues['name']

# move neighborhood column to the first column
fixed_columns = [onehot_all_venues_tagged.columns[-1]] + list(onehot_all_venues_tagged.columns[:-1])
onehot_all_venues_tagged = onehot_all_venues_tagged[fixed_columns]

onehot_all_venues_tagged.head()

Unnamed: 0,Name,FNB,POI,SHOP,1,2,3,4,lat,lng
1,Jacqueline Kennedy Onassis Reservoir,0,1,0,0,0,1,0,40.784519,-73.960966
2,The Jewish Museum,0,1,0,0,0,1,0,40.785276,-73.957411
3,Central Park,0,1,0,0,0,0,1,40.784083,-73.964853
4,Conservatory Garden,0,1,0,0,1,0,0,40.793531,-73.952032
5,Conservatory Garden Center Fountain,0,1,0,1,0,0,0,40.793896,-73.952816


## DBSCAN

NOTE (From Course 8): DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. This technique is one of the most common clustering algorithms which works based on density of object. The whole idea is that if a particular point belongs to a cluster, it should be near to lots of other points in that cluster.

#### Not tagged

In [51]:
epsilon = 0.3  # try the value used in the course
minimumSamples = 5 # intuitively the number of similar & hopefully 'close' locations for tourists to choose.

cluster_df_nottagged = onehot_all_venues_nottagged.drop('Name', axis=1)

db_nottagged = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(cluster_df_nottagged)
DBSCAN_labels_nottagged = db_nottagged.labels_
DBSCAN_labels_nottagged

array([0, 0, 1, 2, 3, 3, 3, 2, 1, 3, 0, 3, 3, 3, 1, 2, 0, 3, 0, 0, 3, 2,
       0, 1, 3, 0, 1, 0, 1, 3, 0, 2, 3, 1, 2, 0, 2, 0, 3, 2, 1, 3, 1, 2,
       0, 2, 2, 1, 0, 0, 2, 1, 1, 1, 0, 2, 1, 1, 3, 3, 1, 3, 3, 1, 1, 2,
       3, 3, 2, 3, 2, 2, 0, 1, 2, 0, 2, 1, 0])

In [52]:
DBSCAN_clusters_nottagged = len(np.unique(DBSCAN_labels_nottagged))
print('Number of clusters: ', DBSCAN_clusters_nottagged)

Number of clusters:  4


Pivot Table

In [53]:
venues_cluster_nottagged = updated_all_venues.copy()
venues_cluster_nottagged['DBSCAN'] = DBSCAN_labels_nottagged
venues_cluster_nottagged['rating'] = venues_cluster_nottagged['rating'].astype('int')

table_nottagged = pd.pivot_table(venues_cluster_nottagged, values='categories', index=['DBSCAN'], 
                                 columns=['rating', 'tag'], aggfunc=len, fill_value=0)
table_nottagged

rating,1,1,1,2,2,2,3,3,3,4,4,4
tag,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP
DBSCAN,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,0,0,0,0,0,0,5,13,1,0,0,0
1,0,0,0,0,0,0,0,0,0,5,14,1
2,0,0,0,5,13,1,0,0,0,0,0,0
3,5,14,2,0,0,0,0,0,0,0,0,0


#### tagged

In [69]:
epsilon = 0.3  # try the value used in the course
minimumSamples = 5 # intuitively the number of similar & hopefully 'close' locations for tourists to choose.

cluster_df_tagged = onehot_all_venues_tagged.drop('Name', axis=1)

db_tagged = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(cluster_df_tagged)
DBSCAN_labels_tagged = db_tagged.labels_
DBSCAN_labels_tagged

array([ 0,  0,  1,  2,  3,  3, -1,  2,  1,  3,  0,  3,  3,  4,  5,  6,  7,
        3,  7,  0,  3,  2,  0,  5,  4,  0,  5,  0,  1,  3,  0,  2,  3,  1,
        2,  0,  2,  7,  3,  2,  5,  3,  1,  6,  0,  2,  2,  1,  0, -1,  6,
        1,  1,  1,  0, -1,  1,  1,  4,  3,  1, -1,  3,  1,  5,  2,  4,  3,
        6,  4,  2,  2,  7,  1,  6,  7,  2, -1,  0])

In [70]:
DBSCAN_clusters_tagged = len(np.unique(DBSCAN_labels_tagged))
print('Number of clusters: ', DBSCAN_clusters_tagged)

Number of clusters:  9


In [71]:
venues_cluster_tagged = updated_all_venues.copy()
venues_cluster_tagged['DBSCAN'] = DBSCAN_labels_tagged
venues_cluster_tagged['rating'] = venues_cluster_tagged['rating'].astype('int')

table_tagged = pd.pivot_table(venues_cluster_tagged, values='categories', index=['DBSCAN'], 
                              columns=['rating', 'tag'], aggfunc=len, fill_value=0)
table_tagged

rating,1,1,1,2,2,2,3,3,3,4,4,4
tag,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP
DBSCAN,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
-1,0,0,2,0,0,1,0,0,1,0,0,1
0,0,0,0,0,0,0,0,13,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,14,0
2,0,0,0,0,13,0,0,0,0,0,0,0
3,0,14,0,0,0,0,0,0,0,0,0,0
4,5,0,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,5,0,0
6,0,0,0,5,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,5,0,0,0,0,0


REMARK: This clustering with minSamples of 5 splits almost each tag & rating into a separate cluster. Fewer clusters are formed when minSamples is increased to 10 the cluster with rating 4 consists of only places of interest. FNB and SHOP are grouped as outliers.

## KMeans

In [75]:
cluster_kmeans = onehot_all_venues_tagged.drop('Name', axis=1)

k_clusters = 4
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(cluster_kmeans)

In [77]:
venues_kmeans = updated_all_venues.copy()
venues_kmeans['KMEANS_labels'] = kmeans.labels_
venues_kmeans['rating'] = venues_kmeans['rating'].astype('int')

table_kmeans = pd.pivot_table(venues_kmeans, values='categories', index=['KMEANS_labels'], 
                              columns=['rating', 'tag'], aggfunc=len, fill_value=0)
table_kmeans

rating,1,1,1,2,2,2,3,3,3,4,4,4
tag,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP
KMEANS_labels,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,5,0,0,5,0,0,5,0,0,5,0,0
1,0,14,2,0,0,0,0,0,0,0,0,0
2,0,0,0,0,13,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,13,1,0,14,1


REMARK: For k = 4, which corresponds to a number found by DBSCAN method earlier. It is interesting to observe that cluster 0 is formed by FNB while the other clusters are made up of POI and SHOP.

### Update dataframe

REMARK: Cluster 1 from DBSCAN_clusters_nottagged seems promising and with almost all the high rating venues together. Thus, we use the labels from DBSCAN_labels_nottagged to update our main dataframe.

In [78]:
updated_all_venues['DBSCAN'] = DBSCAN_labels_nottagged

# convert ratings to integer values
updated_all_venues['rating'] = updated_all_venues['rating'].astype('int')

table = pd.pivot_table(updated_all_venues, values='categories', index=['DBSCAN'], columns=['rating', 'tag'], aggfunc=len, fill_value=0)
table

rating,1,1,1,2,2,2,3,3,3,4,4,4
tag,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP,FNB,POI,SHOP
DBSCAN,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
0,0,0,0,0,0,0,5,13,1,0,0,0
1,0,0,0,0,0,0,0,0,0,5,14,1
2,0,0,0,5,13,1,0,0,0,0,0,0
3,5,14,2,0,0,0,0,0,0,0,0,0


## Map

DBSCAN clusters are plotted in solid circles while neighborhoods are in black hollow circles. Click on the circles to see the description of the venues/neighborhoods.

In [88]:
map_clusters_DBSCAN = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(DBSCAN_clusters_nottagged)
ys = [i+x+(i*x)**2 for i in range(DBSCAN_clusters_nottagged)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(updated_all_venues['lat'], updated_all_venues['lng'], updated_all_venues['name'], updated_all_venues['DBSCAN']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters_DBSCAN)
    
# add neighborhoods to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=False,
        parse_html=False).add_to(map_clusters_DBSCAN)     
       
map_clusters_DBSCAN

REMARK: Purple points are from Cluster 1. Visually, we can see that they are actually within close proximities. Moreover, it makes sense that popular eateries are close to popular places of interest since tourist traffic is higher in those places. Thus, the clustering is reasonable.

## Selecting neighborhoods

NOTE: We set selected_DBSCAN_cluster = 1.

In [82]:
selected_DBSCAN_cluster = 1

We use distance.geodesic(coords_1, coords_2).km to compute distance in km.  

Example:  
coords_1 = (52.2296756, 21.0122287)  
coords_2 = (52.406374, 16.9251681)  
distance.geodesic(coords_1, coords_2).km)

In [83]:
distance_df = pd.DataFrame()
distance_df['Neighborhood'] = manhattan_data['Neighborhood']
distance_df = distance_df.assign(mean_distance=[np.nan] * distance_df.shape[0])

We compute and average distance for each neighborhood to every coordinate in DBSCAN cluster 1. The lower the average distance, the closer the neighborhood is to the popular venues.

In [84]:
for n_ind in manhattan_data.index: 
    n_str = manhattan_data['Neighborhood'][n_ind]
    n_lat = manhattan_data['Latitude'][n_ind]
    n_lng = manhattan_data['Longitude'][n_ind]
    count = 0
    total_dist = 0
    for v_ind in updated_all_venues.index:
        if updated_all_venues['DBSCAN'][v_ind] == selected_DBSCAN_cluster:
            count += 1
            v_lat = updated_all_venues['lat'][v_ind]
            v_lng = updated_all_venues['lng'][v_ind]
            coords_1 = (n_lat, n_lng)
            coords_2 = (v_lat, v_lng)
            total_dist += distance.geodesic(coords_1, coords_2).km

    assert distance_df['Neighborhood'][n_ind] == n_str
    if count > 0:
        distance_df.at[n_ind, 'mean_distance'] = total_dist / count

## Conclusion

### Venues with highest rating and DBSCAN cluster 0 are:

In [85]:
venue_short = updated_all_venues.copy()

# Delete these row indexes from dataFrame
indexNames = venue_short[venue_short['DBSCAN'] != 1 ].index
venue_short.drop(indexNames , inplace=True)

In [86]:
venue_short[['name', 'categories']]

Unnamed: 0,name,categories
3,Central Park,Park
9,The Metropolitan Museum of Art (Metropolitan M...,Art Museum
18,Two Little Red Hens,Bakery
30,Levain Bakery,Bakery
33,Up Thai,Thai Restaurant
36,Lincoln Center for the Performing Arts,Performing Arts Venue
41,The Metropolitan Opera (Metropolitan Opera),Opera House
51,Marea,Seafood Restaurant
54,Carnegie Hall,Concert Hall
63,Museum of Modern Art (MoMA),Art Museum


REMARK: This list of venue can be used for review / recommendation.

### The top 5 neighborhoods for accommodation arrangements are:

In [87]:
distance_df.sort_values(by='mean_distance').head(5)

Unnamed: 0,Neighborhood,mean_distance
15,Midtown,1.420808
13,Lincoln Square,1.579648
34,Sutton Place,1.803459
14,Clinton,1.891847
10,Lenox Hill,1.99341


REMARK: The average distance of these neighborhoods to popular venues are all within 2km, which is a comfortable walking distance. These are good candidates for accommodation arrangements.