## 1. A description of the problem and a discussion of the background

# Introduction/Business Problem
When companies deploy new projects, it is done through experimentation. An experiment is run in a controlled environment OR in many cases in a suitable pilot location, and if successful, it is expanded to other locations. While deciding what other locations one should deploy such projects, it is always beneficial to have a good comparison of target locations with the location already experimented on, based on appropriate parameters. This can be achieved by realizing how similar or dis-similar the locations are. 

In this project we will aim to solve three seperate scenarios (having the same problem above) through Clustering:

1.A company like Amazon/Walmart were thinking long-term and wanted to have their own logistics fleet for better control and reliability. They ran an experiment in New York City to determine where to establish new package drop-off locations, to make it more convinient for their customers to return something. The experiment was a success and they have regonized that best target cities are Toronto and San Francisco.
We solve the problem of what areas in Toronto and San Francisco should they target to set up these drop off locations.

2.A company like Trip Advisor wants to recommend me locations in San Francisco based on what I did in New York City.
We solve the problem of recommending locations in a new city based on similarities in the previous one.

3.Clusters in different cities can also help companies determine where, in different cities, they can deploy self driving cars to pilot test (based on successful experimentation for pilot tests in one city). Lets consider Waymo, Uber or Tesla are at that stage where level 5 cars can be tested on roads with real customers. They would like to know where they can deploy their cars, after a successful test (again) in New York City. They have identified that San Francisco and Toronto are two metros that are open to this.
OR This can even help RideShare companies like Uber and Lyft to deploy more cars around a particluar area specifically for a certain type of service (like Ride Share). So people will much rather take a shared Uber/Lyft rather than taking a bus for example. 
Here we solve the problem of Identifying where they can deploy specific type of service in San Francisco based on experimentation in NYC.

*Note there are a lot of assumptions made to create these scenarios and the focus of "this" work is clustering Neighborhoods based on top 10 venues (in each neighborhood) in these three cities to see what are similar areas between the three and how they can be identified after the target locations (San Francisco and Toronto) have been identified.

** This study does not go into too much detail, and does not take into consideration a lot of real world factors and other data. This is meant to be a presentation of Data Science and Machine Learning abilities to showcase academic comprehension of the subject matter and tools.


## 2. A description of the data and how it will be used to solve the problem

# Data

In order to cluster different areas of the three cities mentioned above, we will consider each area within a city as a "neighborhood". The data set for all three cities was cleaned from sources (below), to represent the 'Neighborhood' with corresponding latitudes and longitudes of each neighborhood's approximate center and is stored in "combined_df" dataframe. Details on three Indivudual data sources:

1. Toronto DataSet: This dataframe in csv was combined using two different datasets:
    - Postal Codes List in Toronto (which was exported and saves as an xslx file). This file contained the postal codes and corresponding Neighborhood names. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
    - Geospatial File that has the latitude and longitudes and has been provided by Coursera http://cocl.us/Geospatial_data
2. New York DataSet: This data set was extracted from https://geo.nyu.edu/catalog/nyu_2451_34572
3. San Francisco DataSet: This data was extracted from https://geodata.lib.berkeley.edu/catalog/ark28722-s75c8t

We will use the three datasets to create one Merged dataset that has all neighborhoods of all three cities. We will extract Venues for each neighborhood from Foursquare and cluster the neighborhoods to determine similarities in inter-city neighborhoods and intra-city neighborhoods. We will focus more in "inter-city" relationships as it will be used to solve all three problems stated in the first section.

# METHODOLOGY

This section explains the main body of the work conducted in the project. It is divided into 6 sub-sections (if you want to follow along in the notebook). After these sub-sections, the next main section "RESULTS", will be portrayed.

Before diving into details, to summarize the Methodology, Neighborhood and Location data for three cities (NYC, SF and Toronto) was taken and combined into a single dataframe. Venues were extracted for these Neighborhods from FourSquare and each Neighborhood was explored based on top venues. Then, partitioning clustering (K-means) was used to cluster the neighborhoods into 'four' clusters to see similarity of clusters between different neighborhoods in different cities. The result of this clustering will enable one to make judgements about what location - in new cities (SF, Toronto) can be used to deploy an experiment that succeeeded in a pilot-city (NYC).

1. Importing and preparing all three data sets (NY, SF, Toronto): As mentioned in the data section, all three datasets were publically available and were formatted and cleaned to get a conbined dataframe "combined_df" which consisted of four columns, 'City', 'Neighborhood', 'Latitude', and 'Longitude'. 

2. Creating Maps and Pinouts to visualize the data: A Map of North America was created to visualize the Neighborhood in three cities using Folium. This ensured all the neighborhoods were imported properly and the location data was correct.

3. Using FourSquare to get the Venue information for combined_df: FourSquare is an application that has data for Venues and Users and Experiences. Using this app, data for Venues (limited to 100 venues per neighborhod) was extracted using API calls (more on APIs here https://developer.foursquare.com/docs/). The main interest of this project was to segment the Neighborhods based on different types of Venues in that Neighborhood. Hence 'Venue_Categories' were exported and cleaned up to create a master_df contianing all Cities, Neighborhoods, Venues and Venue Categories, along with their location information.

4. Create dataframe for Clustering: The goal of this sub-section, was to have a dataframe, that could be used for Clustering. This data frame is "master_grouped" and was created by grouping "one_hot encoded venues table", by Neighborhood and City to maintain uniqueness of each Neighborhood. Also another dataframe was created created, that would show the top 10 Venues by Neighborhood, so we can add our Cluster labels to this dataframe to analyze what each cluster contains (after clustering). This dataframe is "neighborhoods_venues_sorted".

5. Clustering: Once the dataframes above were available, partitioning Clustering (K-means) was used to create 6 clusters of Neighborhoods. Cluster labels were inserted to the "neighborhoods_venues_sorted" and this dataframe was merged with "combined_df" which contained latitudes and longitudes (location) information and a final dataframe "master_merged" was created that could be analyzed for results.

6. Created a MAP for visualization: Finally, another Map was created using folium to visualize the clusters and how they were spread in different cities. Toronto and NYC being close by, we could see the spread to compare both of them pretty conviniently. But SF being on the opposite coast, we comapred them using snippets. 

             * The next sections are cells for METHODOLOGY section. After sub section 6, RESULTS are shared. You will need to install folium and run the cells to view Maps.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



## 1. Importing and preparing all three data sets (NY, SF, Toronto).
In order to create maps using geoplot and identifying neighborhoods on it, we need to get dfs that have neighborhood, latitude and longitude.

toronto_df  
newyork_df  
sf_df  


#### 1.1. Creating Toronto Dataset

In [4]:
toronto_df = pd.read_csv(r"C:\Users\The Godfather\Desktop\totonto_neighborhoods.csv")
toronto_df.drop(columns=["Unnamed: 0","PostalCode"], inplace=True)
toronto_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,North York,Parkwoods,43.753259,-79.329656
1,North York,Victoria Village,43.725882,-79.315572
2,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


In [5]:
toronto_df.shape

(103, 4)

In [6]:
#Adding City column
toronto_df["City"] = "Toronto"
toronto_df = toronto_df[["City", "Neighborhood", "Latitude", "Longitude"]]
toronto_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,Toronto,Parkwoods,43.753259,-79.329656
1,Toronto,Victoria Village,43.725882,-79.315572
2,Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,Toronto,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


In [7]:
toronto_df.shape

(103, 4)

#### 1.2. Creating NYC Dataset

In [8]:
#1.2.1 Importing the json file
with open(r"C:\Users\The Godfather\Desktop\NYC_neighborhood_names-geojson.json") as json_data:
    newyork_data = json.load(json_data)
    
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

In [9]:
newyork_data = newyork_data['features']
newyork_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [10]:
#1.2.2. Converting the above details into a dataframe
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
newyork_df = pd.DataFrame(columns=column_names)

In [11]:
for data in newyork_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    newyork_df = newyork_df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [12]:
newyork_df.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [13]:
newyork_df.shape

(306, 4)

In [14]:
#Adding City column
newyork_df["City"] = "NYC"
newyork_df = newyork_df[["City", "Neighborhood", "Latitude", "Longitude"]]
newyork_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,NYC,Wakefield,40.894705,-73.847201
1,NYC,Co-op City,40.874294,-73.829939
2,NYC,Eastchester,40.887556,-73.827806
3,NYC,Fieldston,40.895437,-73.905643
4,NYC,Riverdale,40.890834,-73.912585


#### 1.3 Creating San Francisco Dataset

In [15]:
import json

with open(r"C:\Users\The Godfather\Desktop\sf_neighborhood_bounds-geojson.json") as read_file:
    data = json.load(read_file)

In [16]:
list(data.keys())

['type', 'totalFeatures', 'features', 'crs']

In [17]:
#data['features'][0]['geometry']['coordinates'][0][0]
#data['features'][0]['properties']['NEIGHBORHO']
DDFF=pd.DataFrame()
for i in range(0, len(data['features'])):
    df= pd.DataFrame(data['features'][i]['geometry']['coordinates'][0][0]).rename({0:"Longitude", 1:"Latitude"}, axis=1)
    df['Neighbrhood'] = data['features'][i]['properties']['NEIGHBORHO']
    DDFF = DDFF.append(df).copy()

In [18]:
sf_df = DDFF.groupby(['Neighbrhood']).mean().reset_index()

In [19]:
sf_df

Unnamed: 0,Neighbrhood,Longitude,Latitude
0,Bayview,-122.378681,37.724948
1,Bernal Heights,-122.418704,37.738714
2,Castro/Upper Market,-122.441684,37.761834
3,Chinatown,-122.40762,37.792535
4,Crocker Amazon,-122.436243,37.711069
5,Diamond Heights,-122.44191,37.74103
6,Downtown/Civic Center,-122.412572,37.786747
7,Excelsior,-122.424322,37.723815
8,Financial District,-122.394722,37.798139
9,Glen Park,-122.4341,37.738684


In [20]:
sf_df = sf_df[['Neighbrhood', 'Latitude', 'Longitude']]
sf_df

Unnamed: 0,Neighbrhood,Latitude,Longitude
0,Bayview,37.724948,-122.378681
1,Bernal Heights,37.738714,-122.418704
2,Castro/Upper Market,37.761834,-122.441684
3,Chinatown,37.792535,-122.40762
4,Crocker Amazon,37.711069,-122.436243
5,Diamond Heights,37.74103,-122.44191
6,Downtown/Civic Center,37.786747,-122.412572
7,Excelsior,37.723815,-122.424322
8,Financial District,37.798139,-122.394722
9,Glen Park,37.738684,-122.4341


In [24]:
sf_df.shape

(36, 4)

In [25]:
sf_df["City"] = "SF"
sf_df = sf_df[["City", "Neighbrhood", "Latitude", "Longitude"]]
sf_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,City,Neighbrhood,Latitude,Longitude
0,SF,Bayview,37.724948,-122.378681
1,SF,Bernal Heights,37.738714,-122.418704
2,SF,Castro/Upper Market,37.761834,-122.441684
3,SF,Chinatown,37.792535,-122.40762
4,SF,Crocker Amazon,37.711069,-122.436243


In [32]:
sf_df.rename(columns ={"Neighbrhood":"Neighborhood"}, inplace=True)
sf_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,SF,Bayview,37.724948,-122.378681
1,SF,Bernal Heights,37.738714,-122.418704
2,SF,Castro/Upper Market,37.761834,-122.441684
3,SF,Chinatown,37.792535,-122.40762
4,SF,Crocker Amazon,37.711069,-122.436243


In [33]:
sf_df.shape

(36, 4)

#### 1.4. Need to make NY and Toronto df same as SF by dropping Boroughs. AND Merging the three data frames to create new combined_df

In [26]:
toronto_df.head()


Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,Toronto,Parkwoods,43.753259,-79.329656
1,Toronto,Victoria Village,43.725882,-79.315572
2,Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,Toronto,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


In [27]:
toronto_df.shape

(103, 4)

In [28]:
newyork_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,NYC,Wakefield,40.894705,-73.847201
1,NYC,Co-op City,40.874294,-73.829939
2,NYC,Eastchester,40.887556,-73.827806
3,NYC,Fieldston,40.895437,-73.905643
4,NYC,Riverdale,40.890834,-73.912585


In [29]:
newyork_df.shape

(306, 4)

In [34]:
sf_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,SF,Bayview,37.724948,-122.378681
1,SF,Bernal Heights,37.738714,-122.418704
2,SF,Castro/Upper Market,37.761834,-122.441684
3,SF,Chinatown,37.792535,-122.40762
4,SF,Crocker Amazon,37.711069,-122.436243


In [35]:
#Now merge all three dfs vertically
combined_df = pd.concat([sf_df, newyork_df, toronto_df])
combined_df.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude
0,SF,Bayview,37.724948,-122.378681
1,SF,Bernal Heights,37.738714,-122.418704
2,SF,Castro/Upper Market,37.761834,-122.441684
3,SF,Chinatown,37.792535,-122.40762
4,SF,Crocker Amazon,37.711069,-122.436243


In [36]:
combined_df.shape

(445, 4)

#### There are a total of 445 Neighborhoods in all three cities combined that we will take into consideration. Of which 431 are unique.
We will however consider the duplicate cases at this stage as the 'City' column will make the row unique for further processing. We will see later that this gets weeded out step by step and we ignore the few neighborhoods while clustering and get a clean set of 423 rows (after removing the NaNs from one_hot encoding.

## 2. Creating Maps and Pinouts to visualize the data

In [37]:
address = 'USA'
geolocator = Nominatim(user_agent="usa-explorer")

location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of USA are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of USA are 39.7837304, -100.4458825.


In [38]:
# create map of USA/Canada using latitude and longitude values
map_northamerica = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(combined_df['Latitude'], combined_df['Longitude'], combined_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_northamerica)  
    
map_northamerica

## 3. Using FourSquare to get the Venue information for combined_df


#### 3.1. Create foursquare credentials

In [39]:
CLIENT_ID = '4YRRGTHC2QBF0UUR1ASAYR00SZWDMNXI2GN2SI0GTVSEZMBM' # your Foursquare ID
CLIENT_SECRET = 'RGUUUSUTMV5SIBTY1AAN2Y1G5NESQNQX0VSWR1S41DJT1SMM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4YRRGTHC2QBF0UUR1ASAYR00SZWDMNXI2GN2SI0GTVSEZMBM
CLIENT_SECRET:RGUUUSUTMV5SIBTY1AAN2Y1G5NESQNQX0VSWR1S41DJT1SMM


#### 3.2 Create a function to get all nearby venues in combined_df

In [40]:
LIMIT = 100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [41]:
all_venues = getNearbyVenues(names=combined_df['Neighborhood'],
                                   latitudes=combined_df['Latitude'],
                                   longitudes=combined_df['Longitude']
                                  )

Bayview
Bernal Heights
Castro/Upper Market
Chinatown
Crocker Amazon
Diamond Heights
Downtown/Civic Center
Excelsior
Financial District
Glen Park
Golden Gate Park
Haight Ashbury
Inner Richmond
Inner Sunset
Lakeshore
Marina
Mission
Nob Hill
Noe Valley
North Beach
Ocean View
Outer Mission
Outer Richmond
Outer Sunset
Pacific Heights
Parkside
Potrero Hill
Presidio
Presidio Heights
Russian Hill
Seacliff
South of Market
Twin Peaks
Visitacion Valley
West of Twin Peaks
Western Addition
Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens


In [42]:
all_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bayview,37.724948,-122.378681,Maya Organic Jewelry,37.726282,-122.380073,Jewelry Store
1,Bayview,37.724948,-122.378681,MotoTireGuy - Motorcycle Tire Services,37.726075,-122.380314,Motorcycle Shop
2,Bayview,37.724948,-122.378681,Crêpe & Brioche Inc.,37.725685,-122.37937,Restaurant
3,Bayview,37.724948,-122.378681,Brave Matter,37.725871,-122.379711,Art Gallery
4,Bayview,37.724948,-122.378681,Com#,37.72395,-122.382486,Park


In [43]:
all_venues.shape

(12946, 7)

#### There are a total of 12946 Venues that were exported from Foursquare.

#### 3.3 Add the city column from combined_df into all_venues and create a new dataframe called "master_df"

In [44]:
master_df = pd.merge(all_venues, combined_df.drop_duplicates().copy().rename({'Latitude':'Neighborhood Latitude','Longitude':'Neighborhood Longitude'},axis=1), on=['Neighborhood','Neighborhood Longitude','Neighborhood Latitude'], how='left').copy()

In [45]:
master_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,City
0,Bayview,37.724948,-122.378681,Maya Organic Jewelry,37.726282,-122.380073,Jewelry Store,SF
1,Bayview,37.724948,-122.378681,MotoTireGuy - Motorcycle Tire Services,37.726075,-122.380314,Motorcycle Shop,SF
2,Bayview,37.724948,-122.378681,Crêpe & Brioche Inc.,37.725685,-122.37937,Restaurant,SF
3,Bayview,37.724948,-122.378681,Brave Matter,37.725871,-122.379711,Art Gallery,SF
4,Bayview,37.724948,-122.378681,Com#,37.72395,-122.382486,Park,SF


In [46]:
master_df.shape

(12946, 8)

#### 3.4. Make sure data is good

In [47]:
master_df.isnull().any()

Neighborhood              False
Neighborhood Latitude     False
Neighborhood Longitude    False
Venue                     False
Venue Latitude            False
Venue Longitude           False
Venue Category            False
City                      False
dtype: bool

In [48]:
# lets see number of unique categories that were returned from four square

print('There are {} uniques categories.'.format(len(master_df['Venue Category'].unique())))

There are 482 uniques categories.


In [49]:
print(master_df.loc[master_df['Neighborhood'] == 'NaN'])
print(master_df.loc[master_df['Neighborhood'] == 0])

Empty DataFrame
Columns: [Neighborhood, Neighborhood Latitude, Neighborhood Longitude, Venue, Venue Latitude, Venue Longitude, Venue Category, City]
Index: []
Empty DataFrame
Columns: [Neighborhood, Neighborhood Latitude, Neighborhood Longitude, Venue, Venue Latitude, Venue Longitude, Venue Category, City]
Index: []


## 4. Create dataframe for Clustering
1. Create a grouped dataframe by 'City' and 'Neighborhood'with one_hot encoded venue categories.
2. Cluster them using K-means

In [50]:
master_df.groupby(['Neighborhood', 'City']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,City,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Agincourt,Toronto,4,4,4,4,4,4
"Alderwood , Long Branch",Toronto,10,10,10,10,10,10
Allerton,NYC,30,30,30,30,30,30
Annadale,NYC,14,14,14,14,14,14
Arden Heights,NYC,5,5,5,5,5,5
Arlington,NYC,7,7,7,7,7,7
Arrochar,NYC,23,23,23,23,23,23
Arverne,NYC,17,17,17,17,17,17
Astoria,NYC,98,98,98,98,98,98
Astoria Heights,NYC,11,11,11,11,11,11


In [51]:
# one hot encoding
master_onehot = pd.get_dummies(master_df[['Venue Category']], prefix="", prefix_sep="")


#add neighborhood and city column back to dataframe
master_onehot[['Neighborhood', 'City']] = master_df[['Neighborhood', 'City']] 

# move city column to the first column
fixed_columns = [master_onehot.columns[-1]] + list(master_onehot.columns[:-1])
master_onehot = master_onehot[fixed_columns]

# move neighborhood column to the first column
#fixed_columns = [master_onehot.columns[306]] + list(master_onehot.columns[:-1])
#master_onehot = master_onehot[fixed_columns]

master_onehot.head()

Unnamed: 0,City,ATM,Acai House,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Gate,...,Weight Loss Center,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,SF,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,SF,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,SF,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,SF,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,SF,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
master_onehot.shape

(12946, 483)

In [53]:

master_onehot.columns.get_loc("Neighborhood")
     

304

In [54]:
master_onehot["Neighborhood"]

0                                                  Bayview
1                                                  Bayview
2                                                  Bayview
3                                                  Bayview
4                                                  Bayview
5                                                  Bayview
6                                           Bernal Heights
7                                           Bernal Heights
8                                           Bernal Heights
9                                           Bernal Heights
10                                          Bernal Heights
11                                          Bernal Heights
12                                          Bernal Heights
13                                          Bernal Heights
14                                          Bernal Heights
15                                          Bernal Heights
16                                          Bernal Heigh

#### You will notice each neighborhood has multiple rows. This is because one hot was done by venue category. That means, if a neighborhood has 5 venue categories, it will have 5 rows, with '1's on corresponding columns on each row.


In [55]:
master_grouped = master_onehot.groupby(["Neighborhood", "City"]).mean().reset_index()
master_grouped

Unnamed: 0,Neighborhood,City,ATM,Acai House,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,...,Weight Loss Center,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Agincourt,Toronto,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
1,"Alderwood , Long Branch",Toronto,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
2,Allerton,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
3,Annadale,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
4,Arden Heights,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
5,Arlington,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
6,Arrochar,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0
7,Arverne,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.058824,0.0,0.000000,0.000000,0.000000,0.0
8,Astoria,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.010204,0.0,0.000000,0.000000,0.000000,0.0
9,Astoria Heights,NYC,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0


In [57]:
# Now lets see what are the top 5 venues in each neighborhood. Remove the "City" column

num_top_venues = 5

for hood in master_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = master_grouped[master_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                       venue  freq
0             Breakfast Spot  0.25
1                     Lounge  0.25
2         Chinese Restaurant  0.25
3  Latin American Restaurant  0.25
4    North Indian Restaurant  0.00


----Alderwood , Long Branch----
                venue  freq
0         Pizza Place   0.2
1            Pharmacy   0.1
2        Dance Studio   0.1
3         Coffee Shop   0.1
4  Athletics & Sports   0.1


----Allerton----
              venue  freq
0       Pizza Place  0.17
1     Deli / Bodega  0.13
2       Supermarket  0.07
3  Department Store  0.07
4          Pharmacy  0.03


----Annadale----
              venue  freq
0       Pizza Place  0.14
1  Sushi Restaurant  0.07
2            Bakery  0.07
3     Train Station  0.07
4             Diner  0.07


----Arden Heights----
           venue  freq
0       Pharmacy   0.2
1       Bus Stop   0.2
2    Coffee Shop   0.2
3    Pizza Place   0.2
4  Deli / Bodega   0.2


----Arlington----
                 venue  freq
0        

                venue  freq
0          Restaurant  0.19
1                Park  0.12
2  Chinese Restaurant  0.06
3        Home Service  0.06
4          Playground  0.06


----Bulls Head----
           venue  freq
0       Bus Stop  0.09
1    Pizza Place  0.07
2  Deli / Bodega  0.04
3     Food Truck  0.04
4      Gift Shop  0.04


----Bushwick----
                venue  freq
0                 Bar  0.10
1         Coffee Shop  0.07
2  Mexican Restaurant  0.07
3       Deli / Bodega  0.06
4              Bakery  0.04


----Business reply mail Processing CentrE----
                  venue  freq
0    Light Rail Station  0.12
1                   Spa  0.06
2                Garden  0.06
3  Fast Food Restaurant  0.06
4        Farmers Market  0.06


----Butler Manor----
                   venue  freq
0                   Pool  0.33
1         Baseball Field  0.33
2      Convenience Store  0.17
3               Bus Stop  0.17
4  Outdoors & Recreation  0.00


----CN Tower , King and Spadina , Railway Lands

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

#### Lets put the above info in a data frame

In [82]:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [83]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = master_grouped['Neighborhood']

for ind in np.arange(master_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(master_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Breakfast Spot,Lounge,Chinese Restaurant,Latin American Restaurant,Zoo Exhibit,Factory,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant
1,"Alderwood , Long Branch",Pizza Place,Pub,Athletics & Sports,Sandwich Place,Dance Studio,Coffee Shop,Pharmacy,Skating Rink,Gym,Empanada Restaurant
2,Allerton,Pizza Place,Deli / Bodega,Supermarket,Department Store,Spanish Restaurant,Spa,Bus Station,Chinese Restaurant,Cosmetics Shop,Playground
3,Annadale,Pizza Place,Food,Dance Studio,Bakery,Liquor Store,Train Station,Pharmacy,Diner,Deli / Bodega,Restaurant
4,Arden Heights,Bus Stop,Pizza Place,Pharmacy,Deli / Bodega,Coffee Shop,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service


In [84]:
neighborhoods_venues_sorted.shape

(431, 11)

## 5. CLUSTERING

In [79]:
from sklearn.cluster import KMeans 


In [80]:
# set number of clusters
kclusters = 6


master_grouped_clustering = master_grouped.drop(columns =['Neighborhood', 'City'])

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(master_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([5, 5, 5, 5, 0, 0, 0, 5, 5, 5])

In [85]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

master_merged = combined_df

# merge master_grouped with combined_df to add latitude/longitude for each neighborhood
master_merged = master_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

master_merged.head() # check the last columns!

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,37.724948,-122.378681,5.0,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,37.738714,-122.418704,5.0,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,37.761834,-122.441684,5.0,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,37.792535,-122.40762,5.0,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
3,SF,Chinatown,37.792535,-122.40762,5.0,Hotel,Coffee Shop,Bakery,Tea Room,Szechuan Restaurant,Spa,Bubble Tea Shop,Chinese Restaurant,Sushi Restaurant,Vegetarian / Vegan Restaurant


In [86]:
master_merged = master_merged.drop_duplicates(subset="Neighborhood")
master_merged

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,37.724948,-122.378681,5.0,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,37.738714,-122.418704,5.0,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,37.761834,-122.441684,5.0,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,37.792535,-122.407620,5.0,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
4,SF,Crocker Amazon,37.711069,-122.436243,5.0,Tennis Court,Bus Station,Liquor Store,Gastropub,Scenic Lookout,Zoo Exhibit,Event Service,Exhibit,Event Space,Ethiopian Restaurant
5,SF,Diamond Heights,37.741030,-122.441910,5.0,Trail,Park,Playground,Grocery Store,Pharmacy,Dim Sum Restaurant,Salon / Barbershop,Bus Station,Coffee Shop,Baseball Field
6,SF,Downtown/Civic Center,37.786747,-122.412572,5.0,Cocktail Bar,Coffee Shop,Theater,Speakeasy,Vietnamese Restaurant,Breakfast Spot,Spa,Café,Wine Bar,Hotel
7,SF,Excelsior,37.723815,-122.424322,2.0,Trail,Park,Convenience Store,Scenic Lookout,Lake,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
8,SF,Financial District,37.798139,-122.394722,5.0,Coffee Shop,Hotel,Gym / Fitness Center,Café,Pizza Place,Salad Place,Juice Bar,Falafel Restaurant,Park,American Restaurant
9,SF,Glen Park,37.738684,-122.434100,5.0,Trail,Park,Alternative Healer,Scenic Lookout,Dog Run,Coffee Shop,Pet Store,Gift Shop,Dive Bar,Furniture / Home Store


In [87]:
master_merged.dropna()

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,37.724948,-122.378681,5.0,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,37.738714,-122.418704,5.0,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,37.761834,-122.441684,5.0,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,37.792535,-122.407620,5.0,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
4,SF,Crocker Amazon,37.711069,-122.436243,5.0,Tennis Court,Bus Station,Liquor Store,Gastropub,Scenic Lookout,Zoo Exhibit,Event Service,Exhibit,Event Space,Ethiopian Restaurant
5,SF,Diamond Heights,37.741030,-122.441910,5.0,Trail,Park,Playground,Grocery Store,Pharmacy,Dim Sum Restaurant,Salon / Barbershop,Bus Station,Coffee Shop,Baseball Field
6,SF,Downtown/Civic Center,37.786747,-122.412572,5.0,Cocktail Bar,Coffee Shop,Theater,Speakeasy,Vietnamese Restaurant,Breakfast Spot,Spa,Café,Wine Bar,Hotel
7,SF,Excelsior,37.723815,-122.424322,2.0,Trail,Park,Convenience Store,Scenic Lookout,Lake,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
8,SF,Financial District,37.798139,-122.394722,5.0,Coffee Shop,Hotel,Gym / Fitness Center,Café,Pizza Place,Salad Place,Juice Bar,Falafel Restaurant,Park,American Restaurant
9,SF,Glen Park,37.738684,-122.434100,5.0,Trail,Park,Alternative Healer,Scenic Lookout,Dog Run,Coffee Shop,Pet Store,Gift Shop,Dive Bar,Furniture / Home Store


In [88]:
master_merged['Cluster Labels'].value_counts()

5.0    377
0.0     23
2.0     20
4.0      5
3.0      2
1.0      1
Name: Cluster Labels, dtype: int64

In [89]:
master_merged.loc[master_merged['Cluster Labels'].isnull()]

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
207,NYC,Port Ivory,40.639683,-74.174645,,,,,,,,,,,
257,NYC,Howland Hook,40.638433,-74.186223,,,,,,,,,,,
5,Toronto,Islington Avenue,43.667856,-79.532242,,,,,,,,,,,
11,Toronto,"West Deane Park , Princess Gardens , Martin Gr...",43.650943,-79.554724,,,,,,,,,,,
95,Toronto,Upper Rouge,43.836125,-79.205636,,,,,,,,,,,


In [91]:
master_merged.drop(index=[207,257,5,11,95], inplace=True)
master_merged.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,37.724948,-122.378681,5.0,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,37.738714,-122.418704,5.0,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,37.761834,-122.441684,5.0,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,37.792535,-122.40762,5.0,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
4,SF,Crocker Amazon,37.711069,-122.436243,5.0,Tennis Court,Bus Station,Liquor Store,Gastropub,Scenic Lookout,Zoo Exhibit,Event Service,Exhibit,Event Space,Ethiopian Restaurant


In [92]:
master_merged.shape

(423, 15)

In [93]:
master_merged["Cluster Labels"] = master_merged["Cluster Labels"].apply(np.int64)

In [94]:
master_merged.head()

Unnamed: 0,City,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,37.724948,-122.378681,5,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,37.738714,-122.418704,5,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,37.761834,-122.441684,5,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,37.792535,-122.40762,5,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
4,SF,Crocker Amazon,37.711069,-122.436243,5,Tennis Court,Bus Station,Liquor Store,Gastropub,Scenic Lookout,Zoo Exhibit,Event Service,Exhibit,Event Space,Ethiopian Restaurant


In [95]:
master_merged['Cluster Labels'].value_counts()

5    372
0     23
2     20
4      5
3      2
1      1
Name: Cluster Labels, dtype: int64

## 6. Create a MAP to visualize if needed

In [96]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(master_merged['Latitude'], master_merged['Longitude'], master_merged['Neighborhood'], master_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

# RESULTS & DISCUSSIONS

The data successfully divided the Neighborhoods in all three cities into clusters. Clusters 0 to 5. Of which three clusters will be considered for analysis (0, 2 & 5)

1.Cluster 0 (RED Colored Clusters): Thses were areas where the most popular venue was a Bus Stops. There were two main locations in SF that fit this category. This is a bit surprising but these come up as a lot of Students/Tourists who want to travel from the City to these two destinations, end up getting here taking multiple modes of transport and prefer Buses (single mode of transport, even though its slower) to get back to their residence as they are tired after a day in the beach or walking/hiking around twin peaks.

2.Clsuter 1 (PURPLE Colored Clusters): Just had one location. So we will not consider this.  

3.Clsuter 2 (DARK BLUE Colored Clusters): These areas had trails, parks, playgrounds and Zoos. These are places where people visit to for  outdor activities or as weekend trips. Also a lot of tourists visit these places to see local nature and points of interest. This cluster is a good recommendation that companies like Trip Advisor will give to a person who enjoys outdoor activities and is visiting San Francisco or Toronto.

4.Clsuter 3 (LIGHT BLUE Colored Clusters): Just had 2 locations. So we will not consider this.  

5.Clsuter 4 (GREEN Colored Clusters): Pizza Places. Not of much use for our applications. So we will not consider this.  

6.Clsuter 5 (ORANGE Colored Clusters): Restaurants, Coffee Shops, Cafe's, Convenience Stores, Pubs & Bars, etc. This is the most prominent cluster as all three locations are densly populated cities. These are the places where lot of movement happens. People are constantly going to Cafe's and Coffee shops on a regular basis. Neighborhood specific Cafe's, Cofee Shops, Bakeries and Restaurants serve as a great place for people to regularly go with friends and family. These locations can be used by Amazon/Walmart to establish package drop-off locations. 

In [101]:
# Cluster [0]
master_merged.loc[master_merged['Cluster Labels'] == 0, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,SF,Outer Richmond,Bus Stop,Grocery Store,Beach,Garden,Park,Chinese Restaurant,Liquor Store,Café,Garden Center,Gym
34,SF,West of Twin Peaks,Bus Stop,Trail,Monument / Landmark,Mountain,Tree,Park,Empanada Restaurant,English Restaurant,Falafel Restaurant,Electronics Store
77,NYC,Manhattan Beach,Bus Stop,Café,Pizza Place,Beach,Playground,Harbor / Marina,Ice Cream Shop,Sandwich Place,Event Service,Event Space
193,NYC,Brookville,Recording Studio,Deli / Bodega,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
198,NYC,New Brighton,Bus Stop,Deli / Bodega,Park,Bowling Alley,Discount Store,Playground,Construction & Landscaping,Flea Market,Empanada Restaurant,Food Court
202,NYC,Grymes Hill,Bus Stop,Dog Run,American Restaurant,Zoo Exhibit,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
204,NYC,South Beach,Pier,Deli / Bodega,Beach,Bus Stop,Athletics & Sports,Exhibit,Falafel Restaurant,Factory,Eye Doctor,Zoo Exhibit
206,NYC,Mariner's Harbor,Italian Restaurant,Deli / Bodega,Athletics & Sports,Moving Target,Pizza Place,Event Space,Falafel Restaurant,Factory,Eye Doctor,Exhibit
217,NYC,Tottenville,Italian Restaurant,Thrift / Vintage Store,Home Service,Mexican Restaurant,Deli / Bodega,Bus Stop,Frame Store,Cosmetics Shop,Empanada Restaurant,English Restaurant
219,NYC,Silver Lake,Bus Stop,Burger Joint,Golf Course,American Restaurant,Farm,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant


In [102]:
# Cluster [1]:
master_merged.loc[master_merged['Cluster Labels'] == 1, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Toronto,"Malvern , Rouge",Fast Food Restaurant,Zoo Exhibit,Falafel Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant


In [103]:
# Cluster [2]
master_merged.loc[master_merged['Cluster Labels'] == 2, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,SF,Excelsior,Trail,Park,Convenience Store,Scenic Lookout,Lake,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
13,SF,Inner Sunset,Trail,Park,Bus Stop,Mountain,Garden,Event Service,Factory,Eye Doctor,Exhibit,Event Space
20,SF,Ocean View,Playground,Liquor Store,Light Rail Station,Park,Dumpling Restaurant,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant
27,NYC,Clason Point,Park,Bus Stop,Boat or Ferry,Grocery Store,South American Restaurant,Pool,Event Space,Falafel Restaurant,Factory,Eye Doctor
192,NYC,Somerville,Park,Zoo Exhibit,Convention Center,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
203,NYC,Todt Hill,Park,Zoo Exhibit,Convention Center,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
303,NYC,Bayswater,Park,Playground,Zoo Exhibit,Duty-free Shop,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant
0,Toronto,Parkwoods,Bus Stop,Food & Drink Shop,Park,Falafel Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant
16,Toronto,Humewood-Cedarvale,Trail,Park,Field,Hockey Arena,Zoo Exhibit,Eye Doctor,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
19,Toronto,The Beaches,Pub,Trail,Health Food Store,Park,Zoo Exhibit,Eye Doctor,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant


In [104]:
# Cluster [3]
master_merged.loc[master_merged['Cluster Labels'] == 3, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
33,SF,Visitacion Valley,Pool,Zoo Exhibit,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
76,NYC,Mill Island,Pool,Zoo Exhibit,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service


In [105]:
# Cluster [4]
master_merged.loc[master_merged['Cluster Labels'] == 4, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
92,NYC,Midwood,Pizza Place,Convenience Store,Bagel Shop,Bakery,Candy Store,Pharmacy,Video Game Store,Ice Cream Shop,Fish Market,Fish & Chips Shop
10,Toronto,Glencairn,Pizza Place,Pub,Japanese Restaurant,Zoo Exhibit,Ethiopian Restaurant,Eye Doctor,Exhibit,Event Space,Event Service,English Restaurant
50,Toronto,Humber Summit,Pizza Place,Zoo Exhibit,Duty-free Shop,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service
70,Toronto,Westmount,Pizza Place,Intersection,Chinese Restaurant,Discount Store,Sandwich Place,Coffee Shop,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Empanada Restaurant
77,Toronto,"Kingsview Village , St. Phillips , Martin Grov...",Pizza Place,Sandwich Place,Bus Line,Mobile Phone Shop,Zoo Exhibit,Egyptian Restaurant,Electronics Store,Empanada Restaurant,English Restaurant,Entertainment Service


In [106]:
# Cluster [5]
master_merged.loc[master_merged['Cluster Labels'] == 5, master_merged.columns[[0,1] + list(range(5, master_merged.shape[1]))]]

Unnamed: 0,City,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SF,Bayview,Jewelry Store,Restaurant,Food & Drink Shop,Art Gallery,Park,Motorcycle Shop,Exhibit,Event Space,Event Service,Ethiopian Restaurant
1,SF,Bernal Heights,Coffee Shop,Italian Restaurant,Indian Restaurant,Gourmet Shop,Bakery,Mexican Restaurant,Café,Cocktail Bar,Asian Restaurant,Peruvian Restaurant
2,SF,Castro/Upper Market,Park,Monument / Landmark,Hill,Japanese Restaurant,Trail,Szechuan Restaurant,Dessert Shop,Shoe Store,Scenic Lookout,Road
3,SF,Chinatown,Bakery,Cocktail Bar,Chinese Restaurant,American Restaurant,Salon / Barbershop,Optical Shop,Bar,Coffee Shop,Spa,Noodle House
4,SF,Crocker Amazon,Tennis Court,Bus Station,Liquor Store,Gastropub,Scenic Lookout,Zoo Exhibit,Event Service,Exhibit,Event Space,Ethiopian Restaurant
6,SF,Downtown/Civic Center,Cocktail Bar,Coffee Shop,Theater,Speakeasy,Vietnamese Restaurant,Breakfast Spot,Spa,Café,Wine Bar,Hotel
8,SF,Financial District,Coffee Shop,Hotel,Gym / Fitness Center,Café,Pizza Place,Salad Place,Juice Bar,Falafel Restaurant,Park,American Restaurant
9,SF,Glen Park,Trail,Park,Alternative Healer,Scenic Lookout,Dog Run,Coffee Shop,Pet Store,Gift Shop,Dive Bar,Furniture / Home Store
10,SF,Golden Gate Park,Lake,Thai Restaurant,Park,Bus Station,Trail,Windmill,Golf Course,Ethiopian Restaurant,Event Space,Event Service
12,SF,Inner Richmond,Thai Restaurant,Wine Bar,Burger Joint,Chinese Restaurant,Japanese Restaurant,Pizza Place,Burmese Restaurant,Wine Shop,Bakery,Furniture / Home Store


# CONCLUSION
This study sucessfully segmented Neighborhoods, based on similarities in most popular Venues (between cities) which helped companies make decisions on waht area to deploy a project in a new city based on successful experimentation in a pilot - city.
This was extended to three cases introduced in the first section.

1.Companies like Amazon/Walmart were able to deploy "package drop-off locations" around Neighborhood venues that belong to cluster 5 (Orange Colored Clusters) in Toronto and San Francisco.

2.Trip Advisor (or similar) was able to give good relevant recommends - cluster 2 (Dark Blue Colored Clusters) to to its users who were into outdoor activities and were either tourists or just exploring the city of Toronto and San Francisco.

3.Companies like Uber/Lyft or SuperShuttle were able to deploy new Ride Share service routes in San Francisco - cluster 0 (Red Colored Clusters) where majority of the population would wait around a bus stop and get a single mode of transportation, rather than use other modes of transport where they would have to change staions/modes multiple times.


** Again, this study does not go into too much detail, and does not take into consideration a lot of realworld factors and other data. This is meant to be a presenttion of Data Science and Machine Learning abilities to showcase academic comprehension of the subject and tools.