# Ideal Restaurant Venues  
### By: Vatsal Patel

### Introduction:
**Problem Statement:**  
A ramen shop owner notices that his sales hit their peak during lunch time due to a large amount of businesses located near his shop. Furthermore, his shop is located near a subway station which makes it readily accessible to individuals who are staying / working further away. With his savings now large enough to open another shop, he is looking to expand his business. With factors such as taste, quality of food, authenticity and price of the food remaining the same, the shop owner wants to at the very least achieve similar levels of sale (ideally the goal would be to outsell his current shop). With this being said what location is ideal for him to open up shop at? And what factors would deem a location ideal for a ramen business. The shop owner wants to better understand the demographics of the customers visiting his shop. Doing this he can create an environment that is best suitable for the demographic. In doing so, he will achieve similar levels of sales, improve restaurant ratings and create a loyal customer base. 


For the restaurant owner to maintain / improve his sales, we can use the Four Square API to analyze location based data in New York. From the dataset we would want to identify the frequency and time of visits from the area’s inhabitants  

Based on these factors we can identify a location which is frequently visited, in an enjoyable neighborhood and be able to create an environment that is ideal to the age group of the possible customers visiting the area. Creating this positive and engaging environment at the restaurant should encourage the customers to write positive reviews and aid in essentially marketing the restaurant.


### Data  
The data can be easily access through the Four Square API. The data will focus on the latitude and longitude of subway station in New York and the various restaurants and businesses located near the station. The subway location dataset can be accessed from here: https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv?accessType=DOWNLOAD  which can be retrieved from http://web.mta.info/nyct/service/. 


The dataset will be segmented into sectors for the various subway stations in NYC. Following that I was able to identify the names, latitude and longitude of the locations. Following that, from importing the Four Square API for NYC we can identify and visualize the subway location and name based on the latitude and longitudes. Then we are able to identify various restaurants and businesses in a small radius of the subway station. Finally we are able to see the unique locations after removing repeated venues that were in the same radius of multiple subway stations. 


**Segmenting NY Subway Stations**

In [1]:
import pandas as pd
import requests
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import json
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim

In [2]:

geo_path = "https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv?accessType=DOWNLOAD"
subway_df = pd.read_csv(geo_path)
subway_df.drop(["URL","OBJECTID","LINE","NOTES"],axis = 1, inplace = True)
subway_splt = subway_df["the_geom"].str.split("(", n = 2, expand = True)
subway_splt = subway_splt[1].str.split(")", n = 2, expand = True)
subway_splt = subway_splt[0].str.split(" ", n = 2, expand = True)
subway_df["LAT"] = pd.to_numeric(subway_splt[1],errors = "coerce")
subway_df["LON"] = pd.to_numeric(subway_splt[0],errors = "coerce")
subway_df.drop("the_geom",axis=1,inplace = True)
subway_df.drop_duplicates(subset= "NAME", keep = "first",inplace = True)
subway_df.reset_index(drop = True, inplace = True)
subway_df

Unnamed: 0,NAME,LAT,LON
0,Astor Pl,40.730054,-73.991070
1,Canal St,40.718803,-74.000193
2,50th St,40.761728,-73.983849
3,Bergen St,40.680862,-73.974999
4,Pennsylvania Ave,40.664714,-73.894886
5,238th St,40.884667,-73.900870
6,Cathedral Pkwy (110th St),40.800582,-73.958067
7,Kingston - Throop Aves,40.679919,-73.940859
8,65th St,40.749720,-73.898788
9,36th St,40.751960,-73.929018


In [3]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    geographiclib: 1.49-py_0   conda-forge
    geopy:         1.18.1-py_0 conda-forge

geographiclib- 100% |################################| Time: 0:00:00  24.67 MB/s
geopy-1.18.1-p 100% |################################| Time: 0:00:00  34.01 MB/s
Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  56.42 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  34.32 MB/s
vincent-0.4.4- 100% |###################

In [4]:
sub = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
lat_lon = geolocator.geocode(sub)
lat = lat_lon.latitude
lon = lat_lon.longitude
#longitude = location.longitude
print('The geograpical coordinate of Times Square are {}, {}.'.format(lat, lon))

The geograpical coordinate of Times Square are 40.7308619, -73.9871558.


In [5]:
map_newyork = folium.Map(location = [lat, lon], zoom_start = 11)
for lat, lng, label in zip(subway_df['LAT'], subway_df['LON'], subway_df['NAME']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius = 5,
    popup = label,
    color = 'blue',
    fill = True,
    fill_color = '#3186cc',
    fill_opacity = 0.7,
    parse_html = False).add_to(map_newyork)
    
map_newyork

![NYC_Subway](https://serving.photos.photobox.com/879997025dd731d51a0835a1e73bc7d9a253270a4155ff05873564afa93b60b783d792b1.jpg)

In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [8]:

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [9]:
LIMIT = 10
subway_venues = getNearbyVenues(names=subway_df['NAME'],
                                   latitudes=subway_df['LAT'],
                                   longitudes=subway_df['LON']
                                  )

Astor Pl
Canal St
50th St
Bergen St
Pennsylvania Ave
238th St
Cathedral Pkwy (110th St)
Kingston - Throop Aves
65th St
36th St
Delancey St - Essex St
Van Siclen Ave
Norwood Ave
104th-102nd Sts
DeKalb Ave
Beach 105th St
Beach 90th St
Freeman St
Intervale Ave
182nd-183rd Sts
174th-175th Sts
167th St
Mets - Willets Point
Junction Blvd
Flushing - Main St
Buhre Ave
3rd Ave - 138th St
Castle Hill Ave
Brooklyn Bridge - City Hall
Zerega Ave
Grand Central - 42nd St
33rd St
96th St
77th St
Chauncey St
Union St
Elmhurst Ave
Ralph Ave
Pelham Pkwy
Gun Hill Rd
Nereid Ave (238 St)
Franklin Ave
Simpson St
Bronx Park East
Winthrop St
149th St - Grand Concourse
161st St - Yankee Stadium
Lexington Ave - 59th St
E 149th St
Morrison Av - Soundview
Whitlock Ave
St Lawrence Ave
Woodside - 61st St
Far Rockaway - Mott Ave
72nd St
168th St
Kingsbridge Rd
42nd St - Bryant Pk
Prospect Park
55th St
Jamaica - Van Wyck
Kew Gardens - Union Tpke
Sutphin Blvd - Archer Av
Court Sq - 23rd St
67th Ave
Grand Ave - Newtown


In [10]:
print("Size of subway_venues is: ", subway_venues.shape)
print('There are {} uniques categories.'.format(len(subway_venues['Venue Category'].unique())))

Size of subway_venues is:  (3518, 7)
There are 319 uniques categories.


Now that all the data has been gathered and cleaned for modeling purposes we are able to move forward with our plans

### Methodology
We will use a clustering algorithm called K-means. Clusters are a group of objects similar to other objects in the cluster and dissimilar to data points in other clusters. We can partition data sets into groups with similar characteristics. Partition based clustering is also known as K-means clustering. This type of partition based clustering divides the data into non-overlapping subsets. The purpose of K-means is to minimize the intra-cluster distances and maximize the inter-cluster distances.   

Using the K-means clustering algorithm we can create clusters of various venues which have high frequency visits near subway stations in New York. Then we are able to identify the venues and the frequency of customers based on the clusters created.

In [11]:
# one hot encoding
subway_onehot = pd.get_dummies(subway_venues[['Venue Category']], prefix="", prefix_sep="")
subway_onehot[0] = subway_venues['Neighborhood'] 
fixed_columns = [subway_onehot.columns[-1]] + list(subway_onehot.columns[:-1])
subway_onehot = subway_onehot[fixed_columns]
subway_onehot.rename(columns= {0:"Subway"},inplace=True)
subway_onehot.head()

Unnamed: 0,Subway,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Astor Pl,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
subway_grouped = subway_onehot.groupby('Subway').mean().reset_index()
subway_grouped.head()

Unnamed: 0,Subway,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,...,Vietnamese Restaurant,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,103rd St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0
1,103rd St - Corona Plaza,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,104th St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,104th-102nd Sts,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,110th St,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [14]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

columns = ['Subway']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Subway'] = subway_grouped['Subway']

for ind in np.arange(subway_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(subway_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Subway,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,103rd St,Pizza Place,Bar,Ice Cream Shop
1,103rd St - Corona Plaza,Latin American Restaurant,Park,Deli / Bodega
2,104th St,Discount Store,Pharmacy,Pizza Place
3,104th-102nd Sts,Metro Station,Deli / Bodega,Ice Cream Shop
4,110th St,Cuban Restaurant,Latin American Restaurant,Farmers Market


In [15]:
kclusters = 4
subway_grouped_clustering = subway_grouped.drop('Subway', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(subway_grouped_clustering)
kmeans.labels_[0:10]

array([3, 2, 3, 0, 2, 3, 0, 0, 2, 3], dtype=int32)

In [16]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
subway_merged = subway_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
subway_merged = subway_merged.join(neighborhoods_venues_sorted)
subway_merged.dropna(axis = "rows",inplace = True) #Drop any NaN
subway_merged.head()

Unnamed: 0,NAME,LAT,LON,Cluster Labels,Subway,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,Astor Pl,40.730054,-73.99107,3,103rd St,Pizza Place,Bar,Ice Cream Shop
1,Canal St,40.718803,-74.000193,2,103rd St - Corona Plaza,Latin American Restaurant,Park,Deli / Bodega
2,50th St,40.761728,-73.983849,3,104th St,Discount Store,Pharmacy,Pizza Place
3,Bergen St,40.680862,-73.974999,0,104th-102nd Sts,Metro Station,Deli / Bodega,Ice Cream Shop
4,Pennsylvania Ave,40.664714,-73.894886,2,110th St,Cuban Restaurant,Latin American Restaurant,Farmers Market


In [17]:
map_clusters = folium.Map(location=[lat, lon], zoom_start=11)

rainbow = ["green","blue","yellow","red"]

markers_colors = []
for lat, lon, poi, cluster in zip(subway_merged['LAT'], subway_merged['LON'], 
                                  subway_merged['NAME'], subway_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

![Clustering NYC](https://serving.photos.photobox.com/030846659f51f10c2d9d92eb0c828dbc69928c651bce06b5a499aceec1e2cb9039638ab1.jpg)

In [34]:
subway_cluster1 = subway_merged[subway_merged["Cluster Labels"]==0].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
subway_cluster1.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster1.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)
subway_cluster1 = subway_cluster1[subway_cluster1.Frequency!=1]
subway_cluster1.set_index("Frequency",drop = True,inplace = True)
subway_cluster1.sort_index(ascending=False,inplace = True)

In [35]:
subway_cluster1.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
20,Italian Restaurant
14,Coffee Shop
11,Park
6,Japanese Restaurant
5,Theater


In [20]:
subway_cluster2 = subway_merged[subway_merged["Cluster Labels"]==1].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
subway_cluster2.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster2.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)
subway_cluster2 = subway_cluster2[subway_cluster2.Frequency!=1]
subway_cluster2.set_index("Frequency",drop = True,inplace = True)
subway_cluster2.sort_index(ascending=False,inplace = True)

subway_cluster2.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
13,Caribbean Restaurant
3,Latin American Restaurant
2,Bakery
2,Café
2,Discount Store


In [21]:
subway_cluster3 = subway_merged[subway_merged["Cluster Labels"]==2].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
subway_cluster3.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster3.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)
subway_cluster3 = subway_cluster3[subway_cluster3.Frequency!=1]
subway_cluster3.set_index("Frequency",drop = True,inplace = True)
subway_cluster3.sort_index(ascending=False,inplace = True)

subway_cluster3.head()

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
17,Bar
10,Mexican Restaurant
6,Coffee Shop
5,Latin American Restaurant
4,Asian Restaurant


In [22]:
subway_cluster4 = subway_merged[subway_merged["Cluster Labels"]==3].groupby(by="1st Most Common Venue",
                                                                           as_index=False).count()
subway_cluster4.rename(columns = {"NAME":"Frequency"},inplace = True)
subway_cluster4.drop(["LAT","LON","Cluster Labels","Subway","2nd Most Common Venue","3rd Most Common Venue"],
                     axis=1,inplace=True)
subway_cluster4 = subway_cluster4[subway_cluster4.Frequency!=1]
subway_cluster4.set_index("Frequency",drop = True,inplace = True)
subway_cluster4.sort_index(ascending=False,inplace = True)
subway_cluster4

Unnamed: 0_level_0,1st Most Common Venue
Frequency,Unnamed: 1_level_1
27,Pizza Place
6,Discount Store
5,Bakery
5,Bank
4,Pharmacy
4,Supermarket
3,Bagel Shop
3,Café
2,Deli / Bodega
2,Diner


### Results & Discussion
The analysis shows that there are various spots around New York that would generate similar levels of revenues from the frequency of the visits by the customers. A high concentration of locations are near pizza shops, Italian restaurants, parks, bars near the same location of downtown Manhattan. The model also showed a large clustering frequency of data points in the Brooklyn vicinity. The results of the K-means algorithm displayed various locations throughout the city that has high frequency of visitors. Based on this, a restaurant owner can see ample locations where they would want to open shop and receive a large amount of customers. From the first cluster we see that Italian Restaurants have high frequency of 13, second cluster shows that Pizza Places have high clusters of 27, the third cluster shows that Bars have a large cluster of 23 and finally the fourth cluster shows that Theaters have a large cluster. Following from that they can customize their services and environment in the restaurant that is best suitable for the customer to maximize customer satisfaction. From this they are able to build a loyal and regular customer base which can market their brand and quality of food for them.

### Conclusion  
Based on the K-means clustering algorithm we can see that there are various locations around NYC where there is a high frequency of visits by consumers. Based on that, opening shop around those locations and creating an environment suitable to those locations would be ideal for the restaurant owner to maximize customer satisfaction.