<div class="alert alert-block alert-info">
<b>IBM Capstone Project for Applied Data Science</b>
</div>

## Battle of the Neighborhoods: Ikinari Steak
### What Lies Ahead after a Precipitous Fall?
     by Byung Kim

<div class="alert alert-block alert-info">
Section 1

<h2> The Business Problem </h2>
<h3> A.K.A. What's up with Ikinari Steak? </h3>
</div>


### The Ikinari Steak Story in Japan
Ikinari Steak has a near cult following in Japan.
The business model is quite simple to understand and can be broken down into three major points:
1. Create a high customer turnover environment by making the shops __["tachigui"](https://en.wiktionary.org/wiki/%E7%AB%8B%E3%81%A1%E9%A3%9F%E3%81%84)__, literally "standing and eating". <br> There are no seats, so meals are done in 15-20 minutes.
2. Higher turnover allows you to provide higher quality meat for lower prices. <br> Improve your margins by hiring retired chefs who don't mind working a couple hours for lunch or dinner and can cook a cut of meat from rare to well-done.
3. Find locations near business centers to capture the hearts and stomachs of busy employees who don't mind eating the same delicious thing day-in, day-out. <br> Bring them back by creating rewards programs, including one that records how much meat you ate compared to others.

There are over __[300 locations](https://ikinaristeakusa.com/story.html)__ in Japan, with more coming and revenue is looking strong.

### The Bloated Start and Precipitous Fall
The success of Ikinari Steak in Japan roused up ambition to go global, and Ikinari Steak opened its first location in __[East Village of Manhattan, New York in February 2017](https://www.foodnewsfeed.com/content/stand-restaurant-ikinari-steak-opening-nyc)__.

The New York team wasted no time, and by December 2018, they opened a total of __[11 locations throughout Manhattan](https://www.thevillager.com/2018/12/steaks-are-sizzling-at-ikinari-on-bleecker-st/)__. These places included high-profile locations such as Times Square, Chelsea, and Grand Central. 

Then, in February 15, 2019, an article in Eater announced that __["Japanese Standing Steakhouse Ikinari Will Shutter 9 of Its 11 NYC Locations"](https://ny.eater.com/2019/2/15/18226734/ikinari-closing-pepper-lunch-opening-nyc)__. Ikinari Steak was announcing the closure of 7 locations and the rebranding of 2 due to slowing sales and unfulfilled expectations.

What happened in that short period of time?

### Anecdotal Evidence
It is easy to point out anecdotal missteps.
- When New Yorkers hear the word Steakhouse, they are probably not thinking about dashing in and chowing everything down in 15 minutes. Sparks and Peter Luger's has taught us over the years to be well-dressed, bring the family and friends you love, and get ready for a white gloves and wines by the bottle. Even Japanese marketers have taught us with lessons for American brands overseas--it is nigh-impossible to change cultural norms. It takes grueling patience.
- The rent is too damn high. Even with the subprime mortgage crisis, commercial real estate remains as strong as ever. There is a slight slowdown at the time of this writing, but there is an almost religious belief that the New York City real estate market will endure anything and everything. And so...
- The food is cheap, but not cheap enough. 18-25 dollars on lunch is not unheard of in Midtown, but with food carts now selling steak at lower prices, Ikinari Steak is losing face on its value proposition. But you can't just cut prices--the rent is too damn high!

### That's great, but what does the DATA say?
I don't have the sales figures and COGS for Ikinari Steak, but I can use data from Foursquare to begin to understand what the surrounding area looked like for the 3 still-standing Ikinari Steak locations and the 7 closed, and 2 rebranded locations.

I hope to begin to answer the following overarching business questions:

<div class="alert alert-block alert-warning">
    
<h4>Question 1: What are some common characteristics of the surrounding area for the 11 Ikinari Steak locations?</h4>
<ul>
  <li>Can we draw conclusions about the Ikinari Steak New York team's strategy for opening locations?</li>
  <li>Can we draw conclusions about why some locations failed, why some are being rebranded, and why some still stand?</li>
</ul> 

<h4> Question 2: If Ikinari Steak were to try expanding once again, where should they challenge? </h4>
<ul>
  <li>Perhaps we can begin looking in the larger New York City area for new Ikinari Steak locations.</li>
</ul> 
</div>

<div class="alert alert-block alert-info">
Section 2

<h2>The Data</h2>
<h3>The Clean-Up and the Breakdown</h3>
</div>

In [2]:
# import necessary libraries
import numpy as np
import pandas as pd

import json

from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

from sklearn.cluster import KMeans

import folium

#### Data Collection: Ikinari Steak
I will need the locations of all the Ikinari Steak New York locations
and turn them into a Pandas dataframe.

The locations and statuses of the different Ikinari Steak locations
come from the __[official website](https://ikinaristeakusa.com/location.html)__.

#### Data Processing: Ikinari Steak
The following steps will be taken to process the Ikinari Steak data:
1. I will import a pre-made csv file with Ikinari Steak data,
that include the names, addresses, grand opening dates, and current status
of the different Ikinari Steak locations. 

2. However, they will be missing the latitudes and longitudes.
This will be filled in using geopy's Nominatim to get the 
coordinates of each Ikinari Steak location. 

3. After filling in the coordinates, we will map the findings
to see a general overview of where the locations are and conclude
this section.

In [238]:
# This imports a prepared csv file as detailed above
i_steak_data = pd.read_csv('ikinari_steak.csv')
i_steak_data

Unnamed: 0,Official_Name,Address,Latitude,Longitude,Date_Opened,Current_Status
0,East Village,"90 E 10th St, New York, NY 10003",,,2017-02-23,Open
1,Chelsea 7th Ave,"154 7th Ave, New York, NY 10011",,,2017-12-15,Rebrand
2,Times Square,"368 W 46th St, New York, NY 10036",,,2018-01-19,Open
3,5th Ave,"37 W 46th St, New York, NY 10036",,,2018-02-16,Open
4,Chelsea 8th Ave,"96 8th Ave, New York, NY 10011",,,2018-02-23,Closed
5,Park Ave,"455 Park Avenue South, New York, NY 10016",,,2018-03-16,Closed
6,Broadway,"243 W 54th St, New York, NY 10019",,,2018-05-04,Rebrand
7,Upper West,"2233 Broadway, New York, NY 10024",,,2018-07-06,Closed
8,Lexington Ave,"1007 Lexington Ave, New York, NY 10021",,,2018-08-03,Closed
9,Madison Ave,"295 Madison Ave, New York, NY 10017",,,2018-10-19,Closed


In [240]:
# Use geocoder.Nominatim to get the latitudes and longitudes of
# each Ikinari Steak address
# then use a loop to set the values into our pandas dataframe

geolocator = Nominatim(user_agent="is_explorer")

for index, row in i_steak_data.iterrows():
    location = geolocator.geocode(row['Address'])
    latitude = location.latitude
    longitude = location.longitude
    i_steak_data.at[index, 'Latitude'] = latitude
    i_steak_data.at[index, 'Longitude'] = longitude

i_steak_data

Unnamed: 0,Official_Name,Address,Latitude,Longitude,Date_Opened,Current_Status
0,East Village,"90 E 10th St, New York, NY 10003",40.730806,-73.989727,2017-02-23,Open
1,Chelsea 7th Ave,"154 7th Ave, New York, NY 10011",40.741916,-73.997605,2017-12-15,Rebrand
2,Times Square,"368 W 46th St, New York, NY 10036",40.760638,-73.990395,2018-01-19,Open
3,5th Ave,"37 W 46th St, New York, NY 10036",40.756872,-73.980425,2018-02-16,Open
4,Chelsea 8th Ave,"96 8th Ave, New York, NY 10011",40.74018,-74.001972,2018-02-23,Closed
5,Park Ave,"455 Park Avenue South, New York, NY 10016",40.744894,-73.982617,2018-03-16,Closed
6,Broadway,"243 W 54th St, New York, NY 10019",40.764537,-73.983327,2018-05-04,Rebrand
7,Upper West,"2233 Broadway, New York, NY 10024",40.784303,-73.979768,2018-07-06,Closed
8,Lexington Ave,"1007 Lexington Ave, New York, NY 10021",40.770717,-73.961706,2018-08-03,Closed
9,Madison Ave,"295 Madison Ave, New York, NY 10017",40.751781,-73.979337,2018-10-19,Closed


In [18]:
# turn the new dataframe into a csv for future use
i_steak_data.to_csv('comp_ikinari_steak.csv', index=False)

In [241]:
# mapping the Ikinari Steak locations
geolocator = Nominatim(user_agent="ny_explorer")

location = geolocator.geocode('Midtown Manhattan, NY')
latitude = location.latitude
longitude = location.longitude

map_ikinari_steak = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers
for lat, lng, name, date, stat in zip(
        i_steak_data['Latitude'],
        i_steak_data['Longitude'],
        i_steak_data['Official_Name'],
        i_steak_data['Date_Opened'],
        i_steak_data['Current_Status']):
    label = '{}, {}, {}'.format(name, date, stat)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ikinari_steak)

map_ikinari_steak

#### Data Collection: NYC Neighborhoods
This section of the notebook will follow many of the same steps as the sample notebook
from the __[IBM Coursera Capstone Course](https://www.coursera.org/learn/applied-data-science-capstone/)__.

I will using the 2014 neighborhood data from __[NYU's Spatial Data Repository](https://geo.nyu.edu/catalog/nyu_2451_34572)__.

#### Data Processing: 
The data will be processed with the following steps:
1. I will first import the json file and sanitize it to bring out the neighborhood names
and coordinates.
3. I'm going to save this into a pandas dataframe and a csv file for future use.

In [3]:
# Loading the json file
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
# Pertinent information is within "features" key
neighborhoods_data = newyork_data['features']
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

In [5]:
# Instantiate the pandas dataframe for the neighborhood data
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods_df = pd.DataFrame(columns=column_names)
neighborhoods_df

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


In [6]:
# Loop through the  neighborhoods_data list
# and fill in neighborhoods_df
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods_df = neighborhoods_df.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods_df['Borough'].unique()),
        neighborhoods_df.shape[0]
    )
)

neighborhoods_df.head()

The dataframe has 5 boroughs and 306 neighborhoods.


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [11]:
# Create a csv for convenience
neighborhoods_df.to_csv('nyc_neighborhoods.csv', index=False)

#### Data Collection and Processing: Foursquare
##### Phase 1: Venues Nearby Ikinari Steak at Grand Opening
Using the Foursquare API, I am going to analyze the surrounding area of the various
Ikinari Steak locations. 

Following the footsteps of the labs in the IBM Coursera course, I am going to find 50 venues 
within a 250 meter (0.15 mile) radius. The relative density of New York City
should allow for enough results, especially in the locations that Ikinari Steak 
has chosen. 

We will also be using the version of the API from the opening of the different locations.
This will allow us to best gauge what the environment was like at the opening of the location.

All of the data will be placed into a pandas dataframe.

##### Phase 2: Venues in NYC Neighborhoods
We will then use the most current API version on the many neighborhoods of NYC for the purpose
of trying to find the best location for a possible Ikinari Steak expansion. We will be also finding
50 venues within 250 meter (0.15 mile) radius of the neighborhood center.

We will filter out any neighborhoods that cannot produce over 30 nearby venues, because we can assume
that the place does not have enough foot traffic for Ikinari Steak's high-turnover strategy (though this inevitably means there will be more competition to deal with).

### Phase 1: Venues Nearby Ikinari Steak at Grand Opening

In [13]:
# Client ID and Client Secret in local txt file
filepath = 'foursquare_cred.txt'
with open(filepath) as cred:
    CLIENT_ID = cred.readline()
    CLIENT_SECRET = cred.readline()

In [20]:
# Function to get venues from latitudes, longitudes
# Function was tweaked so that the version is set to the grand opening date
def getNearbyPastVenues(names, latitudes, longitudes, versions, radius=250):
    print('Getting venues for')
    venues_list=[]
    for name, lat, lng, ver in zip(names, latitudes, longitudes, versions):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            ver, 
            lat, 
            lng, 
            radius, 
            50) # limit set to 100 results
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,
            ver,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Official_Name', 
                  'IS_Latitude', 
                  'IS_Longitude',
                  'Version',
                  'Venue', 
                  'Venue_Latitude', 
                  'Venue_Longitude', 
                  'Venue_Category']
    
    print('Done.')
    return(nearby_venues)

In [16]:
# For function to work, I need to remove hyphens from
# the dates in the dataframe
i_steak_data = pd.read_csv('comp_ikinari_steak.csv')

for index, row in i_steak_data.iterrows():
    date = row['Date_Opened']
    dehyphened_date = date.replace('-', '')
    i_steak_data.at[index, 'Date_Opened'] = dehyphened_date

i_steak_data

Unnamed: 0,Official_Name,Address,Latitude,Longitude,Date_Opened,Current_Status
0,East Village,"90 E 10th St, New York, NY 10003",40.730806,-73.989727,20170223,Open
1,Chelsea 7th Ave,"154 7th Ave, New York, NY 10011",40.741916,-73.997605,20171215,Rebrand
2,Times Square,"368 W 46th St, New York, NY 10036",40.760638,-73.990395,20180119,Open
3,5th Ave,"37 W 46th St, New York, NY 10036",40.756872,-73.980425,20180216,Open
4,Chelsea 8th Ave,"96 8th Ave, New York, NY 10011",40.74018,-74.001972,20180223,Closed
5,Park Ave,"455 Park Avenue South, New York, NY 10016",40.744894,-73.982617,20180316,Closed
6,Broadway,"243 W 54th St, New York, NY 10019",40.764537,-73.983327,20180504,Rebrand
7,Upper West,"2233 Broadway, New York, NY 10024",40.784303,-73.979768,20180706,Closed
8,Lexington Ave,"1007 Lexington Ave, New York, NY 10021",40.770717,-73.961706,20180803,Closed
9,Madison Ave,"295 Madison Ave, New York, NY 10017",40.751781,-73.979337,20181019,Closed


In [21]:
# Run the function
iks_venues = getNearbyPastVenues(names=i_steak_data['Official_Name'],
                            latitudes=i_steak_data['Latitude'],
                            longitudes=i_steak_data['Longitude'],
                            versions=i_steak_data['Date_Opened'])

Getting venues for
East Village
Chelsea 7th Ave
Times Square
5th Ave
Chelsea 8th Ave
Park Ave
Broadway
Upper West
Lexington Ave
Madison Ave
Bleecker St
Done.


In [41]:
# check size and head of the dataframe
print(iks_venues.shape)
iks_venues.head()

(525, 8)


Unnamed: 0,Official_Name,IS_Latitude,IS_Longitude,Version,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,East Village,40.730806,-73.989727,20170223,Shake Shack,40.729998,-73.989696,Burger Joint
1,East Village,40.730806,-73.989727,20170223,Ippudo,40.730948,-73.990287,Ramen Restaurant
2,East Village,40.730806,-73.989727,20170223,Angel’s Share,40.729755,-73.98936,Speakeasy
3,East Village,40.730806,-73.989727,20170223,Switch Playground 12th Street,40.732184,-73.988699,Gym
4,East Village,40.730806,-73.989727,20170223,Fabio Clemente Jiu Jitsu,40.732304,-73.989069,Martial Arts Dojo


In [42]:
# check how many venues for each Ikinari Steak
iks_venues.groupby('Official_Name').count()

Unnamed: 0_level_0,IS_Latitude,IS_Longitude,Version,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
Official_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
5th Ave,50,50,50,50,50,50,50
Bleecker St,50,50,50,50,50,50,50
Broadway,50,50,50,50,50,50,50
Chelsea 7th Ave,50,50,50,50,50,50,50
Chelsea 8th Ave,50,50,50,50,50,50,50
East Village,36,36,36,36,36,36,36
Lexington Ave,39,39,39,39,39,39,39
Madison Ave,50,50,50,50,50,50,50
Park Ave,50,50,50,50,50,50,50
Times Square,50,50,50,50,50,50,50


In [43]:
# Check how many unique categories there are
print('There are {} unique categories.'.format(len(iks_venues['Venue_Category'].unique())))

There are 161 unique categories.


In [44]:
# Output as csv for future use
iks_venues.to_csv('ikinari_steak_venues.csv', index=False)

### Phase 2: Venues in NYC Neighborhoods

In [27]:
# Set the version to date of this report
VERSION = '20190720'

In [32]:
# Use a similar function, but without the version changes
def getNearbyVenues(names, latitudes, longitudes, radius=250):
    print('Loading')
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print('[]', end="")
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            50)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('Done!')
    return(nearby_venues)

In [37]:
# Load csv file if necessary
# neighborhoods_df = pd.read_csv('nyc_neighborhoods.csv')

# Use the function above to get the venues in each neighborhood
nyc_venues = getNearbyVenues(names=neighborhoods_df['Neighborhood'],
                                   latitudes=neighborhoods_df['Latitude'],
                                   longitudes=neighborhoods_df['Longitude']
                                  )

Loading
[][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]Done!


In [38]:
print(nyc_venues.shape)
nyc_venues.head()

(3644, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Shell,40.894187,-73.845862,Gas Station
2,Wakefield,40.894705,-73.847201,Pitman Deli,40.894149,-73.845748,Food
3,Wakefield,40.894705,-73.847201,The Upper Room,40.892567,-73.846406,Music Venue
4,Co-op City,40.874294,-73.829939,Capri II Pizza,40.876374,-73.82994,Pizza Place


In [39]:
# Output as csv file for future use
nyc_venues.to_csv('full_nyc_venues.csv', index=False)

In [40]:
# Group by neighborhood and check the count
nyc_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Allerton,15,15,15,15,15,15
Arden Heights,1,1,1,1,1,1
Arlington,3,3,3,3,3,3
Arrochar,3,3,3,3,3,3
Arverne,3,3,3,3,3,3
Astoria,10,10,10,10,10,10
Astoria Heights,10,10,10,10,10,10
Auburndale,1,1,1,1,1,1
Bath Beach,10,10,10,10,10,10
Battery Park City,31,31,31,31,31,31


In [50]:
# Filter out neighborhoods with fewer than 30 venues
filtered_neighborhoods = nyc_venues.groupby('Neighborhood').filter(lambda x : len(x)>=30)
filtered_neighborhoods.reset_index()
filtered_neighborhoods.groupby('Neighborhood').first()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,40.711932,-74.016869,Battery Park City Esplanade,40.711622,-74.017907,Park
Brooklyn Heights,40.695864,-73.993782,Brooklyn Historical Society,40.694942,-73.992333,History Museum
Carnegie Hill,40.782683,-73.953256,Kitchen Arts & Letters,40.784226,-73.952135,Bookstore
Carroll Gardens,40.68054,-73.994654,East One Coffee Roasters,40.681128,-73.996526,Coffee Shop
Chelsea,40.744035,-74.003116,Milk & Hops Chelsea,40.744751,-74.002595,Beer Bar
Chinatown,40.715618,-73.994279,Spicy Village,40.71701,-73.99353,Chinese Restaurant
Civic Center,40.715229,-74.005415,Atera,40.716752,-74.005712,Molecular Gastronomy Restaurant
Clinton,40.759101,-73.996119,Pershing Square Signature Theater,40.759228,-73.995232,Theater
Clinton Hill,40.693229,-73.967843,Cardiff Giant,40.693215,-73.969203,Bar
Downtown,40.690844,-73.983463,Alamo Drafthouse Cinema,40.691016,-73.983686,Movie Theater


In [61]:
# Create a csv file with the candidate neighborhoods
filtered_neighborhoods.to_csv('candidates_venues.csv', index=False)

The neighborhoods shown above are the candidates for Ikinari Steak locations.

They all have a high density of venues and from a cursory look, are recognizable neighborhoods by name alone.

In [55]:
# First, I create a dataframe for the candidate neighborhoods
candidates_df = filtered_neighborhoods.groupby('Neighborhood').first().reset_index()
candidates_df.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Battery Park City,40.711932,-74.016869,Battery Park City Esplanade,40.711622,-74.017907,Park
1,Brooklyn Heights,40.695864,-73.993782,Brooklyn Historical Society,40.694942,-73.992333,History Museum
2,Carnegie Hill,40.782683,-73.953256,Kitchen Arts & Letters,40.784226,-73.952135,Bookstore
3,Carroll Gardens,40.68054,-73.994654,East One Coffee Roasters,40.681128,-73.996526,Coffee Shop
4,Chelsea,40.744035,-74.003116,Milk & Hops Chelsea,40.744751,-74.002595,Beer Bar


In [242]:
# Mapping the candidates
# Then I create a map using folium
geolocator = Nominatim(user_agent="nyc_explorer")

location = geolocator.geocode('Midtown Manhattan, NY')
latitude = location.latitude
longitude = location.longitude

map_candidates = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(candidates_df['Neighborhood Latitude'], 
                           candidates_df['Neighborhood Longitude'], 
                           candidates_df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_candidates)  
    
map_candidates

#### Putting Together the Ikinari Steak and NYC Neighborhoods Data
Finally, I am going to bring together the two dataframes into one large dataframe.

In [71]:
# Load csv files if necessary
iks_venues = pd.read_csv('ikinari_steak_venues.csv')
cds_venues = pd.read_csv('candidates_venues.csv')

In [72]:
print(iks_venues.columns)
print(cds_venues.columns)

Index(['Official_Name', 'IS_Latitude', 'IS_Longitude', 'Version', 'Venue',
       'Venue_Latitude', 'Venue_Longitude', 'Venue_Category'],
      dtype='object')
Index(['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude',
       'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category'],
      dtype='object')


In [83]:
# First, we will drop the version in iks_venues
# And add the string 'Ikinari Steak ' in front of the Official_Name
temp_iks_venues = iks_venues.drop(['Version'], axis=1)
temp_iks_venues['Official_Name'] = 'Ikinari Steak ' + temp_iks_venues['Official_Name'].astype(str)
temp_iks_venues.reset_index()
temp_iks_venues.rename(columns={"Official_Name": "Name", 
                                "IS_Latitude": "Latitude", 
                                "IS_Longitude": "Longitude", 
                                "Venue_Latitude": "Venue Latitude",
                                "Venue_Longitude": "Venue Longitude",
                                "Venue_Category": "Venue Category"}, inplace=True)
temp_iks_venues.groupby('Name').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ikinari Steak 5th Ave,50,50,50,50,50,50
Ikinari Steak Bleecker St,50,50,50,50,50,50
Ikinari Steak Broadway,50,50,50,50,50,50
Ikinari Steak Chelsea 7th Ave,50,50,50,50,50,50
Ikinari Steak Chelsea 8th Ave,50,50,50,50,50,50
Ikinari Steak East Village,36,36,36,36,36,36
Ikinari Steak Lexington Ave,39,39,39,39,39,39
Ikinari Steak Madison Ave,50,50,50,50,50,50
Ikinari Steak Park Ave,50,50,50,50,50,50
Ikinari Steak Times Square,50,50,50,50,50,50


In [90]:
# Also need to rename the columns of the candidates_df
temp_cds_df = filtered_neighborhoods.rename(columns={"Neighborhood": "Name", 
                                                     "Neighborhood Latitude": "Latitude", 
                                                     "Neighborhood Longitude": "Longitude"})
temp_cds_df.head()

Unnamed: 0,Name,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
105,Fordham,40.860997,-73.896427,Fresh Frutii,40.861722,-73.898682,Juice Bar
106,Fordham,40.860997,-73.896427,188 Bakery Cuchifritos,40.861602,-73.898311,Latin American Restaurant
107,Fordham,40.860997,-73.896427,Pollo Campero,40.86096,-73.897599,Fried Chicken Joint
108,Fordham,40.860997,-73.896427,Paradise Theater,40.860499,-73.898463,Music Venue
109,Fordham,40.860997,-73.896427,Best Italian Pizza,40.862475,-73.896898,Pizza Place


In [93]:
full_venues = temp_cds_df.append(temp_iks_venues, sort=False)
print(full_venues.shape)
full_venues.groupby('Name').count()

(2222, 7)


Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Battery Park City,31,31,31,31,31,31
Brooklyn Heights,44,44,44,44,44,44
Carnegie Hill,36,36,36,36,36,36
Carroll Gardens,49,49,49,49,49,49
Chelsea,33,33,33,33,33,33
Chinatown,50,50,50,50,50,50
Civic Center,40,40,40,40,40,40
Clinton,50,50,50,50,50,50
Clinton Hill,32,32,32,32,32,32
Downtown,50,50,50,50,50,50


In [106]:
# Check how many unique categories there are
print('There are {} unique categories.'.format(len(full_venues['Venue Category'].unique())))
unique_vc = full_venues['Venue Category'].unique()
unique_vc.sort()
unique_vc

There are 284 unique categories.


array(['Accessories Store', 'Adult Boutique', 'Afghan Restaurant',
       'American Restaurant', 'Animal Shelter', 'Antique Shop',
       'Arepa Restaurant', 'Art Gallery', 'Art Museum',
       'Arts & Crafts Store', 'Asian Restaurant', 'Australian Restaurant',
       'Austrian Restaurant', 'Automotive Shop', 'BBQ Joint',
       'Bagel Shop', 'Bakery', 'Bank', 'Bar', 'Beach', 'Beer Bar',
       'Beer Garden', 'Beer Store', 'Big Box Store',
       'Bike Rental / Bike Share', 'Bike Shop', 'Bistro', 'Board Shop',
       'Boat or Ferry', 'Bookstore', 'Boutique', 'Boxing Gym',
       'Brazilian Restaurant', 'Breakfast Spot', 'Bridal Shop',
       'Bubble Tea Shop', 'Building', 'Burger Joint', 'Burrito Place',
       'Butcher', 'Café', 'Cajun / Creole Restaurant', 'Camera Store',
       'Candy Store', 'Cantonese Restaurant', 'Caribbean Restaurant',
       'Carpet Store', 'Cheese Shop', 'Chinese Restaurant',
       'Chocolate Shop', 'Circus', 'Climbing Gym', 'Clothing Store',
       'Cocktail

Although there are some problematic categories (e.g. Gym vs. Gym / Fitness Center) that require more investigation, I do not anticipate that they will affect the clustering or decision tree dramatically.

In [95]:
# Again, save the full venues dataframe as a csv for future reference
full_venues.to_csv('full_venues.csv', index=False)

<div class="alert alert-block alert-info">
Section 3

<h2>The Methodology</h2>
<h3>Clustering and a Decision Tree</h3>
</div>

### Approaching from Two Directions
#### Approach 1: Clustering the Candidates
For the first section, I am going to do a k-means clustering analysis on the larger dataframe:
1. Using dummy variables, I am going to cluster the different neighborhoods and Ikinari Steak locations.
- There will be 3 clusters, based on the amount of data available. 
2. By exploring this result, we can begin to understand whether the clustering algorithm places the Ikinari Steak locations appropriately and look for patterns.

#### Approach 2: Predicting a Potential Ikinari Steak Location's Status
For the second section, I am going to train a decision tree algorithm and classify whether an Ikinari Steak location there would be Open, Closed, or Rebranded. 
1. Using the same dummy variables, I am going to train the decision tree algorithm using the Ikinari Steak locations, then apply it to the different neighborhoods.
2. I will then compare the clustering analysis with the decision tree predictions. 

In [None]:
# Load the csv file if necessary
# full_venues = pd.read_csv('full_venues.csv')

### Approach 1: Clustering

In [108]:
# getting dummy variables for the different categories
# then putting them into a dataframe
venues_onehot = pd.get_dummies(full_venues[['Venue Category']], prefix="", prefix_sep="")
venues_onehot['Name'] = full_venues['Name']
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]

print(venues_onehot.shape)
venues_onehot.head()

(2222, 285)


Unnamed: 0,Name,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arepa Restaurant,Art Gallery,Art Museum,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
105,Fordham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
106,Fordham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
107,Fordham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
108,Fordham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
109,Fordham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [202]:
# group the new dataframe by neighborhood/Ikinari Steak location name
venues_grouped = venues_onehot.groupby('Name').mean().reset_index()
venues_grouped

Unnamed: 0,Name,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arepa Restaurant,Art Gallery,Art Museum,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Battery Park City,0.0,0.0,0.0,0.032258,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.032258,0.0
1,Brooklyn Heights,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.068182
2,Carnegie Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,0.027778,0.0,0.0
3,Carroll Gardens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0
4,Chelsea,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.030303,0.0,0.060606,0.0,0.0,0.0,0.0,0.0
5,Chinatown,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0
6,Civic Center,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025
7,Clinton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
8,Clinton Hill,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125
9,Downtown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Taking the new one-hot encoded dataframe, I'm going to cluster them into four clusters.

I am also going to run the kmeans algorithm until the clusters somewhat match up with the data.
For example: Ikinari Steak East Village should be in the same cluster as East Village.

I did this by checking for three conditions:
1. Open stores should be in the same cluster
2. Ikinari Steak East Village is in same cluster as East Village
3. The number of stores in Open cluster should not exceed 5 from the Ikinari Steak group

In [206]:
venues_clustering = venues_grouped.drop('Name', 1)

while True:
    # run k-means clustering
    kmeans = KMeans(n_clusters=4).fit(venues_clustering)
    k_lbls = kmeans.labels_
    num_open = k_lbls.tolist()[21:32].count(k_lbls[21])
    
    # Check for a three conditions: 
    # 1. Open stores should be in the same cluster
    # 2. Ikinari Steak East Village is in same cluster as East Village
    # 3. Number of stores in Open cluster should not exceed 5
    if (k_lbls[21] == k_lbls[26] == k_lbls[30] and
        k_lbls[11] == k_lbls[26] and
        num_open <= 5):
        break

# check cluster labels generated for each row in the dataframe
print(num_open)
print(kmeans.labels_[21:32])
print(kmeans.labels_)

5
[1 0 0 2 1 1 0 2 3 1 1]
[2 1 2 0 1 1 0 0 1 1 2 1 1 2 1 1 2 2 1 0 1 1 0 0 2 1 1 0 2 3 1 1 1 1 0 1 2
 1 3 3 1 1 1 1 1 2 2 1 1 0 1]


In [224]:
# add clustering labels to the dataframe
venues_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhoods_clustered = neighborhoods_df.rename(columns={"Neighborhood": "Name"})

# merge neighborhoods_clustered with venues_grouped to add latitude/longitude for each neighborhood
neighborhoods_clustered = neighborhoods_clustered.join(venues_grouped.set_index('Name'), on='Name')

neighborhoods_clustered = neighborhoods_clustered.dropna(axis=0, how='any')
neighborhoods_clustered.reset_index(drop=True, inplace=True)

print(neighborhoods_clustered.shape)
neighborhoods_clustered

(42, 289)


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Bronx,Fordham,40.860997,-73.896427,1.0,0.029412,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.03125
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,0.0,0.0,0.0,0.022727,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.068182
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125
5,Brooklyn,Downtown,40.690844,-73.983463,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Brooklyn,East Williamsburg,40.708492,-73.938858,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Brooklyn,North Side,40.714823,-73.958809,1.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.06,0.0,0.0,0.0,0.02,0.0,0.0,0.04
8,Brooklyn,South Side,40.710861,-73.958001,1.0,0.0,0.0,0.0,0.022222,0.0,...,0.0,0.0,0.022222,0.0,0.0,0.0,0.022222,0.022222,0.0,0.0
9,Manhattan,Chinatown,40.715618,-73.994279,1.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0


In [230]:
# I am going to do the same with the Ikinari Steak data
ikisteak_clustered = venues_grouped[venues_grouped['Name'].str.contains('Ikinari')]
ikisteak_clustered

Unnamed: 0,Cluster Labels,Name,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arepa Restaurant,Art Gallery,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
21,1,Ikinari Steak 5th Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,...,0.0,0.0,0.02,0.02,0.0,0.0,0.0,0.0,0.02,0.0
22,0,Ikinari Steak Bleecker St,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.02,0.0,0.02,0.0,0.0,0.02
23,0,Ikinari Steak Broadway,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
24,2,Ikinari Steak Chelsea 7th Ave,0.0,0.0,0.0,0.08,0.0,0.04,0.02,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.04
25,1,Ikinari Steak Chelsea 8th Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.02
26,1,Ikinari Steak East Village,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778
27,0,Ikinari Steak Lexington Ave,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,...,0.0,0.0,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641,0.0
28,2,Ikinari Steak Madison Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0
29,3,Ikinari Steak Park Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0
30,1,Ikinari Steak Times Square,0.0,0.0,0.0,0.06,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0


In [231]:
# Load Ikinari Steak data if necessary
ikisteak_data = pd.read_csv('comp_ikinari_steak.csv')

# Add string 'Ikinari Steak ' to the Official Name column
# and rename column to Name, drop Address, Date_Opened, and Current_Status
ikisteak_data['Official_Name'] = 'Ikinari Steak ' + ikisteak_data['Official_Name'].astype(str)
ikisteak_data.reset_index()
ikisteak_data.rename(columns={"Official_Name": "Name"}, inplace=True)
ikisteak_data.drop(['Address', 'Date_Opened', 'Current_Status'], axis=1, inplace=True)

# Finally, I will merge the two Ikinari Steak dataframes
ikisteak_clustered = ikisteak_clustered.join(ikisteak_data.set_index('Name'), on='Name')
ikisteak_clustered.reset_index(drop=True, inplace=True)
ikisteak_clustered.insert(0, 'Borough', 'Manhattan')

print(ikisteak_clustered.shape)
ikisteak_clustered

(11, 289)


Unnamed: 0,Borough,Cluster Labels,Name,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arepa Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Latitude,Longitude
0,Manhattan,1,Ikinari Steak 5th Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.02,0.02,0.0,0.0,0.0,0.0,0.02,0.0,40.756872,-73.980425
1,Manhattan,0,Ikinari Steak Bleecker St,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.02,0.0,0.02,0.0,0.02,0.0,0.0,0.02,40.729355,-74.001377
2,Manhattan,0,Ikinari Steak Broadway,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,40.764537,-73.983327
3,Manhattan,2,Ikinari Steak Chelsea 7th Ave,0.0,0.0,0.0,0.08,0.0,0.04,0.02,...,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.04,40.741916,-73.997605
4,Manhattan,1,Ikinari Steak Chelsea 8th Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.02,0.02,40.74018,-74.001972
5,Manhattan,1,Ikinari Steak East Village,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.027778,40.730806,-73.989727
6,Manhattan,0,Ikinari Steak Lexington Ave,0.0,0.0,0.0,0.025641,0.0,0.0,0.0,...,0.025641,0.0,0.0,0.0,0.0,0.0,0.025641,0.0,40.770717,-73.961706
7,Manhattan,2,Ikinari Steak Madison Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,40.751781,-73.979337
8,Manhattan,3,Ikinari Steak Park Ave,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,40.744894,-73.982617
9,Manhattan,1,Ikinari Steak Times Square,0.0,0.0,0.0,0.06,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.02,0.04,0.0,0.0,40.760638,-73.990395


In [233]:
# I bring it all together in this cell
all_clustered = neighborhoods_clustered.append(ikisteak_clustered, sort=False)
all_clustered.reset_index()
all_clustered

Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Bronx,Fordham,40.860997,-73.896427,1.0,0.029412,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.03125
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,0.0,0.0,0.0,0.022727,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.068182
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125
5,Brooklyn,Downtown,40.690844,-73.983463,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Brooklyn,East Williamsburg,40.708492,-73.938858,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Brooklyn,North Side,40.714823,-73.958809,1.0,0.0,0.0,0.0,0.04,0.0,...,0.0,0.0,0.06,0.0,0.0,0.0,0.02,0.0,0.0,0.04
8,Brooklyn,South Side,40.710861,-73.958001,1.0,0.0,0.0,0.0,0.022222,0.0,...,0.0,0.0,0.022222,0.0,0.0,0.0,0.022222,0.022222,0.0,0.0
9,Manhattan,Chinatown,40.715618,-73.994279,1.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0


In [234]:
# As we have done so far, it is helpful to create csv files for important dataframes
venues_grouped.to_csv('venues_clustered.csv', index=False)
neighborhoods_clustered.to_csv('neighborhoods_clustered.csv', index=False)
ikisteak_clustered.to_csv('ikinari_steak_clustered.csv', index=False)
all_clustered.to_csv('all_clustered.csv', index=False)

### Phase 2: Making Predictions using a Decision Tree
I am going to train the decision tree using the Ikinari Steak data,
then going to predict whether an Ikinari Steak location in one of the candidate locations
would still be Open, Closed, or Rebranded. 

I will be using the Scikit-Learn DecisionTreeClassifier method.
- I will be using the entropy crierion to maximize information gain 
- I will also set the max_depth to 5, otherwise it will overfit and find too many similarities where there are none.
    - Case in point: there are many 0 counts for many of the venue categories--this will be confusing for the algorithm. When the max_depth is set to 6 or above, it claims that about half of the neighborhoods are great places for a new Ikinari Steak location, which is doubtful.

In [18]:
# Import the libraries I need for a Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import preprocessing
from six import StringIO
import matplotlib.image as mpimg
from sklearn import tree
%matplotlib inline

In [256]:
ikisteak_train = all_clustered[all_clustered['Name'].str.contains('Ikinari')]
ikisteak_statuses = ['Open', 'Closed', 'Rebrand', 'Rebrand', 'Closed', 'Open', 'Closed', 'Closed', 'Closed', 'Open', 'Closed']
ikisteak_train['Status'] = ikisteak_statuses
ikisteak_train.columns[5:-1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Index(['Accessories Store', 'Adult Boutique', 'Afghan Restaurant',
       'American Restaurant', 'Animal Shelter', 'Antique Shop',
       'Arepa Restaurant', 'Art Gallery', 'Art Museum', 'Arts & Crafts Store',
       ...
       'Used Bookstore', 'Vape Store', 'Vegetarian / Vegan Restaurant',
       'Video Game Store', 'Vietnamese Restaurant', 'Whisky Bar', 'Wine Bar',
       'Wine Shop', 'Women's Store', 'Yoga Studio'],
      dtype='object', length=284)

In [257]:
# train_X is the variable that will store the values
# to train the Decision Tree algorithm
train_X = ikisteak_train[ikisteak_train.columns[5:-1]].values
train_X[0:5]

array([[0.  , 0.  , 0.  , ..., 0.  , 0.02, 0.  ],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.02],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.02],
       [0.  , 0.  , 0.  , ..., 0.  , 0.  , 0.04],
       [0.  , 0.  , 0.  , ..., 0.02, 0.02, 0.02]])

In [259]:
# Likewise, train_Y will have the statuses
# that we want to predict using the Decision Tree
train_Y = ikisteak_train['Status']
train_Y

0        Open
1      Closed
2     Rebrand
3     Rebrand
4      Closed
5        Open
6      Closed
7      Closed
8      Closed
9        Open
10     Closed
Name: Status, dtype: object

In [295]:
# I will be using the Decision Tree from Scikit-Learn
# Please see above for more details
ikisteakTree = DecisionTreeClassifier(criterion='entropy', max_depth=5)
ikisteakTree.fit(train_X, train_Y)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [296]:
# Then I prepare the test set, using the neighborhood values
ikisteak_test = all_clustered[~all_clustered['Name'].str.contains('Ikinari')]
print(ikisteak_test.shape)
ikisteak_test.head()

(42, 289)


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,...,Used Bookstore,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Bronx,Fordham,40.860997,-73.896427,1.0,0.029412,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.03125
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,0.0,0.0,0.0,0.022727,0.0,...,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.068182
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125


In [297]:
test_X = ikisteak_test[ikisteak_test.columns[5:]].values
test_X[0:5]

array([[0.02941176, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.03125   ],
       [0.        , 0.        , 0.        , ..., 0.04545455, 0.02272727,
        0.06818182],
       [0.        , 0.        , 0.        , ..., 0.02040816, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.03125   , 0.        ,
        0.03125   ]])

In [298]:
# Here I set up the prediction of values using the test set
predTree = ikisteakTree.predict(test_X)

In [299]:
# What the prediction produced is...
predTree

array(['Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed',
       'Closed', 'Open', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed',
       'Closed', 'Closed', 'Closed', 'Open', 'Closed', 'Closed', 'Closed',
       'Closed', 'Rebrand', 'Closed', 'Closed', 'Closed', 'Closed',
       'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed',
       'Closed', 'Closed', 'Closed', 'Closed', 'Closed', 'Closed',
       'Closed', 'Closed', 'Closed', 'Closed'], dtype=object)

In [304]:
# I will now place these predictions into the original dataframe
# and create a new dataframe
neighborhoods_predicted = ikisteak_test
neighborhoods_predicted['Status'] = predTree
neighborhoods_predicted.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Animal Shelter,...,Vape Store,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio,Status
0,Bronx,Fordham,40.860997,-73.896427,1.0,0.029412,0.0,0.0,0.0,0.0,...,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,Closed
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,0.0,0.0,0.0,0.0,0.0,...,0.03125,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,Closed
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,0.0,0.0,0.0,0.022727,0.0,...,0.0,0.0,0.0,0.022727,0.0,0.0,0.045455,0.022727,0.068182,Closed
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.020408,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,Closed
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,Closed


In [305]:
# As always, here's the csv file for posterity.
neighborhoods_predicted.to_csv('neighborhoods_predicted.csv', index=False)

This concludes the Methodology section, and I will discuss my findings in the following sections.

<div class="alert alert-block alert-info">
Section 4

<h2>The Results</h2>
<h3>A Walkthrough of the Analysis and Science</h3>
</div>

The results for the analysis can be broken down into answering three questions:
1. What are the findings and insights from the Clustering Analysis?
2. What are the findings and insights from the Decision Tree Analysis?
3. Do these two methods combined provide a more complete picture and how so?

But first, I need to clean up the results from the analyses...

In [15]:
# I will begin by having new dataframes that remove the dummy variables.
# And once again, just in case we need it:
all_clustered = pd.read_csv('all_clustered.csv')
neighborhoods_predicted = pd.read_csv('neighborhoods_predicted.csv')

In [22]:
# Removing all the dummy variable category columns for the neighborhoods
cleaned_columns = ['Borough', 'Name', 'Latitude', 'Longitude', 'Cluster Labels', 'Status']
neighborhood_results = pd.DataFrame(columns=cleaned_columns)
neighborhood_results = neighborhoods_predicted[cleaned_columns]
print(neighborhood_results.shape)
neighborhood_results.reset_index()
neighborhood_results.head()

(42, 6)


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Status
0,Bronx,Fordham,40.860997,-73.896427,1.0,Closed
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,Closed
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,Closed
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,Closed
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,Closed


In [23]:
# Doing the same for the Ikinari Steak locations
ikisteak_results = all_clustered[all_clustered['Name'].str.contains('Ikinari')]
ikisteak_statuses = ['Open', 'Closed', 'Rebrand', 'Rebrand', 'Closed', 'Open', 'Closed', 'Closed', 'Closed', 'Open', 'Closed']
ikisteak_results['Status'] = ikisteak_statuses
ikinari_steak_results = ikisteak_results[cleaned_columns]
print(ikinari_steak_results.shape)
ikinari_steak_results.reset_index(drop=True)

(11, 6)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Status
0,Manhattan,Ikinari Steak 5th Ave,40.756872,-73.980425,1.0,Open
1,Manhattan,Ikinari Steak Bleecker St,40.729355,-74.001377,0.0,Closed
2,Manhattan,Ikinari Steak Broadway,40.764537,-73.983327,0.0,Rebrand
3,Manhattan,Ikinari Steak Chelsea 7th Ave,40.741916,-73.997605,2.0,Rebrand
4,Manhattan,Ikinari Steak Chelsea 8th Ave,40.74018,-74.001972,1.0,Closed
5,Manhattan,Ikinari Steak East Village,40.730806,-73.989727,1.0,Open
6,Manhattan,Ikinari Steak Lexington Ave,40.770717,-73.961706,0.0,Closed
7,Manhattan,Ikinari Steak Madison Ave,40.751781,-73.979337,2.0,Closed
8,Manhattan,Ikinari Steak Park Ave,40.744894,-73.982617,3.0,Closed
9,Manhattan,Ikinari Steak Times Square,40.760638,-73.990395,1.0,Open


In [24]:
# Lastly, we can append the results together for one big dataframe
all_results = pd.concat([neighborhood_results, ikinari_steak_results], sort=False)
all_results.reset_index(drop=True)
all_results

Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Status
0,Bronx,Fordham,40.860997,-73.896427,1.0,Closed
1,Brooklyn,Greenpoint,40.730201,-73.954241,1.0,Closed
2,Brooklyn,Brooklyn Heights,40.695864,-73.993782,1.0,Closed
3,Brooklyn,Carroll Gardens,40.68054,-73.994654,0.0,Closed
4,Brooklyn,Clinton Hill,40.693229,-73.967843,1.0,Closed
5,Brooklyn,Downtown,40.690844,-73.983463,1.0,Closed
6,Brooklyn,East Williamsburg,40.708492,-73.938858,1.0,Closed
7,Brooklyn,North Side,40.714823,-73.958809,1.0,Open
8,Brooklyn,South Side,40.710861,-73.958001,1.0,Closed
9,Manhattan,Chinatown,40.715618,-73.994279,1.0,Closed


In [25]:
# Well, just one more thing...
neighborhood_results.to_csv('neighborhood_results.csv', index=False)
ikinari_steak_results.to_csv('ikinari_steak_results.csv', index=False)
all_results.to_csv('all_results.csv', index=False)

### Part I: What does Clustering reveal?
First, I will try mapping the clustering results onto a map.

In [26]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [44]:
# We need to cast the cluster as type int
all_results = all_results.astype({'Cluster Labels':int})

In [51]:
geolocator = Nominatim(user_agent="nyc_explorer")

location = geolocator.geocode('Midtown Manhattan, NY')
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(4)
ys = [i + x + (i*x)**2 for i in range(4)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(all_results['Latitude'], 
                                  all_results['Longitude'], 
                                  all_results['Name'], 
                                  all_results['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    if 'Ikinari' in poi:
        folium.CircleMarker(
            [lat, lon],
            radius=8,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
    else:
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Looking more closely at the Clusters
To do this, I am going to return the most common venues for each cluster.

In [52]:
# This function from the labs returns the most common venues in the results
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[5:-1]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [56]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cresults_sorted = pd.DataFrame(columns=columns)
cresults_sorted['Name'] = all_results['Name']

for ind in np.arange(all_clustered.shape[0]):
    cresults_sorted.iloc[ind, 1:] = (
        return_most_common_venues(all_clustered.iloc[ind, :], num_top_venues)
    )

print(cresults_sorted.shape)

(53, 11)


In [57]:
# Then I insert the cluster labels into the most common venues dataframe
cresults_sorted.insert(0, 'Cluster Labels', all_clustered['Cluster Labels'].values)
cresults_sorted.head()

Unnamed: 0,Cluster Labels,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1.0,Fordham,Bank,Shoe Store,Gym / Fitness Center,Fried Chicken Joint,Clothing Store,Fast Food Restaurant,Mobile Phone Shop,Gym,Supplement Shop,Sporting Goods Shop
1,1.0,Greenpoint,Grocery Store,Sushi Restaurant,Spa,Café,Coffee Shop,Mexican Restaurant,Italian Restaurant,Deli / Bodega,Record Shop,Pizza Place
2,1.0,Brooklyn Heights,Deli / Bodega,Ice Cream Shop,Wine Shop,Cosmetics Shop,Pharmacy,Thai Restaurant,Asian Restaurant,Eastern European Restaurant,Burger Joint,Chocolate Shop
3,0.0,Carroll Gardens,Italian Restaurant,Deli / Bodega,Bank,Café,Coffee Shop,Cocktail Bar,Pizza Place,Playground,Shoe Store,Greek Restaurant
4,1.0,Clinton Hill,Mexican Restaurant,Juice Bar,Restaurant,Thai Restaurant,Italian Restaurant,Convenience Store,Bagel Shop,Lounge,Massage Studio,Sushi Restaurant


##### The First Cluster: The Older Crowd
We can almost map out a day in this cluster: a trip to the spa to relax, followed by a pasta dish at a long-standing Italian Restaurant, ending at the theater or the opera house in an evening gown.

This cluster is also home to a few failed Ikinari Steak locations.

In [62]:
print(cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 0, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]].shape)
cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 0, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]]

(9, 11)


Unnamed: 0,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Carroll Gardens,Italian Restaurant,Deli / Bodega,Bank,Café,Coffee Shop,Cocktail Bar,Pizza Place,Playground,Shoe Store,Greek Restaurant
14,Lincoln Square,Theater,Indie Movie Theater,Performing Arts Venue,Concert Hall,Opera House,American Restaurant,Café,Park,Gym / Fitness Center,Plaza
15,Clinton,Theater,Spa,Sandwich Place,Hotel,Restaurant,Lounge,Gym / Fitness Center,French Restaurant,Steakhouse,Sports Bar
19,Greenwich Village,Italian Restaurant,Cosmetics Shop,Café,Clothing Store,Ice Cream Shop,French Restaurant,Sushi Restaurant,Pilates Studio,Optical Shop,Gourmet Shop
24,West Village,Italian Restaurant,Cosmetics Shop,Gastropub,Coffee Shop,Cocktail Bar,Wine Bar,Chinese Restaurant,Gourmet Shop,Bakery,American Restaurant
34,Civic Center,Sandwich Place,Coffee Shop,Italian Restaurant,Bakery,Gym / Fitness Center,Gym,Martial Arts Dojo,Falafel Restaurant,Dance Studio,Molecular Gastronomy Restaurant
43,Ikinari Steak Bleecker St,Italian Restaurant,Café,Ice Cream Shop,Gourmet Shop,Pizza Place,Comedy Club,Jazz Club,Dessert Shop,Sushi Restaurant,Chinese Restaurant
44,Ikinari Steak Broadway,Theater,Italian Restaurant,Cuban Restaurant,Japanese Restaurant,Food Truck,Coffee Shop,Russian Restaurant,Sandwich Place,Hotel,Performing Arts Venue
48,Ikinari Steak Lexington Ave,Italian Restaurant,Cocktail Bar,Gift Shop,French Restaurant,Coffee Shop,Toy / Game Store,Bakery,Boutique,BBQ Joint,Paper / Office Supplies Store


##### The Second Cluster: The Younger Crowd
Here we see everything from the bodega to the bubble tea place. There's plenty of evidence that this is a younger crowd: from night clubs to the ice cream shops. You can also see a bit more ethnic diversity in the food selection. 

The second cluster holds all the successful Ikinari Steak locations.

In [63]:
print(cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 1, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]].shape)
cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 1, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]]

(29, 11)


Unnamed: 0,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Fordham,Bank,Shoe Store,Gym / Fitness Center,Fried Chicken Joint,Clothing Store,Fast Food Restaurant,Mobile Phone Shop,Gym,Supplement Shop,Sporting Goods Shop
1,Greenpoint,Grocery Store,Sushi Restaurant,Spa,Café,Coffee Shop,Mexican Restaurant,Italian Restaurant,Deli / Bodega,Record Shop,Pizza Place
2,Brooklyn Heights,Deli / Bodega,Ice Cream Shop,Wine Shop,Cosmetics Shop,Pharmacy,Thai Restaurant,Asian Restaurant,Eastern European Restaurant,Burger Joint,Chocolate Shop
4,Clinton Hill,Mexican Restaurant,Juice Bar,Restaurant,Thai Restaurant,Italian Restaurant,Convenience Store,Bagel Shop,Lounge,Massage Studio,Sushi Restaurant
5,Downtown,Coffee Shop,Spanish Restaurant,Burger Joint,Chinese Restaurant,Pie Shop,Pizza Place,Polish Restaurant,Creperie,Restaurant,Italian Restaurant
6,East Williamsburg,Bar,Coffee Shop,Concert Hall,Music Venue,Cocktail Bar,Café,Vegetarian / Vegan Restaurant,Bakery,Food Truck,Pizza Place
7,North Side,Vegetarian / Vegan Restaurant,Salon / Barbershop,South American Restaurant,American Restaurant,Jewelry Store,Seafood Restaurant,Bookstore,Juice Bar,Sushi Restaurant,Grocery Store
8,South Side,Bar,Tapas Restaurant,Mexican Restaurant,Latin American Restaurant,Coffee Shop,Burger Joint,Pizza Place,Breakfast Spot,Pub,Chinese Restaurant
9,Chinatown,Bubble Tea Shop,Chinese Restaurant,Hotel,Noodle House,Korean Restaurant,Spa,Sandwich Place,Japanese Restaurant,Hotpot Restaurant,Vietnamese Restaurant
10,Washington Heights,Café,Chinese Restaurant,Deli / Bodega,Coffee Shop,Park,Pizza Place,Grocery Store,Mobile Phone Shop,Bakery,Paper / Office Supplies Store


##### The Third Cluster: Why don't you stay awhile?
Wine shops, rivers, boats, and parks. This cluster seems home to all the places that might reward having a long, romantic sitdown, rather than a quick bite of steak. 

In [64]:
print(cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 2, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]].shape)
cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 2, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]]

(11, 11)


Unnamed: 0,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
21,Tribeca,Park,Spa,Wine Shop,American Restaurant,Café,Salad Place,Cycle Studio,Boutique,Coffee Shop,River
25,Gramercy,Thrift / Vintage Store,Spa,Coffee Shop,Wine Shop,Latin American Restaurant,Liquor Store,Beer Bar,Bike Rental / Bike Share,Supplement Shop,Convenience Store
26,Battery Park City,Park,Department Store,Food Court,Sandwich Place,Cupcake Shop,Women's Store,Salad Place,Coffee Shop,Sushi Restaurant,Boat or Ferry
27,Financial District,Coffee Shop,Steakhouse,Event Space,Gym / Fitness Center,Pizza Place,Park,Gym,Spa,Wine Shop,Jewelry Store
29,Long Island City,Coffee Shop,Hotel,Café,Mexican Restaurant,Deli / Bodega,Bar,Donut Shop,Italian Restaurant,Dessert Shop,Sandwich Place
32,Carnegie Hill,Gym / Fitness Center,Café,Italian Restaurant,Coffee Shop,Gym,Spa,Karaoke Bar,Grocery Store,Wine Shop,Shoe Store
38,Turtle Bay,Coffee Shop,Hotel,Italian Restaurant,Sushi Restaurant,Karaoke Bar,Café,Pharmacy,Garden,Farmers Market,Bookstore
40,Fulton Ferry,Boat or Ferry,Park,Ice Cream Shop,Pizza Place,American Restaurant,Hotel Bar,Café,Seafood Restaurant,Scenic Lookout,Leather Goods Store
41,Dumbo,Coffee Shop,Bookstore,Bakery,Gym,Scenic Lookout,American Restaurant,Playground,Art Gallery,Men's Store,Wine Shop
45,Ikinari Steak Chelsea 7th Ave,American Restaurant,Coffee Shop,Furniture / Home Store,Gym / Fitness Center,Gym,Antique Shop,Italian Restaurant,Paella Restaurant,Camera Store,Salon / Barbershop


##### The Fourth Cluster: The Asian Competiition
Korean, Japanese, Shanghai, Chinese. With so many Asian food options, it's hard to stand out as a "Japanese-twist" on a dish.

In [65]:
print(cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 3, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]].shape)
cresults_sorted.loc[cresults_sorted['Cluster Labels'] == 3, cresults_sorted.columns[[1] + list(range(2, cresults_sorted.shape[1]))]]

(4, 11)


Unnamed: 0,Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Murray Hill,Korean Restaurant,Coffee Shop,Hotel,Bar,Japanese Restaurant,Bank,Deli / Bodega,Burger Joint,Shanghai Restaurant,Chinese Restaurant
30,Murray Hill,Korean Restaurant,Coffee Shop,Hotel,Bar,Japanese Restaurant,Bank,Deli / Bodega,Burger Joint,Shanghai Restaurant,Chinese Restaurant
35,Midtown South,Korean Restaurant,Cosmetics Shop,Coffee Shop,Italian Restaurant,American Restaurant,Lingerie Store,Fried Chicken Joint,Hotel Bar,Plaza,Building
50,Ikinari Steak Park Ave,Hotel,Korean Restaurant,Gym / Fitness Center,Spa,Café,Italian Restaurant,Japanese Restaurant,Gym,Pizza Place,Sandwich Place


### Part II: What does the Decision Tree reveal?
Where should we make a new Ikinari Steak location? Or at least, which place has potential?

In [70]:
# Once again, just in case:
neighborhood_results = pd.read_csv('neighborhood_results.csv')

In [71]:
good_prediction = neighborhood_results.loc[neighborhood_results['Status'] == 'Open']
print(good_prediction.shape)
good_prediction

(2, 6)


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Status
7,Brooklyn,North Side,40.714823,-73.958809,1.0,Open
16,Manhattan,Midtown,40.754691,-73.981669,1.0,Open


Taking the prediction tree at face value, it would seem Ikinari Steak should open locations in:
1. __North Side, Brooklyn__
2. Another one in __Midtown, Manhattan__

In [72]:
possible_prediction = neighborhood_results.loc[neighborhood_results['Status'] == 'Rebrand']
print(possible_prediction.shape)
possible_prediction

(1, 6)


Unnamed: 0,Borough,Name,Latitude,Longitude,Cluster Labels,Status
21,Manhattan,Tribeca,40.721522,-74.010683,2.0,Rebrand


And another prediction from the tree, there may be a possibility of success in __Tribeca, Manhattan__.

### Part III: The Relationship between Clustering and the Prediction Tree
I will begin by calculating the Kruskal correlation between different sections of the data.<br>
I chose the Kruskal correlation because it has been __[recommended](https://statistics.laerd.com/stata-tutorials/kruskal-wallis-h-test-using-stata.php)__ for categorical variables.

I will also explore the top ten venues of the three locations predicted by the Decision Tree.

In [87]:
from scipy import stats

In [81]:
# To get the Kruskal correlation between the Cluster Labels and Status
# I need to strip the results dataframe of everything except those two columns
corr_columns = ['Name', 'Cluster Labels', 'Status']
corr_df = pd.DataFrame(columns=corr_columns)
corr_df = all_results[corr_columns]
corr_df.set_index('Name', inplace=True)
corr_df

Unnamed: 0_level_0,Cluster Labels,Status
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Fordham,1,Closed
Greenpoint,1,Closed
Brooklyn Heights,1,Closed
Carroll Gardens,0,Closed
Clinton Hill,1,Closed
Downtown,1,Closed
East Williamsburg,1,Closed
North Side,1,Open
South Side,1,Closed
Chinatown,1,Closed


In [84]:
# I also need to preprocess the status to numerical values.
le_status = preprocessing.LabelEncoder()
le_status.fit(['Open', 'Closed', 'Rebrand'])
corr_df['Status'] = le_status.transform(corr_df['Status'])
corr_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0_level_0,Cluster Labels,Status
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Fordham,1,0
Greenpoint,1,0
Brooklyn Heights,1,0
Carroll Gardens,0,0
Clinton Hill,1,0
Downtown,1,0
East Williamsburg,1,0
North Side,1,1
South Side,1,0
Chinatown,1,0


In [88]:
stats.kruskal(corr_df['Cluster Labels'], corr_df['Status'])

KruskalResult(statistic=43.64705882352936, pvalue=3.9326771993576644e-11)

There seems to be very little correlation between the clusters and the status of the different locations.
The biggest issue is that there aren't enough samples to get a meaningful Kruskal Result. 