# IBM Data Science Capstone Project: Clustering New York City Neighborhoods

## 1. Introduction

A resident of a Manhattan neighborhood enjoys his local amenities, but he feels that apartment rents in the area are getting too high. He decides to relocate to a more affordable neighborhood in New York City. However, he wants to make sure that his new neighborhood has a similar mix of amenities. How can he identify neighborhoods in other boroughs of NYC which have similar amenities to his neighborhood in Manhattan?

## 2. Data

In order to solve this problem, I need a dataset which contains information about the various venues and amenities in each neighborhood of New York City.

The first step is to use a dataset from NYU which contains information about each neighborhood in New York City. I can use this data to identify the names and locations (latitude and longitude coordinates) of each neighborhood. Here is a link to the data: https://geo.nyu.edu/catalog/nyu_2451_34572

The second step is to use FourSquare location data to access venue information for each neighborhood. I can use the FourSquare API functions to explore the venues around the latitude/longitude coordiantes of each NYC neighborhood in the NYU dataset.

### Part A: NYU Neighborhood Data

I will load the NYU neighborhood JSON file, and open it as a Python dictionary:

In [2]:
import json
!wget -q -O 'newyork_data.json' https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs/newyork_data.json

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

The neighborhood data that I want to use is in the 'features' key of the dictionary. So I will redefine the data variable to only include this information:

In [4]:
neighborhoods_data = newyork_data['features']

Next, I would like to convert the neighborhood_data dictionary into a Pandas dataframe. So I will first initialize the empty dataframe, then fill it in one row at a time.

In [5]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# Fill in the dataframe (each iteration fills one neighborhood row)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Let's examine the first few rows of the resulting dataframe.

In [6]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Everything appears to be in order. I now have a Pandas dataframe which contains the name and location coordinates of each neighborhood in New York City.

### Part B: FourSquare Venue Data

Now that I have the NYC neighborhoods dataframe, I want to identify which venues are located within each neighborhood. I can use the FourSquare API functions to access this data: 

In [7]:
# FourSquare API credentials
CLIENT_ID = 'WTRR3MQDOG2EO1DX4KS2Y1CALAFONKGBSW2YM3KKLC54C20M' # Foursquare ID
CLIENT_SECRET = '5SJNKI5PTTOB2MHGJKDYXDBJXD3FPICV1RJMMMMFUY00M0YT' # Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

# Define a function that returns top 100 venues within 500 meters of a neighborhood's latitude/longitude coordinates
import requests # library to handle requests
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

# Use the above function to create a new dataframe with the venue data for each neighborhood
ny_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

ny_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
2,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
3,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


After processing all of the neighborhood venue data, the result is a new Pandas dataframe with information about the top venues in New York City. However, I need to reformat this data for the clustering analysis. I am interested in knowing the categories of venues near each neighborhood. Therefore, I will use indicator (dummy) variables to encode the 'venue category' feature as a number.

In [15]:
# Indicator variable encoding (one hot encoding)
ny_dummy = pd.get_dummies(ny_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
ny_dummy['Neighborhood'] = ny_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [ny_dummy.columns[-168]] + list(ny_dummy.columns[:-168]) + list(ny_dummy.columns[-167:])
ny_dummy = ny_dummy[fixed_columns]

ny_dummy.head()

Unnamed: 0,Neighborhood,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,American Restaurant,Animal Shelter,Antique Shop,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auditorium,Australian Restaurant,Austrian Restaurant,Auto Workshop,Automotive Shop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Bath House,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Beer Garden,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bike Trail,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Buffet,Building,Burger Joint,Burmese Restaurant,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Cafeteria,Café,Cajun / Creole Restaurant,Camera Store,Campground,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Caucasian Restaurant,Cha Chaan Teng,Check Cashing Service,Cheese Shop,Child Care Service,Chinese Restaurant,Chocolate Shop,Christmas Market,Church,Circus,Climbing Gym,Clothing Store,Club House,Cocktail Bar,Coffee Shop,College Academic Building,College Arts Building,College Basketball Court,College Bookstore,College Cafeteria,Colombian Restaurant,Comedy Club,Comfort Food Restaurant,Comic Shop,Community Center,Concert Hall,Construction & Landscaping,Convenience Store,Cooking School,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Cycle Studio,Czech Restaurant,Dance Studio,Daycare,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distillery,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Dosa Place,Drugstore,Dry Cleaner,Dumpling Restaurant,Duty-free Shop,Eastern European Restaurant,Electronics Store,Empanada Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Factory,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Stand,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General Entertainment,German Restaurant,Gift Shop,Gluten-free Restaurant,Go Kart Track,Golf Course,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Heliport,Herbs & Spices Store,High School,Himalayan Restaurant,Historic Site,History Museum,Hobby Shop,Home Service,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Bar,Hotel Pool,Hotpot Restaurant,Hunan Restaurant,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Indie Theater,Indonesian Restaurant,Insurance Office,Intersection,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kebab Restaurant,Kids Store,Kitchen Supply Store,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Laundromat,Laundry Service,Lawyer,Leather Goods Store,Lebanese Restaurant,Library,Lingerie Store,Liquor Store,Locksmith,Lounge,Malay Restaurant,Market,Martial Arts School,Massage Studio,Mattress Store,Medical Center,Medical Supply Store,Mediterranean Restaurant,Memorial Site,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mini Golf,Miscellaneous Shop,Mobile Phone Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Motel,Motorcycle Shop,Movie Theater,Moving Target,Multiplex,Museum,Music School,Music Store,Music Venue,Nail Salon,New American Restaurant,Newsstand,Nightclub,Nightlife Spot,Non-Profit,Noodle House,North Indian Restaurant,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Other Nightlife,Other Repair Shop,Outdoor Gym,Outdoor Sculpture,Outdoors & Recreation,Outlet Mall,Outlet Store,Paella Restaurant,Pakistani Restaurant,Paper / Office Supplies Store,Park,Pastry Shop,Pedestrian Plaza,Performing Arts Venue,Persian Restaurant,Peruvian Restaurant,Pet Café,Pet Service,Pet Store,Pharmacy,Photography Studio,Physical Therapist,Piano Bar,Pie Shop,Pier,Piercing Parlor,Pilates Studio,Pizza Place,Platform,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Pool Hall,Post Office,Print Shop,Professional & Other Places,Pub,Public Art,Puerto Rican Restaurant,Racetrack,Ramen Restaurant,Record Shop,Recording Studio,Recreation Center,Rental Car Location,Rental Service,Residential Building (Apartment / Condo),Resort,Rest Area,Restaurant,River,Rock Climbing Spot,Rock Club,Roller Rink,Roof Deck,Rooftop Bar,Russian Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scandinavian Restaurant,Scenic Lookout,School,Sculpture Garden,Seafood Restaurant,Shabu-Shabu Restaurant,Shanghai Restaurant,Shipping Store,Shoe Store,Shop & Service,Shopping Mall,Skate Park,Skating Rink,Ski Area,Smoke Shop,Smoothie Shop,Snack Place,Soba Restaurant,Soccer Field,Social Club,Soup Place,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Souvlaki Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stables,Stadium,State / Provincial Park,Stationery Store,Steakhouse,Storage Facility,Street Art,Strip Club,Supermarket,Supplement Shop,Surf Spot,Sushi Restaurant,Swiss Restaurant,Szechuan Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Tea Room,Tech Startup,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park,Theme Park Ride / Attraction,Thrift / Vintage Store,Tibetan Restaurant,Tiki Bar,Toll Plaza,Tourist Information Center,Toy / Game Store,Track,Trail,Train Station,Turkish Restaurant,Udon Restaurant,Used Bookstore,Vape Store,Varenyky restaurant,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Warehouse Store,Waste Facility,Waterfront,Weight Loss Center,Well,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Wakefield,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Wakefield,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Wakefield,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Wakefield,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Wakefield,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Next, I will define a function to find the most common venue categories in each neighborhood. I will use this function to identify the top ten venue categories for each neighborhood, then put this data into a new dataframe.

In [16]:
# Function to determine the top venue categories of a neighborhood
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
import numpy as np
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
ny_grouped = ny_dummy.groupby('Neighborhood').mean().reset_index()
neighborhood_venues_sorted = pd.DataFrame(columns=columns)
neighborhood_venues_sorted['Neighborhood'] = ny_grouped['Neighborhood']

for ind in np.arange(ny_grouped.shape[0]):
    neighborhood_venues_sorted.iloc[ind, 1:] = return_most_common_venues(ny_grouped.iloc[ind, :], num_top_venues)

neighborhood_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Allerton,Pizza Place,Deli / Bodega,Supermarket,Spa,Chinese Restaurant,Bus Station,Gas Station,Spanish Restaurant,Breakfast Spot,Grocery Store
1,Annadale,Pizza Place,Food,Cosmetics Shop,Train Station,Diner,Restaurant,Liquor Store,Pub,Farmers Market,Field
2,Arden Heights,Pharmacy,Deli / Bodega,Coffee Shop,Pizza Place,Flea Market,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field
3,Arlington,Grocery Store,Deli / Bodega,Bus Stop,Coffee Shop,Boat or Ferry,Food,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
4,Arrochar,Italian Restaurant,Pizza Place,Bus Stop,Deli / Bodega,Bagel Shop,Mediterranean Restaurant,Middle Eastern Restaurant,Sandwich Place,Outdoors & Recreation,Liquor Store


## 3. k-means Cluster Analysis

Now that I have the dataframes set up, I can apply a well-known machine learning algorithm called k-means. For this algorithm, I define a distance function between each data point. In this case, the distance function is defined by the number of top venue categories two neighborhoods have in common. Next, I select the number of clusters (k) that I want to segment the neighborhoods into. I randomly select initial points (centroids) for the k clusters. The algorithm then calculates which centroid each data point is closest to, and assigns the point to that centroid's cluster. Each centroid is then moved to the center of the points in its cluster. The algorithm applies this process iteratively until the cluster assignments reach a local optimum solution. At this stage, the intracluster distance should be minimized, and the intercluster distance should be maximized.

The Python library, scikit learn has an implementation of the k-means algorithm. I will use it to cluster all New York City neighborhoods according to the similarity in their top venue categories. Then I will create an updated Pandas dataframe that includes the cluster assignments (along with the top ten venue categories).

In [17]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

ny_grouped_clustering = ny_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(ny_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 3, 0, 3, 3, 3, 3, 0], dtype=int32)

In [18]:
# add clustering labels
neighborhood_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

ny_merged = neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
ny_merged = ny_merged.join(neighborhood_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

ny_merged.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bronx,Wakefield,40.894705,-73.847201,0.0,Pharmacy,Deli / Bodega,Gas Station,Donut Shop,Dessert Shop,Ice Cream Shop,Sandwich Place,Laundromat,Yoga Studio,Film Studio
1,Bronx,Co-op City,40.874294,-73.829939,0.0,Fast Food Restaurant,Accessories Store,Grocery Store,Basketball Court,Park,Baseball Field,Pharmacy,Bagel Shop,Pizza Place,Deli / Bodega
2,Bronx,Eastchester,40.887556,-73.827806,3.0,Bus Station,Caribbean Restaurant,Deli / Bodega,Diner,Juice Bar,Donut Shop,Chinese Restaurant,Seafood Restaurant,Food & Drink Shop,Pizza Place
3,Bronx,Fieldston,40.895437,-73.905643,3.0,Medical Supply Store,Plaza,Bus Station,River,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
4,Bronx,Riverdale,40.890834,-73.912585,3.0,Park,Bus Station,Baseball Field,Plaza,Bank,Gym,Medical Supply Store,French Restaurant,Falafel Restaurant,Farm


## 4. Results and Data Visualization

Now that I have sorted each New York City neighborhood into the five clusters, I would like to visualize them on a map. I will use the Python Folium library. Each neighborhood will be labeled with a colored dot which indicates its cluster assignment. 

In [19]:
!pip install folium
import folium # map rendering library

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.4 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0


In [20]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [21]:
# NYC latitude and longitude
latitude = 40.7127281
longitude = -74.0060152

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(ny_merged['Latitude'], ny_merged['Longitude'], ny_merged['Neighborhood'], ny_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    if (0<=cluster) and (cluster<=4):
        cluster_color=rainbow[int(cluster)-1]
    else:
        cluster_color='black'
    if (0<=cluster) and (cluster<=4):
        cluster_fill_color=rainbow[int(cluster)-1]
    else:
        cluster_fill_color='black'
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=cluster_color,
        fill=True,
        fill_color=cluster_fill_color,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Evidently, all of the neighborhoods in Manhattan fall into Cluster 3 (cyan). The Bronx appears to have a greater proportion of Cluster 0 (red) neighborhoods, while Staten Island, Brooklyn, and Queens have a mix of Cluster 0 (red) and Cluster 3 (cyan) neighborhoods. The other there clusters have only two or three neighborhoods each, suggesting that these are outliers in some way. Also, two of the neighborhoods on Staten Island were apparently not assigned to clusters due to an error in the data (they are colored black). 

It would be interesting to know what these clusters are telling us about the feature of interest-- the top venues in a neighborhood. What do the Cluster 3 (cyan) neighborhoods have in common, for example? And what sets the less common clusters apart? Let's take a closer look at the data.

In [22]:
ny_merged.loc[ny_merged['Cluster Labels'] == 0, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Wakefield,Pharmacy,Deli / Bodega,Gas Station,Donut Shop,Dessert Shop,Ice Cream Shop,Sandwich Place,Laundromat,Yoga Studio,Film Studio
1,Co-op City,Fast Food Restaurant,Accessories Store,Grocery Store,Basketball Court,Park,Baseball Field,Pharmacy,Bagel Shop,Pizza Place,Deli / Bodega
5,Kingsbridge,Pizza Place,Bakery,Sandwich Place,Bar,Donut Shop,Latin American Restaurant,Mexican Restaurant,Supermarket,Fast Food Restaurant,Deli / Bodega
7,Woodlawn,Deli / Bodega,Bar,Playground,Food & Drink Shop,Pub,Pizza Place,Food Truck,Grocery Store,Park,Pharmacy
8,Norwood,Pizza Place,Bank,Park,Pharmacy,Burger Joint,Food Truck,Mexican Restaurant,Mobile Phone Shop,Grocery Store,Bus Station
11,Pelham Parkway,Italian Restaurant,Pizza Place,Chinese Restaurant,Frozen Yogurt Shop,Bakery,Bank,Metro Station,Mexican Restaurant,Sushi Restaurant,Gas Station
13,Bedford Park,Diner,Deli / Bodega,Mexican Restaurant,Chinese Restaurant,Pizza Place,Sandwich Place,Bus Station,Food Truck,Thrift / Vintage Store,Grocery Store
14,University Heights,Pizza Place,Fast Food Restaurant,Bakery,Donut Shop,Bank,Supermarket,Latin American Restaurant,Sandwich Place,History Museum,Cosmetics Shop
15,Morris Heights,Bank,Pharmacy,Spanish Restaurant,Grocery Store,Pizza Place,Yoga Studio,Fish Market,Falafel Restaurant,Farm,Farmers Market
17,East Tremont,Pizza Place,Cosmetics Shop,Restaurant,Mobile Phone Shop,Fast Food Restaurant,Bank,Fish & Chips Shop,Lounge,Donut Shop,Breakfast Spot


It appears that Cluster 0 (red) neighborhoods primarily have restaurants and food stores (e.g. pizza places, bakeries, and cafes). 

In [23]:
ny_merged.loc[ny_merged['Cluster Labels'] == 1, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
192,Somerville,Park,Yoga Studio,Flower Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio
203,Todt Hill,Park,Yoga Studio,Flower Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio


Each of the Cluster 1 (purple) neighborhoods has an identical top venue profile, with parks, yoga studios, and flower shops among the top venues. 

In [24]:
ny_merged.loc[ny_merged['Cluster Labels'] == 2, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
76,Mill Island,Pool,Food,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop
238,Butler Manor,Pool,Baseball Field,Gas Station,Flower Shop,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant


The Cluster 2 (blue) neighborhoods have top venues such as pools, farms, and fields.

In [25]:
ny_merged.loc[ny_merged['Cluster Labels'] == 3, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Eastchester,Bus Station,Caribbean Restaurant,Deli / Bodega,Diner,Juice Bar,Donut Shop,Chinese Restaurant,Seafood Restaurant,Food & Drink Shop,Pizza Place
3,Fieldston,Medical Supply Store,Plaza,Bus Station,River,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
4,Riverdale,Park,Bus Station,Baseball Field,Plaza,Bank,Gym,Medical Supply Store,French Restaurant,Falafel Restaurant,Farm
6,Marble Hill,Gym,Sandwich Place,Coffee Shop,Discount Store,Yoga Studio,Video Game Store,Steakhouse,Supplement Shop,Shopping Mall,Tennis Stadium
9,Williamsbridge,Bar,Soup Place,Nightclub,Metro Station,Caribbean Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant
10,Baychester,Donut Shop,Bus Station,Men's Store,Fast Food Restaurant,Gym / Fitness Center,Mattress Store,Pet Store,Bank,Boat or Ferry,Sandwich Place
12,City Island,Seafood Restaurant,Thrift / Vintage Store,Tapas Restaurant,Bank,Bar,Baseball Field,Park,Harbor / Marina,Deli / Bodega,Pharmacy
16,Fordham,Spanish Restaurant,Mobile Phone Shop,Shoe Store,Bank,Gym / Fitness Center,Donut Shop,Video Game Store,Supplement Shop,Pharmacy,Clothing Store
18,West Farms,Bus Station,Park,Donut Shop,Chinese Restaurant,Bank,Lounge,Sandwich Place,Supermarket,Basketball Court,Bus Stop
22,Port Morris,Furniture / Home Store,Donut Shop,Restaurant,Metro Station,Distillery,Latin American Restaurant,Spanish Restaurant,Grocery Store,Brewery,Storage Facility


The Cluster 3 (cyan) neighborhoods tend to have restaurants and food stores among the top venues. However, neighborhoods in this cluster appear to have a more diverse set of venues, including parks, public transit options, and other non-food-related amenities.

In [26]:
ny_merged.loc[ny_merged['Cluster Labels'] == 4, ny_merged.columns[[1] + list(range(5, ny_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
183,Jamaica Estates,Dog Run,Intersection,Food,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio
193,Brookville,Deli / Bodega,Yoga Studio,Food,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio
202,Grymes Hill,Dog Run,Deli / Bodega,Food & Drink Shop,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop


Cluster 4 (orange) neighborhoods primarily feature dog runs, farms, fields, and restaurants.

## 5. Conclusion

A resident of Manhattan who wishes to relocate to another New York City neighborhood with a similar mix of venues should consider moving to one of the Cluster 3 neighborhoods listed above. Such neighborhoods are abundant, especially in Staten Island, Brooklyn, and Queens. The differences between Cluster 3 and Cluster 0 neighborhoods are not immediately clear, however, it does appear the Cluster 3 neighborhoods are more diverse. In particular , the Cluster 3 neighborhoods appear to have better access to amenities such as parks and gyms.

For further discussion, please read the report published on my GitHub.