## Comparing Neighbrhoods from Coast to Coast

### Yiwei Wang

You can go to [Jupyter notebook viewer](https://nbviewer.jupyter.org/github/wangyw80/My-Projects/blob/master/Comparing%20Neighborhoods%20From%20Coast%20to%20Coast_codes.ipynb) if maps are not showing.

### This notebook contains the codes used in this report.

__Part 1. Collecting Neighborhood Geographic Information__

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json
from bs4 import BeautifulSoup
! pip install geocoder
import geocoder
from geopy.geocoders import Nominatim
import requests 
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
! pip install folium
import folium 

print('All Libraries imported.')

I scrape neighborhood names of New York City, Chicago, and Seattle from Wikipedia. The corresponding web page links that provide tables of neighborhood names are here: 

New York, https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City;  
Chicago, https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago;  
Seattle, https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle. 

I use BeautyfulSoul library to scrape desired information from these web pages. The html source codes are loaded and the tables within the pages are pulled.

In [3]:
# Prepare URLs
wiki_NYC = requests.get('https://en.wikipedia.org/wiki/Neighborhoods_in_New_York_City').text
wiki_Chicago = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Chicago').text
wiki_Seattle = requests.get('https://en.wikipedia.org/wiki/List_of_neighborhoods_in_Seattle').text

# Loading html
soup_NYC = BeautifulSoup(wiki_NYC, 'lxml')
soup_Chicago = BeautifulSoup(wiki_Chicago, 'lxml')
soup_Seattle = BeautifulSoup(wiki_Seattle, 'lxml')

# Check if source codes are successfully loaded
print(soup_NYC.title) 
print(soup_Chicago.title)
print(soup_Seattle.title)

# Pull out the tables
table_NYC = soup_NYC.find('table', class_='wikitable sortable')
table_Chicago = soup_Chicago.find('table', class_='wikitable sortable')
table_Seattle = soup_Seattle.find('table', class_='wikitable sortable')

<title>Neighborhoods in New York City - Wikipedia</title>
<title>List of neighborhoods in Chicago - Wikipedia</title>
<title>List of neighborhoods in Seattle - Wikipedia</title>


Next, I pull neighborhood names from the tables. Chicago and Seattle neighborhood tables are quite straight forward with one neighborhood name in each row. The New York table is slightly tricker as there are multiple neighborhood names in each cell of interest. Therefore, I further split the cells to make sure each row has one neighborhood name.

In [4]:
# Generate an empty data frame containing city, state, neighborhood name, latitude, and longitude.
column = ['City','State','Neighborhood','Latitude','Longitude']
neighborhood = pd.DataFrame(columns=column)

# Check number of rows in each table scraped.
row_count=0
for x in table_NYC.find_all('tr'):
    row_count+=1
print("The wiki New York neighborhood table has a total of", row_count, "rows")

row_count=0
for x in table_Chicago.find_all('tr'):
    row_count+=1
print("The wiki Chicago neighborhood table has a total of", row_count, "rows")

row_count=0
for x in table_Seattle.find_all('tr'):
    row_count+=1
print("The wiki Seattle neighborhood table has a total of", row_count, "rows")

# Scape neighborhood names from the New York table.
header = True
for row in table_NYC.find_all('tr'):
    if header: # Skipping header
        header = False
    else:
        columns = row.find_all('td')
        column_marker = 0
        for column in columns:      
            if column_marker<4: # Skipping the first 4 columns in each row as the neighborhood names are in the 5th column. 
                column.get_text()
            else:
                neighborhood = neighborhood.append({'City':'New York','State':'NY','Neighborhood':column.get_text(),'Latitude':'','Longitude':''}, ignore_index=True)
            column_marker += 1

# The final row of the table does not contain neighborhood information. So the row should be dropped.
neighborhood.drop(neighborhood[neighborhood.index == 59].index, inplace=True)
# Neighborhood names in each cells are separated by commas. I split the cells and stack the new rows in the table.
neighborhood = neighborhood.set_index(neighborhood.columns.drop('Neighborhood',3).tolist()).Neighborhood.str.split(',', expand=True).stack().reset_index().rename(columns={0:'Neighborhood'}).loc[:, neighborhood.columns]

# Scrape neighborhood names of Chicago
header = True
for row in table_Chicago.find_all('tr'):
    if header: # Skipping header
        header = False
    else:
        columns = row.find_all('td')
        column_marker = 0
        for column in columns:      
            if column_marker==0:
                neighborhood = neighborhood.append({'City':'Chicago','State':'IL','Neighborhood':column.get_text(),'Latitude':'','Longitude':''}, ignore_index=True)
            else:
                column.get_text()
            column_marker += 1

# Scrape neighborhood names of Seattle
header = True
for row in table_Seattle.find_all('tr'):
    if header: # Skipping header
        header = False
    else:
        columns = row.find_all('td')
        column_marker = 0
        for column in columns:      
            if column_marker==0:
                neighborhood = neighborhood.append({'City':'Seattle','State':'WA','Neighborhood':column.get_text(),'Latitude':'','Longitude':''}, ignore_index=True)
            else:
                column.get_text()
            column_marker += 1

# Clear our unwanted strings in scraped cells
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r'\n', '') 
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r'\[.*\]', '') 
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r'\(.*\)', '') 
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r'\/.*', '') 
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r'&.*', '') 
neighborhood['Neighborhood'] = neighborhood['Neighborhood'].str.replace(r',.*', '') 

# Check size and sample rows of the scraped table
print(neighborhood.shape)
neighborhood.tail()

The wiki New York neighborhood table has a total of 61 rows
The wiki Chicago neighborhood table has a total of 247 rows
The wiki Seattle neighborhood table has a total of 128 rows
(700, 5)


Unnamed: 0,City,State,Neighborhood,Latitude,Longitude
695,Seattle,WA,Riverview,,
696,Seattle,WA,Highland Park,,
697,Seattle,WA,South Delridge,,
698,Seattle,WA,Roxhill,,
699,Seattle,WA,High Point,,


With neighborhood names in place, now I can get the location information of these neighborhoods. I use google geocoding service to attain latitude and longitude of each neighborhood based on their names, city, ans state.

In [6]:
# Google API key (removed when sharing codes)
key='My KEY'

In [7]:
# Make API calls for each row
for index, row in neighborhood.iterrows():
    response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address={},+{},+{}&key={}'.format(row['Neighborhood'], row['City'], row['State'], key))
    resp_json = response.json()
    row['Latitude'] = resp_json['results'][0]['geometry']['location']['lat']
    row['Longitude'] = resp_json['results'][0]['geometry']['location']['lng']


Now each neighborhood has its corresponding location for the next step.

__Part 2. Cluster neighborhoods based on popular venues in each neighborhood__

I use Foursquare as the sources to get top venues in each neighborhood. I then compare the most popular 15 venue categories in each neighborhood to determine whether neighborhoods can be clustered in a same group using machine learning technique. I use k-means clustering algorithm to assign neighborhoods to different clusters.

Making API calls using latitude and longitude of each neighborhood to get venues within defined radius.

In [9]:
# Forusqure credential (removed when sharing codes)
CLIENT_ID = 'MY ID' # Foursquare ID
CLIENT_SECRET = 'MY SECRET' # Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [8]:
# Define a function to make API calls to Foursquare and then pull out venue information

def getNearbyVenues(names, latitudes, longitudes, radius=500): # default search radius is 500
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
                    
        # Generate the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue in the json file
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    # Store all venue information for each neighborhood in the dataframe
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [11]:
LIMIT = 100 # Calling upto 100 venues for each neighborhood.

venues = getNearbyVenues(names=neighborhood['Neighborhood'],
                                   latitudes=neighborhood['Latitude'],
                                   longitudes=neighborhood['Longitude']
                                  )

print(venues.shape)
venues.head()

(20990, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Melrose,40.824545,-73.910414,Porto Salvo,40.823887,-73.91291,Italian Restaurant
1,Melrose,40.824545,-73.910414,Perry Coffee Shop.,40.823433,-73.91094,Diner
2,Melrose,40.824545,-73.910414,McDonald's,40.825183,-73.908625,Fast Food Restaurant
3,Melrose,40.824545,-73.910414,Cinco de Mayo,40.8226,-73.911586,Mexican Restaurant
4,Melrose,40.824545,-73.910414,Old Bronx Courthouse,40.822894,-73.909565,Art Gallery


To perform k-mean clustering, there has to be a way to measure the distance from one observation to the cluster center. I turn all venue categories into dummies, 1 if exists and 0 if not. I then calculate the weight of each venue category in each neighborhood. The more venues of the same category exist in a neighborhood, the higher weight that category has in that neighborhood.

In [57]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))
# Create dummies for venue categories
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped.head()

There are 496 uniques categories.


Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,Alternative Healer,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auditorium,Australian Restaurant,Austrian Restaurant,Auto Dealership,Auto Garage,Auto Workshop,Automotive Shop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Court,Basketball Stadium,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Beer Garden,Beer Store,Big Box Store,Bike Rental / Bike Share,Bike Shop,Bike Trail,Bistro,Board Shop,Boat Rental,Boat or Ferry,Bookstore,Botanical Garden,Boutique,Bowling Alley,Boxing Gym,Brazilian Restaurant,Breakfast Spot,Brewery,Bridal Shop,Bridge,Bubble Tea Shop,Buffet,Building,Burger Joint,Burrito Place,Bus Line,Bus Station,Bus Stop,Business Service,Butcher,Cafeteria,Café,Cajun / Creole Restaurant,Cambodian Restaurant,Camera Store,Campground,Canal,Candy Store,Cantonese Restaurant,Caribbean Restaurant,Carpet Store,Casino,Caucasian Restaurant,Cemetery,Check Cashing Service,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Circus,Climbing Gym,Clothing Store,Club House,Cocktail Bar,Coffee Shop,College Academic Building,College Arts Building,College Basketball Court,College Bookstore,College Cafeteria,College Gym,College Rec Center,College Theater,College Track,Colombian Restaurant,Comedy Club,Comfort Food Restaurant,Comic Shop,Community Center,Community College,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Costume Shop,Coworking Space,Creperie,Cuban Restaurant,Cultural Center,Cupcake Shop,Currency Exchange,Cycle Studio,Czech Restaurant,Dance Studio,Daycare,Deli / Bodega,Dentist's Office,Department Store,Design Studio,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distillery,Dive Bar,Doctor's Office,Dog Run,Donut Shop,Drive-in Theater,Drugstore,Dry Cleaner,Dumpling Restaurant,Duty-free Shop,Eastern European Restaurant,Egyptian Restaurant,Electronics Store,Elementary School,Empanada Restaurant,English Restaurant,Entertainment Service,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Fabric Shop,Factory,Fair,Falafel Restaurant,Farm,Farmers Market,Fast Food Restaurant,Field,Filipino Restaurant,Film Studio,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Fondue Restaurant,Food,Food & Drink Shop,Food Court,Food Service,Food Stand,Food Truck,Football Stadium,Fountain,Frame Store,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gas Station,Gastropub,Gay Bar,General College & University,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Golf Course,Golf Driving Range,Gourmet Shop,Government Building,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Halal Restaurant,Harbor / Marina,Hardware Store,Hawaiian Restaurant,Health & Beauty Service,Health Food Store,Heliport,Herbs & Spices Store,High School,Himalayan Restaurant,Historic Site,History Museum,Hobby Shop,Hockey Arena,Home Service,Hookah Bar,Hostel,Hot Dog Joint,Hotel,Hotel Bar,Hotpot Restaurant,IT Services,Ice Cream Shop,Indian Chinese Restaurant,Indian Restaurant,Indie Movie Theater,Indie Theater,Indonesian Restaurant,Indoor Play Area,Intersection,Irish Pub,Israeli Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Karaoke Bar,Kebab Restaurant,Kids Store,Kitchen Supply Store,Knitting Store,Korean Restaurant,Kosher Restaurant,Lake,Latin American Restaurant,Laundromat,Laundry Service,Lawyer,Leather Goods Store,Lebanese Restaurant,Library,Light Rail Station,Lighthouse,Lingerie Store,Liquor Store,Locksmith,Lounge,Luggage Store,Mac & Cheese Joint,Malay Restaurant,Marijuana Dispensary,Market,Martial Arts Dojo,Massage Studio,Mattress Store,Medical Center,Mediterranean Restaurant,Memorial Site,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Mini Golf,Miscellaneous Shop,Mobile Phone Shop,Modern European Restaurant,Modern Greek Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Moroccan Restaurant,Motel,Motorcycle Shop,Movie Theater,Moving Target,Multiplex,Museum,Music School,Music Store,Music Venue,Nail Salon,National Park,Nature Preserve,New American Restaurant,Newsstand,Night Market,Nightclub,Nightlife Spot,Non-Profit,Noodle House,North Indian Restaurant,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Other Nightlife,Other Repair Shop,Outdoor Sculpture,Outdoor Supply Store,Outdoors & Recreation,Paella Restaurant,Pakistani Restaurant,Paper / Office Supplies Store,Park,Parking,Pastry Shop,Pawn Shop,Pedestrian Plaza,Peking Duck Restaurant,Performing Arts Venue,Perfume Shop,Persian Restaurant,Peruvian Restaurant,Pet Café,Pet Service,Pet Store,Pharmacy,Photography Studio,Piano Bar,Pie Shop,Pier,Pilates Studio,Pizza Place,Platform,Playground,Plaza,Poke Place,Polish Restaurant,Pool,Pool Hall,Post Office,Print Shop,Pub,Public Art,Radio Station,Ramen Restaurant,Record Shop,Recording Studio,Recreation Center,Rental Car Location,Rental Service,Residential Building (Apartment / Condo),Resort,Rest Area,Restaurant,River,Rock Club,Roller Rink,Roof Deck,Rugby Pitch,Russian Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Salsa Club,Salvadoran Restaurant,Sandwich Place,Scandinavian Restaurant,Scenic Lookout,School,Science Museum,Sculpture Garden,Seafood Restaurant,Shabu-Shabu Restaurant,Shanghai Restaurant,Shipping Store,Shoe Repair,Shoe Store,Shop & Service,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Ski Area,Ski Shop,Smoke Shop,Smoothie Shop,Snack Place,Soba Restaurant,Soccer Field,Soccer Stadium,Social Club,Soup Place,South American Restaurant,South Indian Restaurant,Southern / Soul Food Restaurant,Souvenir Shop,Souvlaki Shop,Spa,Spanish Restaurant,Speakeasy,Sporting Goods Shop,Sports Bar,Sports Club,Sri Lankan Restaurant,Stadium,Stationery Store,Steakhouse,Storage Facility,Street Art,Street Food Gathering,Strip Club,Supermarket,Supplement Shop,Surf Spot,Sushi Restaurant,Swiss Restaurant,Synagogue,Szechuan Restaurant,TV Station,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tattoo Parlor,Taxi,Taxi Stand,Tea Room,Tech Startup,Tennis Court,Tennis Stadium,Tex-Mex Restaurant,Thai Restaurant,Theater,Theme Park Ride / Attraction,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Tiki Bar,Toll Plaza,Tour Provider,Tourist Information Center,Toy / Game Store,Track,Track Stadium,Trail,Train,Train Station,Tram Station,Tree,Tunnel,Turkish Restaurant,Udon Restaurant,Ukrainian Restaurant,Used Bookstore,Vape Store,Varenyky restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Video Game Store,Video Store,Vietnamese Restaurant,Volleyball Court,Voting Booth,Warehouse Store,Waste Facility,Watch Shop,Waterfront,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Arden Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bay Terrace,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.136364,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022727,0.0,0.0,0.0,0.0,0.0,0.045455,0.0,0.0,0.0
2,Bayside,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.083333,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayswater,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Beechhurst,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


To simplify computation, I only use the venue categories that are among the top 15 in terms of weight for each neighborhood in this analysis.

In [58]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 15

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,Arden Heights,Bus Station,Pizza Place,Bowling Alley,Nightclub,Trail,Diner,Zoo Exhibit,Farmers Market,Farm,Falafel Restaurant,Factory,Fair,Field,Fabric Shop,Eye Doctor
1,Bay Terrace,Clothing Store,Lingerie Store,Women's Store,Cosmetics Shop,Donut Shop,Kids Store,Mobile Phone Shop,American Restaurant,Steakhouse,Bakery,Bank,Gluten-free Restaurant,Gift Shop,Men's Store,Furniture / Home Store
2,Bayside,Sandwich Place,Korean Restaurant,Bakery,Pharmacy,Chinese Restaurant,Bank,BBQ Joint,Coffee Shop,Tea Room,Mobile Phone Shop,Asian Restaurant,Donut Shop,Fast Food Restaurant,Taiwanese Restaurant,Greek Restaurant
3,Bayswater,Playground,Speakeasy,Zoo Exhibit,Fast Food Restaurant,Ethiopian Restaurant,Event Service,Event Space,Exhibit,Eye Doctor,Fabric Shop,Factory,Fair,Falafel Restaurant,Farm,Farmers Market
4,Beechhurst,Health & Beauty Service,Gym / Fitness Center,Gym,Bus Station,Park,Dog Run,Zoo Exhibit,Farmers Market,Event Space,Exhibit,Eye Doctor,Fabric Shop,Factory,Fair,Falafel Restaurant


K-means clustering is done using scikit-learn library. Different numbers of clusters are tested multiple times to explore best number of clusters that can provide meaningful decomposition of neighborhoods.

In [59]:
# Import library
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 6

grouped_clustering = grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

In [60]:
# Add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
merged = neighborhood

# Merge data frames add latitude/longitude for each neighborhood
merged = merged.join(venues_sorted.set_index('Neighborhood'), on='Neighborhood')

merged.head() # check the last columns.

Unnamed: 0,City,State,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue
0,New York,NY,Melrose,40.8245,-73.9104,0.0,Bus Station,Supermarket,Deli / Bodega,Convenience Store,Pizza Place,Laundromat,Fried Chicken Joint,Bus Stop,Fast Food Restaurant,Mexican Restaurant,Check Cashing Service,Gym / Fitness Center,Pharmacy,School,Donut Shop
1,New York,NY,Mott Haven,40.8091,-73.9229,0.0,Pizza Place,Fast Food Restaurant,Café,Discount Store,Lounge,Coffee Shop,Pharmacy,Gas Station,Asian Restaurant,Donut Shop,Art Gallery,Steakhouse,Bank,Bar,Mobile Phone Shop
2,New York,NY,Port Morris,40.8022,-73.9166,5.0,Peruvian Restaurant,Storage Facility,Distillery,Discount Store,Latin American Restaurant,Department Store,Spanish Restaurant,Burger Joint,Hardware Store,Food Truck,Chinese Restaurant,Exhibit,Event Space,Fast Food Restaurant,Fabric Shop
3,New York,NY,Hunts Point,40.8094,-73.8803,0.0,Grocery Store,BBQ Joint,Farmers Market,Spanish Restaurant,Café,Gourmet Shop,Bus Stop,Construction & Landscaping,Restaurant,Bank,Pizza Place,Fabric Shop,Eye Doctor,Exhibit,Event Space
4,New York,NY,Longwood,40.8248,-73.8916,0.0,Fast Food Restaurant,Mobile Phone Shop,Pharmacy,Metro Station,Park,Pizza Place,Donut Shop,Check Cashing Service,Spanish Restaurant,Rental Service,Supermarket,Train,Wings Joint,Sandwich Place,Gym / Fitness Center


Due to limitation of the venue data source, not all neighborhoods have valid venue data. These neighborhoods have to be dropped for the following steps.

In [93]:
merged.isnull().sum() # Check number of neighborhoods that failed to get enough venue information.

City                       0
State                      0
Neighborhood               0
Latitude                   0
Longitude                  0
Cluster Labels            10
1st Most Common Venue     10
2nd Most Common Venue     10
3rd Most Common Venue     10
4th Most Common Venue     10
5th Most Common Venue     10
6th Most Common Venue     10
7th Most Common Venue     10
8th Most Common Venue     10
9th Most Common Venue     10
10th Most Common Venue    10
11th Most Common Venue    10
12th Most Common Venue    10
13th Most Common Venue    10
14th Most Common Venue    10
15th Most Common Venue    10
dtype: int64

In [61]:
clean=merged.dropna(subset=['Cluster Labels']) # Drop neighborhoods that were not assigned any common venues.
clean.shape

(690, 21)

Given the nature of this study, it is hard to tell how many clusters I should set. Unlike analysis that can have limited categories of outcomes, the definition of "similar neighborhoods" is vague and cannot be quantified objectively. Therefore, I compare share of each clusters in each city with different number of clusters (3 to 8). The goal is to have enough clusters that can break down relatively similar neighborhoods into clusters but not creating too many clusters with only outliers.

I start with 3 clusters. For each city, I calculate the share of neighborhoods in each cluster.

In [110]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III
Chicago,86.07%,0.00%,13.93%
New York,95.31%,1.25%,3.44%
Seattle,84.13%,2.38%,13.49%


In [50]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III', 'Type IV']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III,Type IV
Chicago,15.16%,81.97%,0.00%,2.87%
New York,4.69%,93.75%,1.25%,0.31%
Seattle,11.90%,81.75%,1.59%,4.76%


In [56]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III', 'Type IV', 'Type V']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III,Type IV,Type V
Chicago,0.00%,63.93%,13.52%,20.90%,1.64%
New York,1.25%,53.44%,3.12%,41.88%,0.31%
Seattle,1.59%,76.19%,12.70%,7.94%,1.59%


In [62]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III', 'Type IV', 'Type V', 'Type VI']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III,Type IV,Type V,Type VI
Chicago,22.54%,2.87%,11.07%,13.93%,0.00%,49.59%
New York,51.88%,0.31%,1.56%,4.06%,0.31%,41.88%
Seattle,5.56%,4.76%,1.59%,11.90%,0.79%,75.40%


In [23]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III', 'Type IV', 'Type V', 'Type VI', 'Type VII']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III,Type IV,Type V,Type VI,Type VII
Chicago,2.87%,13.93%,48.36%,21.31%,0.00%,13.52%,0.00%
New York,0.31%,3.75%,45.94%,38.44%,1.25%,9.06%,1.25%
Seattle,4.76%,11.90%,70.63%,10.32%,1.59%,0.79%,0.00%


In [29]:
crosstab = pd.crosstab(clean['City'],clean['Cluster Labels']).apply(lambda r: r/r.sum(), axis=1)
crosstab.columns=['Type I', 'Type II', 'Type III', 'Type IV', 'Type V', 'Type VI', 'Type VII', 'Type VIII']
crosstab = crosstab.rename_axis(None)
crosstab.style.format("{:.2%}")

Unnamed: 0,Type I,Type II,Type III,Type IV,Type V,Type VI,Type VII,Type VIII
Chicago,1.23%,8.20%,12.70%,14.34%,59.84%,2.87%,0.82%,0.00%
New York,2.19%,0.31%,40.00%,3.75%,50.31%,0.31%,2.81%,0.31%
Seattle,2.38%,1.59%,4.76%,11.90%,72.22%,4.76%,2.38%,0.00%


Comparing across different number of clusters, I proceed with 6 clusters as it provides a balanced between similar groups and outliers.

__Part 3. Results__

After clustering all neighborhoods, time to show the results. Next, I plot all neighborhoods on the map, different clusters are in different colors. The maps are essentially the same, but for visual purpose, I created 3 maps with different starting location in order to show neighborhoods in all 3 cities.

In [63]:
# Get latitude and longitude of each city.
response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address=New+York,+NY&key={}'.format(key))
resp_json = response.json()
NYC_Latitude = resp_json['results'][0]['geometry']['location']['lat']
NYC_Longitude = resp_json['results'][0]['geometry']['location']['lng']

response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address=Chicago,+IL&key={}'.format(key))
resp_json = response.json()
Chicago_Latitude = resp_json['results'][0]['geometry']['location']['lat']
Chicago_Longitude = resp_json['results'][0]['geometry']['location']['lng']

response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address=Seattle,+WA&key={}'.format(key))
resp_json = response.json()
Seattle_Latitude = resp_json['results'][0]['geometry']['location']['lat']
Seattle_Longitude = resp_json['results'][0]['geometry']['location']['lng']

New York City neighborhoods by clusters.

In [64]:
# Load folium library for mapping
import folium
NYC_clusters = folium.Map(location=[NYC_Latitude, NYC_Longitude], zoom_start=10)

# set color scheme for the clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(clean['Latitude'], clean['Longitude'], clean['Neighborhood'], clean['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(NYC_clusters)
       
NYC_clusters

Chicago neighborhoods by clusters.

In [65]:
Chicago_clusters = folium.Map(location=[Chicago_Latitude, Chicago_Longitude], zoom_start=10)

for lat, lon, poi, cluster in zip(clean['Latitude'], clean['Longitude'], clean['Neighborhood'], clean['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(Chicago_clusters)
       
Chicago_clusters

Seattle neighborhoods by clusters.

In [66]:
Seattle_clusters = folium.Map(location=[Seattle_Latitude, Seattle_Longitude], zoom_start=11)

markers_colors = []
for lat, lon, poi, cluster in zip(clean['Latitude'], clean['Longitude'], clean['Neighborhood'], clean['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(Seattle_clusters)
       
Seattle_clusters

Now that all neighborhoods are assigned into different clusters, I further explore what does these clusters look like. It is good to know that several neighborhoods are more similar to each other but it would provide more information if we could know which parts look different among different clusters.

To perform this task, I melt the top venues table and to allow counting of venue categories by cluster numbers.

In [86]:
types=clean.drop(['City', 'State', 'Neighborhood', 'Latitude','Longitude'], axis=1)
types=types.melt('Cluster Labels', value_name='venues').drop('variable', 1)
types
types.head(5)

Unnamed: 0,Cluster Labels,venues
0,0.0,Bus Station
1,0.0,Pizza Place
2,5.0,Peruvian Restaurant
3,0.0,Grocery Store
4,0.0,Fast Food Restaurant


Show the top 10 most popular venue categories in cluster 1.

In [120]:
type1=types.drop(types[types['Cluster Labels']!=0].index)
crosstab_type1 = pd.crosstab(type1['Cluster Labels'],type1['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type1 = crosstab_type1.rename_axis(None)
crosstab_type1 = crosstab_type1.reset_index(drop=True)
crosstab_type1.index=['Type I']
crosstab_type1 = crosstab_type1.sort_values(by = 'Type I', ascending=False, axis=1)
del crosstab_type1.columns.name
crosstab_type1=crosstab_type1.iloc[:,0:10]
crosstab_type1.style.format("{:.2%}")


Unnamed: 0,Pizza Place,Fast Food Restaurant,Pharmacy,Sandwich Place,Chinese Restaurant,Donut Shop,Deli / Bodega,Factory,Fabric Shop,Bank
Type I,4.71%,3.04%,2.75%,2.63%,2.43%,2.40%,2.19%,2.05%,2.02%,2.02%


Top 10 most popular venue categories in cluster 2

In [122]:
type2=types.drop(types[types['Cluster Labels']!=1].index)
crosstab_type2 = pd.crosstab(type2['Cluster Labels'],type2['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type2 = crosstab_type2.rename_axis(None)
crosstab_type2 = crosstab_type2.reset_index(drop=True)
crosstab_type2.index=['Type II']
crosstab_type2 = crosstab_type2.sort_values(by = 'Type II', ascending=False, axis=1)
del crosstab_type2.columns.name
crosstab_type2=crosstab_type2.iloc[:,0:10]
crosstab_type2.style.format("{:.2%}")

Unnamed: 0,Farm,Fabric Shop,Park,Falafel Restaurant,Fair,Factory,Event Service,Event Space,Exhibit,Eye Doctor
Type II,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%


Top 10 most popular venue categories in cluster 3

In [125]:
type3=types.drop(types[types['Cluster Labels']!=2].index)
crosstab_type3 = pd.crosstab(type3['Cluster Labels'],type3['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type3 = crosstab_type3.rename_axis(None)
crosstab_type3 = crosstab_type3.reset_index(drop=True)
crosstab_type3.index=['Type III']
crosstab_type3 = crosstab_type3.sort_values(by = 'Type III', ascending=False, axis=1)
del crosstab_type3.columns.name
crosstab_type3=crosstab_type3.iloc[:,0:10]
crosstab_type3.style.format("{:.2%}")

Unnamed: 0,Mexican Restaurant,Pizza Place,Fast Food Restaurant,Fabric Shop,Fair,Grocery Store,Eye Doctor,Factory,Park,Exhibit
Type III,5.88%,4.51%,3.33%,3.33%,2.75%,2.75%,2.55%,2.55%,2.35%,2.35%


Top 10 most popular venue categories in cluster 4

In [126]:
type4=types.drop(types[types['Cluster Labels']!=3].index)
crosstab_type4 = pd.crosstab(type4['Cluster Labels'],type4['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type4 = crosstab_type4.rename_axis(None)
crosstab_type4 = crosstab_type4.reset_index(drop=True)
crosstab_type4.index=['Type IV']
crosstab_type4 = crosstab_type4.sort_values(by = 'Type IV', ascending=False, axis=1)
del crosstab_type4.columns.name
crosstab_type4=crosstab_type4.iloc[:,0:10]
crosstab_type4.style.format("{:.2%}")

Unnamed: 0,Park,Eye Doctor,Exhibit,Factory,Fair,Fabric Shop,Event Space,Farm,Falafel Restaurant,Fast Food Restaurant
Type IV,6.67%,6.13%,6.02%,6.02%,5.91%,5.70%,5.59%,5.05%,4.95%,3.44%


Top 10 most popular venue categories in cluster 5

In [127]:
type5=types.drop(types[types['Cluster Labels']!=4].index)
crosstab_type5 = pd.crosstab(type5['Cluster Labels'],type5['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type5 = crosstab_type5.rename_axis(None)
crosstab_type5 = crosstab_type5.reset_index(drop=True)
crosstab_type5.index=['Type V']
crosstab_type5 = crosstab_type5.sort_values(by = 'Type V', ascending=False, axis=1)
del crosstab_type5.columns.name
crosstab_type5=crosstab_type5.iloc[:,0:10]
crosstab_type5.style.format("{:.2%}")

Unnamed: 0,Fair,Factory,Scenic Lookout,Farmers Market,Farm,Falafel Restaurant,Ethiopian Restaurant,Zoo Exhibit,Fabric Shop,Eye Doctor
Type V,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%,6.67%


Top 10 most popular venue categories in cluster 6

In [128]:
type6=types.drop(types[types['Cluster Labels']!=5].index)
crosstab_type6 = pd.crosstab(type6['Cluster Labels'],type6['venues']).apply(lambda r: r/r.sum(), axis=1)
crosstab_type6 = crosstab_type6.rename_axis(None)
crosstab_type6 = crosstab_type6.reset_index(drop=True)
crosstab_type6.index=['Type VI']
crosstab_type6 = crosstab_type6.sort_values(by = 'Type VI', ascending=False, axis=1)
del crosstab_type6.columns.name
crosstab_type6=crosstab_type6.iloc[:,0:10]
crosstab_type6.style.format("{:.2%}")

Unnamed: 0,Coffee Shop,Bar,Pizza Place,Bakery,Italian Restaurant,Mexican Restaurant,Sandwich Place,American Restaurant,Park,Gym
Type VI,4.10%,2.78%,2.69%,2.04%,2.04%,1.92%,1.85%,1.77%,1.77%,1.56%


End of codes, please see the report for detailed introduction and discussion.