<h1 align=center><font size = 5>Segmenting and Clustering Queens Neighborhoods in New York City</font></h1>

## Introduction

In this project, we will convert addresses into their equivalent latitude and longitude values. Also, we will use the Foursquare API to explore neighborhoods in New York City. We will use the **explore** function to get the most common venue categories in Queens Neighborhoods, and then use this feature to group the neighborhoods into clusters. We use the *k*-means clustering algorithm to complete this task. Finally, we will use the Folium library to visualize the neighborhoods in Queens and their emerging clusters. This study will help make the decision of where to open a Chinese restaurant in Queens. 

## Table of Contents

<div style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Queens, NY</a>

3. <a href="#item3">Cluster Neighborhoods</a>

4. <a href="#item4">Examine Clusters</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

Next, let's load the data.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
newyork_data.keys()

dict_keys(['type', 'totalFeatures', 'features', 'crs', 'bbox'])

Let's take a quick look at the data.

Notice how all the relevant data is in the *features* key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [5]:
neighborhoods_data = newyork_data['features']
#neighborhoods_data

In [6]:
len(neighborhoods_data)

306

Let's take a look at the first item in this list.

In [7]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [8]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [9]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [10]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [11]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [12]:
neighborhoods.tail()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
301,Manhattan,Hudson Yards,40.756658,-74.000111
302,Queens,Hammels,40.587338,-73.80553
303,Queens,Bayswater,40.611322,-73.765968
304,Queens,Queensbridge,40.756091,-73.945631
305,Staten Island,Fox Hills,40.617311,-74.08174


And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

In [13]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


## 2. Explore Neighborhoods in Queens, NY

Get out dataframe of only Queens neighborhoods. 

For this project purposes, we will only focus on the neighborhoods in Queens. So let's slice the original dataframe and create a new dataframe of the Queens data.

In [14]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


In [15]:
queens_data.tail()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
76,Queens,Middle Village,40.716415,-73.881143
77,Queens,Malba,40.790602,-73.826678
78,Queens,Hammels,40.587338,-73.80553
79,Queens,Bayswater,40.611322,-73.765968
80,Queens,Queensbridge,40.756091,-73.945631


In [16]:
queens_data.shape

(81, 4)

Let's get the geographical coordinates of Queens.

In [17]:
address = 'Queens, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Queens are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Queens are 40.7498243, -73.7976337.


Let's visualizat Queens the neighborhoods in it.

In [18]:
# create map of Queens using latitude and longitude values
map_queens = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(queens_data['Latitude'], queens_data['Longitude'], queens_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_queens)  
    
map_queens

### Using Foursquare API to explore Chinese venues neighborhood after neighborhood in Queens, NY.

#### Define Foursquare Credentials and Version

In [19]:
CLIENT_ID = 'YXTPL4ARZNIEZNQ1T1ZA0B0PCWBHDMK0MNBGIQNZGAZY1FDL' # your Foursquare ID
CLIENT_SECRET = 'ORFZYQ5WKRNZH3GHXM44X44URXZEGTAQ0DZQXXD00YSLNK2H' # your Foursquare Secret
VERSION = '20200605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: YXTPL4ARZNIEZNQ1T1ZA0B0PCWBHDMK0MNBGIQNZGAZY1FDL
CLIENT_SECRET:ORFZYQ5WKRNZH3GHXM44X44URXZEGTAQ0DZQXXD00YSLNK2H


#### First create a function to find Chinese venues by neighborhood.

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    query_term = 'Chinese'
    LIMIT = 100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            query_term)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Next write the code to run the above function on each neighborhood and create a new dataframe called *queens_chinese_venues*.

In [21]:
# type your answer here

queens_chinese_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                   latitudes=queens_data['Latitude'],
                                   longitudes=queens_data['Longitude']
                                  )



Astoria
Woodside
Jackson Heights
Elmhurst
Howard Beach
Corona
Forest Hills
Kew Gardens
Richmond Hill
Flushing
Long Island City
Sunnyside
East Elmhurst
Maspeth
Ridgewood
Glendale
Rego Park
Woodhaven
Ozone Park
South Ozone Park
College Point
Whitestone
Bayside
Auburndale
Little Neck
Douglaston
Glen Oaks
Bellerose
Kew Gardens Hills
Fresh Meadows
Briarwood
Jamaica Center
Oakland Gardens
Queens Village
Hollis
South Jamaica
St. Albans
Rochdale
Springfield Gardens
Cambria Heights
Rosedale
Far Rockaway
Broad Channel
Breezy Point
Steinway
Beechhurst
Bay Terrace
Edgemere
Arverne
Rockaway Beach
Neponsit
Murray Hill
Floral Park
Holliswood
Jamaica Estates
Queensboro Hill
Hillcrest
Ravenswood
Lindenwood
Laurelton
Lefrak City
Belle Harbor
Rockaway Park
Somerville
Brookville
Bellaire
North Corona
Forest Hills Gardens
Jamaica Hills
Utopia
Pomonok
Astoria Heights
Hunters Point
Sunnyside Gardens
Blissville
Roxbury
Middle Village
Malba
Hammels
Bayswater
Queensbridge


#### Let's check the size of the resulting dataframe

In [22]:
print(queens_chinese_venues.shape)
queens_chinese_venues.head()

(352, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Sampan,40.764319,-73.914666,Chinese Restaurant
1,Astoria,40.768509,-73.915654,Golden Kitchen,40.770366,-73.919031,Chinese Restaurant
2,Astoria,40.768509,-73.915654,Golden House Chinese Restaurant,40.765193,-73.91776,Chinese Restaurant
3,Astoria,40.768509,-73.915654,Golden Hands Bodywork - Chinese Qi Gong Tui Na,40.765279,-73.917666,Spa
4,Woodside,40.746349,-73.901842,SriPraPhai,40.746342,-73.899248,Thai Restaurant


#### Drop the non-restaurant row.

In [23]:
queens_chinese_res = queens_chinese_venues.drop(queens_chinese_venues[(queens_chinese_venues['Venue Category'] == 'Spa') | (queens_chinese_venues['Venue Category'] == 'Church')|(queens_chinese_venues['Venue Category'] == 'Grocery Store')].index)
#df = df.drop(df[(df.score < 50) & (df.score > 20)].index)

In [24]:
print(queens_chinese_res.shape)
queens_chinese_res.head()

(349, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Astoria,40.768509,-73.915654,Sampan,40.764319,-73.914666,Chinese Restaurant
1,Astoria,40.768509,-73.915654,Golden Kitchen,40.770366,-73.919031,Chinese Restaurant
2,Astoria,40.768509,-73.915654,Golden House Chinese Restaurant,40.765193,-73.91776,Chinese Restaurant
4,Woodside,40.746349,-73.901842,SriPraPhai,40.746342,-73.899248,Thai Restaurant
5,Woodside,40.746349,-73.901842,Peking BBQ Chicken,40.745488,-73.906053,Chinese Restaurant


#### Let's find out how many unique neighborhood can be curated from all the returned chinese restaurants.

In [25]:
print('There are {} uniques neighborhoods in Queens,NY having Chinese restaurants.'.format(len(queens_chinese_res['Neighborhood'].unique())))

There are 61 uniques neighborhoods in Queens,NY having Chinese restaurants.


#### Let's check how many Chinese restaurants were returned for each neighborhood.

In [26]:
df1 = queens_chinese_res.groupby('Neighborhood').count()
print(df1.shape)
df1.head()

(61, 6)


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Astoria,3,3,3,3,3,3
Astoria Heights,2,2,2,2,2,2
Bayside,5,5,5,5,5,5
Beechhurst,2,2,2,2,2,2
Bellaire,3,3,3,3,3,3


#### Make a new dataframe with only the Chinese restaurant counts.

In [27]:
df2 = df1[['Venue']]
print(df2.shape)
df2.head()

(61, 1)


Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Astoria,3
Astoria Heights,2
Bayside,5
Beechhurst,2
Bellaire,3


In [28]:
df3 = df2.reset_index()
print(df3.shape)
df3.head()

(61, 2)


Unnamed: 0,Neighborhood,Venue
0,Astoria,3
1,Astoria Heights,2
2,Bayside,5
3,Beechhurst,2
4,Bellaire,3


#### Insert df3 to queens_data to make a dataframe including neighborhood and Chinese restaurant counts info.

In [29]:
queens_data_modified = queens_data.merge(df3, how='outer')
queens_data_modified.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Venue
0,Queens,Astoria,40.768509,-73.915654,3.0
1,Queens,Woodside,40.746349,-73.901842,8.0
2,Queens,Jackson Heights,40.751981,-73.882821,9.0
3,Queens,Elmhurst,40.744049,-73.881656,47.0
4,Queens,Howard Beach,40.654225,-73.838138,2.0


In [30]:
queens_data_mod1 = queens_data_modified.fillna(0)
queens_data_mod1.head() 

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Venue
0,Queens,Astoria,40.768509,-73.915654,3.0
1,Queens,Woodside,40.746349,-73.901842,8.0
2,Queens,Jackson Heights,40.751981,-73.882821,9.0
3,Queens,Elmhurst,40.744049,-73.881656,47.0
4,Queens,Howard Beach,40.654225,-73.838138,2.0


In [31]:
queens_data_mod = queens_data_mod1.drop('Borough',1)
queens_data_mod.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Venue
0,Astoria,40.768509,-73.915654,3.0
1,Woodside,40.746349,-73.901842,8.0
2,Jackson Heights,40.751981,-73.882821,9.0
3,Elmhurst,40.744049,-73.881656,47.0
4,Howard Beach,40.654225,-73.838138,2.0


## 3. Cluster Neighborhoods

#### Run *k*-means to cluster the neighborhood into 5 clusters.

In [32]:
# set number of clusters
kclusters = 3

# Neighborhood in this dataset is a categorical variable. k-means algorithm isn't directly applicable to 
# categorical variables because Euclidean distance function isn't really meaningful for discrete variables. 
# So, lets drop this feature and run clustering.
queens_data_clustering = queens_data_mod.drop(['Neighborhood','Latitude','Longitude'], axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(queens_data_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 0, 0, 1, 2, 2, 2, 0, 0, 1, 2, 0, 2, 0, 0, 2, 0, 0, 0, 2, 0, 2,
       0, 2, 0, 0, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 2, 2, 2, 2], dtype=int32)

#### Let's create a new dataframe that includes the cluster labels.

In [33]:
# add clustering labels
queens_data_mod.insert(0, 'Cluster Labels', kmeans.labels_)
queens_data_mod

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Venue
0,2,Astoria,40.768509,-73.915654,3.0
1,0,Woodside,40.746349,-73.901842,8.0
2,0,Jackson Heights,40.751981,-73.882821,9.0
3,1,Elmhurst,40.744049,-73.881656,47.0
4,2,Howard Beach,40.654225,-73.838138,2.0
5,2,Corona,40.742382,-73.856825,3.0
6,2,Forest Hills,40.725264,-73.844475,2.0
7,0,Kew Gardens,40.705179,-73.829819,5.0
8,0,Richmond Hill,40.697947,-73.831833,4.0
9,1,Flushing,40.764454,-73.831773,77.0


#### Finally, let's visualize the resulting clusters.

In [34]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors={}
markers_colors[0] = 'red'
markers_colors[1] = 'blue'
markers_colors[2] = 'green'
markers_colors[3] = 'yellow'
markers_colors[4] = 'cyan'
markers_colors[5] = 'black'

# add markers to the map
#markers_colors = []
for lat, lon, poi, cluster in zip(queens_data_mod['Latitude'], queens_data_mod['Longitude'], queens_data_mod['Neighborhood'], queens_data_mod['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
     #   fill_color=rainbow[cluster-1],
        fill_color=markers_colors[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 4. Examine Clusters

Now, you can examine each cluster. 

#### Cluster 0

In [35]:
queens_data_mod.loc[queens_data_mod['Cluster Labels'] == 0]

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Venue
1,0,Woodside,40.746349,-73.901842,8.0
2,0,Jackson Heights,40.751981,-73.882821,9.0
7,0,Kew Gardens,40.705179,-73.829819,5.0
8,0,Richmond Hill,40.697947,-73.831833,4.0
11,0,Sunnyside,40.740176,-73.926916,5.0
13,0,Maspeth,40.725427,-73.896217,8.0
14,0,Ridgewood,40.708323,-73.901435,4.0
16,0,Rego Park,40.728974,-73.857827,7.0
17,0,Woodhaven,40.689887,-73.85811,6.0
18,0,Ozone Park,40.680708,-73.843203,4.0


#### Cluster 1

In [36]:
queens_data_mod.loc[queens_data_mod['Cluster Labels'] == 1]

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Venue
3,1,Elmhurst,40.744049,-73.881656,47.0
9,1,Flushing,40.764454,-73.831773,77.0


#### Cluster 2

In [37]:
queens_data_mod.loc[queens_data_mod['Cluster Labels'] == 2]

Unnamed: 0,Cluster Labels,Neighborhood,Latitude,Longitude,Venue
0,2,Astoria,40.768509,-73.915654,3.0
4,2,Howard Beach,40.654225,-73.838138,2.0
5,2,Corona,40.742382,-73.856825,3.0
6,2,Forest Hills,40.725264,-73.844475,2.0
10,2,Long Island City,40.750217,-73.939202,3.0
12,2,East Elmhurst,40.764073,-73.867041,1.0
15,2,Glendale,40.702762,-73.870742,2.0
19,2,South Ozone Park,40.66855,-73.809865,0.0
21,2,Whitestone,40.781291,-73.814202,0.0
23,2,Auburndale,40.76173,-73.791762,0.0
