<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in New York City</font></h1>

## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in New York City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in New York City</a>

3. <a href="#item3">Analyze Each Neighborhood</a>

4. <a href="#item4">Cluster Neighborhoods</a>

5. <a href="#item5">Examine Clusters</a>    
</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## 1. Download and Explore Dataset

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

#### Load and explore the data

Next, let's load the data.

In [2]:
with open('data/newyork_polygon.json') as json_data:
    newyork_data = json.load(json_data)

Then let's loop through the data and fill the dataframe one row at a time.

In [3]:
ny_df = pd.DataFrame(columns=['Neighborhood','Borough','Longitude','Latitude'])
GLOB_MIN = np.array([100000.0,10000])
GLOB_MAX = np.array([-100000.0,-10000])
for ng in newyork_data['features']:
    
    max_ = np.max(np.array(ng['geometry']['coordinates'][0]), axis = 0)
    min_ = np.min(np.array(ng['geometry']['coordinates'][0]), axis = 0)
    GLOB_MIN[0] = min(GLOB_MIN[0],min_[0])
    GLOB_MIN[1] = min(GLOB_MIN[1],min_[1])
    GLOB_MAX[0] = max(GLOB_MAX[0],max_[0])
    GLOB_MAX[1] = max(GLOB_MAX[1],max_[1])
    
    coords = min_ + (max_ - min_)/2
    ny_df.loc[-1] = [ng['properties']['neighborhood'],ng['properties']['borough']] + coords.tolist()
        
    ny_df = ny_df.reset_index(drop=True)

ny_df['Borough']  = ny_df.Borough.astype('category')

ny_df['Borough_code'] = ny_df.Borough.cat.codes
# neighborhoods = ny_df.set_index('Neighborhood')
neighborhoods = ny_df

Quickly examine the resulting dataframe.

And make sure that the dataset has all 5 boroughs and 306 neighborhoods.

#### Use geopy library to get the latitude and longitude values of New York City.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [4]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Create a map of New York with neighborhoods superimposed on top.

**Folium** is a great visualization library. Feel free to zoom into the above map, and click on each circle mark to reveal the name of the neighborhood and its respective borough.

However, for illustration purposes, let's simplify the above map and segment and cluster only the neighborhoods in Manhattan. So let's slice the original dataframe and create a new dataframe of the Manhattan data.

Let's get the geographical coordinates of Manhattan.

As we did with all of New York City, let's visualizat Manhattan the neighborhoods in it.

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

#### Define Foursquare Credentials and Version

In [5]:
CLIENT_ID = 'MK4JTCJZQ3ETHRUPHUKTBAA4AI0RXFIROY5IWFM5WLFV2IEL' # your Foursquare ID
CLIENT_SECRET = 'CP1DPIPSFQ1UIYPNIAY010VM3EVJS15C4WBVHTAKHORHFG21' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: MK4JTCJZQ3ETHRUPHUKTBAA4AI0RXFIROY5IWFM5WLFV2IEL
CLIENT_SECRET:CP1DPIPSFQ1UIYPNIAY010VM3EVJS15C4WBVHTAKHORHFG21


#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [6]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

<a id='item2'></a>

## 2. Explore Neighborhoods in Manhattan

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        repeat = True
        cnt = 0
        # create the API request URL
        while(repeat and cnt < 50):
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT)

            # make the GET request
            try:
                results = requests.get(url).json()["response"]['groups'][0]['items']
                repeat = False
            except:
                cnt += 1
                pass
        if cnt > 49: continue    
        # return only relevant information for each nearby venue
        print(len(results))
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [8]:
# type your answer here
LIMIT=200
manhattan_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )



Allerton
22
Alley Pond Park
8
Arden Heights
6
Arlington
0
Arrochar
8
Arverne
17
Astoria
100
Bath Beach
23
Battery Park City
100
Bay Ridge
47
Bay Terrace
19
Bay Terrace, Staten Island
8
Baychester
18
Bayside
39
Bayswater
3
Bayswater
0
Bedford-Stuyvesant
18
Belle Harbor
19
Bellerose
4
Belmont
55
Bensonhurst
23
Bergen Beach
7
Bloomfield
1
Boerum Hill
100
Borough Park
14
Breezy Point
1
Briarwood
12
Brighton Beach
27
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
0
Broad Channel
1
Broad Channel
1
Bronx Park
14
Bronxdale
33
Brooklyn Heights
78
Brownsville
16
Bull's Head
22
Bushwick
18
Cambria Heights
12
Canarsie
13
Carroll Gardens
100
Castle Hill
6
Castleton Corners
4
Central Park
64
Charleston
2
Chelsea
100
Chelsea, Staten Island
18
Chinatown
100
City Island
1
City Island
27
Civic Center
100
Claremont Village
6
Clason Point
10
Clifton
8
Clinton Hill
34
Co-op City
28
Cobble Hill
100
College Point
44
Columbia St
30

Double-click __here__ for the solution.
<!-- The correct answer is:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )
--> 

#### Let's check the size of the resulting dataframe

In [9]:
print(manhattan_venues.shape)
manhattan_venues.head()

(7748, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allerton,40.865256,-73.858424,Sal & Doms Bakery,40.865377,-73.855236,Dessert Shop
1,Allerton,40.865256,-73.858424,Domenick's Pizzeria,40.865576,-73.858124,Pizza Place
2,Allerton,40.865256,-73.858424,Bronx Martial Arts Academy,40.865721,-73.857529,Martial Arts Dojo
3,Allerton,40.865256,-73.858424,Dunkin',40.865204,-73.859007,Donut Shop
4,Allerton,40.865256,-73.858424,White Castle,40.866065,-73.862307,Fast Food Restaurant


Let's check how many venues were returned for each neighborhood

In [10]:
neighborhoods = neighborhoods.set_index('Neighborhood')

In [11]:
neighborhoods['venue_count'] = manhattan_venues.groupby('Neighborhood').count().max(axis=1)

In [12]:
manhattan_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Allerton,40.865256,-73.858424,Sal & Doms Bakery,40.865377,-73.855236,Dessert Shop
1,Allerton,40.865256,-73.858424,Domenick's Pizzeria,40.865576,-73.858124,Pizza Place
2,Allerton,40.865256,-73.858424,Bronx Martial Arts Academy,40.865721,-73.857529,Martial Arts Dojo
3,Allerton,40.865256,-73.858424,Dunkin',40.865204,-73.859007,Donut Shop
4,Allerton,40.865256,-73.858424,White Castle,40.866065,-73.862307,Fast Food Restaurant


In [13]:
neighborhoods['category_count'] = manhattan_venues.groupby(['Neighborhood','Venue Category']).count().reset_index().groupby('Neighborhood').count()['Venue']

In [14]:
neighborhoods['category_count'] = neighborhoods['category_count']/(neighborhoods.venue_count + 20)

#### Let's find out how many unique categories can be curated from all the returned venues

<a id='item3'></a>

In [15]:
world_map = folium.Map(location=[40.7896239, -73.9598939], zoom_start=11,width='100%',height='100%')



choropleth = folium.Choropleth(geo_data='data/newyork_polygon.json', 
                    data=neighborhoods.reset_index(), columns=['Neighborhood','venue_count'],key_on='feature.properties.neighborhood',
                    fill_color='YlOrRd', 
                    fill_opacity=0.7, 
                    line_opacity=1,
                    line_color='black').add_to(world_map)

choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['neighborhood'], labels=False)
)

world_map

In [16]:
world_map.save('fig/venues.html')

In [67]:
world_map = folium.Map(location=[40.7896239, -73.9598939], zoom_start=11,width='100%',height='100%')



choropleth = folium.Choropleth(geo_data='data/newyork_polygon.json', 
                    data=neighborhoods.reset_index(), columns=['Neighborhood','category_count'],key_on='feature.properties.neighborhood',
                    fill_color='YlOrRd', 
                    fill_opacity=0.7, 
                    line_opacity=1,
                    line_color='black').add_to(world_map)

choropleth.geojson.add_child(
    folium.features.GeoJsonTooltip(['neighborhood'], labels=False)
)

world_map

In [69]:
neighborhoods.to_csv('venues.csv')

<hr>

Copyright &copy; 2018 [Cognitive Class](https://cognitiveclass.ai/?utm_source=bducopyrightlink&utm_medium=dswb&utm_campaign=bdu). This notebook and its source code are released under the terms of the [MIT License](https://bigdatauniversity.com/mit-license/).