# Applied Data Science Capstone &ndash; Final Project

## Preface

This notebook is being provided to fulfill the requirements of the final project for the Coursera **Applied Data Science Capstone** course.

# Data

The dataset that we will be using is from FourSquare.
FourSquare data contains information on numerous points of interest in the city under investigation,
with categories such as restaurants, parks, and schools.

Many of the previous projects in this course have used neighborhoods as they are defined
by exernal datasets. These external datasets have defined neighborhoods by such arbitrary
measures as postal codes and historical/political districts. It does not seem that this
always represents a good measure of what constitutes a neighborhood.

I will be using the FourSquare data itself to define neighborhoods based on
where points of interest are located geographically. I will be using clustering algorithms
as part of this process, and will specifically be using the latitude and longitude
fields from the FourSquare data.

Once the neighborhoods have been defined, I will dig deeper into the characteristics
of each point of interest, and attempt to understand what major amenities define each
neighborhood.


## Libraries

To implement the investigation of the neighborhoods in Stamford, CT, the following Python
libraries and web services will be used:
- Numpy and Pandas (for data analysis)
- GeoPy (for location services)
- Requests (to request data from web services, such as FourSquare)
- Matplotlib (for plotting)
- Sklearn (for machine learning, such as clustering algorithms)
- Folium (to create maps)
- Shapely (to deal with mapping geometry)
- FourSquare (to provide data on points of interest)

### Setup

First we download all the dependencies that we will need.

In [1]:
import numpy as np

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import math
import json

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests
from pandas import json_normalize # transform JSON file into a pandas dataframe (was pandas.io.json)

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

from shapely.geometry import shape, GeometryCollection, Point, Polygon
import shapely

print('Libraries imported.')

Libraries imported.


#### Use geopy library to get the latitude and longitude values of Stamford, CT.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>geo_explorer</em>, as shown below.

In [2]:
address = 'Stamford, CT'

geolocator = Nominatim(user_agent="geo_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Stamford, CT are {}, {}.'.format(latitude, longitude))

The geographical coordinates of Stamford, CT are 41.0534302, -73.5387341.


#### Create a map of Stamford with neighborhoods superimposed on top.

In [5]:
map_stamford_empty = folium.Map(location=[latitude, longitude], zoom_start=7)

state_geo_file = 'Connecticut.geojson'
city_geo_file = 'StamfordCT.geojson'

def geo_style_state(feature):
    return {
        # Border (weight is pixels); Also: stroke (bool to disable), lineCap, lineJoin, dashArray, dashOffset
        'color': '#000000', 'weight': 1.2, 'opacity': 0.2,
        # Fill; Also: fill (bool to disable), fillRule (determines inside of shape), bubblingMouseEvents
        'fillColor': '#0000ff', 'fillOpacity': 0.05
    }
def geo_highlight_state(feature):
    return {
        'fillOpacity': 0.1
    }

def geo_style_stamford(feature):
    return {
        'fillOpacity': 0.2, 'weight': 1.6, 'fillColor': '#00cc00'
    }

def geo_highlight_stamford(feature):
    return {
        'fillOpacity': 0.08
    }

def geo_style_rect(feature):
    return {
        'fillOpacity': 0, 'weight': 1.2, 'fillColor': '#00cccc'
    }

# Add overlay to show where city is
geo_ct = folium.GeoJson(state_geo_file, name='Connecticut', style_function=geo_style_state,
              highlight_function=geo_highlight_state, # tooltip="Connecticut",
              overlay=True, control=True, show=True).add_to(map_stamford_empty)
geo = folium.GeoJson(city_geo_file, name='Stamford, CT', style_function=geo_style_stamford,
              highlight_function=geo_highlight_stamford, tooltip="Stamford, CT",
              overlay=True, control=True, show=True).add_to(map_stamford_empty)

# CircleMarker does NOT have a style_function
folium.vector_layers.CircleMarker((latitude, longitude), radius=4, weight=1,
                                  fill=True, fill_opacity=1.0, fill_color="#000000",
                                  color="#000000", tooltip="City").add_to(map_stamford_empty)

lc = folium.map.LayerControl(position='topright').add_to(map_stamford_empty) #, collapsed=True, autoZIndex=True, **kwargs)

map_stamford_empty

#### Define Foursquare Credentials and Version

In [6]:
CLIENT_ID = 'BOGTGAFH5WKN13HQNJU0K5U3ZEUS1DLFI5JU0P2CX1ZTQGBE'
CLIENT_SECRET = '5MIOG3NT1VQ22FLEJPQJ1RND2QAFVF5QEMBXATCLXNHOOBRZ'
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: BOGTGAFH5WKN13HQNJU0K5U3ZEUS1DLFI5JU0P2CX1ZTQGBE
CLIENT_SECRET:5MIOG3NT1VQ22FLEJPQJ1RND2QAFVF5QEMBXATCLXNHOOBRZ


In [7]:
LIMIT = 100
radius = 1000

# To get a JSON struct of categories:
# GET https://api.foursquare.com/v2/venues/categories
# Docs: https://developer.foursquare.com/docs/api-reference/venues/categories

# Extra params
# section = food, drinks, coffee, shops, arts, outdoors, sights, trending, nextVenues
# categoryId = <>
# https://developer.foursquare.com/docs/api/venues/categories = get a JSON representation of all categories (in tree)
# These are the top-level category IDs
#  4d4b7104d754a06370d81259 = Arts & Entertainment
#  4d4b7105d754a06372d81259 = College & University
#  4d4b7105d754a06373d81259 = Event
#  4d4b7105d754a06374d81259 = Food
#  4d4b7105d754a06376d81259 = Nightlife Spot (subs = Bar, Brewery, Lounge, Night Market, Nightclub, Other Nightlife, Strip Club)
#  4d4b7105d754a06377d81259 = Outdoors & Recreation
#  4d4b7105d754a06375d81259 = Professional & Other Places (incl Medical Center, School, Spiritual Center)
#  4e67e38e036454776db1fb3a = Residence (subs = Assisted Living, Home (private), Housing Development, Residential Building (Apartment / Condo), Trailer Park)
#  4d4b7105d754a06378d81259 = Shop & Service
#  4d4b7105d754a06379d81259 = Travel & Transport

url_params = "&time=any&day=any"

#### Now, let's get some venues that are in the search radius.

First, create the GET request URL.

In [8]:
def make_url(latitude, longitude, radius, offset=0, category_id=None):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}" \
          "&v={}&ll={},{}&radius={}&offset={}&limit={}".format(
        CLIENT_ID, CLIENT_SECRET, VERSION,
        latitude, longitude,
        radius, offset, LIMIT)
    if category_id:
        url += "&categoryId={}".format(category_id)
    # sortByDistance=1  # distance instead of 'relevance'
    url += url_params
    return url


In [9]:
offset = 0
url = make_url(latitude, longitude, radius, offset)
url

'https://api.foursquare.com/v2/venues/explore?client_id=BOGTGAFH5WKN13HQNJU0K5U3ZEUS1DLFI5JU0P2CX1ZTQGBE&client_secret=5MIOG3NT1VQ22FLEJPQJ1RND2QAFVF5QEMBXATCLXNHOOBRZ&v=20180605&ll=41.0534302,-73.5387341&radius=1000&offset=0&limit=100&time=any&day=any'

Send the GET request and examine the resutls

In [10]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5f398b10c84c6d0398b2b3ad'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'},
    {'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Downtown Stamford',
  'headerFullLocation': 'Downtown Stamford, Stamford',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 120,
  'suggestedBounds': {'ne': {'lat': 41.06243020900001,
    'lng': -73.52682157478513},
   'sw': {'lat': 41.04443019099999, 'lng': -73.55064662521487}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '56ac70b2498ebfb7f369e6af',
       'name': 'Kashi Stamford',
       'location': {'address': '131 Summer St',
        'lat': 41.05431965539036,
        'lng': -73.54061792432401,
        'labele

In [12]:
len(results['response']['groups'][0]['items'])

100

In [13]:
results['response']['groups'][0]['items'][0]

{'reasons': {'count': 0,
  'items': [{'summary': 'This spot is popular',
    'type': 'general',
    'reasonName': 'globalInteractionReason'}]},
 'venue': {'id': '56ac70b2498ebfb7f369e6af',
  'name': 'Kashi Stamford',
  'location': {'address': '131 Summer St',
   'lat': 41.05431965539036,
   'lng': -73.54061792432401,
   'labeledLatLngs': [{'label': 'display',
     'lat': 41.05431965539036,
     'lng': -73.54061792432401}],
   'distance': 186,
   'postalCode': '06901',
   'cc': 'US',
   'city': 'Stamford',
   'state': 'CT',
   'country': 'United States',
   'formattedAddress': ['131 Summer St',
    'Stamford, CT 06901',
    'United States']},
  'categories': [{'id': '4bf58dd8d48988d1c4941735',
    'name': 'Restaurant',
    'pluralName': 'Restaurants',
    'shortName': 'Restaurant',
    'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/default_',
     'suffix': '.png'},
    'primary': True}],
  'delivery': {'id': '322861',
   'url': 'https://www.grubhub.com/restaurant/kashi

In [14]:
# See what keys each array entry has
results['response']['groups'][0]['items'][0].keys()

dict_keys(['reasons', 'venue', 'referralId'])

In [15]:
poi = results['response']['groups'][0]['items'][0]['venue']
poi.keys()

dict_keys(['id', 'name', 'location', 'categories', 'delivery', 'photos'])

We want to keep the following keys:
- location.(lat, lng)  # for clustering
- location.city  # to verify that POI is in fact in the city under search
- location.postalCode  # for potential future analysis
- location.address  # for posterity
- categories.id  # for analysis
- categories.name  # for labelling
- categories.primary  # for checking heirarchy
- venue.id  # for comparing and avoiding duplicates if doing multiple searches

In [16]:
def get_dataframe_from_results(results):
    venues = results['response']['groups'][0]['items']
    nearby_venues = json_normalize(venues) # flatten JSON
    nv = nearby_venues

    # Rename some of the columns that we want to keep
    # lat, lng
    # The we can delete 'venue'location' entries
    cols_to_keep = ['venue.name', 'venue.location.address', 'venue.location.city',
                    'venue.location.lat', 'venue.location.lng', 'venue.location.postalCode', 'venue.id']
                    #'venue.categories'] #'venue.categories.name', 'venue.categories.id', 'venue.categories.primary']
    df = nv[cols_to_keep]

    # df.loc[0]['venue.categories'][0] #['name']
    categories = [[i[0]['name'], i[0]['id']] for i in nv['venue.categories']]
    # Create a DataFrame
    df_cat = pd.DataFrame(categories)
    df_cat.columns = ['category_name', 'category_id']
    df = df.join(df_cat)

    df.rename(columns={"venue.name": "name", "venue.location.address": "address", "venue.location.city": "city",
                      "venue.location.lat": "latitude", "venue.location.lng": "longitude",
                      "venue.location.postalCode": "zipcode", "venue.id": "venue_id"}, inplace=True)
    return df


In [18]:
offset = 0
total_returned = 0
df = pd.DataFrame()

while True:
    url = make_url(latitude, longitude, radius, offset)
    print('Calling:', url)
    results = requests.get(url).json()
    num_available = results['response']['totalResults']
    print('num_available =', num_available)
    num_returned = len(results['response']['groups'][0]['items'])
    df_temp = get_dataframe_from_results(results)
    # print(df_temp.loc[0])
    df = df.append(df_temp, ignore_index=True)

    # Break out of loop if we have retrieved all of the entries
    offset += num_returned
    total_returned += num_returned
    print('total_returned =', total_returned)
    if total_returned >= num_available:
        break

df

Calling: https://api.foursquare.com/v2/venues/explore?client_id=BOGTGAFH5WKN13HQNJU0K5U3ZEUS1DLFI5JU0P2CX1ZTQGBE&client_secret=5MIOG3NT1VQ22FLEJPQJ1RND2QAFVF5QEMBXATCLXNHOOBRZ&v=20180605&ll=41.0534302,-73.5387341&radius=1000&offset=0&limit=100&time=any&day=any
num_available = 120
total_returned = 100
Calling: https://api.foursquare.com/v2/venues/explore?client_id=BOGTGAFH5WKN13HQNJU0K5U3ZEUS1DLFI5JU0P2CX1ZTQGBE&client_secret=5MIOG3NT1VQ22FLEJPQJ1RND2QAFVF5QEMBXATCLXNHOOBRZ&v=20180605&ll=41.0534302,-73.5387341&radius=1000&offset=100&limit=100&time=any&day=any
num_available = 120
total_returned = 120


Unnamed: 0,name,address,city,latitude,longitude,zipcode,venue_id,category_name,category_id
0,Kashi Stamford,131 Summer St,Stamford,41.05432,-73.540618,6901.0,56ac70b2498ebfb7f369e6af,Restaurant,4bf58dd8d48988d1c4941735
1,bartaco,222 Summer St,Stamford,41.054875,-73.540731,6901.0,4d921fd980d3370413acac06,Mexican Restaurant,4bf58dd8d48988d1c1941735
2,Barcelona Wine Bar Restaurant,222 Summer St,Stamford,41.054876,-73.540932,6901.0,4aa3f4d8f964a5208c4420e3,Spanish Restaurant,4bf58dd8d48988d150941735
3,Lorca,125 Bedford St,Stamford,41.056168,-73.538517,6901.0,50de12e8e4b01ebc50f9b951,Coffee Shop,4bf58dd8d48988d1e0931735
4,Remo's Brick Oven Pizza Company,35 Bedford St,Stamford,41.055555,-73.538733,6901.0,4b4a586df964a520ab8426e3,Pizza Place,4bf58dd8d48988d1ca941735
5,Connecticut Cigar Company,43 Bank St,Stamford,41.052658,-73.540334,6901.0,4af63292f964a520270222e3,Lounge,4bf58dd8d48988d121941735
6,Garden Catering,235 Main St,Stamford,41.053191,-73.54141,6901.0,4ac6952bf964a52037b520e3,Deli / Bodega,4bf58dd8d48988d146941735
7,Flinders Lane,184 Summer St,Stamford,41.054636,-73.540739,6901.0,59a9f912d3cce8610502ec6a,Australian Restaurant,4bf58dd8d48988d169941735
8,Cask Republic,191 Summer St,Stamford,41.054654,-73.540669,6901.0,52b3904d498e1a41f5b1376e,Gastropub,4bf58dd8d48988d155941735
9,Columbus Park Trattoria,205 Main St,Stamford,41.053004,-73.541891,6901.0,4aa412c9f964a5202d4520e3,Italian Restaurant,4bf58dd8d48988d110941735


#### Create a map of Stamford with the POIs superimposed on top.

In [19]:
# create map of New York using latitude and longitude values
map_stamford_2 = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, name in zip(df['latitude'], df['longitude'], df['name']):
    label = '{}'.format(name)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_stamford_2)  
    
map_stamford_2