<img src="Tesla.PNG">

# <font color='red'>  Visualizing the Future Prospect of Electric Vehicles: a K-means Clustering-based Spatial Analysis of Electric Vehicle Charging Stations in Boston, MA </font>

# Capstone Project--The Battle of Neighborhoods 

### *-->`Wang Xi__Applied Data Science Capstone__Final Assignment`*

# Table of Contents

* ### [1. Introduction](#introduction)
* ### [2. Data Acquisition](#data)
* ### [3. Methodology](#methodology)
* ### [4. Exploratory Data Analysis](#analysis)
##### [(1) Analyze Each Neighborhood](#analyze)
##### [(2) Cluster Neighborhoods](#cluster)
##### [(3) Examine Clusters](#examine)

* ### [5. Discussions](#discussion)

##  1. Introduction <a name="introduction"></a> 

**Electric vehicles (EVs)** are both economic and ecological vehicles which get their power from rechargeable batteries inside the car. Since they have a lot of advantages as producing nearly no carbon emissions or pollution, being cost effective and less noisy; the main disadvantage of these vehicles are recharge related problems. 

One approach to deal with this problem is to construct **electric vehicle charging stations (EVCS)**. 
A proper EVCS also should be located very carefully to maximize EV usage. 

Thus in this project, a **K-means clustering-based spatial analysis** is applied to demonstrate the surrounding built-environment clustering situation of each EVCS site, and explore EV drivers' possible consumption propensity to the surrounding build-environment when charging their vehicles.

##  2. Data Acquisition <a name="data"></a> 

### A Gentle Introduction to dataset

In this part, let me go through a top-down introduction of my database. My dataset comprises of 11 features that are the following:

> 1. `elec_car_station_data` - the geolocation of the electric car charging stations json data (the whole United States).
> 2. `evcs` - electric vehicle charging station pandas dataframe (the whole United States).
> 3. `evcs_boston` - the electric vehicle charging station location in Boston, MA.
> 4. `evcs_boston_map` - the visualization map of the electric vehicle charging station location in Boston, MA.
> 5. `evcs_venues` - the nearby venues characteristics of EVCS in Boston, MA.
> 6. `evcs_onehot` - the dummies dataframe of EVCS in Boston, MA. 
> 7. `evcs_grouped` - EVCS grouped by neighborhood and taken the mean of the frequency of occurrence of each category in Boston, MA.
> 8. `num_top_venues` -the number of top visiting venues near EVCS in Boston, MA. 
> 9. `evcs_venues_sorted` - EVCS nearby venues sorted by from the 1st most common venue to the 10th most common venue in Boston, MA.
> 10. `evcs_merged` - EVCS dataframe merged with EVCS nearby venues sorted by from the 1st most common venue to the 10th most common venue in Boston, MA.
> 11. `map_clusters` - the visualization map of the electric vehicle charging station clusters by using K-means clustering methodology in Boston, MA.

Based on definition of our problem, factors that will influence our decision are:
* the number of existing leisure facilities in the neighborhood (categorized by venue type );
* the clustering situation of the existing places in the neighborhood;

Following data sources will be needed to extract/generate the required information:
* the data was obtained from Alternative Fuels Data Center, the office of Energy Efficiency & Renewable Energy, U.S. Department of Energy official website;
* the number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**.

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import json # library to handle JSON files
!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# import k-means from clustering stage
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library. Folium is a powerful Python library that helps people create several types of Leaflet maps. 

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    scikit-learn-0.20.1        |   py36h22eb022_0         5.7 MB
    liblapack-3.8.0            |      11_openblas          10 KB  conda-forge
    scipy-1.3.2                |   py36h921218d_0        18.0 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    libopenblas-0.3.6          |       h5a2b251_2         7.7 MB
    liblapacke-3.8.0           |      11_openblas          10 KB  conda-forge
    numpy-1.17.3               |   py36h95a1406_0         5.2 MB  conda-forge
    libcblas-3.8.0             |      11_openblas       

In [2]:
with open('nrel-station_elec_equipments_mv.json') as json_data:
    elec_car_station_data = json.load(json_data)
elec_car_station_data

{'type': 'FeatureCollection',
 'totalFeatures': 26537,
 'features': [{'type': 'Feature',
   'id': 'station_elec_equipments_mv.fid-12dd6121_16dfb743c61_-6d13',
   'geometry': {'type': 'Point', 'coordinates': [-121.4926, 38.5783802]},
   'geometry_name': 'the_geom',
   'properties': {'station_name': 'City of Sacramento - Capitol Parking Garage',
    'fuel_type_code': 'ELEC',
    'status_code': 'AVBL',
    'street_address': '1015 L St',
    'city': 'Sacramento',
    'st_prv_code': 'CA',
    'zip': '95814',
    'station_phone': '888-758-4389  916-264-5011',
    'access_days_time': 'Garage business hours; pay lot',
    'groups_with_access_code': 'Public',
    'cards_accepted': None,
    'longitude': -121.4926,
    'latitude': 38.5783802}},
  {'type': 'Feature',
   'id': 'station_elec_equipments_mv.fid-12dd6121_16dfb743c61_-6d12',
   'geometry': {'type': 'Point', 'coordinates': [-118.387136, 34.249396]},
   'geometry_name': 'the_geom',
   'properties': {'station_name': 'LADWP - Truesdale Cen

In [3]:
evcs_data=elec_car_station_data['features']

In [4]:
evcs_data[0]

{'type': 'Feature',
 'id': 'station_elec_equipments_mv.fid-12dd6121_16dfb743c61_-6d13',
 'geometry': {'type': 'Point', 'coordinates': [-121.4926, 38.5783802]},
 'geometry_name': 'the_geom',
 'properties': {'station_name': 'City of Sacramento - Capitol Parking Garage',
  'fuel_type_code': 'ELEC',
  'status_code': 'AVBL',
  'street_address': '1015 L St',
  'city': 'Sacramento',
  'st_prv_code': 'CA',
  'zip': '95814',
  'station_phone': '888-758-4389  916-264-5011',
  'access_days_time': 'Garage business hours; pay lot',
  'groups_with_access_code': 'Public',
  'cards_accepted': None,
  'longitude': -121.4926,
  'latitude': 38.5783802}}

In [5]:
# define the dataframe columns
column_names = ['State', 'City', 'Status','Station Name','Street Address','Access Time','Group','Longitude','Latitude'] 

# instantiate the dataframe
evcs = pd.DataFrame(columns=column_names)

In [6]:
evcs

Unnamed: 0,State,City,Status,Station Name,Street Address,Access Time,Group,Longitude,Latitude


In [7]:
for data in evcs_data:
    evcs_state = data['properties']['st_prv_code'] 
    evcs_city = data['properties']['city']
    evcs_status= data['properties']['status_code'] 
    evcs_station= data['properties']['station_name'] 
    evcs_address= data['properties']['street_address'] 
    evcs_time= data['properties']['access_days_time'] 
    evcs_group= data['properties']['groups_with_access_code'] 
    evcs_lon = data['properties']['longitude'] 
    evcs_lat = data['properties']['latitude'] 
    
    evcs = evcs.append({'State': evcs_state,
                        'City':evcs_city ,
                        'Status': evcs_status,
                        'Station Name': evcs_station,
                        'Street Address':evcs_address,
                        'Access Time':evcs_time,
                        'Group':evcs_group,
                        'Longitude': evcs_lon,
                        'Latitude':evcs_lat}, ignore_index=True)

In [8]:
evcs.head()

Unnamed: 0,State,City,Status,Station Name,Street Address,Access Time,Group,Longitude,Latitude
0,CA,Sacramento,AVBL,City of Sacramento - Capitol Parking Garage,1015 L St,Garage business hours; pay lot,Public,-121.4926,38.57838
1,CA,Sun Valley,AVBL,LADWP - Truesdale Center,11791 Truesdale St,,Private - Government only,-118.387136,34.249396
2,CA,Rosemead,AVBL,Southern California Edison - Rosemead Office B...,2244 Walnut Grove Ave,Employee use only,Private,-118.081014,34.050745
3,CA,Los Angeles,AVBL,Los Angeles Convention Center,1201 S Figueroa St,24 hours daily; pay lot,Public,-118.268762,34.04057
4,CA,Los Angeles,AVBL,LADWP - John Ferraro Building,111 N Hope St,24 hours daily,Public,-118.2498,34.057922


In [9]:
evcs_boston = evcs[evcs['City'] == 'Boston'].reset_index(drop=True)
evcs_boston.head()

Unnamed: 0,State,City,Status,Station Name,Street Address,Access Time,Group,Longitude,Latitude
0,MA,Boston,AVBL,SEAPORT GARAGE,1 Seaport Ln,24 hours daily,Public,-71.041742,42.349316
1,MA,Boston,AVBL,CLARENDON GROUP,265 Franklin St,24 hours daily,Public,-71.053321,42.356527
2,MA,Boston,AVBL,PRUDENTIAL CTR,800 Boylston St,24 hours daily,Public,-71.082497,42.347275
3,MA,Boston,AVBL,State Street Garage,75 State St,24 hours daily; pay lot,Public,-71.055089,42.358449
4,MA,Boston,AVBL,State Street Financial Center Parking,1 Lincoln St,24 hours daily,Public,-71.057957,42.352954


In [10]:
evcs_boston.shape

(107, 9)


Let's visualize City of Boston, MA.

##  3. Methodology <a name="methodology"></a> 

In this project I will direct my efforts on detecting areas of Boston that have high electric vehicle charging station density, particularly defining the category of the surrounding leisure facilities. I will limit my analysis to area radius ~500m and limit 100 venues around each electric vehicle charging station.

> In first step I will visualize the electric vehicle charging station geolocation (Latitude and Longitude) focusing on City of Boston in MA, by creating folium map. 

> Second step I will focus my attention on collecting the required **data: location and type (category) of every leisure place within 500m (10-min walking distance) from each EVCS** (according to Foursquare categorization). I have also **limited 100 venues around each EVCS**. Therefore, in my analysis it will not only be calculation and exploration of '**leisure facility density**' across different areas of Boston by using **folium map package**, but also I will add pop-up text that would get displayed when you hover over each marker, which can display the name of each electric vehicle charging station when hovered over.

> In third and final step I will create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for each electric vehicle charging station exploration and search for the nearby venues.

1. `Customer segmentation` is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy, as it allows the business to target specific groups of customers, so as to more effectively allocate marketing resources.

2. `Clustering` can group data only unsupervised, based on the similarity of customers to each other. It will partition customers into mutually exclusive groups. For example, in this project, into five clusters. The customers in each cluster are similar to each other demographically. And then we can create a profile for each group, considering the common characteristics of each cluster. Clustering means finding clusters in a dataset, unsupervised. A cluster is a group of data points or objects in a dataset that are similar to other objects in the group, and dissimilar to data points in other clusters.

3. In clustering however, the data is unlabeled and the process is unsupervised. In my analysis, I will use **a clustering algorithm-k-means** to group similar customers as mentioned, and assign them to a cluster, based on whether they share similar attributes, such as geographic position, surrounding neighborhood status and so on. K-Means can group data only unsupervised based on the similarity of customers to each other. K-Means is a type of partitioning clustering, that is, it divides the data into K non-overlapping subsets or clusters without any cluster internal structure or labels. This means, it's an unsupervised algorithm. Objects within a cluster are very similar, and objects across different clusters are very different or dissimilar.`K-Means tries to minimize the intra-cluster distances and maximize the inter-cluster distances.`

In [11]:
address = 'Boston, MA'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Boston are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Boston are 42.3602534, -71.0582912.


In [12]:
# create map of Boston using latitude and longitude values
evcs_boston_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(evcs_boston['Latitude'], evcs_boston['Longitude'], evcs_boston['Station Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(evcs_boston_map)  
    
evcs_boston_map

### Foursquare
Now that we have our location candidates, let's use Foursquare API to get information on leisure facilities in each neighborhood.

Foursquare credentials are defined in hidden cell bellow.

In [13]:
# @hidden_cell
CLIENT_ID = '3MH0DHIFVD4RX02CXEAEALOK5YIW4CGROKAJFLDC1Y3EPB24' # your Foursquare ID
CLIENT_SECRET = 'UBSIE0AN0P53BBPWQSND5AIHPCP12JA4UNN3I3H3H5ZWZHL2' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3MH0DHIFVD4RX02CXEAEALOK5YIW4CGROKAJFLDC1Y3EPB24
CLIENT_SECRET:UBSIE0AN0P53BBPWQSND5AIHPCP12JA4UNN3I3H3H5ZWZHL2


In [14]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Station Name', 
                  'EVCS Latitude', 
                  'EVCS Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
evcs_venues = getNearbyVenues(names=evcs_boston['Station Name'],
                                   latitudes=evcs_boston['Latitude'],
                                   longitudes=evcs_boston['Longitude']
                                  )

SEAPORT GARAGE
CLARENDON GROUP
PRUDENTIAL CTR
State Street Garage
State Street Financial Center Parking
GARAGE AT PO SQ
RES INN FENWAY
TRILOGY
TRILOGY
LONGWOOD GARAGE
PRUDENTIAL CTR
601 Congress Street
Standard Parking
WHOLE FOODS MKT
CHARLES RIVER
CHARLES RIVER
ATLANTIC WHARF
MASSPORT
MASSPORT
TRANSCOMM
TRANSCOMM
UDR
FEDERAL RESERVE
BRIGHAM CIRCLE
PILGRIM PARKING
50 POST OFFICE
CHARGE STATION
XL Hybrids Inc
125 HIGH ST
100 High Street
Midtown Hotel
The Lenox Hotel Watt Station
101 SEAPORT
SEAPORT GARAGE
Medical, Academic, and Scientific Community Organization
WMK
19-23 DRYDOCK
19-23 DRYDOCK
CAMELOT COURT
VAN NESS
VAN NESS
VAN NESS
JACKSON UE
BOSTON COLLEGE
BWH-5FRGARAGE
GTI PROPERTIES
TRANSCOMM
PILGRIM PARKING
LEVEL P3
100 CLARENDON
Channel Center Garage
North Station Garage
Longfellow Garage
WHOLE FOODS MKT
HARVARD MEDICAL
JOHN HANCOCK
ONE SEAPORT
THE HARLO
100 CLARENDON
FID KENNEDY
WMSP
Copley Place
RCC CAR CHARGE
Millennium Tower Boston
MFA
Environmental Protection Agency
345 HARRI

## 4. Exploratory Data Analysis  <a name="analysis"></a> 

Let's perform some basic explanatory data analysis and derive some additional information from our raw data. First let's count the number of leisure facilities and venue categories in every area candidate:

In [16]:
print(evcs_venues.shape)
evcs_venues.head()

(6870, 7)


Unnamed: 0,Station Name,EVCS Latitude,EVCS Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,SEAPORT GARAGE,42.349316,-71.041742,Seaport Hotel & World Trade Center,42.349241,-71.041362,Hotel
1,SEAPORT GARAGE,42.349316,-71.041742,Mortons Steakhouse Seaport Boston,42.348951,-71.040781,Steakhouse
2,SEAPORT GARAGE,42.349316,-71.041742,Ocean Prime,42.351345,-71.043478,Steakhouse
3,SEAPORT GARAGE,42.349316,-71.041742,Del Frisco's Double Eagle Steak House,42.348909,-71.038495,Steakhouse
4,SEAPORT GARAGE,42.349316,-71.041742,Tatte Bakery & Cafe,42.351966,-71.043246,Bakery


In [17]:
evcs_venues.groupby('Station Name').count()

Unnamed: 0_level_0,EVCS Latitude,EVCS Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Station Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10 St. James Ave. / 75 Arlington ST,100,100,100,100,100,100
100 CLARENDON,200,200,200,200,200,200
100 High Street,100,100,100,100,100,100
100 Northern,50,50,50,50,50,50
101 SEAPORT,76,76,76,76,76,76
...,...,...,...,...,...,...
W Boston Hotel Hotel & Residences - Tesla Destination,100,100,100,100,100,100
WHOLE FOODS MKT,70,70,70,70,70,70
WMK,91,91,91,91,91,91
WMSP,194,194,194,194,194,194


In [18]:
print('There are {} uniques categories.'.format(len(evcs_venues['Venue Category'].unique())))

There are 259 uniques categories.


### (1) Analyze Each Neighborhood <a name="analyze"></a> 

In [19]:
# one hot encoding
evcs_onehot = pd.get_dummies(evcs_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
evcs_onehot['Station Name'] = evcs_venues['Station Name'] 

# move neighborhood column to the first column
fixed_columns = [evcs_onehot.columns[-1]] + list(evcs_onehot.columns[:-1])
evcs_onehot = evcs_onehot[fixed_columns]

evcs_onehot.head()

Unnamed: 0,Station Name,ATM,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,SEAPORT GARAGE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,SEAPORT GARAGE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,SEAPORT GARAGE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,SEAPORT GARAGE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,SEAPORT GARAGE,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [20]:
evcs_onehot.shape

(6870, 260)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [21]:
evcs_grouped = evcs_onehot.groupby('Station Name').mean().reset_index()
evcs_grouped

Unnamed: 0,Station Name,ATM,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,10 St. James Ave. / 75 Arlington ST,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.050000,0.000000,...,0.010000,0.0,0.000000,0.00,0.000,0.010000,0.0,0.01,0.010000,0.0
1,100 CLARENDON,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.055000,0.000000,...,0.000000,0.0,0.000000,0.01,0.005,0.020000,0.0,0.01,0.010000,0.0
2,100 High Street,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.030000,0.000000,...,0.010000,0.0,0.000000,0.00,0.020,0.010000,0.0,0.00,0.000000,0.0
3,100 Northern,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.020000,0.000000,...,0.020000,0.0,0.000000,0.00,0.000,0.000000,0.0,0.00,0.000000,0.0
4,101 SEAPORT,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.013158,0.0,0.000000,0.00,0.000,0.013158,0.0,0.00,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
84,W Boston Hotel Hotel & Residences - Tesla Dest...,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.010000,0.000000,...,0.010000,0.0,0.000000,0.02,0.000,0.010000,0.0,0.00,0.010000,0.0
85,WHOLE FOODS MKT,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.014286,0.014286,...,0.000000,0.0,0.014286,0.00,0.000,0.014286,0.0,0.00,0.014286,0.0
86,WMK,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.010989,0.000000,...,0.010989,0.0,0.000000,0.00,0.000,0.010989,0.0,0.00,0.000000,0.0
87,WMSP,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.020619,0.000000,...,0.005155,0.0,0.000000,0.00,0.000,0.000000,0.0,0.00,0.000000,0.0


#### Let's confirm the new size

In [22]:
evcs_grouped.shape

(89, 260)

#### Let's print each neighborhood along with the top 5 most common venues

First, let's write a function to sort the venues in descending order.

In [23]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [24]:
num_top_venues = 5

for hood in evcs_grouped['Station Name']:
    print("----"+hood+"----")
    temp = evcs_grouped[evcs_grouped['Station Name'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----10 St. James Ave. / 75 Arlington ST----
                 venue  freq
0                  Spa  0.08
1                Hotel  0.06
2       Sandwich Place  0.05
3  American Restaurant  0.05
4   Seafood Restaurant  0.05


----100 CLARENDON----
                  venue  freq
0   American Restaurant  0.06
1    Seafood Restaurant  0.04
2                   Gym  0.04
3                   Spa  0.04
4  Gym / Fitness Center  0.03


----100 High Street----
                venue  freq
0         Coffee Shop  0.10
1      Sandwich Place  0.08
2  Italian Restaurant  0.04
3        Burger Joint  0.03
4         Salad Place  0.03


----100 Northern----
                      venue  freq
0        Italian Restaurant  0.06
1                Steakhouse  0.06
2                     Hotel  0.06
3  Mediterranean Restaurant  0.04
4                Taco Place  0.04


----101 SEAPORT----
                venue  freq
0  Italian Restaurant  0.07
1    Asian Restaurant  0.05
2               Hotel  0.05
3         Coffee Shop  

#### Let's put that into a *pandas* dataframe

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Station Name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
evcs_venues_sorted = pd.DataFrame(columns=columns)
evcs_venues_sorted['Station Name'] = evcs_grouped['Station Name']

for ind in np.arange(evcs_grouped.shape[0]):
    evcs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(evcs_grouped.iloc[ind, :], num_top_venues)

evcs_venues_sorted.head()

Unnamed: 0,Station Name,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,10 St. James Ave. / 75 Arlington ST,Spa,Hotel,Sandwich Place,American Restaurant,Seafood Restaurant,Theater,Gym,Gym / Fitness Center,Italian Restaurant,Jewelry Store
1,100 CLARENDON,American Restaurant,Spa,Gym,Seafood Restaurant,Gym / Fitness Center,Hotel,Sandwich Place,Italian Restaurant,Department Store,Cosmetics Shop
2,100 High Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Falafel Restaurant,American Restaurant,Park,New American Restaurant,Hotel,Burger Joint
3,100 Northern,Hotel,Italian Restaurant,Steakhouse,Seafood Restaurant,Park,Mediterranean Restaurant,Taco Place,Gym,Coffee Shop,Salad Place
4,101 SEAPORT,Italian Restaurant,Coffee Shop,Asian Restaurant,Hotel,Steakhouse,Seafood Restaurant,Gym,Bakery,Mediterranean Restaurant,Salad Place


In [27]:
evcs_venues_sorted.shape

(89, 11)

### (2) Cluster Neighborhoods <a name="cluster"></a> 

Run *k*-means to cluster the neighborhood into 5 clusters.

In [28]:
# set number of clusters
kclusters = 5

evcs_grouped_clustering = evcs_grouped.drop('Station Name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(evcs_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 1, 1, 1, 1, 1, 1, 3, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [29]:
# add clustering labels
evcs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

evcs_merged = evcs_boston

# merge evcs_grouped with evcs_boston data to add latitude/longitude for each neighborhood
evcs_merged = evcs_merged.join(evcs_venues_sorted.set_index('Station Name'), on='Station Name')

evcs_merged.head() # check the last columns!

Unnamed: 0,State,City,Status,Station Name,Street Address,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,MA,Boston,AVBL,SEAPORT GARAGE,1 Seaport Ln,24 hours daily,Public,-71.041742,42.349316,1,Seafood Restaurant,Coffee Shop,Bar,Hotel,Italian Restaurant,Steakhouse,Donut Shop,American Restaurant,Park,Taco Place
1,MA,Boston,AVBL,CLARENDON GROUP,265 Franklin St,24 hours daily,Public,-71.053321,42.356527,1,Coffee Shop,Hotel,Seafood Restaurant,Bakery,American Restaurant,Historic Site,Sandwich Place,Salad Place,Park,Italian Restaurant
2,MA,Boston,AVBL,PRUDENTIAL CTR,800 Boylston St,24 hours daily,Public,-71.082497,42.347275,3,Hotel,Italian Restaurant,Seafood Restaurant,Coffee Shop,American Restaurant,Bar,Ice Cream Shop,Dessert Shop,Gym / Fitness Center,Spa
3,MA,Boston,AVBL,State Street Garage,75 State St,24 hours daily; pay lot,Public,-71.055089,42.358449,1,Seafood Restaurant,Historic Site,Coffee Shop,Park,American Restaurant,Bakery,Hotel,Salad Place,Italian Restaurant,Sandwich Place
4,MA,Boston,AVBL,State Street Financial Center Parking,1 Lincoln St,24 hours daily,Public,-71.057957,42.352954,1,Coffee Shop,Sandwich Place,Asian Restaurant,Chinese Restaurant,Bakery,Café,Gym / Fitness Center,Sushi Restaurant,Gym,Italian Restaurant


Finally, let's visualize the resulting clusters.

In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(evcs_merged['Latitude'], evcs_merged['Longitude'], evcs_merged['Station Name'], evcs_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### (3) Examine Clusters <a name="examine"></a> 

Now, I examined each cluster and determine the discriminating venue categories that distinguish each cluster. 

#### Cluster 1 (red point on the map)

In [31]:
evcs_merged.loc[evcs_merged['Cluster Labels'] == 0, evcs_merged.columns[[3] + list(range(5, evcs_merged.shape[1]))]]

Unnamed: 0,Station Name,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,LONGWOOD GARAGE,24 hours daily,Public,-71.105662,42.338179,0,Café,American Restaurant,Coffee Shop,Falafel Restaurant,Sandwich Place,Gastropub,Pharmacy,Park,Bus Stop,Burrito Place
23,BRIGHAM CIRCLE,24 hours daily,Public,-71.103849,42.332801,0,Café,Sandwich Place,Pizza Place,Sushi Restaurant,Gastropub,Caribbean Restaurant,New American Restaurant,Greek Restaurant,Grocery Store,Pub
34,"Medical, Academic, and Scientific Community Or...",24 hours daily,Public,-71.108312,42.340042,0,Café,Coffee Shop,Donut Shop,Fast Food Restaurant,Pharmacy,American Restaurant,Sandwich Place,Metro Station,Falafel Restaurant,Shipping Store
43,BOSTON COLLEGE,24 hours daily,Public,-71.168664,42.337087,0,ATM,Café,Mexican Restaurant,Donut Shop,Shipping Store,Bus Stop,Bus Station,Ice Cream Shop,Costume Shop,Concert Hall
44,BWH-5FRGARAGE,24 hours daily,Public,-71.108562,42.335386,0,Café,American Restaurant,Sandwich Place,Falafel Restaurant,Bus Stop,Sushi Restaurant,Coffee Shop,Ice Cream Shop,Bar,Pharmacy
54,HARVARD MEDICAL,24 hours daily,Public,-71.103142,42.338445,0,Café,Coffee Shop,Sandwich Place,Falafel Restaurant,Fast Food Restaurant,Italian Restaurant,Park,Burrito Place,Shipping Store,Donut Shop
76,DoubleTree Club by Hilton Boston-Bayside - Tes...,24 hours daily; for Tesla use only; for custom...,Public,-71.045934,42.319148,0,Coffee Shop,Bank,Harbor / Marina,Grocery Store,Liquor Store,Café,Breakfast Spot,Bike Rental / Bike Share,Hotel,Antique Shop
85,Harvard Medical School,,Private,-71.102475,42.336489,0,Café,Italian Restaurant,American Restaurant,Gym,Sushi Restaurant,Sandwich Place,Coffee Shop,Falafel Restaurant,Ice Cream Shop,Bar
89,MASCO E.V.E.,24 hours daily,Public,-71.108014,42.339954,0,Café,Coffee Shop,Donut Shop,Falafel Restaurant,Sandwich Place,Pharmacy,American Restaurant,Fast Food Restaurant,Italian Restaurant,Bank
104,SCHRAFFTS CENTE,24 hours daily,Public,-71.072489,42.384195,0,Gym / Fitness Center,Café,Bus Stop,Electronics Store,Gym,Bus Station,Art Gallery,Baseball Field,Theater,Auto Workshop


#### Cluster 2 (purple point on the map)

In [32]:
evcs_merged.loc[evcs_merged['Cluster Labels'] == 1, evcs_merged.columns[[3] + list(range(5, evcs_merged.shape[1]))]]

Unnamed: 0,Station Name,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,SEAPORT GARAGE,24 hours daily,Public,-71.041742,42.349316,1,Seafood Restaurant,Coffee Shop,Bar,Hotel,Italian Restaurant,Steakhouse,Donut Shop,American Restaurant,Park,Taco Place
1,CLARENDON GROUP,24 hours daily,Public,-71.053321,42.356527,1,Coffee Shop,Hotel,Seafood Restaurant,Bakery,American Restaurant,Historic Site,Sandwich Place,Salad Place,Park,Italian Restaurant
3,State Street Garage,24 hours daily; pay lot,Public,-71.055089,42.358449,1,Seafood Restaurant,Historic Site,Coffee Shop,Park,American Restaurant,Bakery,Hotel,Salad Place,Italian Restaurant,Sandwich Place
4,State Street Financial Center Parking,24 hours daily,Public,-71.057957,42.352954,1,Coffee Shop,Sandwich Place,Asian Restaurant,Chinese Restaurant,Bakery,Café,Gym / Fitness Center,Sushi Restaurant,Gym,Italian Restaurant
5,GARAGE AT PO SQ,24 hours daily,Public,-71.055596,42.356276,1,Sandwich Place,Coffee Shop,Historic Site,Hotel,Italian Restaurant,New American Restaurant,American Restaurant,Salad Place,Seafood Restaurant,Park
12,Standard Parking,Garage business hours,Public,-71.062137,42.359957,1,Coffee Shop,Historic Site,American Restaurant,Seafood Restaurant,Hotel,Sandwich Place,New American Restaurant,Cocktail Bar,Salad Place,Restaurant
16,ATLANTIC WHARF,24 hours daily,Public,-71.052761,42.353325,1,Coffee Shop,Italian Restaurant,Hotel,Sandwich Place,French Restaurant,Bar,Salad Place,Cocktail Bar,Asian Restaurant,Park
21,UDR,24 hours daily,Public,-71.043399,42.351138,1,Italian Restaurant,Steakhouse,Coffee Shop,Hotel,Seafood Restaurant,Park,Bar,Asian Restaurant,Taco Place,Bakery
22,FEDERAL RESERVE,24 hours daily,Public,-71.053416,42.352676,1,Sandwich Place,Coffee Shop,Italian Restaurant,Hotel,Asian Restaurant,Bar,French Restaurant,Seafood Restaurant,Pizza Place,Dive Bar
24,PILGRIM PARKING,24 hours daily,Public,-71.060194,42.350088,1,Asian Restaurant,Chinese Restaurant,Bakery,Coffee Shop,Sushi Restaurant,Sandwich Place,Theater,Gym,Performing Arts Venue,Café


#### Cluster 3 (blue point on the map)

In [33]:
evcs_merged.loc[evcs_merged['Cluster Labels'] == 2, evcs_merged.columns[[3] + list(range(5, evcs_merged.shape[1]))]]

Unnamed: 0,Station Name,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
62,RCC CAR CHARGE,24 hours daily,Public,-71.093021,42.278609,2,Ice Cream Shop,Moving Target,Liquor Store,Park,Zoo Exhibit,Donut Shop,Dive Bar,Doctor's Office,Dog Run,Duty-free Shop


#### Cluster 4 (brilliant green point on the map)

In [34]:
evcs_merged.loc[evcs_merged['Cluster Labels'] == 3, evcs_merged.columns[[3] + list(range(5, evcs_merged.shape[1]))]]

Unnamed: 0,Station Name,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,PRUDENTIAL CTR,24 hours daily,Public,-71.082497,42.347275,3,Hotel,Italian Restaurant,Seafood Restaurant,Coffee Shop,American Restaurant,Bar,Ice Cream Shop,Dessert Shop,Gym / Fitness Center,Spa
6,RES INN FENWAY,24 hours daily,Public,-71.100562,42.345877,3,American Restaurant,Lounge,Sports Bar,Mexican Restaurant,Thai Restaurant,Café,Donut Shop,Pizza Place,Coffee Shop,Hotel
7,TRILOGY,24 hours daily,Public,-71.101413,42.344161,3,Sports Bar,American Restaurant,Thai Restaurant,Lounge,Pizza Place,Mexican Restaurant,Café,Greek Restaurant,Baseball Field,Liquor Store
8,TRILOGY,24 hours daily,Public,-71.098469,42.34407,3,Sports Bar,American Restaurant,Thai Restaurant,Lounge,Pizza Place,Mexican Restaurant,Café,Greek Restaurant,Baseball Field,Liquor Store
10,PRUDENTIAL CTR,24 hours daily,Public,-71.082532,42.347152,3,Hotel,Italian Restaurant,Seafood Restaurant,Coffee Shop,American Restaurant,Bar,Ice Cream Shop,Dessert Shop,Gym / Fitness Center,Spa
11,601 Congress Street,MO: Not Specified; TU: Not Specified; WE: Not ...,Public,-71.039867,42.347527,3,Seafood Restaurant,Donut Shop,American Restaurant,Music Venue,Harbor / Marina,Steakhouse,Deli / Bodega,Café,Sandwich Place,Hotel
13,WHOLE FOODS MKT,24 hours daily,Public,-71.062833,42.345169,3,Grocery Store,Art Gallery,Pizza Place,Dog Run,Chinese Restaurant,Italian Restaurant,Japanese Restaurant,Hotel,Thrift / Vintage Store,Pharmacy
19,TRANSCOMM,24 hours daily,Public,-71.066885,42.336015,3,Sandwich Place,Donut Shop,Mediterranean Restaurant,Flower Shop,Department Store,Tapas Restaurant,Bakery,Café,Southern / Soul Food Restaurant,Miscellaneous Shop
20,TRANSCOMM,24 hours daily,Public,-71.068898,42.336521,3,Sandwich Place,Donut Shop,Mediterranean Restaurant,Flower Shop,Department Store,Tapas Restaurant,Bakery,Café,Southern / Soul Food Restaurant,Miscellaneous Shop
30,Midtown Hotel,MO: Not Specified; TU: Not Specified; WE: Not ...,Public,-71.083958,42.343261,3,Coffee Shop,Sushi Restaurant,American Restaurant,Pizza Place,Bakery,Bar,Concert Hall,Middle Eastern Restaurant,Bookstore,Grocery Store


#### Cluster 5 (brown point on the map)

In [35]:
evcs_merged.loc[evcs_merged['Cluster Labels'] == 4, evcs_merged.columns[[3] + list(range(4, evcs_merged.shape[1]))]]

Unnamed: 0,Station Name,Street Address,Access Time,Group,Longitude,Latitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,CHARLES RIVER,899-925 Commonwealth Avenue,24 hours daily,Public,-71.117506,42.352459,4,Pizza Place,Coffee Shop,Burger Joint,Chinese Restaurant,Thai Restaurant,Gym / Fitness Center,Yoga Studio,Donut Shop,College Cafeteria,Nightclub
15,CHARLES RIVER,161-179 Ashford St,24 hours daily,Public,-71.122325,42.353708,4,Pizza Place,Coffee Shop,Burger Joint,Chinese Restaurant,Thai Restaurant,Gym / Fitness Center,Yoga Studio,Donut Shop,College Cafeteria,Nightclub
17,MASSPORT,Airport Road - Arrival Level,24 hours daily,Public,-71.020897,42.367584,4,Donut Shop,Coffee Shop,Rental Car Location,Airport Lounge,Harbor / Marina,Seafood Restaurant,Hotel,American Restaurant,Park,Airport Service
18,MASSPORT,Logan Airport Terminal B,24 hours daily,Public,-71.018645,42.363092,4,Donut Shop,Coffee Shop,Rental Car Location,Airport Lounge,Harbor / Marina,Seafood Restaurant,Hotel,American Restaurant,Park,Airport Service
27,XL Hybrids Inc,145 Newton St,Employee use only,Private,-71.169102,42.356504,4,Park,Hotpot Restaurant,Baseball Field,Skating Rink,Harbor / Marina,Toll Plaza,Convenience Store,Discount Store,Event Space,Ethiopian Restaurant
38,CAMELOT COURT,10 Camelot Ct,24 hours daily,Public,-71.140868,42.350007,4,Pizza Place,Donut Shop,Gastropub,Liquor Store,Bar,Burmese Restaurant,Dog Run,Chinese Restaurant,Japanese Restaurant,Bubble Tea Shop
42,JACKSON UE,1542R Columbus Ave,24 hours daily,Public,-71.098236,42.323381,4,Park,Pizza Place,Metro Station,Mobile Phone Shop,Donut Shop,Plaza,Snack Place,Paella Restaurant,Chinese Restaurant,Sandwich Place
48,LEVEL P3,1 Nashua St,24 hours daily,Public,-71.063294,42.365955,4,Donut Shop,Pizza Place,Italian Restaurant,Coffee Shop,Bar,Hotel,Sports Bar,Brewery,Sandwich Place,Park
51,North Station Garage,121 Nashua St,Garage business hours; pay lot,Public,-71.065477,42.36758,4,Donut Shop,Science Museum,Bar,Park,Pizza Place,Sports Bar,Coffee Shop,Sporting Goods Shop,Gym,Liquor Store
52,Longfellow Garage,60 Staniford St,Garage business hours; pay lot,Public,-71.064251,42.362557,4,Pizza Place,Sandwich Place,American Restaurant,Hotel,Bar,Coffee Shop,Café,Mexican Restaurant,Donut Shop,Museum


##  5. Discussions <a name="discussion"></a> 

Essentially, determining the number of clusters in a data set, or k as in the k-Means algorithm, is a frequent problem in data clustering. The correct choice of K is often ambiguous because it's very dependent on the shape and scale of the distribution of points in a dataset. There are some approaches to address this problem, but one of the techniques that is commonly
used is to run the clustering across the different values of K and looking at a metric of accuracy for clustering. This metric can be mean, distance between data points and their cluster's centroid. Which indicate how dense our clusters are or, to what extent we minimize the error of clustering. Then, looking at the change of this metric, we can find the best value for K. But the problem is that with increasing the number of clusters, the distance of centroids to data points will always reduce. This means increasing K will always decrease the error. So, the elbow point is determined where the rate of decrease sharply shifts. It is the right K for clustering. (**This method is called the `elbow method`**) 

However,pre-specifying the number of clusters is not an easy task.