# Segmenting and Clustering Sao Paulo Subway Stations 

At this notebook, Foursquare API was used to cluster subway station (Metro) neighborhoods in Sao Paulo regarding closely venues. The amount of venues per category in each subway station was acquired by a HTTP request to Foursquare API, and then used to group the stations into clusters. For this notebook *k*-means clustering algorithm was selected to complete this task. Finally, Folium library has been used to visualize the neighborhoods in Sao Paulo and their emerging clusters.

[Medium Post by Felipe Testa](https://medium.com/@felipe.testaa/unsupervised-machine-learning-kmeans-clustering-of-s%C3%A3o-paulo-subway-stations-using-foursquare-c5101727dd85)

---
## Introduction
---

The São Paulo Subway, commonly called the Metro is one of the urban railways that serves the city of São Paulo, alongside the São Paulo Metropolitan Trains Company (CPTM), forming the largest metropolitan rail transport network of Latin America. The six lines in the metro system operate on 101.1 kilometres (62.8 mi) of route, serving 72 stations. The metro system carries about 5,300,000 passengers a day.

Metro itself is far from covering the entire urban area in the city of São Paulo and only runs within the city limits. However, it is complemented by a network of metropolitan trains operated by CPTM, which serve the São Paulo and the São Paulo Metropolitan Region. The two systems combined form a 374 km (232 mi) long network.

Undoubtedly, subway is an important part of any metropolis around the world. The idea of this project is to categorically segment the neighborhoods of Sao Paulo into major clusters and examine their surrounding. A desirable intention is to examine the neighborhood cluster's profile, some neighborhoods are mostly residential, others have more business or commercial spaces.

This project will help to understand the diversity of a subway station by leveraging venue data from Foursquare’s ‘Places API’ and ‘k-means clustering’ unsupervised machine learning algorithm. By analyzing this data we can classify stations by their primary usage. This data can be useful for city planners to determine where from and where to people are most likely to travel for work and leisure, plan further extension of the network and find places for new development. Also after this analysis, is expected to have a initial project to help politicians to make better public policy for the population.

---
## 1. Setup
---

At this first part, all code below is used to setup all libraries and settings from Notebook

### 1.1 Import Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import folium
import requests
import seaborn as sns
import matplotlib.pyplot as plt

from pathlib import Path
from configparser import ConfigParser
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
from sklearn.preprocessing import MinMaxScaler
from IPython.display import display, HTML, Image

### 1.2 Notebook Settings

In [2]:
# Show all Columns and Rows
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

# Larger Kernels on Jupyter Notebook
display(HTML('<style>.container{width: 60% !important;}</style>' ))

---
## 2. Data
---

All data for subway stations (Metro) was downloaded from [Kaggle](https://www.kaggle.com/thiagodsd/sao-paulo-metro/kernels). After getting all data from Metro it was merged with Foursquare data, a location data provider. At this notebook will be used RESTful API calls to retrieve data about venues in different areas. This is the link to [Foursquare API](https://developer.foursquare.com/docs) documentation for more details.

### 2.1 Subway Station Data

In order to segment the metro station of Sao Paulo, a dataset of coordinates (latitude and longitude) for all subway stations this dataset was downloaded from [Kaggle](https://www.kaggle.com/thiagodsd/sao-paulo-metro/kernels). Once the .csv file is downloaded and transformed into pandas dataframe, which is basically a table with the metro stations and coordinates. After this step, was performed a basic clean up (renaming columns and dropping other's). 

Upon analysis, it is found that the dataframe consists of 79 metro stations. Further, ‘geopy’ library is used to get the latitude and longitude values of Sao Paulo, which was returned to be **Latitude: -23.5506507, Longitude: -46.6333824**. The curated dataframe is then used to visualize by creating a map of Sao Paulo with metro station superimposed on top. The following map was generated using python ‘folium’ library. 

#### Reading

In [3]:
metro_data = pd.read_csv('../data/metrosp_stations.csv', low_memory=False)
metro_data.head()

Unnamed: 0.1,Unnamed: 0,name,station,lat,lon,line,neigh
0,aacd-servidor,Aacd Servidor,aacd-servidor,-23.597825,-46.652374,['lilas'],"['moema', 'hospital-sao-paulo']"
1,adolfo-pinheiro,Adolfo Pinheiro,adolfo-pinheiro,-23.650073,-46.704206,['lilas'],"['largo-treze', 'alto-da-boa-vista']"
2,alto-da-boa-vista,Alto Da Boa Vista,alto-da-boa-vista,-23.641625,-46.699434,['lilas'],"['adolfo-pinheiro', 'borba-gato']"
3,alto-do-ipiranga,Alto Do Ipiranga,alto-do-ipiranga,-23.602237,-46.612486,['verde'],"['santos-imigrantes', 'sacoma']"
4,ana-rosa,Ana Rosa,ana-rosa,-23.581871,-46.638104,"['azul', 'verde']","['paraiso', 'vila-mariana', 'paraiso', 'chacar..."


#### Cleaning

In [4]:
# Renaming columns
new_columns = ['unnamed',
              'station_name_spaced',
               'metro_station',
               'latitude',
               'longitude',
               'metro_line',
               'neigh_stations']

metro_data.columns = new_columns

# Selecting columns
metro_data = metro_data.drop(columns=['unnamed', 'neigh_stations', 'station_name_spaced'])

# Data Head
metro_data.head()

Unnamed: 0,metro_station,latitude,longitude,metro_line
0,aacd-servidor,-23.597825,-46.652374,['lilas']
1,adolfo-pinheiro,-23.650073,-46.704206,['lilas']
2,alto-da-boa-vista,-23.641625,-46.699434,['lilas']
3,alto-do-ipiranga,-23.602237,-46.612486,['verde']
4,ana-rosa,-23.581871,-46.638104,"['azul', 'verde']"


#### Visualization

In [5]:
metro_data.shape

(79, 4)

In [6]:
address = 'Sao Paulo, Sao Paulo Brazil'
geolocator = Nominatim(user_agent='get_location')

latitude = geolocator.geocode(address).latitude
longitude = geolocator.geocode(address).longitude

city_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, metro_station, metro_line in zip(metro_data['latitude']
                                               , metro_data['longitude']
                                               , metro_data['metro_station']
                                               , metro_data['metro_line']):
    
    label = '{}, {}'.format(metro_station, metro_line)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
                            [lat, lng],
                            radius=5,
                            popup=label,
                            color='blue',
                            fill=True,
                            fill_color='#3186cc',
                            fill_opacity=0.7,
                            parse_html=False
                        ).add_to(city_map)  
    
city_map

### 2.2 Venues and Categories Data from Foursquare API

In order to find and explore venues and categories surrounding each station, it has been used Foursquare API. Venues can be categorized as residential, professional, shopping or leisure. Let's see what venue categories Foursquare identifies.

Following example of a response from Foursquare API

```
{'categories': [{'id': '4d4b7104d754a06370d81259',
'name': 'Arts & Entertainment',
'pluralName': 'Arts & Entertainment',
'shortName': 'Arts & Entertainment',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
'suffix': '.png'},
'categories': [{'id': '56aa371be4b08b9a8d5734db',
'name': 'Amphitheater',
'pluralName': 'Amphitheaters',
'shortName': 'Amphitheater',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/default_',
'suffix': '.png'},
'categories': []},
{'id': '4fceea171983d5d06c3e9823',
'name': 'Aquarium',
'pluralName': 'Aquariums',
'shortName': 'Aquarium',
'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/aquarium_',
'suffix': '.png'},
'categories': []}]
```

As mentioned Foursquare API is used to explore the metro station surrounding and segment them. To access the API, ‘CLIENT_ID’, ‘CLIENT_SECRET’ and ‘VERSION’ is defined in a credentials file, in order to get a credentials as well just sign up on the following [link](https://developer.foursquare.com/).

There are many endpoints available on Foursquare for various GET requests. But, to explore the subway surrounding, it is required the amount of venues per category establish at Foursquare Venue Category Hierarchy, which is retrieved using the code below.

#### Getting Credentials for Foursquare API

In [7]:
def get_credentials():
    ''' Function return credentials from .ini file '''
    
    credentials = {}
    parser = ConfigParser()
    
    parser.read(str(Path.home()) + "/.config/credentials.ini")

    if parser.has_section('foursquare'):
        items = parser.items('foursquare')
        
        for item in items:
            credentials[item[0]] = item[1]
            
        print('Credentials were load successfully')
      
    else:
        print('Credentials were not found')

    return credentials

credentials = get_credentials()

Credentials were not found


#### Get Response from Foursquare API

In [8]:
# GET Response from Foursquare API
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
            credentials['client_id'], 
            credentials['client_secret'], 
            credentials['version'])
            
# make the GET request
category_results = requests.get(categories_url).json()
category_results = category_results['response']['categories']

KeyError: 'client_id'

#### Categories

Upon analysis, it is found that there are 10 major or parent categories of venues,
under which all the other sub-categories are included. Following depiction shows
the ‘Category ID’ and ‘Category Name’ retrieved from API 

In [None]:
# Transforming Json to DataFrame
categories = pd.json_normalize(data=category_results)

# Cleaning DataFrame
categories = categories.drop(columns=['pluralName'
                             ,'shortName'
                             ,'categories'
                             ,'icon.prefix'
                             ,'icon.suffix'])

categories = categories.rename(columns={'id':'category_id',
                                         'name':'category_name'})
categories

#### Number of venues per category

As said earlier, the amount of venues per category for each subway station depiction is the matter of interest. A function is created to return a dataframe with ‘Category ID’, ‘Category Name’ and ‘Venues Quantity’ per ‘Metro Station’. 

The function ‘get_venues_total’ is created. This functions loop through all the subway stations of Sao Paulo and creates an API request URL with radius = 1000. Further, the GET request is made to Foursquare API and the data is then appended to a python dataframe. Lastly, the python dataframe is returned by the function.

In [None]:
def get_venues_total(credentials, lat, long, radius, category_id): 
    ''' Function which returns number of venues considering a Foursquare cateogry and radius '''
    
    explore_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId={}'.format(
                      credentials['client_id'], 
                      credentials['client_secret'], 
                      credentials['version'], 
                      lat,
                      long,
                      radius,
                      category_id)
  
    try:
        response = requests.get(explore_url).json()['response']['totalResults']
  
    except:
        response = 0
        
    return response

In [None]:
metro_venues = []

for index, metro in metro_data.iterrows():
    for index_2, category in categories.iterrows():
        metro_venues.append([(  metro['station_name']
                                , metro['latitude']
                                , metro['longitude']
                                , metro['metro_line']
                                , category['category_name'] 
                                , get_venues_total(credentials = credentials
                                                   ,lat = metro['latitude']
                                                   ,long = metro['longitude']
                                                   ,radius = 1000
                                                   ,category_id = category['category_id']))])
    
metro_venues = pd.DataFrame([item for metro_venues in metro_venues for item in metro_venues])
metro_venues.columns=['metro_station', 'latitude', 'longitude', 'metro_line', 'category_name', 'venues']

# Exporting to .CSV
metro_venues.to_csv('../data/results_from_api.csv', index=False)
metro_venues.tail()

---
## 3. Methodology
---

In [None]:
metro_venues = pd.read_csv('../data/results_from_api.csv')


### 3.1 Descriptive Analysis

In [None]:
metro_venues.groupby('category_name')['venues'].describe().T.round()

In [None]:
plt.figure(figsize=(16, 6))

ax = sns.boxplot(x="category_name", y="venues", data=metro_venues)
ax.set_ylabel('Count of venues', fontsize=18)
ax.set_xlabel('Venue category', fontsize=18)
ax.tick_params(labelsize=18)
plt.xticks(rotation=45, ha='right')

plt.show()

As we can see, the top 3 venues categories with a higher frequency around the Metro station in São Paulo:

* Food
* Professional & Other Places
* Shop & Service

It means when we're looking at any subway station surrounding in São Paulo is more likely to have a higher number of venues related to those categories than others.

Another important fact is that the category **Event** has fewer venues, therefore, it has been not considered for the further clustering method.

In [None]:
df = pd.pivot_table(metro_venues
                    , index=['metro_station', 'metro_line','latitude', 'longitude']
                    , columns='category_name'
                    , values='venues', ).reset_index()

# Excluding Event Category from the analysis
df = df.drop(columns='Event')

df = df.rename_axis(None, axis=1)
df.head()

### 3.3 Feature Engineering

Let’s normalize the data using [min-max scaling](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization)) (scale count of venues from 0 to 1 where 0 is the lowest value in a set and 1 is highest). This both normalizes the data and provides an easy to interpret score at the same time. The scaled box plot was ploted below:

In [None]:
cluster_dataset = df.iloc[:,4:]
category_list = cluster_dataset.columns

X = cluster_dataset.values
cluster_dataset = pd.DataFrame(MinMaxScaler().fit_transform(X))
cluster_dataset.columns = category_list

cluster_dataset = cluster_dataset.reset_index(drop=True)
cluster_dataset.head()

In [9]:
plt.figure(figsize=(16, 6))

ax = sns.boxplot(data=cluster_dataset)
ax.set_ylabel('Number of venues (Relative)', fontsize=18)
ax.set_xlabel('Venue category', fontsize=18)
ax.tick_params(labelsize=18)

plt.xticks(rotation=45, ha='right')

NameError: name 'cluster_dataset' is not defined

<Figure size 1152x432 with 0 Axes>

### 3.4 K-Means Clustering

'K-Means' is an unsupervised machine learning algorithm that creates clusters of data points aggregated together because of certain similarities. This algorithm will be used to count venues for each cluster label for variable cluster size. To implement this algorithm, it is very important to determine the optimal number of clusters (i.e. k). There are 2 most popular methods for the same, namely [‘The Elbow Method’ and ‘The Silhouette Method’](https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb), for this project will be used 'The Elbow Method'.

#### The Elbow Method

The Elbow Method calculates the sum of squared distances of samples to their closest cluster center for different values of 'k'. The optimal number of clusters is the value after which there is no significant decrease in the sum of squared distances. Following is an implementation of this method (with varying number of clusters from 1 to 20):

Sometimes, Elbow method does not give the required result, which did not happen in this case. If there was a gradual decrease in the sum of squared distances, an optimal number of clusters could not be determined. To counter this, another method can be implemented, such as the Silhouette Method.
Following below the code and the plot obtained:

In [None]:
sum_of_squared_distances = []
K = range(1,20)

for k in K:
    print(k, end=' ')
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=20).fit(cluster_dataset)
    sum_of_squared_distances.append(kmeans.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('sum_of_squared_distances')
plt.title('Elbow Method For Optimal k');

The Elbow Method determines an optimal number of cluster of **4**

#### K-Means

Following code block runs the k-Means algorithm with number of **clusters = 4** and prints the counts of metro stations assigned to different clusters: 

In [None]:
# set number of clusters
kclusters = 4

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0, n_init=20).fit(cluster_dataset)

kmeans_labels = kmeans.labels_

df_clusters = df.copy()
df_clusters['cluster'] = kmeans_labels
df_clusters['latitude'] = df['latitude']
df_clusters['longitude'] = df['longitude']

df_clusters_minmax = cluster_dataset.copy()
df_clusters_minmax['cluster'] = kmeans_labels
df_clusters_minmax['metro_station'] = df['metro_station']

df_clusters_minmax.head()

---
## 4. Results
---

### 4.1 K-Means Results

After getting the result DataFrame, It has been displayed the boxplot below. It's noticed that the major difference between clusters was related to how 'crowded' of venues is the subway surrounding. For example, Cluster 3 has a higher number of venues (relative) medians, when compared to other clusters. Then, we could imply that subway stations from Cluster 3 have more venues density in a 1000m radius than any other metro stations from other clusters.

In [None]:
fig, axes = plt.subplots(1, kclusters, figsize=(18, 6), sharey=True )

axes[0].set_ylabel('Number of venues (relative)', fontsize=18)

for k in range(kclusters):
    #Set same y axis limits
    axes[k].set_ylim(0,1.1)
    axes[k].xaxis.set_label_position('top')
    axes[k].set_xlabel('Cluster ' + str(k), fontsize=25)
    axes[k].tick_params(labelsize=16)
    plt.sca(axes[k])
    plt.xticks(rotation='vertical')
    sns.boxplot(data = df_clusters_minmax[df_clusters_minmax['cluster'] == k].drop(columns='cluster'), ax=axes[k])

plt.show()

In [None]:
address = 'Sao Paulo, Sao Paulo Brazil'
geolocator = Nominatim(user_agent='get_location')

latitude = geolocator.geocode(address).latitude
longitude = geolocator.geocode(address).longitude

categories_list = cluster_dataset.columns

# set color scheme for the clusters
rainbow = ['#4c6ca5'
           , '#3c8447'
           , '#b40049'
           , '#864ca5']

city_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
markers_colors = []
for i, lat, lng, metro_station, cluster in zip(df_clusters.index
                                              , df_clusters['latitude']
                                              , df_clusters['longitude']
                                              , df_clusters['metro_station']
                                              , df_clusters['cluster']):
  
    #Calculate top 3 categories for each station
    station_series = df_clusters.iloc[i]
    top_categories_dict = {}

    for cat in categories_list:
        top_categories_dict[cat] = station_series[cat]
        
    top_categories = sorted(top_categories_dict.items(), key=lambda x: x[1], reverse=True)
  
    popup='<b>{}</b><br>Cluster {}<br>1. {} {}<br>2. {} {}<br>3. {} {}'.format(
          metro_station,
          cluster,
          top_categories[0][0],
          "{0:.2f}".format(top_categories[0][1]),
          top_categories[1][0],
          "{0:.2f}".format(top_categories[1][1]),
          top_categories[2][0],
          "{0:.2f}".format(top_categories[2][1]))
  
    label = folium.Popup(popup, parse_html=True)

    folium.CircleMarker(
                          [lat, lng],
                          radius=5,
                          popup=label,
                          color=rainbow[cluster-1],
                          fill=True,
                          fill_color=rainbow[cluster-1],
                          fill_opacity=0.7,
                          parse_html=False).add_to(city_map)  
    
city_map

On the figure cluster division shows clearly divisions on the map like circles, which we could easily identify 4 circles in Sao Paulo, which one has their own characteristics. More details on clusters can be defined as followed:


**Summary:**

+ **Cluster 3 (Red)** - 13 metro stations

    Metro stations within cluster 3 have a higher frequency of venues and contain Sao Paulo Downtown Neighborhoods (Praca da Se, Republica e Anhangabau) and important streets in Sao Paulo (Av. Paulista, Faria Lima, Reboucas and Oscar Freire). Those streets have headquarters of many financial and cultural institutions, it's known the financial capital of Brazil. Usually, those areas have a higher frequency of Professional, Food, Shop and Service venues.


+ **Cluster 1 (Blue)** - 18 metro stations

    Metro stations within cluster 1 do not have the highest frequency of venues in Sao Paulo. However, It's close to Downtown and Financial Center of Sao Paulo. Neighborhoods close to those metro stations are also super important in Sao Paulo, many companies have headquarters and important places in Sao Paulo are located in this areas. And it also has great restaurant areas such as Moema and Itaim.


+ **Cluster 0 (Purple)** - 31 metro stations 

    It is the biggest cluster on this analysis with 31 metro stations, this area contains stations that are further from downtown and financial center in São Paulo. It also contains the fewer frequency of venues in every category.


+ **Cluster 2 (Green)** - 17 metro stations

    Subway stations from this cluster are in the farther area in Sao Paulo and it's major Residence Venues.
    

As explained before K-means was able to cluster metro stations by using their surrounding venues, and it has been produced Four different clusters as shown below. Those areas are different from each other mainly due to venue concentration. Stations that are more close to downtown has more venues within 1000m radius than stations further to the center.

In [None]:
Image(filename='../image/SaoPauloClusteringZones.jpg', width=1200, height=200)

### 4.2 Cluster Results Details

#### Cluster 0 (Purple)

In [None]:
# Cluster Number
k = 0

#Set same y axis limits
figsize=(10, 6)
plt.tick_params(labelsize=16)
plt.xticks(rotation='vertical')

#Set same y axis limits
sns.boxplot(data = df_clusters_minmax[df_clusters_minmax['cluster'] == k].drop('cluster',1))

plt.show()

In [None]:
df_cluster_k = df_clusters_minmax[df_clusters_minmax['cluster'] == k]

# Reordering Columns
df_cluster_k = df_cluster_k[['metro_station'
                             , 'Arts & Entertainment'
                             , 'College & University'
                             , 'Food'
                             , 'Nightlife Spot'
                             ,'Outdoors & Recreation'
                             ,'Professional & Other Places'
                             , 'Residence'
                             ,'Shop & Service'
                             , 'Travel & Transport']].sort_values(by='metro_station').reset_index(drop=True)

df_cluster_k

#### Cluster 1 (Blue)

In [None]:
# Cluster Number
k = 1

#Set same y axis limits
figsize=(10, 6)
plt.tick_params(labelsize=16)
plt.xticks(rotation='vertical')

#Set same y axis limits
sns.boxplot(data = df_clusters_minmax[df_clusters_minmax['cluster'] == k].drop('cluster',1))

plt.show()


In [None]:
df_cluster_k = df_clusters_minmax[df_clusters_minmax['cluster'] == k]

# Reordering Columns
df_cluster_k = df_cluster_k[['metro_station'
                             , 'Arts & Entertainment'
                             , 'College & University'
                             , 'Food'
                             , 'Nightlife Spot'
                             ,'Outdoors & Recreation'
                             ,'Professional & Other Places'
                             , 'Residence'
                             ,'Shop & Service'
                             , 'Travel & Transport']].sort_values(by='metro_station').reset_index(drop=True)

df_cluster_k

#### Cluster 2 (Green)

In [None]:
# Cluster Number
k = 2

#Set same y axis limits
figsize=(10, 6)
plt.tick_params(labelsize=16)
plt.xticks(rotation='vertical')

#Set same y axis limits
sns.boxplot(data = df_clusters_minmax[df_clusters_minmax['cluster'] == k].drop('cluster',1))

plt.show()

In [None]:
df_cluster_k = df_clusters_minmax[df_clusters_minmax['cluster'] == k]

# Reordering Columns
df_cluster_k = df_cluster_k[['metro_station'
                             , 'Arts & Entertainment'
                             , 'College & University'
                             , 'Food'
                             , 'Nightlife Spot'
                             ,'Outdoors & Recreation'
                             ,'Professional & Other Places'
                             , 'Residence'
                             ,'Shop & Service'
                             , 'Travel & Transport']].sort_values(by='metro_station').reset_index(drop=True)

df_cluster_k

#### Cluster 3 (Red)

In [None]:
# Cluster Number
k = 3

#Set same y axis limits
figsize=(10, 6)
plt.tick_params(labelsize=16)
plt.xticks(rotation='vertical')

#Set same y axis limits
sns.boxplot(data = df_clusters_minmax[df_clusters_minmax['cluster'] == k].drop('cluster',1))

plt.show()

In [None]:
df_cluster_k = df_clusters_minmax[df_clusters_minmax['cluster'] == k]

# Reordering Columns
df_cluster_k = df_cluster_k[['metro_station'
                             , 'Arts & Entertainment'
                             , 'College & University'
                             , 'Food'
                             , 'Nightlife Spot'
                             ,'Outdoors & Recreation'
                             ,'Professional & Other Places'
                             , 'Residence'
                             ,'Shop & Service'
                             , 'Travel & Transport']].sort_values(by='metro_station').reset_index(drop=True)

df_cluster_k

---
## 5. Discussion
---

The purpose of this project was to cluster different metro stations in Sao Paulo based on the surrounding areas of every metro station. For that, Foursquare API venue data was used. Foursquare data isn't all-encompassing since data doesn't take into account a venue's size (e.g. a big restaurant attracts a lot more people that a hot dog stand - each of them is still one Foursquare "venue").
Another possible development is to include more data e.g housing prices and criminality and passenger per station it would be interesting to add this kind of information to the analysis. This could potentially be valuable for getting more detailed clusters and a profile of each metro station helping politicians and Metro Company to take better decisions.

---
## 6. Conclusion
---

Four clusters were identified. The main differences between the clusters are the average number of venues per metro station and the most common venues surrounding it are Food, Professional & Other Places and Shop & Service. K-Means clustering method was able to separate stations by a number of venues within a 1000m radius and showed that Sao Paulo has more venues concentration close to downtown than on the city border.