# Capstone Project - The Battle of the Neighborhoods (Week 2)

### Analysis of Brazil's major cities, differences and similarities 

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>

Brazil is 5th biggest country in the world, and with a population of 210 million people is the 6th largest population. The large size and population are reflected in numerous big cities that have different culture, nature and people. This can be a big challenge for companies when deciding which city, they can growth/expand. It's easy to understand that if a company is successful in one specific city, it may have more chances to profit in a similar city rather than a very different one. Therefore, this project aims to analyze Brazil's capital cities and find which ones share similarities, forming city clusters.

## Data <a name="data"></a>

For this analysis we need the following data:

Data related to geographic and basic information of the target cities:
1.   List of Brazil's Capital Cities names
2.   List of geospatial coordinates (latitude, longitude) of Brazil's Capital Cities

Data that will provide us information of each city, making possible the analysis and clusterization:
3.   For that, we will use the Foursquare API to get the 1000 most popular places in a radius of 10km.




Install Requirements

In [1]:
# We will need folium to visualize maps
!pip install folium==0.5.0

Collecting folium==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/37/456fb3699ed23caa0011f8b90d9cad94445eddc656b601e6268090de35f5/folium-0.5.0.tar.gz (79kB)
[K     |████▏                           | 10kB 18.4MB/s eta 0:00:01[K     |████████▎                       | 20kB 1.8MB/s eta 0:00:01[K     |████████████▍                   | 30kB 2.7MB/s eta 0:00:01[K     |████████████████▌               | 40kB 1.8MB/s eta 0:00:01[K     |████████████████████▊           | 51kB 2.2MB/s eta 0:00:01[K     |████████████████████████▉       | 61kB 2.6MB/s eta 0:00:01[K     |█████████████████████████████   | 71kB 2.1MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 1.8MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.5.0-cp36-none-any.whl size=76240 sha256=157736b7224d1d60628d5c3949a24c6140736cc15444cff6ef02e3de730016f6
  Stored in directory: /root

Import relevant packages

In [0]:
import folium
import pandas as pd
import numpy as np
import requests
from ipywidgets import IntProgress
from IPython.display import display, HTML
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

## Gather geospatial information of each Brazil's capital cities

Below we have a list with each capital city `capitals` and the geospatial information of them `capitals_coord`

In [0]:
capitals = ['Porto Velho', 'Manaus', 'Rio Branco', 'Campo Grande', 'Macapa', 'Brasilia', 'Boa Vista', 'Cuiaba', 'Palmas', 'Teresina', 'Sao Paulo', 'Rio de Janeiro', 'Belem', 'Sao Luis', 'Goiania', 'Salvador', 'Maceio', 'Porto Alegre', 'Curitiba', 'Florianopolis', 'Belo Horizonte', 'Fortaleza', 'Recife', 'Joao Pessoa', 'Acaraju', 'Natal', 'Vitoria']

In [0]:
capitals_coord = [[-8.765641, -63.875509], [-3.120778, -60.021362], [-9.971815, -67.825691], [-20.467168, -54.619796], [0.033906, -51.066463], [-15.807910, -47.959944], [2.823485, -60.679053], [-15.593696, -56.098124], [-10.246682, -48.324712], [-5.086539, -42.800808], [-23.552193, -46.639079], [-22.908964, -43.179471], [-1.455427, -48.483116], [-2.529956, -44.252316], [-16.690268, -49.265598], [-12.976777, -38.507956], [-9.650390, -35.710339], [-30.031338, -51.207043], [-25.427976, -49.260580], [-27.593404, -48.548334], [-19.915741, -43.941747], [-3.734547, -38.528010], [-8.052249, -34.889198], [-7.119057, -34.838982], [-10.934840, -37.063968], [-5.789384, -35.204150], [-20.294332, -40.296230]]

Now we have to transform that data into a DataFrame with the columns city, latitude and longitude.

In [0]:
df_cap = pd.DataFrame({'city': capitals, 'latitude': [x[0] for x in capitals_coord], 'longitude': [x[1] for x in capitals_coord]})

In [6]:
df_cap.head()

Unnamed: 0,city,latitude,longitude
0,Porto Velho,-8.765641,-63.875509
1,Manaus,-3.120778,-60.021362
2,Rio Branco,-9.971815,-67.825691
3,Campo Grande,-20.467168,-54.619796
4,Macapa,0.033906,-51.066463


Create a function to view our analyzed cities

In [0]:
def create_map(coordinates_center, zoom, df_cities):
  map_newyork = folium.Map(location=coordinates_center, zoom_start=zoom)
  for lat, lng, city in zip(df_cities['latitude'], df_cities['longitude'], df_cities['city']):
      label = city
      label = folium.Popup(label, parse_html=True)
      folium.CircleMarker(
          [lat, lng],
          radius=5,
          popup=label,
          color='blue',
          fill=True,
          fill_color='#3186cc',
          fill_opacity=0.7,
          parse_html=False).add_to(map_newyork)  
      
  return map_newyork

In [8]:
create_map([-15.724072, -49.437987], 4, df_cap)

In [0]:
CLIENT_ID = 'VHYDSZEFDRHN1FRUM5VZ1MNES0ZNOZ5GMZ32DRF4V0IDLNRY'
CLIENT_SECRET = '01SUHJROHLA25NZYFNKZTVO1HTH4RCSLFZEYHLNEKBUB3TP0'
VERSION = '20180605'

## Gather city popular places data

Now, the function below will gather the 1 000 most popular venues in a radius of 10 km (10 000 meters) for each city. This function uses the Foursquare API and `explore` endpoint.

In [0]:
def getPopularVenues(names, latitudes, longitudes, radius=10000, LIMIT=1000):

    # Get the number of places to be evaluated
    max_count = len(names)
    # instantiate the progress bar
    f = IntProgress(min=0, max=max_count) 
    # Display the progress bar
    display(f)
    # Start counter
    count = 0  
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(f'Getting data from: {name}')
            
        url = f'https://api.foursquare.com/v2/venues/explore?&client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&v={VERSION}&ll={lat},{lng}&radius={radius}&limit={LIMIT}'

        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
        f.value += 1 # signal to increment the progress bar
        count += 1


    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['city', 
                  'city latitude', 
                  'city longitude', 
                  'venue', 
                  'venue latitude', 
                  'venue longitude', 
                  'venue category']


    return(nearby_venues)

In [11]:
df_cities = getPopularVenues(df_cap['city'], df_cap['latitude'], df_cap['longitude'])

IntProgress(value=0, max=27)

Getting data from: Porto Velho
Getting data from: Manaus
Getting data from: Rio Branco
Getting data from: Campo Grande
Getting data from: Macapa
Getting data from: Brasilia
Getting data from: Boa Vista
Getting data from: Cuiaba
Getting data from: Palmas
Getting data from: Teresina
Getting data from: Sao Paulo
Getting data from: Rio de Janeiro
Getting data from: Belem
Getting data from: Sao Luis
Getting data from: Goiania
Getting data from: Salvador
Getting data from: Maceio
Getting data from: Porto Alegre
Getting data from: Curitiba
Getting data from: Florianopolis
Getting data from: Belo Horizonte
Getting data from: Fortaleza
Getting data from: Recife
Getting data from: Joao Pessoa
Getting data from: Acaraju
Getting data from: Natal
Getting data from: Vitoria


Now, we can analyze the overall of each city

In [12]:
df_cities['venue category'].value_counts()

Brazilian Restaurant    132
Ice Cream Shop          108
Restaurant               97
Gym / Fitness Center     97
Bar                      93
                       ... 
Smoke Shop                1
Organic Grocery           1
Motel                     1
Theme Restaurant          1
Cheese Shop               1
Name: venue category, Length: 254, dtype: int64

## Methodology <a name="methodology"></a>

With city popular venue data, we will be able to get similar cities by using clusterization methodologies. For that, we will need first to convert the categorical information of `venue` to something numeric, that can be mathematically calculated. This will be done by using the one-hot encoding methodology, that will create a column for each venue type and assign, for each city, the number of times this venue type was found. 

Clusterization algorithms can have the performance decreased if non normalized data is used (because these algorithms are based on distance metrics, therefore, features with large numbers can have a big effect on cluster performance). To solve that, we will normalize each city data (each column in a row) by the total number of venues (1 000), this will give us a maximum score of 1 and minimum 0.

Finally, we will be able to use that data into a K-Means algorithm that will be able to find similar cities. However, the number of clusters must be (manually) defined. Therefore, we can proceed in two ways: first, using an arbitrary number of clusers (we would not have any clue if the clusterization is indeed good) or using a more analytical approach. The analytical approach will use a clusterization metric score (called silhouette) to decide which number of clusters will provide us the best clusterization result. In our case, we will use the analytical approach.

## Analysis <a name="analysis"></a>

As the dataset is organized in categories, we must one-hot encode the categorical values. Additionally, we normalize each city result to reflect the proportion of each popular venue per city.

In [0]:
row_list = []
for city in df_cap['city'].unique():
  row_numeric_data = []
  for category in df_cities['venue category'].unique():
    vanue_category_data = df_cities[df_cities['city'] == city]['venue category']
    row_numeric_data.append(vanue_category_data[vanue_category_data == category].shape[0])
    if sum(row_numeric_data) > 0:
      row_normalized_data = [numberCategory/sum(row_numeric_data) for numberCategory in row_numeric_data]
    else:
      row_normalized_data = row_numeric_data
  row_list.append(dict(zip(['city']+list(df_cities['venue category'].unique()), [city]+row_normalized_data)))

Convert the raw one-hot encoded (and normalized) data to a DataFrame

In [0]:
df_norm_city = pd.DataFrame(row_list)

In [15]:
df_norm_city.head()

Unnamed: 0,city,Fish & Chips Shop,Gym,Bar,Gift Shop,Snack Place,Pub,Candy Store,Martial Arts Dojo,Brazilian Restaurant,BBQ Joint,Coffee Shop,Beer Store,Tea Room,Hotel,Convenience Store,Shoe Store,Bookstore,Northern Brazilian Restaurant,Restaurant,Middle Eastern Restaurant,Creperie,Sandwich Place,Japanese Restaurant,Gym / Fitness Center,Nightclub,Beer Garden,Shopping Mall,Bistro,Chinese Restaurant,Massage Studio,Ice Cream Shop,Seafood Restaurant,Soccer Field,Burger Joint,Italian Restaurant,Deli / Bodega,Gourmet Shop,Food Truck,Pharmacy,...,Lake,Jazz Club,Baiano Restaurant,Bay,Arcade,Convention Center,Lighthouse,Resort,Surf Spot,Track Stadium,Garden Center,Thai Restaurant,Street Fair,College Bookstore,Photography Lab,Poke Place,Supplement Shop,Taiwanese Restaurant,Southern Brazilian Restaurant,Empada House,German Restaurant,Library,Smoke Shop,Dog Run,Botanical Garden,Antique Shop,Coworking Space,Playground,Women's Store,Bed & Breakfast,Paper / Office Supplies Store,Sports Bar,Tennis Court,Airport,Drugstore,Public Art,IT Services,Motorcycle Shop,Aquarium,Beach Bar
0,Porto Velho,0.01,0.06,0.08,0.01,0.03,0.01,0.01,0.01,0.11,0.01,0.03,0.01,0.01,0.04,0.03,0.01,0.02,0.01,0.05,0.03,0.01,0.02,0.02,0.01,0.01,0.03,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.03,0.02,0.01,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Manaus,0.01,0.01,0.03,0.0,0.01,0.0,0.01,0.0,0.06,0.05,0.01,0.01,0.0,0.02,0.0,0.01,0.02,0.02,0.05,0.0,0.01,0.0,0.03,0.05,0.0,0.0,0.01,0.02,0.0,0.0,0.02,0.02,0.0,0.01,0.0,0.0,0.0,0.02,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Rio Branco,0.0,0.04,0.01,0.0,0.04,0.01,0.0,0.0,0.1,0.07,0.0,0.0,0.01,0.04,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.0,0.04,0.02,0.0,0.01,0.0,0.0,0.0,0.08,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Campo Grande,0.0,0.04,0.09,0.0,0.01,0.01,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.02,0.0,0.01,0.03,0.0,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.0,0.04,0.01,0.0,0.02,0.03,0.0,0.01,0.01,0.03,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Macapa,0.0,0.03,0.0,0.01,0.01,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.05,0.0,0.0,0.01,0.02,0.02,0.0,0.0,0.02,0.01,0.01,0.0,0.08,0.01,0.0,0.02,0.02,0.0,0.0,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we have to find the best number of clusters, this can be done using the [silhouette score](https://en.wikipedia.org/wiki/Silhouette_(clustering)), which is a metric that measures how well clusters are formed. This score has a range from -1 to 1, where 1 is the best clustering result possible, therefore, higher values of silhouette mean better cluster formation.

In the algorithm below, we analyze a range of cluster number and select the one with the best metric.

In [16]:
number_test_clusters = [*range(4,15)]
best_s_score = -10
best_n_cluster = 0
for n_cluster in number_test_clusters:
  kmeans = KMeans(n_clusters=n_cluster, random_state=0).fit(df_norm_city.drop(columns=['city']))
  s_score = silhouette_score(df_norm_city.drop(columns=['city']), kmeans.labels_)

  if s_score > best_s_score:
    best_s_score = s_score
    best_n_cluster = n_cluster

print(f'The best number of clusters is {best_n_cluster} with a silhouette score of {best_s_score}')

The best number of clusters is 11 with a silhouette score of 0.05864430928808369


As we can see above, from a range of 4 to 14 clusters, the best cluster number was 11. Therefore, we will use that number to make our analysis.

In [0]:
kmeans = KMeans(n_clusters=11, random_state=0).fit(df_norm_city.drop(columns=['city']))

In [18]:
kmeans.labels_

array([ 7,  5,  2,  6,  5,  6,  5,  7,  2,  7,  3,  8,  5,  0,  7,  1,  4,
       10, 10,  0,  3,  7,  1,  4,  9,  7, 10], dtype=int32)

Now, let's create a new cluster of cities to store the cluster number information.

In [0]:
df_analysis = df_cap.copy()

In [0]:
df_analysis['cluster'] = kmeans.labels_

## Results and Discussion <a name="results"></a>

Now, let's see each individual cluster

In [0]:
region_information = {'North': ['Manaus', 'Rio Branco', 'Boa Vista', 'Porto Velho', 'Macapa', 'Belem', 'Palmas'], 
                      'Northeast': ['Sao Luis', 'Recife', 'Salvador', 'Fortaleza', 'Natal', 'Maceio', 'Joao Pessoa', 'Teresina', 'Acaraju'],
                      'Central-West': ['Campo Grande', 'Cuiaba', 'Goiania', 'Brasilia'],
                      'Southeast': ['Belo Horizonte', 'Sao Paulo', 'Vitoria', 'Rio de Janeiro'],
                      'South': ['Curitiba', 'Florianopolis', 'Porto Alegre']}

In [0]:
dict_city_region = {}
for key, valueList in region_information.items():
  for value in valueList:
    dict_city_region[value] = key

In [0]:
df_analysis['region'] = df_analysis['city'].map(dict_city_region)

In [51]:
for cluster_n in set(kmeans.labels_):
  print(f'Cluster {cluster_n} cities:\n')
  print(display(HTML(df_analysis[df_analysis['cluster'] == cluster_n].to_html())))
  print('\n')

Cluster 0 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
13,Sao Luis,-2.529956,-44.252316,0,Northeast
19,Florianopolis,-27.593404,-48.548334,0,South


None


Cluster 1 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
15,Salvador,-12.976777,-38.507956,1,Northeast
22,Recife,-8.052249,-34.889198,1,Northeast


None


Cluster 2 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
2,Rio Branco,-9.971815,-67.825691,2,North
8,Palmas,-10.246682,-48.324712,2,North


None


Cluster 3 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
10,Sao Paulo,-23.552193,-46.639079,3,Southeast
20,Belo Horizonte,-19.915741,-43.941747,3,Southeast


None


Cluster 4 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
16,Maceio,-9.65039,-35.710339,4,Northeast
23,Joao Pessoa,-7.119057,-34.838982,4,Northeast


None


Cluster 5 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
1,Manaus,-3.120778,-60.021362,5,North
4,Macapa,0.033906,-51.066463,5,North
6,Boa Vista,2.823485,-60.679053,5,North
12,Belem,-1.455427,-48.483116,5,North


None


Cluster 6 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
3,Campo Grande,-20.467168,-54.619796,6,Central-West
5,Brasilia,-15.80791,-47.959944,6,Central-West


None


Cluster 7 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
0,Porto Velho,-8.765641,-63.875509,7,North
7,Cuiaba,-15.593696,-56.098124,7,Central-West
9,Teresina,-5.086539,-42.800808,7,Northeast
14,Goiania,-16.690268,-49.265598,7,Central-West
21,Fortaleza,-3.734547,-38.52801,7,Northeast
25,Natal,-5.789384,-35.20415,7,Northeast


None


Cluster 8 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
11,Rio de Janeiro,-22.908964,-43.179471,8,Southeast


None


Cluster 9 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
24,Acaraju,-10.93484,-37.063968,9,Northeast


None


Cluster 10 cities:



Unnamed: 0,city,latitude,longitude,cluster,region
17,Porto Alegre,-30.031338,-51.207043,10,South
18,Curitiba,-25.427976,-49.26058,10,South
26,Vitoria,-20.294332,-40.29623,10,Southeast


None




The result above shows a very interesting trend, cities that are in the same region looks to cluster together. There are some exceptions like in cluster 0, 7 and 10, but most of cities cluster with cities that are in the same region.

Now, for each cluster, let's see what are the most commom venues:

In [52]:
for cluster_n in set(kmeans.labels_):
  print(f'Most popular venues of cities: {", ".join(df_analysis[df_analysis["cluster"] == cluster_n]["city"].tolist())} \n')
  print(display(HTML(df_cities[df_cities['city'].isin(df_analysis[df_analysis['cluster'] == cluster_n]['city'].tolist())]['venue category'].value_counts().to_frame().head(10).to_html())))
  print('\n')



Most popular venues of cities: Sao Luis, Florianopolis 



Unnamed: 0,venue category
Gym / Fitness Center,16
Bakery,10
Restaurant,9
Café,7
Burger Joint,7
Sushi Restaurant,6
Bar,6
Ice Cream Shop,5
Plaza,5
Pizza Place,5


None


Most popular venues of cities: Salvador, Recife 



Unnamed: 0,venue category
Plaza,9
Bakery,8
Restaurant,8
Brazilian Restaurant,8
Art Museum,7
Coffee Shop,6
Ice Cream Shop,6
Café,6
Theater,5
Beach,5


None


Most popular venues of cities: Rio Branco, Palmas 



Unnamed: 0,venue category
Brazilian Restaurant,16
Ice Cream Shop,15
Gym / Fitness Center,12
BBQ Joint,9
Snack Place,8
Hotel,7
Sandwich Place,6
Gym,6
Park,5
Pizza Place,5


None


Most popular venues of cities: Sao Paulo, Belo Horizonte 



Unnamed: 0,venue category
Ice Cream Shop,13
Theater,12
Restaurant,11
Pizza Place,9
Brazilian Restaurant,8
Bookstore,6
Bar,6
Park,5
Italian Restaurant,5
Hotel,5


None


Most popular venues of cities: Maceio, Joao Pessoa 



Unnamed: 0,venue category
Hotel,13
Bar,12
Restaurant,11
Seafood Restaurant,10
Beach,8
Ice Cream Shop,7
Dessert Shop,6
Gym / Fitness Center,6
Northeastern Brazilian Restaurant,5
Pizza Place,5


None


Most popular venues of cities: Manaus, Macapa, Boa Vista, Belem 



Unnamed: 0,venue category
Brazilian Restaurant,26
Restaurant,23
Ice Cream Shop,21
Plaza,19
Gym / Fitness Center,16
BBQ Joint,12
Café,11
Bakery,11
Pizza Place,9
Japanese Restaurant,8


None


Most popular venues of cities: Campo Grande, Brasilia 



Unnamed: 0,venue category
Bar,14
Gym / Fitness Center,11
Ice Cream Shop,9
Café,7
Burger Joint,6
Gym,6
Bakery,6
Pastelaria,6
Middle Eastern Restaurant,5
Italian Restaurant,5


None


Most popular venues of cities: Porto Velho, Cuiaba, Teresina, Goiania, Fortaleza, Natal 



Unnamed: 0,venue category
Brazilian Restaurant,41
Bar,32
Restaurant,28
Ice Cream Shop,21
Gym,21
Gym / Fitness Center,19
Hotel,15
Bakery,15
Coffee Shop,14
Pizza Place,14


None


Most popular venues of cities: Rio de Janeiro 



Unnamed: 0,venue category
Coffee Shop,9
Bookstore,6
Historic Site,4
Church,4
Park,4
Music Venue,4
Art Museum,3
Scenic Lookout,3
Bar,3
Plaza,3


None


Most popular venues of cities: Acaraju 



Unnamed: 0,venue category
Brazilian Restaurant,14
Café,6
Japanese Restaurant,4
Northeastern Brazilian Restaurant,4
Coffee Shop,4
Ice Cream Shop,4
Bakery,4
Gym / Fitness Center,3
Pharmacy,3
Park,3


None


Most popular venues of cities: Porto Alegre, Curitiba, Vitoria 



Unnamed: 0,venue category
Hotel,15
Coffee Shop,14
Pizza Place,13
Italian Restaurant,13
Gym / Fitness Center,10
Café,9
Burger Joint,8
Brazilian Restaurant,7
Ice Cream Shop,6
Vegetarian / Vegan Restaurant,6


None




Again, with these results we can see interesting results. For example, `Brazilian Restaurant` seems the most popular venue type in most clusters, but not in Rio de Janeiro cluster, that is well-known a famous touristic city and, then, has more venues related to cultural activities.

Finally, we will create a map that will show us each city and cluster (by color):

In [0]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [94]:
# create map
map_clusters = folium.Map(location=[-15.724072, -49.437987], zoom_start=4)

# set color scheme for the clusters
x = np.arange(12)
ys = [i + x + (i*x)**2 for i in range(12)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_analysis['latitude'], 
                                  df_analysis['longitude'], 
                                  df_analysis['city'],
                                  df_analysis['cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

We can see, again, above that cities that are geographically closer tends to belong to the same cluster. This is an expected behavior as the cultural exchange between cities is stronger when the distance is shorter.

## Conclusion <a name="conclusion"></a>

This work succesfully identified clusters of cities that can be considered to be similar. One argument to support the cluster efficiency is the fact cities in the same region generally belong to the same cluster (even though we didn't use any distance metric in our analysis, only venue types). The work could also identify which types of venues are more popular in each cluster and find interesting patterns, for example, cities that had a big influx of Italian immigrants (Brazil' South cities like Curitiba and Porto Alegre) have a high number of Italian Restaurants, while `Seafood Restaurant` is more popular in cities that have famous beaches (Maceio and Joao Pessoa) and Rio de Janeiro looks a unique city (with its own cluster) as it's city known for several touristic attractions. Therefore, this work could help business to have a better understanding of the Brazilian market in the capital cities, potentially improving the expansion/growth strategy.