# The Battle of Cities
Analyzing similarities between neighborhoods in two cities.

---
# Introduction/Business Problem

## Problem

In the coming weeks, I am changing my job from Germany to Chile. Since I will need to relocate there, chosing a new apartment can be overwhelming given all the different districts in the new city, let alone the lack of knowledge of what it is like to live there.  

Obviously, I would like to chose an area to live where I feel comfortable.  An approximate solution is to find a neighborhood similiar to the one I live in or any other I know in my current city, based on the services they offer.  

By clustering neighborhoods in both cities, using their services as the features of the algorithm, we could find neighborhoods in the new city that are similar to the ones where I live currently. In this way, I will mitigate the uncertainty what I will find in my new location.  

## Audience
This situation is a very frequent one that employees all over the world have to face when they are relocated. If generalized to multiple cities, any human resources department could leverage the applicability of this algorithm to ease the on-boarding of employees moving to different cities.  
In addition, I intend to use this project for my personal use to look for an apartment in Santiago.  

---
# Data

## Data processing
After an initial attempt to find location data on this two cities, there does not seem to be an official source with structured data. Therefore, the approach will be:  
1.- Extract the postal codes of each city.  
2.- Find the coordinates of each postal code using Nominatim or any other service offering such information.  
3.- Structure the extracted data coordinates properly  

## Data sources
The following sources of information for this phase have been identified:  

*Postal codes and city neighborhoods data*
  * https://www.geonames.org/postal-codes/  
  For Santiago
  * https://www.geonames.org/postalcode-search.html?q=santiago&country=CL  
  For Darmstadt
  * https://www.geonames.org/postalcode-search.html?q=Darmstadt&country=DE

*Venues information*  
  * https://developer.foursquare.com/


With this approach, I foresee that the solution is transformed in a general one with any tuple of cities. 


In order to execute successful query, we need:  
  * The name of the cities to search.   
    For this data no need of automatic extraction.
  * The correct country codes that are used in the Geonames website, such as CL for Chile, or DE for Germany.  
    As we see in the website, we can extract all available country codes by inspecting the dropdown menu.




*Double click __HERE__ to check the HTML code*

<!--
<select name="country">
<option value="" selected=""> all countries</option>
<option value="DZ"> Algeria</option>
<option value="AS"> American Samoa</option>
<option value="AD"> Andorra</option>
<option value="AR"> Argentina</option>
<option value="AU"> Australia</option>
<option value="AT"> Austria</option>
<option value="BD"> Bangladesh</option>
<option value="BY"> Belarus</option>
<option value="BE"> Belgium</option>
<option value="BM"> Bermuda</option>
<option value="BR"> Brazil</option>
<option value="BG"> Bulgaria</option>
<option value="CA"> Canada</option>
...
<option value="SJ"> Svalbard and Jan Mayen</option>
<option value="SE"> Sweden</option>
<option value="CH"> Switzerland</option>
<option value="TH"> Thailand</option>
<option value="TR"> Turkey</option>
<option value="VI"> U.S. Virgin Islands</option>
<option value="UA"> Ukraine</option>
<option value="GB"> United Kingdom</option>
<option value="US"> United States</option>
<option value="UY"> Uruguay</option>
<option value="VA"> Vatican City</option>
<option value="WF"> Wallis and Futuna</option>
<option value="AX"> Åland</option>
</select>
-->

To obtain the postal codes, a query to geonames including city and country, will output a HTML table with each postal code and coordinates in their rows, among other data. Our work will consist then in requesting such HTML and extract the relevant items from the table.
For example, the url https://www.geonames.org/postalcode-search.html?q=santiago&country=CL produces the following HTML (I cut some rows to make visualization easier):

<table class="restable">
<tbody><tr><th></th><th>Place</th><th>Code</th><th>Country</th><th>Admin1</th><th>Admin2</th><th>Admin3</th></tr>
<tr><td><small>1</small></td><td>Santiago</td><td>8320000</td><td>Chile</td><td>Región Metropolitana</td><td>Provincia de Santiago</td><td>Santiago</td></tr><tr><td></td><td colspan="6">&nbsp;&nbsp;&nbsp;<a href="/maps/browse_-33.454_-70.656.html" rel="nofollow"><small>-33.454/-70.656</small></a></td></tr>
<tr class="odd"><td><small>2</small></td><td>Providencia</td><td>7500000</td><td>Chile</td><td>Región Metropolitana</td><td>Provincia de Santiago</td><td>Providencia</td></tr><tr class="odd"><td></td><td colspan="6">&nbsp;&nbsp;&nbsp;<a href="/maps/browse_-33.436_-70.609.html" rel="nofollow"><small>-33.436/-70.609</small></a></td></tr>
<tr><td><small>3</small></td><td>Las Condes</td><td>7550000</td><td>Chile</td><td>Región Metropolitana</td><td>Provincia de Santiago</td><td>Las Condes</td></tr><tr><td></td><td colspan="6">&nbsp;&nbsp;&nbsp;<a href="/maps/browse_-33.421_-70.502.html" rel="nofollow"><small>-33.421/-70.502</small></a></td></tr>
<tr class="odd"><td><small>4</small></td><td>Vitacura</td><td>7630000</td><td>Chile</td><td>Región Metropolitana</td><td>Provincia de Santiago</td><td>Vitacura</td></tr>
<tr class="tfooter"><td colspan="7"></td></tr>
</tbody></table>

---
# Methodology

__Input__
  * *cities*: Dictionary with cities and country codes where the clustering will be done. For example:  
  cities = [  
  {'city' : 'Cologne', 'country' : 'DE'},  
  {'city' : 'Santiago', 'country' : 'CL'},  
  ]  
  
  This algorithm will allow to enter as many cities as desired, but in this exercise we are focusing in only two. 
  
  
__Output__
  * A list of clusters grouping areas in both cities, so one can inspect visually the similarity between them

---
# Results
...

---
# Discussion
...

---
# Conclusion
...

---
# Code

In [None]:
# we install the necessary packages
!pip install geocoder
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from IPython.display import display_html
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print('Libraries imported.')


Libraries imported.


## Obtain the country codes
The purpose of this step is to get the country codes as they are used in the website itself, so we can build a query that will contain the right code.  
This step will be necessary to prepare the input variables.

In [17]:
url = 'https://www.geonames.org/postal-codes/'
html = requests.get(url).content
soup = BeautifulSoup(html, 'lxml')
country_codes = soup.find('select', {"name" : "country"})

list_countries = country_codes.findAll('option')
for l in list_countries:
    print("Country: " + l.text + " | Code: " + l['value'])

Country:  all countries | Code: 
Country:  Algeria | Code: DZ
Country:  American Samoa | Code: AS
Country:  Andorra | Code: AD
Country:  Argentina | Code: AR
Country:  Australia | Code: AU
Country:  Austria | Code: AT
Country:  Bangladesh | Code: BD
Country:  Belarus | Code: BY
Country:  Belgium | Code: BE
Country:  Bermuda | Code: BM
Country:  Brazil | Code: BR
Country:  Bulgaria | Code: BG
Country:  Canada | Code: CA
Country:  Chile | Code: CL
Country:  Colombia | Code: CO
Country:  Costa Rica | Code: CR
Country:  Croatia | Code: HR
Country:  Czechia | Code: CZ
Country:  Denmark | Code: DK
Country:  Dominican Republic | Code: DO
Country:  Faroe Islands | Code: FO
Country:  Finland | Code: FI
Country:  France | Code: FR
Country:  French Guiana | Code: GF
Country:  Germany | Code: DE
Country:  Greenland | Code: GL
Country:  Guadeloupe | Code: GP
Country:  Guam | Code: GU
Country:  Guatemala | Code: GT
Country:  Guernsey | Code: GG
Country:  Hungary | Code: HU
Country:  Iceland | Code: 

## Prepare the input variables
The latitude and longitude will be filled automatically later on in the code. The country name will be used later to get the coordinates.

In [44]:
cities = [
#    {'city' : 'Darmstadt', 'country_code' : 'DE', 'country' : 'Germany', 'latitude' : '', 'longitude' : ''},
    {'city' : 'Cologne', 'country_code' : 'DE', 'country' : 'Germany', 'latitude' : '', 'longitude' : ''},
#    {'city' : 'Marbella', 'country_code' : 'ES', 'country' : 'Spain', 'latitude' : '', 'longitude' : ''},
    {'city' : 'Santiago', 'country_code' : 'CL', 'country' : 'Chile', 'latitude' : '', 'longitude' : ''},
]

## Create a dataframe with postal codes, coordinate and neighborhoods
The dataframe shall be a merge of all the data from all the input cities

In [45]:
dataframes = []
for city in cities:
    url = 'https://www.geonames.org/postalcode-search.html?q=' + city['city'] + '&country=' + city['country_code']
    html = requests.get(url).content
    soup = BeautifulSoup(html, 'lxml')
    #tables.append(soup.find('table', class_='restable'))
    table = soup.find('table', class_='restable')
    table_df = pd.read_html(str(table),header=0)[0]
    # add city name to the table
    table_df['City'] = city['city']
    dataframes.append(table_df)

df = pd.concat(dataframes)
df = df.reset_index(drop=True)
df.head()


Unnamed: 0.1,Admin1,Admin2,Admin3,Admin4,City,Code,Country,Place,Unnamed: 0
0,Nordrhein-Westfalen,Regierungsbezirk Köln,"Köln, Stadt","Köln, Stadt",Cologne,50823.0,Germany,Köln,1.0
1,,,,,Cologne,,,50.951/6.926,
2,Nordrhein-Westfalen,Regierungsbezirk Köln,"Köln, Stadt","Köln, Stadt",Cologne,50667.0,Germany,Köln,2.0
3,,,,,Cologne,,,50.939/6.955,
4,Nordrhein-Westfalen,Regierungsbezirk Köln,"Köln, Stadt","Köln, Stadt",Cologne,50668.0,Germany,Köln,3.0


The columns we want in our dataframe are  
PostalCode, Borough, Neighborhood, Latitude, Longitude, City, Country

Therefore, we have to cleanup the data towards that goal.  
The mapping between the columns we get from the HTML and the desired dataframe is as follows:  
  * __PostalCode__ <-- Code
  * __Borough__ <-- Place
  * __Neighborhood__ <-- Place
  * __Latitude__ <-- First part of the cell including the coordinates, before the slash sign
  * __Longitude__ <-- Second part of the cell including the coordinates, after the slash sign

In [46]:
# rename 
df.rename(columns={'Code':'PostalCode','Place':'Borough'}, inplace=True)

# create a column neighborhood
df['Neighborhood'] = df['Borough']

# add columns "Latitude" and "Longitude" , initialized at 0
df['Latitude'] = 0.0 
df['Longitude'] = 0.0

df['PostalCode'].fillna(0, inplace=True)
convert_dict = {'PostalCode' : int, 'Latitude' : float, 'Longitude' : float, 'Borough' : str, 'Neighborhood' : str, 'City' : str, 'Country' : str}
df = df.astype(convert_dict) 

# change the order of the columns and retain only the necessary ones
df = df[['PostalCode', 'Borough', 'Neighborhood', 'Latitude', 'Longitude', 'City', 'Country']]
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,City,Country
0,50823,Köln,Köln,0.0,0.0,Cologne,Germany
1,0,50.951/6.926,50.951/6.926,0.0,0.0,Cologne,
2,50667,Köln,Köln,0.0,0.0,Cologne,Germany
3,0,50.939/6.955,50.939/6.955,0.0,0.0,Cologne,
4,50668,Köln,Köln,0.0,0.0,Cologne,Germany


In [47]:
df.shape

(94, 7)

In [48]:
# more data wrangling. The lat/long is in the row following to each postal code
for i in range(0,len(df.index),2) :
    lat_lon = str(df.loc[i+1, 'Borough']).split("/")
    lat = lat_lon[0]
    lon = lat_lon[1]
    df.at[i, 'Latitude'] = lat
    df.at[i, 'Longitude'] = lon

df = df.iloc[::2].reset_index(drop=True)  # remove every second row
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,City,Country
0,50823,Köln,Köln,50.951,6.926,Cologne,Germany
1,50667,Köln,Köln,50.939,6.955,Cologne,Germany
2,50668,Köln,Köln,50.95,6.963,Cologne,Germany
3,50670,Köln,Köln,50.95,6.95,Cologne,Germany
4,50672,Köln,Köln,50.944,6.936,Cologne,Germany


## Use geopy library to get the latitude and longitude values of the cities
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>cities_explorer</em>, as shown below.

In [49]:
geolocator = Nominatim(user_agent="cities_explorer")
for city in cities:
    address = city['city'] + ', ' + city['country']
    location = geolocator.geocode(address)
    city['latitude'] = location.latitude
    city['longitude'] = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(city['city'], city['latitude'], city['longitude']))

The geograpical coordinates of Cologne are 50.938361, 6.959974.
The geograpical coordinates of Santiago are -33.4377968, -70.6504451.


In [50]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,City,Country
0,50823,Köln,Köln,50.951,6.926,Cologne,Germany
1,50667,Köln,Köln,50.939,6.955,Cologne,Germany
2,50668,Köln,Köln,50.95,6.963,Cologne,Germany
3,50670,Köln,Köln,50.95,6.95,Cologne,Germany
4,50672,Köln,Köln,50.944,6.936,Cologne,Germany


### We visualize cities neighborhoods superimposed on top

In [51]:
# rename the dataframe
neighborhoods = df
# neighborhoods = df.iloc[:3] # limit the amount of rows to test the code

# create maps using latitude and longitude values
maps = []
for city in cities:
    map = folium.Map(location=[city['latitude'], city['longitude']], zoom_start=15)
    # add markers to map
    for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map)
    maps.append(map)

In [53]:
maps[0]

In [54]:
maps[1]

## Foursquare
Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [55]:
# The code was removed by Watson Studio for sharing.

Your credentails:
CLIENT_ID: 2O5JLFN3320A552ZHOUNB3HBFMS4CYJWSYUPFMWSOVOKIQGY
CLIENT_SECRET:QGLMU1LV40ACXF34A052GTEM1V2D3L5W1CZDACVXZ4HGG5DC


In [56]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Let's create a function to repeat the same process to all the neighborhoods passed as arguments

In [57]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We run the function to get the venues in all neighborhoods in our dataframe

In [58]:
venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Köln
Santiago
Providencia
Las Condes
Vitacura
Lo Barnechea
Ñuñoa
Macul
La Reina
Penalolén
La Cisterna
El Bosque
La Florida
Independencia
Recoleta
Quinta Normal
Conchalí
Huechuraba
Renca
Quilicura
La Granja
La Pintana
San Miguel
Lo Prado
Pudahuel
Cerro Navia
Lo Espejo
Cerrillos
Maipu
Pedro Aguirre Cerda
San Ramón
San Joaquín
Estación Central


Let´s check the venues dataframe

In [59]:
print(venues.shape)
venues.head(10)

(941, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Köln,50.951,6.926,Café Sehnsucht,50.94997,6.923213,Café
1,Köln,50.951,6.926,Kaffeebud Ehrenfeld,50.951651,6.92266,Café
2,Köln,50.951,6.926,Van Dyck Rösterei,50.949294,6.922661,Coffee Shop
3,Köln,50.951,6.926,Pfeiler Grill,50.951589,6.925579,BBQ Joint
4,Köln,50.951,6.926,Café Lia,50.950585,6.927693,Coffee Shop
5,Köln,50.951,6.926,Haus Scholzen,50.947137,6.923044,German Restaurant
6,Köln,50.951,6.926,Taverne Alekos,50.947734,6.921719,Taverna
7,Köln,50.951,6.926,Café Schwesterherz,50.946854,6.923819,Café
8,Köln,50.951,6.926,Pizza e Caffé,50.951718,6.925359,Pizza Place
9,Köln,50.951,6.926,Maison Baguette,50.94736,6.922932,Sandwich Place


How many venues were returned per Neighborhood?

In [60]:
venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Cerrillos,4,4,4,4,4,4
Cerro Navia,3,3,3,3,3,3
Conchalí,6,6,6,6,6,6
El Bosque,3,3,3,3,3,3
Estación Central,7,7,7,7,7,7
Independencia,8,8,8,8,8,8
Köln,675,675,675,675,675,675
La Cisterna,13,13,13,13,13,13
La Florida,1,1,1,1,1,1
La Granja,8,8,8,8,8,8


Let's find out how many unique categories can be curated from all the returned venues

In [61]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 191 uniques categories.


## Analyze Each Neighborhood

In [62]:
# one hot encoding
onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]]  + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

onehot.head()

Unnamed: 0,Neighborhood,ATM,African Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Austrian Restaurant,Auto Workshop,BBQ Joint,...,Trattoria/Osteria,Travel Agency,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Köln,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Köln,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Köln,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Köln,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,Köln,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [63]:
onehot.shape

(941, 192)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [64]:
grouped = onehot.groupby('Neighborhood').mean().reset_index()
grouped

Unnamed: 0,Neighborhood,ATM,African Restaurant,Art Gallery,Art Museum,Asian Restaurant,Athletics & Sports,Austrian Restaurant,Auto Workshop,BBQ Joint,...,Trattoria/Osteria,Travel Agency,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Water Park,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Cerrillos,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cerro Navia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Conchalí,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,El Bosque,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Estación Central,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Independencia,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Köln,0.001481,0.001481,0.001481,0.007407,0.005926,0.0,0.002963,0.0,0.004444,...,0.005926,0.001481,0.014815,0.005926,0.004444,0.001481,0.005926,0.0,0.0,0.001481
7,La Cisterna,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,La Florida,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,La Granja,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let´s confirm the new size

In [65]:
grouped.shape

(29, 192)

Let's print each neighborhood along with the top 5 most common venues

In [66]:
num_top_venues = 5

for hood in grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = grouped[grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Cerrillos----
                  venue  freq
0                 Plaza  0.25
1            Restaurant  0.25
2  Fast Food Restaurant  0.25
3         Grocery Store  0.25
4  Outdoor Supply Store  0.00


----Cerro Navia----
                  venue  freq
0                 Plaza  0.33
1              Pharmacy  0.33
2         Grocery Store  0.33
3  Outdoor Supply Store  0.00
4             Multiplex  0.00


----Conchalí----
                venue  freq
0                 Gym  0.17
1            Pharmacy  0.17
2  Athletics & Sports  0.17
3        Liquor Store  0.17
4  Chinese Restaurant  0.17


----El Bosque----
                           venue  freq
0               Department Store  0.33
1                    Pizza Place  0.33
2                   Liquor Store  0.33
3                            ATM  0.00
4  Paper / Office Supplies Store  0.00


----Estación Central----
               venue  freq
0  Martial Arts Dojo  0.14
1     Farmers Market  0.14
2   Asian Restaurant  0.14
3       Soccer Field  0.

Let's put that into a *pandas* dataframe.  
First, let's write a function to sort the venues in descending order.

In [67]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [68]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)



In [69]:
neighborhoods_venues_sorted.head(20)

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Cerrillos,Plaza,Fast Food Restaurant,Restaurant,Grocery Store,Yoga Studio,Eastern European Restaurant,Food & Drink Shop,Food,Flea Market,Farmers Market
1,Cerro Navia,Plaza,Pharmacy,Grocery Store,Eastern European Restaurant,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market,Falafel Restaurant
2,Conchalí,Chinese Restaurant,Pharmacy,Athletics & Sports,Gym,Grocery Store,Liquor Store,Yoga Studio,Food & Drink Shop,Food,Flea Market
3,El Bosque,Liquor Store,Pizza Place,Department Store,Yoga Studio,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market
4,Estación Central,Soccer Field,Asian Restaurant,Fast Food Restaurant,Park,Martial Arts Dojo,Farmers Market,Grocery Store,Yoga Studio,Event Space,Food Truck
5,Independencia,Asian Restaurant,Fried Chicken Joint,Chinese Restaurant,Football Stadium,Food,Big Box Store,Restaurant,Food Truck,Food & Drink Shop,Flea Market
6,Köln,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
7,La Cisterna,Pizza Place,Gym,Sushi Restaurant,Pharmacy,Fast Food Restaurant,Moving Target,Other Great Outdoors,Bakery,Basketball Court,Peruvian Restaurant
8,La Florida,Gym / Fitness Center,Yoga Studio,Elementary School,Football Stadium,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market
9,La Granja,Grocery Store,Theater,Soccer Field,Farmers Market,Gym,Salad Place,Candy Store,Yoga Studio,Food,Flea Market


## Cluster Neighborhoods
Run *k*-means to cluster the neighborhood into 5 clusters.

In [70]:
# set number of clusters
kclusters = 5

grouped_clustering = grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 1, 0, 1, 0, 0, 2, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [71]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

merged = neighborhoods

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
merged = merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
merged = merged.dropna() # remove the rows where we got NaN after the merge

merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,50823,Köln,Köln,50.951,6.926,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
1,50667,Köln,Köln,50.939,6.955,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
2,50668,Köln,Köln,50.95,6.963,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
3,50670,Köln,Köln,50.95,6.95,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
4,50672,Köln,Köln,50.944,6.936,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant


Finally, let's visualize the resulting clusters

In [72]:
# create maps with clusters
all_maps_clusters = []

for city in cities:
    map_clusters = folium.Map(location=[city['latitude'], city['longitude']], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(merged['Latitude'], merged['Longitude'], merged['Neighborhood'], merged['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[int(cluster)-1],
            fill=True,
            fill_color=rainbow[int(cluster)-1],
            fill_opacity=0.7).add_to(map_clusters)
    
    all_maps_clusters.append(map_clusters)

We examine the map of clusters for the first city

In [73]:
all_maps_clusters[0]

Now, the map of clusters for the second city

In [74]:
all_maps_clusters[1]

## Examine Clusters


### Cluster 1: Dining out in German restaurants

In [76]:
merged.loc[merged['Cluster Labels'] == 0, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
1,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
2,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
3,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
4,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
5,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
6,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
7,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
8,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant
9,Köln,Cologne,Germany,0.0,Café,Italian Restaurant,German Restaurant,Bar,Bakery,Hotel,Plaza,Supermarket,Coffee Shop,Restaurant


### Cluster 2: Varied services

In [77]:
merged.loc[merged['Cluster Labels'] == 1, merged.columns[[1] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Borough,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Providencia,Santiago,Chile,1.0,Restaurant,Café,Gym,Breakfast Spot,Plaza,Theater,Sandwich Place,Shopping Mall,Mountain,Chinese Restaurant
21,Macul,Santiago,Chile,1.0,Bar,Pharmacy,Gymnastics Gym,Restaurant,Gym / Fitness Center,Eastern European Restaurant,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant
25,El Bosque,Santiago,Chile,1.0,Liquor Store,Pizza Place,Department Store,Yoga Studio,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market
27,Independencia,Santiago,Chile,1.0,Asian Restaurant,Fried Chicken Joint,Chinese Restaurant,Football Stadium,Food,Big Box Store,Restaurant,Food Truck,Food & Drink Shop,Flea Market
28,Recoleta,Santiago,Chile,1.0,Chinese Restaurant,Food Truck,Plaza,Bus Station,Dive Bar,Fast Food Restaurant,Bar,Yoga Studio,Food & Drink Shop,Food
30,Conchalí,Santiago,Chile,1.0,Chinese Restaurant,Pharmacy,Athletics & Sports,Gym,Grocery Store,Liquor Store,Yoga Studio,Food & Drink Shop,Food,Flea Market
32,Renca,Santiago,Chile,1.0,Plaza,Convenience Store,Soccer Field,Bus Station,Flea Market,Grocery Store,Market,Yoga Studio,Event Space,Food & Drink Shop
33,Quilicura,Santiago,Chile,1.0,Food Truck,Plaza,Convenience Store,Park,Electronics Store,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market
39,Cerro Navia,Santiago,Chile,1.0,Plaza,Pharmacy,Grocery Store,Eastern European Restaurant,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market,Falafel Restaurant
40,Lo Espejo,Santiago,Chile,1.0,Grocery Store,Liquor Store,Fast Food Restaurant,Elementary School,Football Stadium,Food Truck,Food & Drink Shop,Food,Flea Market,Farmers Market


### Cluster 3 : Groceries and markets

In [78]:
merged.loc[merged['Cluster Labels'] == 2, merged.columns[[2] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Neighborhood,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26,La Florida,Santiago,Chile,2.0,Gym / Fitness Center,Yoga Studio,Elementary School,Football Stadium,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market


### Cluster 4

In [79]:
merged.loc[merged['Cluster Labels'] == 3, merged.columns[[3] + list(range(5, merged.shape[1]))]]

Unnamed: 0,Latitude,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,-33.486,Santiago,Chile,3.0,Garden Center,Spa,Garden,Event Space,Football Stadium,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant


### Cluster 5

In [80]:
merged.loc[merged['Cluster Labels'] == 4, merged.columns[[1] + list(range(4, merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,City,Country,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Las Condes,-70.502,Santiago,Chile,4.0,Mountain,Ice Cream Shop,Fountain,Football Stadium,Food Truck,Food & Drink Shop,Food,Flea Market,Fast Food Restaurant,Farmers Market
