# Exploring Caracas Metropolitan District venues information available in FourSquare


## Introduction

This study was conducted to collect and explore Caracas Metropolitan District venues information available in FourSquare with the objective of getting insights about interests of individuals living on this geographical region. Moreover, this data can be used to identify interest point for deployment of new 

In [1]:
import pandas as pd
import numpy as np
import re
import math

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# !conda install -c conda-forge lxml --yes # uncomment this line if you don' have lxml library
# !conda install -c anaconda beautifulsoup4 --yes # uncomment this line if you don' have beautifilsoup4 installed library

print('Libraries imported.')

Libraries imported.


Caracas' Metropolitan Area is composed by 5 municipalities. Each municipality is composed by one or more parish.
We are interested in studying parish in Caracas' Metropolitan Area.

## Data collection

Let's build a database with 4 columns: "Municipality", "Parish", "Latitude", "Longitude". Rows represent parishes geographical location. 

In [2]:
caracas_df = pd.DataFrame(columns = ["Municipality", "Parish", "Latitude", "Longitude", "Area(km2)"])
caracas_df.head()

Unnamed: 0,Municipality,Parish,Latitude,Longitude,Area(km2)


### Municipality Libertador

Getting parishes from wikipedia

In [3]:
tables_libertador_website = pd.read_html("https://es.wikipedia.org/wiki/Municipio_Libertador_de_Caracas")

In [4]:
tables_libertador_website[2].head()

Unnamed: 0,#,Parroquia,Superficie(km²),Población(hab.),Densidad(hab./km²)
0,1,Santa Rosalía,61,190.282,19.5248
1,2,El Valle,311,231.117,4.9588
2,3,Coche,130,79.83,4.5106
3,4,Caricuao,248,248.2,6.9021
4,5,Macarao,1314,70.217,3885.0


Let's build an aux dataframe with this information

In [5]:
#Remove last value 'Total' which is not a valid Parish
libertador_data = tables_libertador_website[2][["Parroquia", "Superficie(km²)"]][:-1]
libertador_data.rename(columns={'Parroquia': 'Parish', 'Superficie(km²)': 'Area(km2)'}, inplace =True)
libertador_data["Municipality"] = 'Libertador'
#Change made to be consistent with wikipedia webpage 
libertador_data.loc[libertador_data.Parish == "23 de enero", "Parish"] = "23 de Enero"
#Correct column area format
libertador_data = libertador_data.astype({"Area(km2)": float})
libertador_data['Area(km2)'] = libertador_data['Area(km2)'] / 10
libertador_data.head()

Unnamed: 0,Parish,Area(km2),Municipality
0,Santa Rosalía,6.1,Libertador
1,El Valle,31.1,Libertador
2,Coche,13.0,Libertador
3,Caricuao,24.8,Libertador
4,Macarao,131.4,Libertador


Let's find know the coordinates of each parish. There is no table on wikipedia that contains its information directly,
however, we can construct links to each parish wikipedia website and get this information.

In [6]:
libertador_parish_name = libertador_data['Parish'].tolist()
for parish in libertador_parish_name:
    tables_website = pd.read_html('https://es.wikipedia.org/wiki/Parroquia_{}_(Caracas)'.format(parish.replace(" ","_")
                                                                                                    .replace("í", "i")
                                                                                                    .replace("é","e")))
    table_website = tables_website[0].set_index(tables_website[0].columns[0]).transpose()
    try:
        c = table_website["Coordenadas"][0]
    except:
        try: #Some times the info is not stored in the first table, then we chech the second
            table_website = tables_website[1].set_index(tables_website[1].columns[0]).transpose()
            c = table_website["Coordenadas"][0]
        except: 
            print("Not found")
    latitude_longitude = re.findall("[-+]?\d+\.\d+", c)
    latitude, longitude = latitude_longitude[0], latitude_longitude[1]
    libertador_data.loc[libertador_data.Parish == parish, "Latitude"] = latitude
    libertador_data.loc[libertador_data.Parish == parish, "Longitude"] = longitude
libertador_data.head()

Unnamed: 0,Parish,Area(km2),Municipality,Latitude,Longitude
0,Santa Rosalía,6.1,Libertador,10.48355,-66.91442
1,El Valle,31.1,Libertador,10.467206,-66.907329
2,Coche,13.0,Libertador,10.451853,-66.925286
3,Caricuao,24.8,Libertador,10.43333333,-66.98333333
4,Macarao,131.4,Libertador,10.425423,-67.034833


### Municipality Baruta

Getting parishes from wikipedia

In [7]:
tables_baruta_website = pd.read_html("https://es.wikipedia.org/wiki/Municipio_Baruta")

In [8]:
tables_baruta_website[3]

Unnamed: 0,Parroquia,Superficie,Población (2010),Densidad,Ubicación
0,Nuestra Señora del Rosario,73 km²,211.841 hab.[4]​,2.902 hab./km²,Sur
1,El Cafetal,9 km²,55.130 hab.[4]​,6.126 hab./km²,Noreste
2,Las Minas,4 km²,51.441 hab.[4]​,12.860 hab./km²,Norte
3,Municipio Baruta,86 km²,318.412 hab.[4]​,3.702 hab./km²,


In [9]:
#Remove last value 'Municipio Baruta' which is not a valid Parish
baruta_data = tables_baruta_website[3][["Parroquia", 'Superficie']][:-1]
baruta_data.rename(columns={'Parroquia': 'Parish', 'Superficie': 'Area(km2)'}, inplace =True)
baruta_data["Municipality"] = 'Baruta'
#Correct column area format
baruta_data['Area(km2)'] = baruta_data['Area(km2)'].str.extract('(\d+)')
baruta_data = baruta_data.astype({"Area(km2)": float})
baruta_data.head()

Unnamed: 0,Parish,Area(km2),Municipality
0,Nuestra Señora del Rosario,73.0,Baruta
1,El Cafetal,9.0,Baruta
2,Las Minas,4.0,Baruta


In [10]:
baruta_parish_name = baruta_data['Parish'].tolist()
for parish in baruta_parish_name:
    if parish == 'Nuestra Señora del Rosario': #Handle particular exception
        tables_website = pd.read_html('https://es.wikipedia.org/wiki/Nuestra_Senora_del_Rosario_de_Baruta')
    else:
        tables_website = pd.read_html('https://es.wikipedia.org/wiki/Parroquia_{}'.format(parish.replace(" ","_")))
    table_website = tables_website[0].set_index(tables_website[0].columns[0]).transpose()
    try:
        c = table_website["Coordenadas"][0]
    except:
        try: #Some times the info is not stored in the first table, then we chech the second
            table_website = tables_website[1].set_index(tables_website[1].columns[0]).transpose()
            c = table_website["Coordenadas"][0]
        except: 
            print("Not found", parish)
    latitude_longitude = re.findall("[-+]?\d+\.\d+", c)
    latitude, longitude = latitude_longitude[0], latitude_longitude[1]
    baruta_data.loc[baruta_data.Parish == parish, "Latitude"] = latitude
    baruta_data.loc[baruta_data.Parish == parish, "Longitude"] = longitude

Not found Las Minas


Wikipedia webpage for Parish 'Las Minas' does not coords information, we found it by hand

In [11]:
baruta_data.loc[baruta_data.Parish == 'Las Minas', "Latitude"] = 10.4597222
baruta_data.loc[baruta_data.Parish == 'Las Minas', "Longitude"] = -66.85888
baruta_data.head()

Unnamed: 0,Parish,Area(km2),Municipality,Latitude,Longitude
0,Nuestra Señora del Rosario,73.0,Baruta,10.432222222222,-66.873888888889
1,El Cafetal,9.0,Baruta,10.467126,-66.830409
2,Las Minas,4.0,Baruta,10.4597,-66.8589


### Municipality El Hatillo

This municipality is composed by a single parish, Let's get its info from wikipedia website

In [12]:
tables_website = pd.read_html('https://es.wikipedia.org/wiki/Municipio_El_Hatillo')
table_website = tables_website[0].set_index(tables_website[0].columns[0]).transpose()
table_website.head()
c = table_website["Coordenadas"][0]
latitude_longitude = re.findall("[-+]?\d+\.\d+", c)
hatillo_data_dic = {"Municipality":["El Hatillo"], 
                "Parish": ["Parroquia Santa Rosalía de Palermo"],
                "Latitude": latitude_longitude[0],
                "Longitude": latitude_longitude[1],
                "Area(km2)": 81.0}
hatillo_data = pd.DataFrame(hatillo_data_dic)
hatillo_data.head()

Unnamed: 0,Municipality,Parish,Latitude,Longitude,Area(km2)
0,El Hatillo,Parroquia Santa Rosalía de Palermo,10.439166666667,-66.83,81.0


### Municipality Chacao

This municipality is composed by a single parish, Let's get its info from wikipedia website

In [13]:
tables_website = pd.read_html('https://es.wikipedia.org/wiki/Municipio_Chacao')
table_website = tables_website[1].set_index(tables_website[1].columns[0]).transpose()
c = table_website["Coordenadas"][0]
latitude_longitude = re.findall("[-+]?\d+\.\d+", c)
chacao_data_dic = {"Municipality":["Chacao"], 
                "Parish": ["San José de Chacao"],
                "Latitude": latitude_longitude[0],
                "Longitude": latitude_longitude[1],
                "Area(km2)": 13}
chacao_data = pd.DataFrame(chacao_data_dic)
chacao_data.head()

Unnamed: 0,Municipality,Parish,Latitude,Longitude,Area(km2)
0,Chacao,San José de Chacao,10.483333333333,-66.833333333333,13


### Municipality Sucre

Getting parishes from wikipedia

In [14]:
tables_sucre_website = pd.read_html("https://es.wikipedia.org/wiki/Municipio_Sucre_(Miranda)")

In [15]:
tables_sucre_website[2].head()

Unnamed: 0,Parroquia,Superficie,Población (2010),Densidad
0,Caucagüita,54 km²,64.048 hab.[9]​,1.186 hab./km²
1,Fila de Mariches,36 km²,34.274 hab.[9]​,952 hab./km²
2,La Dolorita,11 km²,84.041 hab.[9]​,7640 hab./km²
3,Leoncio Martínez,23 km²,64.457 hab.[9]​,2.802 hab./km²
4,Petare,40 km²,409.706 hab.[9]​,10.243 hab./km²


In [16]:
#Remove last value 'Municipio Sucre' which is not a valid Parish
sucre_data = tables_sucre_website[2][["Parroquia", "Superficie"]][:-1]
sucre_data.rename(columns={'Parroquia': 'Parish', 'Superficie': 'Area(km2)'}, inplace =True)
sucre_data["Municipality"] = 'Sucre'
#Correct column area format
sucre_data['Area(km2)'] = sucre_data['Area(km2)'].str.extract('(\d+)')
sucre_data = sucre_data.astype({"Area(km2)": float})
sucre_data.head()

Unnamed: 0,Parish,Area(km2),Municipality
0,Caucagüita,54.0,Sucre
1,Fila de Mariches,36.0,Sucre
2,La Dolorita,11.0,Sucre
3,Leoncio Martínez,23.0,Sucre
4,Petare,40.0,Sucre


In [17]:
sucre_parish_name = sucre_data['Parish'].tolist()
for parish in sucre_parish_name:
    if parish == 'Petare': #Handle particular exception
        tables_website = pd.read_html('https://es.wikipedia.org/wiki/Petare')
    elif parish == 'Caucagüita': #Handle special cases data not in wikipedia
        sucre_data.loc[sucre_data.Parish == parish, "Latitude"] = 10.4792
        sucre_data.loc[sucre_data.Parish == parish, "Longitude"] = -66.7427
        continue
    elif parish == 'Fila de Mariches':
        sucre_data.loc[sucre_data.Parish == parish, "Latitude"] = 10.4472
        sucre_data.loc[sucre_data.Parish == parish, "Longitude"] = -66.7531
        continue
    elif parish == 'La Dolorita':
        sucre_data.loc[sucre_data.Parish == parish, "Latitude"] = 10.4472
        sucre_data.loc[sucre_data.Parish == parish, "Longitude"] = -66.7695
        continue
    elif parish == 'Leoncio Martínez':
        sucre_data.loc[sucre_data.Parish == parish, "Latitude"] = 10.5171
        sucre_data.loc[sucre_data.Parish == parish, "Longitude"] = -66.8261
        continue
    else:
        tables_website = pd.read_html('https://es.wikipedia.org/wiki/Parroquia_{}'.format(parish.replace(" ","_").replace("í","i")), encoding='utf-8')
    table_website = tables_website[0].set_index(tables_website[0].columns[0]).transpose()
    try:
        c = table_website["Coordenadas"][0]
    except:
        try: #Some times the info is not stored in the first table, then we chech the second
            table_website = tables_website[1].set_index(tables_website[1].columns[0]).transpose()
            c = table_website["Coordenadas"][0]
        except: 
            print("Not found", parish)
    latitude_longitude = re.findall("[-+]?\d+\.\d+", c)
    latitude, longitude = latitude_longitude[0], latitude_longitude[1]
    sucre_data.loc[sucre_data.Parish == parish, "Latitude"] = latitude
    sucre_data.loc[sucre_data.Parish == parish, "Longitude"] = longitude
sucre_data.head()

Unnamed: 0,Parish,Area(km2),Municipality,Latitude,Longitude
0,Caucagüita,54.0,Sucre,10.4792,-66.7427
1,Fila de Mariches,36.0,Sucre,10.4472,-66.7531
2,La Dolorita,11.0,Sucre,10.4472,-66.7695
3,Leoncio Martínez,23.0,Sucre,10.5171,-66.8261
4,Petare,40.0,Sucre,10.48333333,-66.81666667


Now, we merge all the data into a single df

In [18]:
caracas_df = pd.concat([libertador_data, baruta_data, hatillo_data, chacao_data, sucre_data], axis=0)
caracas_df.head()

Unnamed: 0,Parish,Area(km2),Municipality,Latitude,Longitude
0,Santa Rosalía,6.1,Libertador,10.48355,-66.91442
1,El Valle,31.1,Libertador,10.467206,-66.907329
2,Coche,13.0,Libertador,10.451853,-66.925286
3,Caricuao,24.8,Libertador,10.43333333,-66.98333333
4,Macarao,131.4,Libertador,10.425423,-67.034833


In [19]:
caracas_df = caracas_df[["Municipality", "Parish", "Latitude", "Longitude", "Area(km2)"]]

In [20]:
caracas_df.head()

Unnamed: 0,Municipality,Parish,Latitude,Longitude,Area(km2)
0,Libertador,Santa Rosalía,10.48355,-66.91442,6.1
1,Libertador,El Valle,10.467206,-66.907329,31.1
2,Libertador,Coche,10.451853,-66.925286,13.0
3,Libertador,Caricuao,10.43333333,-66.98333333,24.8
4,Libertador,Macarao,10.425423,-67.034833,131.4


In [21]:
caracas_df.dtypes #types for each column

Municipality     object
Parish           object
Latitude         object
Longitude        object
Area(km2)       float64
dtype: object

In [22]:
caracas_df = caracas_df.astype({"Latitude": float, "Longitude": float}) #cast Latitude and Longitude columns into the correct type

In [23]:
caracas_df.reset_index(inplace = True, drop= True)
caracas_df

Unnamed: 0,Municipality,Parish,Latitude,Longitude,Area(km2)
0,Libertador,Santa Rosalía,10.48355,-66.91442,6.1
1,Libertador,El Valle,10.467206,-66.907329,31.1
2,Libertador,Coche,10.451853,-66.925286,13.0
3,Libertador,Caricuao,10.433333,-66.983333,24.8
4,Libertador,Macarao,10.425423,-67.034833,131.4
5,Libertador,Antímano,10.4667,-66.9667,29.5
6,Libertador,La Vega,10.460316,-66.943113,12.9
7,Libertador,El Paraíso,10.492164,-66.926004,10.4
8,Libertador,El Junquito,10.478807,-67.083218,56.0
9,Libertador,Sucre,10.515081,-66.944409,58.3


### Experiments

First, let's plot this data into a map

In [24]:
# create map of Caracas using latitude and longitude values
caracas_latitude = 10.4806
caracas_longitude = - 66.9036
#Each municipality will be border with a different color
municipality_colors = {'Libertador':'#DD4124', 'Baruta':'#D65076', 'El Hatillo':'#45B8AC', 'Chacao':'#EFC050', 'Sucre':'#5B5EA6'}
map_caracas = folium.Map(location=[caracas_latitude, caracas_longitude], zoom_start=10)

# add markers to map
for lat, lng, municipality, parish in zip(caracas_df['Latitude'], caracas_df['Longitude'], caracas_df['Municipality'], caracas_df['Parish']):
    label = '{}, {}'.format(parish, municipality)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
#         color='blue',
        color= municipality_colors[municipality],
        fill=True,
        fill_color='#ffffff',
        fill_opacity=1,
        parse_html=False).add_to(map_caracas)  
    
map_caracas

In [25]:
CLIENT_ID = 'PZFAJ2VCBKHWSHV3DLRKIS44CF52HNZM3NRG1C53YWXFQPXC' # your Foursquare ID
CLIENT_SECRET = 'ZWZP0JWVEJHHJW1H0X24X1VKV04MJ0WMEVQAVFRFAXSFZ0XT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: PZFAJ2VCBKHWSHV3DLRKIS44CF52HNZM3NRG1C53YWXFQPXC
CLIENT_SECRET:ZWZP0JWVEJHHJW1H0X24X1VKV04MJ0WMEVQAVFRFAXSFZ0XT


#### Let's explore de first Parish in our dataset to discover the data Foursquare has available for Caracas, Venezuela

In [26]:
parish_latitude, parish_longitude = caracas_df.loc[0, 'Latitude'], caracas_df.loc[0, 'Longitude']
#radius will depend of the area's size
radius = (math.sqrt(caracas_df.loc[0, 'Area(km2)']) / 2) * 1000;
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, parish_latitude, parish_longitude, radius, LIMIT)
url


'https://api.foursquare.com/v2/venues/explore?client_id=PZFAJ2VCBKHWSHV3DLRKIS44CF52HNZM3NRG1C53YWXFQPXC&client_secret=ZWZP0JWVEJHHJW1H0X24X1VKV04MJ0WMEVQAVFRFAXSFZ0XT&v=20180605&ll=10.48355,-66.91442&radius=1234.908903522847&limit=100'

In [27]:
results = requests.get(url).json()

In [28]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [29]:
venues = results['response']['groups'][0]['items']
# print(venues)
    
nearby_venues = json_normalize(venues) # flatten JSON
# print(nearby_venues)

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  after removing the cwd from sys.path.


Unnamed: 0,name,categories,lat,lng
0,El Pescado,Pizza Place,10.487861,-66.908594
1,Miga's,Sandwich Place,10.492222,-66.915694
2,Gabriel's Pizza,Pizza Place,10.487851,-66.908651
3,Il Sapore della Nonna,Italian Restaurant,10.487858,-66.908645
4,Panadería y Pastelería Opera,Bakery,10.48786,-66.908645


This was an example. Now we are going to repeat this process for every Parish in Caracas

In [30]:
def getNearbyVenues(parishes, latitudes, longitudes, radii):
    
    venues_list=[]
    for parish, lat, lng, radius in zip(parishes, latitudes, longitudes, radii):
        print(parish)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
#             math.sqrt(radius) / 2 * 1000, 
            radius,
            LIMIT)
#             5)
            
        # make the GET request
#         try:
        results = requests.get(url).json()["response"]['groups'][0]['items']
#         print(len(results))

        # return only relevant information for each nearby venue
        venues_list.append([(
            parish, 
#             lat, 
#             lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
#         except: 
#             print ("error request on:", parish)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Parish', 
#                   'Neighborhood Latitude', 
#                   'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Each Parish has  a different area, we are considering to fix the exploring radius depending on the area of the Parish.  
The following map will display each Parish with its selected exploring radius.

In [31]:
def plot_map(center_latitude, center_longitude, latitudes, longitudes, municipalities, parishes, radii, border_color, fill_color, fill_opacity = 0.7, zoom_start = 11):
    map_caracas = folium.Map(location=[center_latitude, center_longitude], zoom_start=zoom_start)

    # add markers to map
    for lat, lng, municipality, parish, radius, b_c, f_c in zip(latitudes, longitudes, municipalities, parishes, radii, border_color, fill_color):
        label = '{}, {}'.format(parish, municipality)
        label = folium.Popup(label, parse_html=True)
        folium.Circle(
            [lat, lng],
#             radius=math.sqrt(radius) / 2 * 1000,
            radius = radius,
            popup=label,
    #         color='blue',
#             color= municipality_colors[municipality],
            color = b_c,
            fill=True,
#             fill_color='#ffffff',
            fill_color = f_c,
            fill_opacity=fill_opacity,
            parse_html=False).add_to(map_caracas)  

    return map_caracas

In [32]:
plot_map(caracas_latitude, caracas_longitude, caracas_df['Latitude'], caracas_df['Longitude'], caracas_df['Municipality'], caracas_df['Parish'], caracas_df['Area(km2)'].map(lambda area: math.sqrt(area) / 2 * 1000), [municipality_colors[municipality] for municipality in caracas_df['Municipality']], ['#ffffff' for x in range(caracas_df.shape[0])], fill_opacity = 0.5)

In our first attempt, the radius was calculated as the square root of the area divided by 2.  
As you can see on the map, several Parish are overlapping, we decided to select a different radius for this experiment.

In [33]:
plot_map(caracas_latitude, caracas_longitude, caracas_df['Latitude'], caracas_df['Longitude'], caracas_df['Municipality'], caracas_df['Parish'], caracas_df['Area(km2)'].map(lambda area: math.sqrt(area) / 4 * 1000), [municipality_colors[municipality] for municipality in caracas_df['Municipality']], ['#ffffff' for x in range(caracas_df.shape[0])], fill_opacity = 0.5)

After testing with different values, we have decided to work with a radius calculated by dividing the square root of the area by 4.

In [34]:
caracas_venues = getNearbyVenues(parishes=caracas_df['Parish'],
                                   latitudes=caracas_df['Latitude'],
                                   longitudes=caracas_df['Longitude'],
                                   radii=caracas_df['Area(km2)'].map(lambda area: math.sqrt(area) / 4 * 1000)
                                  )

Santa Rosalía
El Valle
Coche
Caricuao
Macarao
Antímano
La Vega
El Paraíso
El Junquito
Sucre
San Juan
Santa Teresa
23 de Enero
La Pastora
Altagracia
San José
San Bernardino
Catedral
La Candelaria
San Agustín
El Recreo
San Pedro
Nuestra Señora del Rosario
El Cafetal
Las Minas
Parroquia Santa Rosalía de Palermo
San José de Chacao
Caucagüita
Fila de Mariches
La Dolorita
Leoncio Martínez
Petare


In [35]:
print(caracas_venues.shape)
caracas_venues.head()

(559, 5)


Unnamed: 0,Parish,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Santa Rosalía,Terminal Expresos Occidente,10.484,-66.909727,Bus Station
1,Santa Rosalía,San Jorge,10.478878,-66.915979,Shopping Mall
2,Santa Rosalía,Distribuidora Merposur,10.48061,-66.918408,Clothing Store
3,Santa Rosalía,Mercado Del Cementerio,10.480491,-66.917445,Clothing Store
4,El Valle,Salón Venezuela,10.468543,-66.898608,Historic Site


#### Fine, we got the data about venues on parish in Caracas Metropolitan Area, let's explore it now

Let's check how many venues were returned for each neighborhood

In [36]:
caracas_venues.groupby('Parish').count().head()

Unnamed: 0_level_0,Venue,Venue Latitude,Venue Longitude,Venue Category
Parish,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
23 de Enero,1,1,1,1
Altagracia,2,2,2,2
Antímano,18,18,18,18
Caricuao,9,9,9,9
Catedral,23,23,23,23


It seems that there are some Parish with really few data, from 1 to 20 venues registered only. 
Let's add this count into the database as a new column

In [37]:
caracas_venues_count = caracas_venues.groupby(['Parish'])['Venue'].count().to_frame()
caracas_venues_count.rename(columns={'Venue': 'Venues_count'}, inplace =True)
# pd.concat([test_df, caracas_df], keys=['Parish'])
caracas_df = pd.merge(caracas_venues_count, caracas_df, left_on=caracas_venues_count.index.name, right_on='Parish')

In [38]:
caracas_df

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2)
0,23 de Enero,1,Libertador,10.505728,-66.934889,2.5
1,Altagracia,2,Libertador,10.515983,-66.913264,3.7
2,Antímano,18,Libertador,10.4667,-66.9667,29.5
3,Caricuao,9,Libertador,10.433333,-66.983333,24.8
4,Catedral,23,Libertador,10.50642,-66.91403,1.0
5,Caucagüita,4,Sucre,10.4792,-66.7427,54.0
6,Coche,6,Libertador,10.451853,-66.925286,13.0
7,El Cafetal,12,Baruta,10.467126,-66.830409,9.0
8,El Junquito,3,Libertador,10.478807,-67.083218,56.0
9,El Paraíso,23,Libertador,10.492164,-66.926004,10.4


#### Let's find out how many unique categories can be curated from all the returned venues

In [39]:
print('There are {} uniques categories.'.format(len(caracas_venues['Venue Category'].unique())))

There are 140 uniques categories.


In [40]:
caracas_venues['Venue Category'].unique()

array(['Bus Station', 'Shopping Mall', 'Clothing Store', 'Historic Site',
       'Brewery', 'Chinese Restaurant', 'Bowling Alley', 'Pharmacy',
       'Brazilian Restaurant', 'Pizza Place', 'Gym', 'Metro Station',
       'Burger Joint', 'Lake', 'Baseball Field', 'Tennis Court',
       'Baseball Stadium', 'Plaza', 'Arepa Restaurant', 'Bakery',
       'Grocery Store', 'Hot Dog Joint', 'Stadium', 'Sculpture Garden',
       'Skate Park', 'Playground', 'Hobby Shop', 'Garden',
       'Sandwich Place', 'Fast Food Restaurant', 'Park',
       'South American Restaurant', 'Latin American Restaurant',
       'Soccer Field', 'Tea Room', 'Diner', 'Italian Restaurant',
       'Zoo Exhibit', 'Deli / Bodega', 'Pool', 'Restaurant', 'BBQ Joint',
       'Basketball Court', 'Gymnastics Gym', 'Electronics Store',
       'Golf Course', 'African Restaurant', 'Theater', 'Market',
       'Art Museum', 'Convenience Store', 'History Museum',
       'Spanish Restaurant', 'Indie Theater', 'Tennis Stadium', 'Hotel',

We have decided to exclude parishes with less than 10 venues from this experiment.

In [42]:
caracas_more_than_10_venues = caracas_df[caracas_df.Venues_count > 10]
caracas_more_than_10_venues

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2)
2,Antímano,18,Libertador,10.4667,-66.9667,29.5
4,Catedral,23,Libertador,10.50642,-66.91403,1.0
7,El Cafetal,12,Baruta,10.467126,-66.830409,9.0
9,El Paraíso,23,Libertador,10.492164,-66.926004,10.4
10,El Recreo,36,Libertador,10.50503,-66.88773,18.1
11,El Valle,21,Libertador,10.467206,-66.907329,31.1
13,La Candelaria,15,Libertador,10.503634,-66.904806,1.2
18,Nuestra Señora del Rosario,67,Baruta,10.432222,-66.873889,73.0
19,Parroquia Santa Rosalía de Palermo,100,El Hatillo,10.439167,-66.83,81.0
20,Petare,82,Sucre,10.483333,-66.816667,40.0


In [54]:
caracas_venues = caracas_venues[caracas_venues.Parish.isin(caracas_more_than_10_venues['Parish'].tolist())]

In [55]:
# one hot encoding
caracas_onehot = pd.get_dummies(caracas_venues[['Venue Category']], prefix="", prefix_sep="")

# add Parish column back to dataframe
caracas_onehot['Parish'] = caracas_venues['Parish'] 

# move Parish column to the first column
fixed_columns = [caracas_onehot.columns[-1]] + list(caracas_onehot.columns[:-1])
caracas_onehot = caracas_onehot[fixed_columns]

caracas_onehot.head()

Unnamed: 0,Parish,African Restaurant,American Restaurant,Arepa Restaurant,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,...,Tea Room,Tennis Court,Tennis Stadium,Theater,Trail,Video Store,Wine Bar,Women's Store,Yoga Studio,Zoo Exhibit
4,El Valle,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,El Valle,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,El Valle,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,El Valle,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,El Valle,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [56]:
caracas_onehot.shape

(500, 134)

In [57]:
caracas_grouped = caracas_onehot.groupby('Parish').mean().reset_index()
caracas_grouped

Unnamed: 0,Parish,African Restaurant,American Restaurant,Arepa Restaurant,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,...,Tea Room,Tennis Court,Tennis Stadium,Theater,Trail,Video Store,Wine Bar,Women's Store,Yoga Studio,Zoo Exhibit
0,Antímano,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Catedral,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0
2,El Cafetal,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,El Paraíso,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478
4,El Recreo,0.0,0.0,0.0,0.027778,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0
5,El Valle,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,La Candelaria,0.0,0.0,0.0,0.0,0.0,0.066667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Nuestra Señora del Rosario,0.0,0.0,0.0,0.0,0.0,0.014925,0.014925,0.0,0.0,...,0.029851,0.014925,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Parroquia Santa Rosalía de Palermo,0.0,0.0,0.02,0.0,0.01,0.0,0.01,0.0,0.01,...,0.01,0.0,0.0,0.01,0.02,0.0,0.01,0.0,0.01,0.0
9,Petare,0.0,0.0,0.012195,0.0,0.0,0.0,0.0,0.012195,0.036585,...,0.0,0.0,0.0,0.012195,0.0,0.0,0.0,0.0,0.0,0.0


In [58]:
caracas_grouped.shape

(14, 134)

In [59]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [66]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Parish']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
parishes_venues_sorted = pd.DataFrame(columns=columns)
parishes_venues_sorted['Parish'] = caracas_grouped['Parish']

for ind in np.arange(caracas_grouped.shape[0]):
    parishes_venues_sorted.iloc[ind, 1:] = return_most_common_venues(caracas_grouped.iloc[ind, :], num_top_venues)

parishes_venues_sorted.head()

Unnamed: 0,Parish,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Antímano,Bakery,Burger Joint,Fast Food Restaurant,Park,Latin American Restaurant
1,Catedral,Coffee Shop,Breakfast Spot,Historic Site,Theater,Café
2,El Cafetal,Pharmacy,Gym,Bakery,Coffee Shop,Playground
3,El Paraíso,Bakery,Chinese Restaurant,Italian Restaurant,Pharmacy,Pool
4,El Recreo,Plaza,Bakery,Dessert Shop,Restaurant,Theater


In [67]:
# set number of clusters
kclusters = 5

caracas_grouped_clustering = caracas_grouped.drop('Parish', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(caracas_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 2, 3, 0, 2, 2, 1, 2, 2, 2], dtype=int32)

In [68]:
# add clustering labels
parishes_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# caracas_merged = caracas_df
caracas_merged = caracas_more_than_10_venues

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
caracas_merged = caracas_merged.join(parishes_venues_sorted.set_index('Parish'), on='Parish')

caracas_merged.head() # check the last columns!

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Antímano,18,Libertador,10.4667,-66.9667,29.5,4,Bakery,Burger Joint,Fast Food Restaurant,Park,Latin American Restaurant
4,Catedral,23,Libertador,10.50642,-66.91403,1.0,2,Coffee Shop,Breakfast Spot,Historic Site,Theater,Café
7,El Cafetal,12,Baruta,10.467126,-66.830409,9.0,3,Pharmacy,Gym,Bakery,Coffee Shop,Playground
9,El Paraíso,23,Libertador,10.492164,-66.926004,10.4,0,Bakery,Chinese Restaurant,Italian Restaurant,Pharmacy,Pool
10,El Recreo,36,Libertador,10.50503,-66.88773,18.1,2,Plaza,Bakery,Dessert Shop,Restaurant,Theater


In [69]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

caracas_map = plot_map(caracas_latitude, 
             caracas_longitude, 
             caracas_merged['Latitude'], 
             caracas_merged['Longitude'], 
             caracas_merged['Municipality'], 
             caracas_merged['Parish'], 
             caracas_merged['Area(km2)'].map(lambda area: math.sqrt(area) / 4 * 1000), 
             [municipality_colors[municipality] for municipality in caracas_df['Municipality']], 
             [rainbow[cluster-1] for cluster in caracas_merged['Cluster Labels']], fill_opacity = 0.5)

In [70]:
def add_inf_to_map(map_to, latitudes, longitudes, messages):
    # add markers to map
    for lat, lng, message in zip(latitudes, longitudes, messages):
        folium.Marker([lat,lng], icon=folium.features.DivIcon(
                icon_size=(10,10),
                icon_anchor=(1,1),
                html='<div style="font-size: 14pt; color : black">'+str(message)+'</div>',
                )).add_to(map_to)

In [71]:
caracas_map

In [72]:
add_inf_to_map(caracas_map, caracas_merged['Latitude'],caracas_merged['Longitude'], caracas_merged['Venues_count'])

In [73]:
caracas_map

### Examine Clusters

#### Cluster 1

In [74]:
caracas_merged.loc[caracas_merged['Cluster Labels'] == 0, :]

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,El Paraíso,23,Libertador,10.492164,-66.926004,10.4,0,Bakery,Chinese Restaurant,Italian Restaurant,Pharmacy,Pool
22,San Bernardino,38,Libertador,10.512957,-66.902191,9.9,0,Pharmacy,Bakery,Pizza Place,Platform,Deli / Bodega
26,San Pedro,26,Libertador,10.48899,-66.88897,6.7,0,Bakery,College Quad,Pizza Place,Burger Joint,Convenience Store


#### Cluster 2

In [75]:
caracas_merged.loc[caracas_merged['Cluster Labels'] == 1, :]

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
13,La Candelaria,15,Libertador,10.503634,-66.904806,1.2,1,Spanish Restaurant,Plaza,Restaurant,Pizza Place,Bed & Breakfast


#### Cluster 3

In [76]:
caracas_merged.loc[caracas_merged['Cluster Labels'] == 2, :]

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
4,Catedral,23,Libertador,10.50642,-66.91403,1.0,2,Coffee Shop,Breakfast Spot,Historic Site,Theater,Café
10,El Recreo,36,Libertador,10.50503,-66.88773,18.1,2,Plaza,Bakery,Dessert Shop,Restaurant,Theater
11,El Valle,21,Libertador,10.467206,-66.907329,31.1,2,Shopping Mall,Chinese Restaurant,Metro Station,Baseball Field,Pizza Place
18,Nuestra Señora del Rosario,67,Baruta,10.432222,-66.873889,73.0,2,Bakery,Ice Cream Shop,Burger Joint,Coffee Shop,Pharmacy
19,Parroquia Santa Rosalía de Palermo,100,El Hatillo,10.439167,-66.83,81.0,2,Shopping Mall,Pizza Place,Italian Restaurant,Pharmacy,Restaurant
20,Petare,82,Sucre,10.483333,-66.816667,40.0,2,Bakery,Ice Cream Shop,Italian Restaurant,Pizza Place,Steakhouse
24,San José de Chacao,26,Chacao,10.483333,-66.833333,13.0,2,Park,Bakery,Pharmacy,Brewery,Restaurant
29,Sucre,13,Libertador,10.515081,-66.944409,58.3,2,Metro Station,African Restaurant,Bakery,Market,Park


#### Cluster 4

In [77]:
caracas_merged.loc[caracas_merged['Cluster Labels'] == 3, :]

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
7,El Cafetal,12,Baruta,10.467126,-66.830409,9.0,3,Pharmacy,Gym,Bakery,Coffee Shop,Playground


#### Cluster 5

In [78]:
caracas_merged.loc[caracas_merged['Cluster Labels'] == 4, :]

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Antímano,18,Libertador,10.4667,-66.9667,29.5,4,Bakery,Burger Joint,Fast Food Restaurant,Park,Latin American Restaurant


In [79]:
caracas_merged

Unnamed: 0,Parish,Venues_count,Municipality,Latitude,Longitude,Area(km2),Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Antímano,18,Libertador,10.4667,-66.9667,29.5,4,Bakery,Burger Joint,Fast Food Restaurant,Park,Latin American Restaurant
4,Catedral,23,Libertador,10.50642,-66.91403,1.0,2,Coffee Shop,Breakfast Spot,Historic Site,Theater,Café
7,El Cafetal,12,Baruta,10.467126,-66.830409,9.0,3,Pharmacy,Gym,Bakery,Coffee Shop,Playground
9,El Paraíso,23,Libertador,10.492164,-66.926004,10.4,0,Bakery,Chinese Restaurant,Italian Restaurant,Pharmacy,Pool
10,El Recreo,36,Libertador,10.50503,-66.88773,18.1,2,Plaza,Bakery,Dessert Shop,Restaurant,Theater
11,El Valle,21,Libertador,10.467206,-66.907329,31.1,2,Shopping Mall,Chinese Restaurant,Metro Station,Baseball Field,Pizza Place
13,La Candelaria,15,Libertador,10.503634,-66.904806,1.2,1,Spanish Restaurant,Plaza,Restaurant,Pizza Place,Bed & Breakfast
18,Nuestra Señora del Rosario,67,Baruta,10.432222,-66.873889,73.0,2,Bakery,Ice Cream Shop,Burger Joint,Coffee Shop,Pharmacy
19,Parroquia Santa Rosalía de Palermo,100,El Hatillo,10.439167,-66.83,81.0,2,Shopping Mall,Pizza Place,Italian Restaurant,Pharmacy,Restaurant
20,Petare,82,Sucre,10.483333,-66.816667,40.0,2,Bakery,Ice Cream Shop,Italian Restaurant,Pizza Place,Steakhouse
