<H1>Introduction/Business Problem:</H1>


Technology is (and will remain as) arguably our greatest influence in this century, revolutionazing human capabilities and desires beyond what was previously only dreamed of. The massive adoption of smartphones and social media changed forever the way people interact and engage with each other, with an increasingly accesible, diversified and novel content, and of course with businesses that cater the recently developed needs of this new age.

The entertainment industry, for example, has seen major shifts this last two decades. In this regard, one area of special interest for it’s upside potential is the videogames industry. For the digital-native population, videogames have been an ubiquitous and reliable companion from the most tender age, providing opportunities to challenge themselves, to compete and improve, to bond and connect with friends and other fellow gamers or simply enjoy their spare time in a casual manner. As these people grow, so do this passion: millennials and centennials all around the globe continue to find pleasure in gaming as their mature into college or their first jobs, with the intention to relax and have fun with friends after a long, hard day of work, from the comfort of their own homes. The COVID-19 pandemic exacerbated this trend and made it more visible to everyone, something that was reflected in all sort of sectorial metrics.

Even though the ability to seamlessly play a game from an extremely large pool of options with one’s acquaintance and from one’s own place proved to be very appealing, maybe there are other venues where people would enjoy the gaming experience. In the Internet era, communities of all kinds have flooded this wonderful space: would an in-person gathering be something they would be interested in? Would this people like to meet their online partners in real life to share a meal or a drink and, of course, play casually or competitively with others in-site? Could this setup attract a broader audience into the gaming market, thus expanding it’s dimension and resources, ultimately improving the consumer experience through innovative products and services?

In this instance, focus will be placed into finding the best possible locations for such a place in Buenos Aires, Argentina, using a diverse set of tools from the data science and machine learning fields. As the target audience is mainly mid-young adults with a middle-to-high income (relative to Argentina’s economy, which won’t be the case for developed economies), proximity to universities, tech-related companies and stores, and other entertainment venues that are popular among the target audience, like traditional bars and cinemas, would probably be convenient, improving chances to maximize potential turnover and revenue.

<H1>Data:</H1>

The coordinates of all neighborhoods will be gather from the website of Buenos Aires' municipal government. Information regarding location of universities is also contained in a dataset in this website. The Foursquare API will be used to query venues and allow meaningful comparison between neighborhoods.

<H1>Methodology:</H1>
    
Both datasets had to be manipulated in order to concatenate them, just keeping the Neighborhood's names and venues's categories for the final analysis. The dataset containing the universities' information had a plethora of useless labels, and the dataset cointaining neighborhoods locations had a plethora of coordinates to account for their shape. Using these coordinates a rectangular area was calculated for each neighborhood, using it's center as the point from which to query venues' information from the Foursquare API, and using a different radius for each one, according to it's size. Not all venues were important for the purpose of the analysis, so some of them were deleted. Finally, venues were grouped into macro-categories by function to summarizes their information in a more simple presentation.

In [29]:
import requests 
import pandas as pd 
import numpy as np 
import json
import math

from IPython.display import Image 
from IPython.core.display import HTML 
    
from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
#import folium # plotting library

#print('Folium installed')
print('Libraries imported.')

Libraries imported.


In [2]:
#Get the dataset cointaining coordinates for Buenos Aires's Neighborhoods.

!wget --quiet http://cdn.buenosaires.gob.ar/datosabiertos/datasets/barrios/barrios.geojson -O ba_neighborhoods.json

In [3]:
#Open the JSON file and assign it's content to a variable.

with open('ba_neighborhoods.json') as json_data:
    ba_data = json.load(json_data)

   This dataset has multiple coordinates for each neighborhood, because it intends to account for their geometry. For the purpose of this analysis, the gaps between extreme values of longitude and latitude are used to approximate a rectangular area for each neighborhood. With these values the centre of each rectangle is easily calculated. The radius appropriate for every rectangle is calculated using a geometric concept: the diagonal of any quadrilateral can be used as the diameter of a circle so that the rectangle will fit exactly into it. An estimation of 111 km per degree of any coordinate value allows for the calculation of the gaps in terms of kilometers.

   This methodlogy implies many venues will not be uniquely associated with a specific neighborhood, because the area covered by the circle estimated for each neighborhood exceeds the area of it's corresponding rectangle. However, for the purpose of finding the best possible location this is not problematic per se. The issue that may arise is linked to the limit in the calls made to the Foursquare APIs, something important to consider when repeating an refining this analysis.

In [4]:
#Extract a list of neighborhoods from the dataset and create a DataFrame.

list_for_df = []
for i in range(0, len(ba_data['features'])):
    list_for_df.append(ba_data['features'][i]['properties']['barrio'])
    
df_neighborhoods = pd.DataFrame({'Neighborhood': list_for_df, 'Center Latitude':'', 'Center Longitude':'', 'Latitude Gap (in km)': '', 'Longitude Gap (in km)': '', 'Radius for query': ''})
latitude_to_kms = 111
longitude_to_kms = 111
df_neighborhoods

Unnamed: 0,Neighborhood,Center Latitude,Center Longitude,Latitude Gap (in km),Longitude Gap (in km),Radius for query
0,CHACARITA,,,,,
1,PATERNAL,,,,,
2,VILLA CRESPO,,,,,
3,VILLA DEL PARQUE,,,,,
4,ALMAGRO,,,,,
5,CABALLITO,,,,,
6,VILLA SANTA RITA,,,,,
7,MONTE CASTRO,,,,,
8,VILLA REAL,,,,,
9,FLORES,,,,,


In [5]:
#A function is defined to populate the DataFrame with the relevant information.

def populate_neighborhoods(df):

#For every neighborhood, every coordinate is appended to a list to make it easier to calculate the extreme values and the distance between them.    

    for j in range(0, df['Neighborhood'].count()):
        longitude_list = []
        latitude_list = []
        for i in range(0, len(ba_data['features'][j]['geometry']['coordinates'][0])):
            longitude_list.append(ba_data['features'][j]['geometry']['coordinates'][0][i][0])
            latitude_list.append(ba_data['features'][j]['geometry']['coordinates'][0][i][1])

#Populate every neighborhood's center, using minimum values as coordinates for a corner and adding the half of each gap to define the center of the area.  

        long_gap = (max(longitude_list) - min(longitude_list))
        lat_gap = (max(latitude_list) - min(latitude_list))
        df.iloc[j,1] = (min(latitude_list)+0.5*long_gap)
        df.iloc[j,2] =(min(longitude_list)+0.5*lat_gap)
        df.iloc[j,3] = lat_gap*latitude_to_kms
        df.iloc[j,4] = long_gap*longitude_to_kms

#Finally, calculate the radius appropriate to query every neighborhood.        
                
        x = df.iloc[j, 3]*0.5
        y = df.iloc[j, 4]*0.5
        radius = math.sqrt(x**2 + y**2)/2
        df['Radius for query'][j] = radius*1000
        
        df

In [6]:
#Call the function, using the DataFrame previously created as input.

populate_neighborhoods(df_neighborhoods)
df_neighborhoods

Unnamed: 0,Neighborhood,Center Latitude,Center Longitude,Latitude Gap (in km),Longitude Gap (in km),Radius for query
0,CHACARITA,-34.5837,-58.4571,2.16895,3.14046,954.163
1,PATERNAL,-34.594,-58.4699,1.98311,2.50803,799.333
2,VILLA CRESPO,-34.5898,-58.4495,2.10322,3.948,1118.32
3,VILLA DEL PARQUE,-34.5989,-58.4971,2.02325,3.56876,1025.6
4,ALMAGRO,-34.6114,-58.4212,2.70415,2.3771,900.105
5,CABALLITO,-34.6127,-58.4487,3.10419,3.99032,1263.89
6,VILLA SANTA RITA,-34.613,-58.4865,1.83185,2.57126,789.265
7,MONTE CASTRO,-34.614,-58.512,2.14503,3.21903,967.061
8,VILLA REAL,-34.6202,-58.5232,1.84561,1.61712,613.461
9,FLORES,-34.6331,-58.4563,4.76421,5.25889,1774.01


   Proceed to query venues for every neighborhood using it's coordinates and it's corresponding radius, creating a new DataFrame in the process to store all this new information.

In [7]:
# @hidden_cell

CLIENT_ID = 'BIKJD0KWOGCZVJCDTLO4ROBIC5CMNI2M2OIMPRHDI5UDZDYM' 
CLIENT_SECRET = 'PB4Y0DTWKU5NGYTUS30OUCDLQ2LSC55Y4C4CU1C4LUB0JNOJ'
VERSION = '20180604'
LIMIT = 30
print('Client ID and Client Secret for the Foursquare API stored')

Client ID and Client Secret for the Foursquare API stored


In [8]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [9]:
#Function to populate the DataFrame with the venues' information.

def getNearbyVenues(names, latitudes, longitudes, radius):
    
    venues_list=[]
    for name, lat, lng, radius in zip(names, latitudes, longitudes, radius):
            
#Create the API request URL.

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
            
#Make the GET request.

        results = requests.get(url).json()["response"]['groups'][0]['items']
        
#Return only relevant information for each nearby venue.

        venues_list.append([(name, lat, lng, v['venue']['name'], v['venue']['location']['distance'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Distance to Centroid in m', 'Category']
    
    return(nearby_venues)

In [10]:
ba_venues = getNearbyVenues(names=df_neighborhoods['Neighborhood'],
                                   latitudes=df_neighborhoods['Center Latitude'],
                                   longitudes=df_neighborhoods['Center Longitude'],
                                   radius=df_neighborhoods['Radius for query']
                                  )

ba_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Distance to Centroid in m,Category
0,CHACARITA,-34.583689,-58.457058,Bar Palacio (Museo Fotográfico Simik),336,Café
1,CHACARITA,-34.583689,-58.457058,Club Chacarita,536,Pool
2,CHACARITA,-34.583689,-58.457058,Espacio Cultural Carlos Gardel,580,Cultural Center
3,CHACARITA,-34.583689,-58.457058,Roll'in Luí Alimentos,700,Vegetarian / Vegan Restaurant
4,CHACARITA,-34.583689,-58.457058,La Mezzetta,695,Pizza Place
5,CHACARITA,-34.583689,-58.457058,Les Croquants,709,Pastry Shop
6,CHACARITA,-34.583689,-58.457058,El Imperio de la Pizza,404,Pizza Place
7,CHACARITA,-34.583689,-58.457058,Sede Central de Whisky,683,Whisky Bar
8,CHACARITA,-34.583689,-58.457058,Strange Brewing,699,Beer Bar
9,CHACARITA,-34.583689,-58.457058,Fábrica de Churros Olleros,482,Bakery


It's time to incorporate the locations of the universities found in the dataset provided by the Buenos Aires' government.

In [11]:
#Get the dataset cointaining coordinates for Buenos Aires's colleges.

!wget --quiet http://cdn.buenosaires.gob.ar/datosabiertos/datasets/universidades/universidades.csv -O ba_colleges.csv

In [12]:
with open('ba_colleges.csv') as colleges_data:
    ba_colleges = pd.read_csv(colleges_data)
ba_colleges  

Unnamed: 0,regimen,universida,univ_c,unidad_aca,unac_c,anexo_c,unicue,cui,telef,fax,...,direccion_norm,calle,altura,WKT_gkba,barrio,comuna,codigo_postal,codigo_postal_argentino,long,lat
0,Privado,FLACSO (Facultad Latinoamericana de Ciencias S...,1,Rectorado,1,0,100100,2546,5238-9339 (C)5238-9300,4375-1373,...,Ayacucho 555,Ayacucho,555,POINT (-58.3953412190814 -34.6026584750106),Balvanera,Comuna 3,1026.0,C1026AAC,-58.395341,-34.602658
1,Privado,Instituto Tecnológico de Buenos Aires,2,Rectorado,1,0,200100,2456,63934813 [C]6393-4810,6393-4813,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
2,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Investigación,7,0,200700,2456,6393-4869,6393-4869,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
3,Privado,Instituto Tecnológico de Buenos Aires,2,Área de Energía,40,0,204000,2456,6393-4870,6393-4870,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
4,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Bioingeniería,87,0,208700,2456,6393-4893,6393-4893,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
5,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Desarrollo Humano,103,0,210300,2456,6393-4886,6393-4886,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
6,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Economía y Desarrollo Profesional,109,0,210900,2456,6393-4877,6393-4877,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
7,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Extensión Universitaria y Vicu...,112,0,211200,2456,6393-5984,6393-5984,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
8,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Física,114,0,211400,2456,6393-5867,6393-5867,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116
9,Privado,Instituto Tecnológico de Buenos Aires,2,Departamento de Ingeniería Eléctrica,120,0,212000,2456,6393-4829,6393-4829,...,Av. Eduardo Madero 399,Av. Eduardo Madero,399,POINT (-58.3678488464698 -34.6031162314776),Puerto Madero,Comuna 1,1106.0,C1106ACD,-58.367849,-34.603116


In [13]:
#Remove columns with irrelevant information.

columns_names = ['University', 'Neighborhood', 'Longitude', 'Latitude']
columns_to_drop = ['regimen', 'univ_c', 'unidad_aca', 'unac_c', 'anexo_c', 'unicue', 'cui', 'telef', 'fax', 'web', 'direccion_norm', 'calle', 'altura', 'WKT_gkba', 'comuna', 'codigo_postal', 'codigo_postal_argentino']
ba_colleges.drop(columns=columns_to_drop, axis=1, inplace=True)
ba_colleges.columns = columns_names
ba_colleges

Unnamed: 0,University,Neighborhood,Longitude,Latitude
0,FLACSO (Facultad Latinoamericana de Ciencias S...,Balvanera,-58.395341,-34.602658
1,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
2,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
3,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
4,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
5,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
6,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
7,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
8,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116
9,Instituto Tecnológico de Buenos Aires,Puerto Madero,-58.367849,-34.603116


The dataset lists all departments for every university, but for a sizeble portion of them, most of this deparments are distributed among a couple or even a single building. It would be simpler to just take one row per building, although some information get lost in this process, as bigger universities tend to have more departments per building. This is something that can be incorporated as well.

In [14]:
#Remove the indexes of repetead coordinates.

ba_colleges_sorted = ba_colleges.sort_values('Longitude')
ba_colleges_sorted.reset_index(drop=True, inplace=True)
ba_colleges_sorted
index_to_drop = []
for i in range(0, len(ba_colleges_sorted)-1):
    if ba_colleges_sorted['Longitude'][i] == ba_colleges_sorted['Longitude'][i+1] and ba_colleges_sorted['Latitude'][i] == ba_colleges_sorted['Latitude'][i+1]:
        index_to_drop.append(i+1)

ba_colleges_sorted.drop(index=(index_to_drop), axis=0, inplace=True)
ba_colleges_sorted

Unnamed: 0,University,Neighborhood,Longitude,Latitude
0,Universidad de Buenos Aires,Villa Devoto,-58.512286,-34.597359
1,Pontificia Universidad Católica Argentina Sant...,Villa Devoto,-58.508397,-34.593351
2,Instituto Universitario CEMIC,Saavedra,-58.495122,-34.556884
3,Instituto Universitario CEMIC,Saavedra,-58.494884,-34.554937
5,Universidad de Morón,Villa Urquiza,-58.493619,-34.576934
6,Universidad de Buenos Aires,Agronomia,-58.482395,-34.597846
8,Universidad de Buenos Aires,Villa Urquiza,-58.478376,-34.567911
9,Universidad de Buenos Aires,Agronomia,-58.477754,-34.595813
10,Universidad de Flores,Flores,-58.470794,-34.628707
11,Universidad Tecnológica Nacional,Villa Lugano,-58.469309,-34.659459


In [15]:
#Prepare both DataFrames for future concatenation.

for i, item in enumerate(ba_colleges_sorted['Neighborhood']):
    ba_colleges_sorted.iloc[i, 1] = item.upper()
ba_colleges_sorted.drop(columns=['Latitude', 'Longitude'], inplace=True)
ba_colleges_sorted = ba_colleges_sorted[['Neighborhood', 'University']]
ba_colleges_sorted.columns = ['Neighborhood', 'Category']
ba_venues.drop(columns=['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Distance to Centroid in m'], inplace=True)

In [16]:
#Concatenate both DataFrame.

venues_final = pd.concat([ba_venues, ba_colleges_sorted], sort=False)
venues_final.reset_index(drop=True, inplace=True)
venues_final

Unnamed: 0,Neighborhood,Category
0,CHACARITA,Café
1,CHACARITA,Pool
2,CHACARITA,Cultural Center
3,CHACARITA,Vegetarian / Vegan Restaurant
4,CHACARITA,Pizza Place
5,CHACARITA,Pastry Shop
6,CHACARITA,Pizza Place
7,CHACARITA,Whisky Bar
8,CHACARITA,Beer Bar
9,CHACARITA,Bakery


At this point it's time to perform an exploratory analysis of this new data to wrangle it.

In [17]:
venues_grouped = venues_final.groupby(by='Category').count().sort_values(by='Neighborhood', ascending=False)
venues_grouped[venues_grouped['Neighborhood'] > 10]

Unnamed: 0_level_0,Neighborhood
Category,Unnamed: 1_level_1
Argentinian Restaurant,101
Pizza Place,81
Ice Cream Shop,74
Café,66
Bakery,51
Plaza,45
Coffee Shop,44
BBQ Joint,43
Bus Stop,31
Restaurant,26


In [18]:
#Create a list of keyword to filter venues by relevant categories.

list_for_filtering = ('Restaurant', 'Shop', 'Store', 'Mall', 'Club', 'Bar', 'Center', 'Brewery', 'Theater', 'Place', 'Joint', 'Diner', 'Station', 'Spot', 'Steakhouse', 'Instituto', 
                      'Arcade', 'Beer', 'Stop', 'Hostel', 'Hotel', 'Play', 'Motel', 'Pub', 'Multiplex', 'Cafeteria', 'Venue', 'Entertainment', 'Café', 'FLACSO', 'Universidad')

filtered_index = set({})
for item in list_for_filtering:
    for i, venue in enumerate(venues_final['Category']):
        if item in venue:
            filtered_index.add(i)

In [19]:
#Keep only the venues related to the keywords.

venues_filtered = venues_final.iloc[list(filtered_index), :]
venues_filtered_grouped = venues_filtered.groupby(by='Category').count().sort_values(by='Neighborhood', ascending=False)
venues_filtered_grouped[venues_filtered_grouped['Neighborhood'] > 10]

Unnamed: 0_level_0,Neighborhood
Category,Unnamed: 1_level_1
Argentinian Restaurant,101
Pizza Place,81
Ice Cream Shop,74
Café,66
Coffee Shop,44
BBQ Joint,43
Bus Stop,31
Restaurant,26
Italian Restaurant,24
Gym / Fitness Center,20


In [20]:
#Get all the categories to refine and simplify the information. For example, for the purpose of this analysis, all places that only provide a meal can be grouped together.

filtered_categories = set(venues_filtered['Category'])
filtered_categories

{'American Restaurant',
 'Arcade',
 'Argentinian Restaurant',
 'Arts & Crafts Store',
 'Arts & Entertainment',
 'Automotive Shop',
 'BBQ Joint',
 'Bagel Shop',
 'Bar',
 'Beer Bar',
 'Big Box Store',
 'Bike Shop',
 'Breakfast Spot',
 'Brewery',
 'Burger Joint',
 'Bus Station',
 'Bus Stop',
 'Cafeteria',
 'Café',
 'Cheese Shop',
 'Chinese Restaurant',
 'Chocolate Shop',
 'Clothing Store',
 'Cocktail Bar',
 'Coffee Shop',
 'Comfort Food Restaurant',
 'Comic Shop',
 'Convenience Store',
 'Convention Center',
 'Cosmetics Shop',
 'Cultural Center',
 'Department Store',
 'Dessert Shop',
 'Diner',
 'Eastern European Restaurant',
 'Electronics Store',
 'Empanada Restaurant',
 'English Restaurant',
 'FLACSO (Facultad Latinoamericana de Ciencias Sociales)',
 'Fast Food Restaurant',
 'Flower Shop',
 'Food & Drink Shop',
 'French Restaurant',
 'Furniture / Home Store',
 'Gas Station',
 'German Restaurant',
 'Gift Shop',
 'Gourmet Shop',
 'Grocery Store',
 'Gym / Fitness Center',
 'Hardware Store',


In [21]:
#Create a function to make macro-categories.

def categorize_venues (df, category_set, category_list):
    for category in df['Category']:
        for item in category_list:
            if item in category:
                category_set.add(category)

In [22]:
#Create the sets and lists required to apply the function.

restaurants = set({})
restaurants_list = ['Restaurant', 'Joint', 'Steakhouse','Diner', 'Place', 'Spot']
stores = set({})
stores_list = ['Store', 'Mall', 'Comic Shop']
bars = set({})
bars_list = ['Bar', 'Brewery', 'Pub', 'Club']
dessert_shops = set({})
dessert_list = ['Ice Cream', 'Dessert']
coffee_shops = set({})
coffee_list = ['Coffee', 'Cafeteria', 'Café']
stops_stations = set({})
stations_list = ['Station', 'Stop']
entertainment = set({})
entertainment_list = ['Entertainment', 'Venue', 'Multiplex', 'Arcade', 'Play', 'Center', 'Theater']
accommodation = set({})
accommodation_list = ['Hotel', 'Hostel', 'Motel']
universities = set({})
universities_list = ['Universidad', 'FLACSO', 'Instituto']
macro_categories = [restaurants, bars, dessert_shops, coffee_shops, stops_stations, entertainment, accommodation, stores, universities]
categories_lists = [restaurants_list, bars_list, dessert_list, coffee_list, stations_list, entertainment_list, accommodation_list, stores_list, universities_list]

In [23]:
#Call the function to construct each macro-category.

for i in range(0, len(macro_categories)):
    category = macro_categories[i]
    category_list = categories_lists[i]
    categorize_venues(venues_filtered, category, category_list)

In [24]:
#After analyzing each set it's clear some categories should be deleted for consistency.

stops_stations.remove('Gas Station')
entertainment.remove('Gym / Fitness Center')
bars.remove('Hotel Bar')
bars.remove('Instituto Universitario de Ciencias de la Salud de la Fundación Barceló')
universities.add('Instituto Universitario de Ciencias de la Salud de la Fundación Barceló')
list_to_drop = ['Shopping Mall', 'Electronics Store', 'Hardware Store', 'Comic Shop']
venues_to_drop = set({})
for item in stores:
    if item not in list_to_drop:
        venues_to_drop.add(item)
    
for item in bars:
    if item == 'Sports Club' or item == 'Salon / Barbershop':
        venues_to_drop.add(item)
        
venues_filtered.reset_index(drop=True, inplace=True)
index_to_drop = []
        
for i, item in enumerate(venues_filtered['Category']):
    if item in venues_to_drop:
        index_to_drop.append(i)
        
venues_filtered.drop(index=(index_to_drop), axis=0, inplace=True)
venues_filtered.reset_index(drop=True, inplace=True)

gym_index = list(venues_filtered[venues_filtered['Category'] =='Gym / Fitness Center'].index)
venues_filtered.drop(index=(gym_index), axis=0, inplace=True)
venues_filtered

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,Neighborhood,Category
0,CHACARITA,Café
1,CHACARITA,Cultural Center
2,CHACARITA,Vegetarian / Vegan Restaurant
3,CHACARITA,Pizza Place
4,CHACARITA,Pastry Shop
5,CHACARITA,Pizza Place
6,CHACARITA,Whisky Bar
7,CHACARITA,Beer Bar
8,CHACARITA,German Restaurant
9,CHACARITA,Venezuelan Restaurant


Replacing each category with it's corresponding macro-category would make it easier to summarize all this information in a single and easy to understand DataFrame.

In [25]:
#Recategorize venues with representative keywords.

def apply_macro_category(df, category, name):
    df['Category'].replace(to_replace=list(category), value=name, inplace=True)
            
categories_names = ['Just Meals', 'Bars', 'Dessert Shops', 'Coffee Shops', 'Transportation', 'Entertainment', 'Accommodation', 'Stores', 'Universities']    

for i in range(0, len(categories_names)):
    category = macro_categories[i]
    name = categories_names[i]
    apply_macro_category(venues_filtered, category, name)
    
venues_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Unnamed: 0,Neighborhood,Category
0,CHACARITA,Coffee Shops
1,CHACARITA,Entertainment
2,CHACARITA,Just Meals
3,CHACARITA,Just Meals
4,CHACARITA,Pastry Shop


In [30]:
#Group by macro-category to have a final look at the table.

venues_grouped = venues_filtered.groupby(by='Category').count().sort_values(by='Neighborhood', ascending=False)
venues_grouped.head(25)

Unnamed: 0_level_0,Neighborhood
Category,Unnamed: 1_level_1
Just Meals,412
Universities,134
Coffee Shops,111
Dessert Shops,82
Transportation,45
Bars,43
Entertainment,35
Accommodation,28


In [27]:
#Finally remove miscellaneous venues from the DataFrame.

index = list(venues_grouped[venues_grouped['Neighborhood'] < 22].index)
index_to_drop = []
venues_filtered.reset_index(drop=True, inplace=True)
for item in index:
    for i, venue in enumerate(venues_filtered['Category']):
        if item == venue:
            index_to_drop.append(i)
            
venues_filtered.drop(index=index_to_drop, axis=0, inplace=True)
venues_grouped = venues_filtered.groupby(by='Category').count().sort_values(by='Neighborhood', ascending=False)
venues_grouped

Unnamed: 0_level_0,Neighborhood
Category,Unnamed: 1_level_1
Just Meals,412
Universities,134
Coffee Shops,111
Dessert Shops,82
Transportation,45
Bars,43
Entertainment,35
Accommodation,28


The final step is to group by neighborhood and have a glance of how many venues of each macro-category there are in each neighborhood. A ranking could be useful to separate the top five neighborhoods by total relevant venues. In addition, it's possible to divide the sum of the venues of each neighborhood by it's area, as to have both and absolute and relative indicator. It is also possible to figure out which macro-category is more common in each neighborhood, in the case some of them could be considered of higher priority.

In [28]:
venues_grouped_absolute = venues_filtered.groupby(by='Neighborhood').count().sort_values(by='Category', ascending=False)
venues_grouped_absolute.head(10)

Unnamed: 0_level_0,Category
Neighborhood,Unnamed: 1_level_1
BALVANERA,46
MONSERRAT,35
ALMAGRO,32
SAN NICOLAS,31
SAN TELMO,27
RETIRO,27
RECOLETA,27
PUERTO MADERO,26
MONTE CASTRO,25
AGRONOMIA,23


In [31]:
ba_onehot = pd.get_dummies(venues_filtered[['Category']], prefix="", prefix_sep="")
ba_onehot['Neighborhood'] = venues_filtered['Neighborhood'] 

fixed_columns = [ba_onehot.columns[-1]] + list(ba_onehot.columns[:-1])
ba_onehot = ba_onehot[fixed_columns]

ba_onehot.head(10)

Unnamed: 0,Neighborhood,Accommodation,Bars,Coffee Shops,Dessert Shops,Entertainment,Just Meals,Transportation,Universities
0,CHACARITA,0,0,1,0,0,0,0,0
1,CHACARITA,0,0,0,0,1,0,0,0
2,CHACARITA,0,0,0,0,0,1,0,0
3,CHACARITA,0,0,0,0,0,1,0,0
5,CHACARITA,0,0,0,0,0,1,0,0
6,CHACARITA,0,1,0,0,0,0,0,0
7,CHACARITA,0,1,0,0,0,0,0,0
8,CHACARITA,0,0,0,0,0,1,0,0
9,CHACARITA,0,0,0,0,0,1,0,0
10,CHACARITA,0,1,0,0,0,0,0,0


In [34]:
ba_grouped = ba_onehot.groupby('Neighborhood').sum().reset_index()
ba_grouped

Unnamed: 0,Neighborhood,Accommodation,Bars,Coffee Shops,Dessert Shops,Entertainment,Just Meals,Transportation,Universities
0,AGRONOMIA,0,0,4,6,0,9,2,2
1,ALMAGRO,2,2,2,3,2,12,0,9
2,BALVANERA,1,3,5,1,1,11,0,24
3,BARRACAS,0,0,5,1,1,8,0,4
4,BELGRANO,0,1,2,2,0,11,0,5
5,BOCA,0,0,1,0,1,17,0,0
6,BOEDO,0,0,2,2,0,14,2,0
7,CABALLITO,0,1,3,1,0,7,0,7
8,CHACARITA,0,4,1,1,3,10,0,0
9,COGHLAN,0,1,2,1,0,5,2,0


In [77]:
categories = list(ba_grouped.columns)
categories.remove('Neighborhood')
for item in (categories):
    max_of_category = max(ba_grouped[item])
    neighborhood_max = ba_grouped[ba_grouped[item]==max_of_category][item]
    print('The maximum amount of', item, 'is', max_of_category, 'in', list(ba_grouped[ba_grouped[item]==max_of_category]['Neighborhood']))      

The maximum amount of Accommodation is 9 in ['PUERTO MADERO']
The maximum amount of Bars is 4 in ['CHACARITA']
The maximum amount of Coffee Shops is 8 in ['MONTE CASTRO']
The maximum amount of Dessert Shops is 6 in ['AGRONOMIA']
The maximum amount of Entertainment is 8 in ['MONSERRAT']
The maximum amount of Just Meals is 17 in ['BOCA']
The maximum amount of Transportation is 6 in ['CONSTITUCION']
The maximum amount of Universities is 24 in ['BALVANERA']


<H1>Discussion:</H1>

According to this analysis, the most promising location is probably Balvanera, because there is a considerable concentration of the universities of the city, hosting around 18% of them (for comparison, the area with the maximum amount of restaurants and coffee shops only contained the 4% and 7% of them, respectively). This may have acted like a driver to open bars and restaurants in the area, trying to take advantage of the flow of people this univerisities generated. In this regard, the quantity of young people coming to this neighborhood could bolster the turnover of people into a gaming bar in this area.

<H1>Conclussion:</H1>

As shown by this analysis, there are some areas in Buenos Aires City with interesting conditions for this kind of initiative, considering the amount of young people there are and how concentrated their point of interest are.