## Week 4. The Battle of the Neighborhoods.

### Part 1. Background.

Central America (**CAm**) is the natural land bridge between North and South America. After a century of dictatorships, civil wars and political unrest, things are looking good for this part of the world (see note), so much so that in Latin America, **CAm** has shown a bigger *average* economic growth than their neighbors in the South and the one up North ([1](https://www.mckinsey.com/featured-insights/americas/unlocking-the-economic-potential-of-central-america-and-the-caribbean)) in the last decade. Although industry is still a big part of their respective economies (especially Guatemala and Panama), an important amount of the **CAm** economies depends on internal markets, either regional or country-specific. This includes restaurants, malls, and tourism, so important questions arise, are they the same avenues or do some countries share ones that others don't?. Even if they share the same avenues, do customers differ in their reviews depending on the country?

This information will be helpful to a preliminary market research, because it will let us know where the demand for a certain service is and where is it well evaluated (allowing us to investigate further what causes the better scores).


Note: Now signs of authoritarianism are blatant in most CAm (Guatemala, Honduras, El Salvador, Nicaragua), that might impact their economies in the near future.   

### Part 2. Problem.

* ¿Do Central American countries offer the same type of avenues or do some countries have some distinctive ones?

* If they have the same ¿do the reviews differ by country?

### Part 3. Data Description.

In [1]:
import pandas as pd

I will be using the Foursquare data on the Central American capitals. This includes:

* Belmopán (**Belize**), 
* Guatemala City (**Guatemala**), 
* San Salvador (**El Salvador**), 
* Tegicigalpa (**Honduras**), 
* Managua (**Nicaragua**), 
* San José (**Costa Rica**), 
* and Panamá City (**Panamá**). 

In [5]:
geo_capitals = pd.read_csv("./capitals-geolocation/concap.csv")

# Select only those that are marked as being in Central America
geo_capitals[geo_capitals['ContinentName'] == 'Central America'].head() # Only first 5

Unnamed: 0,CountryName,CapitalName,CapitalLatitude,CapitalLongitude,CountryCode,ContinentName
29,Belize,Belmopan,17.25,-88.766667,BZ,Central America
45,Canada,Ottawa,45.416667,-75.7,CA,Central America
59,Costa Rica,San Jose,9.933333,-84.083333,CR,Central America
72,El Salvador,San Salvador,13.7,-89.2,SV,Central America
90,Greenland,Nuuk,64.183333,-51.75,GL,Central America


Thanks to the Kaggle user [*Grecnik*](https://www.kaggle.com/nikitagrec) for the geolocation data on the capitals of the world.

As we can see, although the countries stated before are there, we also have other countries like Canada or Greenland, which we know are not in Central America, so we'll have to clean that up. Knowing this is not a bad idea to make sure the geolacation data is correct, so we can make some Folium maps with the data.

In [8]:
central_geo = geo_capitals[geo_capitals['ContinentName'] == 'Central America'].copy(deep=True)
central_geo

Unnamed: 0,CountryName,CapitalName,CapitalLatitude,CapitalLongitude,CountryCode,ContinentName
29,Belize,Belmopan,17.25,-88.766667,BZ,Central America
45,Canada,Ottawa,45.416667,-75.7,CA,Central America
59,Costa Rica,San Jose,9.933333,-84.083333,CR,Central America
72,El Salvador,San Salvador,13.7,-89.2,SV,Central America
90,Greenland,Nuuk,64.183333,-51.75,GL,Central America
93,Guatemala,Guatemala City,14.616667,-90.516667,GT,Central America
100,Honduras,Tegucigalpa,14.1,-87.216667,HN,Central America
142,Mexico,Mexico City,19.433333,-99.133333,MX,Central America
156,Nicaragua,Managua,12.133333,-86.25,NI,Central America
166,Panama,Panama City,8.966667,-79.533333,PA,Central America


In [9]:
central_geo.drop(index=[45, 90, 142, 183, 184, 227], inplace=True)
central_geo

Unnamed: 0,CountryName,CapitalName,CapitalLatitude,CapitalLongitude,CountryCode,ContinentName
29,Belize,Belmopan,17.25,-88.766667,BZ,Central America
59,Costa Rica,San Jose,9.933333,-84.083333,CR,Central America
72,El Salvador,San Salvador,13.7,-89.2,SV,Central America
93,Guatemala,Guatemala City,14.616667,-90.516667,GT,Central America
100,Honduras,Tegucigalpa,14.1,-87.216667,HN,Central America
156,Nicaragua,Managua,12.133333,-86.25,NI,Central America
166,Panama,Panama City,8.966667,-79.533333,PA,Central America


With this information we'll use the ***Foursquare API*** to get all the venues in the different cities, and we'll be getting **the rating of each of them**. Since a free account only allows for 50 premium calls per day, in case that is not enough the data acquired will be stored in a csv file, with the help of the *Pandas* library.

In [10]:
# The final Data Frame should look something like this, but with average scores on each venue
ca_venues = pd.read_csv("ca_venues.csv")
ca_venues.head()

Unnamed: 0.1,Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,0,Belmopan,17.25,-88.766667,Moon Clusters,17.25041,-88.764992,Café
1,1,Belmopan,17.25,-88.766667,Bull Frog Inn,17.251791,-88.764494,Hotel
2,2,Belmopan,17.25,-88.766667,KenMar's Bed & Breakfast,17.252101,-88.764903,Bed & Breakfast
3,3,Belmopan,17.25,-88.766667,Isidoro Beaton Football Stadium,17.251446,-88.763353,Soccer Stadium
4,4,Belmopan,17.25,-88.766667,Betty’s Fast Food,17.251902,-88.763106,Fast Food Restaurant


In [17]:
# Here we can see the different venue categories found in the capitals
pd.pivot_table(ca_venues, 
               columns='Venue Category', 
               index='City', 
               aggfunc='count')

Unnamed: 0_level_0,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,City Latitude,...,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude,Venue Longitude
Venue Category,Art Gallery,Asian Restaurant,Bakery,Bar,Bed & Breakfast,Big Box Store,Boutique,Breakfast Spot,Brewery,Burger Joint,...,Restaurant,Sandwich Place,Scenic Lookout,Seafood Restaurant,Snack Place,Soccer Stadium,Sports Bar,Steakhouse,Theater,Vegetarian / Vegan Restaurant
City,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Belmopan,,,,,1.0,,,,,,...,,,,,,1.0,,,,
Guatemala City,2.0,1.0,,1.0,,,,1.0,1.0,2.0,...,2.0,1.0,,1.0,,,,1.0,,1.0
Managua,,,,,,1.0,,2.0,,,...,,,,,,,1.0,,,
Panama City,,,,,,,,,,,...,,,1.0,,,,,,1.0,
San Jose,,,1.0,1.0,,,1.0,,,1.0,...,3.0,4.0,,,2.0,,,,1.0,
San Salvador,,,,,,,,,,,...,,,,,1.0,,,,,
Tegucigalpa,,,,,,,,,,,...,1.0,,,,,,,,,


In [23]:
# And here we can see which city has the most varied venues
pd.pivot_table(ca_venues, 
               values='Venue Category', 
               index='City', 
               aggfunc='count').sort_values(by='Venue Category',
                                           ascending=False)

Unnamed: 0_level_0,Venue Category
City,Unnamed: 1_level_1
San Jose,39
Guatemala City,26
San Salvador,11
Panama City,9
Managua,8
Belmopan,6
Tegucigalpa,3


Once we have the entire dataset we'll use a clustering algorithm to know which countries share a commonality with each other. After that will use classification algorithm to see if the average avenue score per country is indicative of something.