# Battle of Neighbourhoods
### By Pedro Rocha

In this project I'll analyze the principal venues in each neighbourhood in CABA (Ciudad Autónoma de Buenos Aires),the capital and largest city of Argentina, my country.

The propouse of this analysis is to cluster the neighbourhoods in order to choose the best one to start a new business. For example, a new restaurant.

For it, I'll use a list of the Neighbourhoods names, with geopy I'll get the coordinates for each one, and then with the Foursquare API I'll get the principal venues in each neighbour.

Once I've all that information, I'll try a K-means model to cluster the neighbourhoods.

In [3]:
#Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from geopy.geocoders import Nominatim
import json 
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium

**I'll get the Neighbour's names from Wikipedia, you can check the info in [THIS LINK](https://es.wikipedia.org/wiki/Anexo:Barrios_de_la_ciudad_de_Buenos_Aires)**

In [4]:
#Get the list of Wikipedia
url = 'https://es.wikipedia.org/wiki/Anexo:Barrios_de_la_ciudad_de_Buenos_Aires'
df_list = pd.read_html(url)
neigh_df = df_list[0]

#Change the column name
neigh_df.rename(columns={'Nombre del barrio':'Neighbourhood'}, inplace=True)

#Select the info that we need
caba_df = neigh_df.loc[:,'Neighbourhood']
#I'll fix the name of Villa General Mitre with plotting and searching propouse
caba_df.replace(to_replace='Villa Gral. Mitre', value='Villa General Mitre',inplace=True)
caba_df = pd.DataFrame(caba_df)

#Let's look a few names
caba_df.head()

Unnamed: 0,Neighbourhood
0,Agronomía
1,Almagro
2,Balvanera
3,Barracas
4,Belgrano


**Ok! Now I need the Neighbourhood's coordinates. For that I'll use geopy. You can see the documentation [HERE](https://pypi.org/project/geopy/)**

In [5]:
#Now I've the Neighbourhoods, I'll use geopy to get the coordinates.
lats=[]
longs=[]
for neigh in caba_df['Neighbourhood']:
    try:
        address = "{},CABA, AR".format(neigh)
        geolocator = Nominatim(user_agent='caba-agent')
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        lats.append(latitude)
        longs.append(longitude)
    except:
        lats.append(0)
        longs.append(0)

caba_df['lat']=lats
caba_df['long']=longs

In [6]:
#Ready! Take a look
caba_df.head()

Unnamed: 0,Neighbourhood,lat,long
0,Agronomía,-34.591516,-58.485385
1,Almagro,-34.609988,-58.422233
2,Balvanera,-34.609215,-58.40314
3,Barracas,-34.645285,-58.387562
4,Belgrano,-34.561308,-58.456545


### Now we can see the data in a map! 

In [7]:
#Lets see the Neighbourhoods in a map!
address = "CABA, AR"
geolocator = Nominatim(user_agent='caba-agent')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_caba = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, long, neighbourhood in zip(caba_df['lat'],caba_df['long'], caba_df['Neighbourhood']):
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=folium.Popup(str(neighbourhood),parse_html=True),
        color='blue',
        fill=True,
        fill_collor='blue',
        fill_opacity=0.7).add_to(map_caba)
map_caba

**Set the credentails to use the Foursquare API**

In [8]:
#Set up Credentails
CLIENT_ID = 'XXXXX'
CLIENT_SECRET = 'XXXX'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

I'll set a function to process each Neighbourhood and get the venues.

In [9]:
#define a function to process neighbourhoods in CABA:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET,
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

**Now we can get the venues!**

In [10]:
#Get the CABA Venues
caba_venues = getNearbyVenues(names=caba_df['Neighbourhood'],
                                   latitudes=caba_df['lat'],
                                   longitudes=caba_df['long'])

Agronomía
Almagro
Balvanera
Barracas
Belgrano
Boedo
Caballito
Chacarita
Coghlan
Colegiales
Constitución
Flores
Floresta
La Boca
La Paternal
Liniers
Mataderos
Montserrat
Monte Castro
Nueva Pompeya
Núñez
Palermo
Parque Avellaneda
Parque Chacabuco
Parque Chas
Parque Patricios
Puerto Madero
Recoleta
Retiro
Saavedra
San Cristóbal
San Nicolás
San Telmo
Vélez Sarsfield
Versalles
Villa Crespo
Villa del Parque
Villa Devoto
Villa General Mitre
Villa Lugano
Villa Luro
Villa Ortúzar
Villa Pueyrredón
Villa Real
Villa Riachuelo
Villa Santa Rita
Villa Soldati
Villa Urquiza


Check how the DataFrame looks.

In [11]:
caba_venues = pd.DataFrame(caba_venues)
caba_venues.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Agronomía,-34.591516,-58.485385,Feria del Productor al Consumidor,-34.593981,-58.483098,Farmers Market
1,Agronomía,-34.591516,-58.485385,Club Comunicaciones,-34.596538,-58.490417,Sports Club
2,Agronomía,-34.591516,-58.485385,Social Parrilla,-34.588955,-58.484677,BBQ Joint
3,Agronomía,-34.591516,-58.485385,Rayuela,-34.596635,-58.486246,Snack Place
4,Agronomía,-34.591516,-58.485385,Vivero Agronomía,-34.5917,-58.488838,Garden Center


In order to segment the neighbourhoods we need to know what kind of venues are in each one. Let's see.

I'll use pandas function: *get_dummies* to conver the categorical variable in an indicator variable.

In [12]:
#Analyze each neighbourhood. What kind of venues does the neighbour has?
caba_venues_cat = pd.get_dummies(caba_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
caba_venues_cat['Neighbourhood'] = caba_df['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [caba_venues_cat.columns[-1]] + list(caba_venues_cat.columns[:-1])
caba_venues_cat = caba_venues_cat[fixed_columns]

caba_venues_cat.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,American Restaurant,Amphitheater,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Agronomía,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Almagro,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Balvanera,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Barracas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Belgrano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Now I've processed the information, let group the venues for each Neighbourhood.**

In [13]:
caba_grouped = caba_venues_cat.groupby('Neighbourhood').mean().reset_index()
caba_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,American Restaurant,Amphitheater,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,Agronomía,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Almagro,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Balvanera,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Barracas,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Belgrano,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**It would be interesting to see the most common venues in each Neighbourhood. Let's find out..**

In [14]:
#Function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
#Create a dataframe with the top10 venue categories for each neighbourhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = caba_grouped['Neighbourhood']

for ind in np.arange(caba_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(caba_grouped.iloc[ind, :], num_top_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agronomía,Farmers Market,Fast Food Restaurant,Garden,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,French Restaurant,Fountain,Food Truck,Food Service
1,Almagro,Sports Club,Yoga Studio,Fast Food Restaurant,Garden,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,French Restaurant,Fountain,Food Truck
2,Balvanera,BBQ Joint,Yoga Studio,Fish Market,Garden Center,Garden,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,French Restaurant,Fountain
3,Barracas,Snack Place,Yoga Studio,Fast Food Restaurant,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,French Restaurant,Fountain,Food Truck,Food Service
4,Belgrano,Garden Center,Fast Food Restaurant,Garden,Gaming Cafe,Furniture / Home Store,Frozen Yogurt Shop,French Restaurant,Fountain,Food Truck,Food Service


If we see the DataFrame shape, we'll see that CABA has 48 neighbourhoods and we can analyze 252 venues categories

In [15]:
caba_grouped_clustering = caba_grouped.drop('Neighbourhood', 1)
caba_grouped_clustering.shape

(48, 252)