# The Battle of Neighborhoods
Sergio Koller

The city of Buenos Aires has to cathegorize their neighborhoods in order recognize the commercial areas from the living areas. This is mainly to see how the town hall is going to allocate the funds destined for commercial growth and for living subsidies.

The objective of this project is to segment the neighborhoods in 5 different categories:

1. Very High Commercial Area
2. High Commercial Area
3. Mid level Commercial Area
4. Low Commercial Area
5. Very Low Commercial Area

So that the City of Buenos Aires will be able to allocate the 2020 funds to mantain "Very High Commercial Areas", Grow "High Commercial Areas", and invest in living areas in the other categories.

Also the project will provide a visual prove of the information gathered, so that the allocation decisions can be analyzed taking in consideration the location of the nieghborhoods. 

For this project the following data will be used:

- Buenos Aires Neighborhoods data set - https://data.buenosaires.gob.ar/dataset/barrios
- Foursquare near venues - https://es.foursquare.com/

This data will be used to analyze the density of commercial venues in all the neighborhoods from Buenos Aires and segment them into the previously mentioned cathegories.

## Solution to the problem

Needed libraries

In [127]:
import pandas as pd
import numpy as np
# For web scrapping
# pip install lxml html5lib beautifulsoup4
# pd.options.mode.chained_assignment = None  # default='warn'

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

## Get the data and prepare it

Read the neighborhoods information for buenos aires

In [128]:
neigh_bs = pd.read_csv('Barrios.csv')
neigh_bs.head()

Unnamed: 0,WKT,barrio,comuna,perimetro,area
0,"POLYGON ((-58.4528200492791 -34.5959886570639,...",CHACARITA,15,7724.852955,3115707.0
1,"POLYGON ((-58.4655768128541 -34.5965577078058,...",PATERNAL,15,7087.513295,2229829.0
2,"POLYGON ((-58.4237529813037 -34.5978273383243,...",VILLA CRESPO,15,8131.857075,3615978.0
3,"POLYGON ((-58.4946097568899 -34.6148652395239,...",VILLA DEL PARQUE,11,7705.389797,3399596.0
4,"POLYGON ((-58.4128700313089 -34.6141162515854,...",ALMAGRO,5,8537.901368,4050752.0


Get Latitude and Longitude for each neighborhood with a accuracy of 6 points after the dot

In [129]:
neigh_bs['Coordinates'] = neigh_bs['WKT'].apply(lambda x:[x[10:20],x[28:38]])
neigh_bs['Latitude'] = neigh_bs['Coordinates'].apply(lambda x:x[1])
neigh_bs['Longitude'] = neigh_bs['Coordinates'].apply(lambda x:x[0])

neigh_bs.head()

Unnamed: 0,WKT,barrio,comuna,perimetro,area,Coordinates,Latitude,Longitude
0,"POLYGON ((-58.4528200492791 -34.5959886570639,...",CHACARITA,15,7724.852955,3115707.0,"[-58.452820, -34.595988]",-34.595988,-58.45282
1,"POLYGON ((-58.4655768128541 -34.5965577078058,...",PATERNAL,15,7087.513295,2229829.0,"[-58.465576, -34.596557]",-34.596557,-58.465576
2,"POLYGON ((-58.4237529813037 -34.5978273383243,...",VILLA CRESPO,15,8131.857075,3615978.0,"[-58.423752, -34.597827]",-34.597827,-58.423752
3,"POLYGON ((-58.4946097568899 -34.6148652395239,...",VILLA DEL PARQUE,11,7705.389797,3399596.0,"[-58.494609, -34.614865]",-34.614865,-58.494609
4,"POLYGON ((-58.4128700313089 -34.6141162515854,...",ALMAGRO,5,8537.901368,4050752.0,"[-58.412870, -34.614116]",-34.614116,-58.41287


Remove unnecesary columns and use english for naming conventions

In [130]:
neigh_bs.drop(['WKT','comuna','perimetro','area','Coordinates'], axis=1, inplace = True)
neigh_bs.columns = ['Neighborhood','Latitude','Longitude']
neigh_bs.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,CHACARITA,-34.595988,-58.45282
1,PATERNAL,-34.596557,-58.465576
2,VILLA CRESPO,-34.597827,-58.423752
3,VILLA DEL PARQUE,-34.614865,-58.494609
4,ALMAGRO,-34.614116,-58.41287


Cast Latitude and Longitude to float

In [131]:
neigh_bs['Latitude'] = neigh_bs['Latitude'].astype(float)
neigh_bs['Longitude'] = neigh_bs['Longitude'].astype(float)
neigh_bs.dtypes

Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

Plot the points over the map of buenos aires

In [132]:
map_bs = folium.Map(location=[-34.6131516, -58.3772316], zoom_start=10)

for lat, lng, neighborhood in zip(neigh_bs['Latitude'], neigh_bs['Longitude'], neigh_bs['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_bs)  
    
map_bs

## Get the number of venues for each neighborhood

In [136]:
CLIENT_ID = 'E251QN2CRIXCPQUWZZJS3PNUPM2XSMWHTW4KPOQ0VF11E1WG' # your Foursquare ID
CLIENT_SECRET = 'AFC1LVRBBMAKFXHUI0KKJJZXSZFHH1LRW4S0OUX1ICBVEHEX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Function the get the venues for each neighborhood

In [137]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [138]:
bs_venues = getNearbyVenues(names=neigh_bs['Neighborhood'],
                            latitudes=neigh_bs['Latitude'],
                            longitudes=neigh_bs['Longitude'])
bs_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,CHACARITA,-34.595988,-58.45282,Yeite,-34.596012,-58.44928,Deli / Bodega
1,CHACARITA,-34.595988,-58.45282,Movistar Arena,-34.594348,-58.448033,Stadium
2,CHACARITA,-34.595988,-58.45282,Margen del Mundo,-34.596987,-58.456835,Museum
3,CHACARITA,-34.595988,-58.45282,Tiro Loco,-34.598935,-58.452126,Café
4,CHACARITA,-34.595988,-58.45282,Alumni Fútbol 5,-34.597889,-58.451936,Soccer Field


Count venues for neighborhood and atach it to neigh_bs dataframe

In [139]:
venues_for_neighborhood = bs_venues.groupby(['Neighborhood']).count()['Venue']
# there are some neighborhoods without venues (forsquare doesn't work so good for south america)
# so will use this method that will check if the key exists in venues_for_neighborhood
def get_venues_count(x):
    if x in venues_for_neighborhood:
        return venues_for_neighborhood[x]
    else:
        return 0

neigh_bs['Number of Venues'] = neigh_bs['Neighborhood'].apply(lambda x:get_venues_count(x))
neigh_bs.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Number of Venues
0,CHACARITA,-34.595988,-58.45282,9
1,PATERNAL,-34.596557,-58.465576,5
2,VILLA CRESPO,-34.597827,-58.423752,26
3,VILLA DEL PARQUE,-34.614865,-58.494609,7
4,ALMAGRO,-34.614116,-58.41287,16


## Standarize the data and perform cluster the neighborhoods

Implement the scandard scaler for the number of venues attribute

In [140]:
sc_X = StandardScaler()
neigh_bs['Number of Venues'] = sc_X.fit_transform(neigh_bs[['Number of Venues']])
neigh_bs.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Number of Venues
0,CHACARITA,-34.595988,-58.45282,-0.314691
1,PATERNAL,-34.596557,-58.465576,-0.661936
2,VILLA CRESPO,-34.597827,-58.423752,1.161102
3,VILLA DEL PARQUE,-34.614865,-58.494609,-0.488314
4,ALMAGRO,-34.614116,-58.41287,0.292988


Perform K-means

In [141]:
# set number of clusters
kclusters = 5

neigh_bs_clustering = neigh_bs.drop(['Neighborhood','Latitude','Longitude'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(neigh_bs_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 4, 3, 2, 2, 3, 0, 0, 0], dtype=int32)

Add clusters to neigh_bs dataframe and change names

In [142]:
neigh_bs['Cluster'] = kmeans.labels_

def assign_cluster_label(cluster):
    if cluster == 0:
        return "Very High shopping Area"
    if cluster == 1:
        return "High Shopping Area"
    if cluster == 2:
        return "Mid Level Shopping Area"
    if cluster == 3:
        return "Low Shopping Area"
    if cluster == 4:
        return "Very Low Shopping Area"

    
neigh_bs['Cluster Label'] = neigh_bs['Cluster'].apply(lambda x:assign_cluster_label(x))
neigh_bs.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Number of Venues,Cluster,Cluster Label
0,CHACARITA,-34.595988,-58.45282,-0.314691,3,Low Shopping Area
1,PATERNAL,-34.596557,-58.465576,-0.661936,0,Very High shopping Area
2,VILLA CRESPO,-34.597827,-58.423752,1.161102,4,Very Low Shopping Area
3,VILLA DEL PARQUE,-34.614865,-58.494609,-0.488314,3,Low Shopping Area
4,ALMAGRO,-34.614116,-58.41287,0.292988,2,Mid Level Shopping Area


Display results in a map

In [143]:
map_clusters = folium.Map(location=[-34.6131516, -58.3772316], zoom_start=10)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, cluster_label in zip(neigh_bs['Latitude'], neigh_bs['Longitude'], neigh_bs['Neighborhood'], neigh_bs['Cluster'], neigh_bs['Cluster Label']):
    label = folium.Popup(str(poi) + ' ' + cluster_label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters