# Introduction / Business Problem

### The idea is to help a new investment company to discover where is the best place and the best type of restaurant to open in the city of Sao Paulo,Brazil.

I will use the foursquare to check how other restaurants are graded and how they are distibuted among the Sao Paulo city´s neighborhood among with other data from city´s oficial site.
I will will check for opportunities identifying the type of restaurants and where are the best neighborhoods to apply. 


# Data

### Foursquare Data

I will use foursquare api data. Foursquare is a social media website that collects information about places around the world. 
The documentation how to use this api is available at https://developer.foursquare.com/docs/places-api/
To use you will need to create an account on this website. Some api calls are available for free and others you need to acquire the premium category.
This api will be use to explore data about Venues in the city of Sao Paulo.

### Geopy

Geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.
I will Use geopy library to get the latitude and longitude values for Sao Paulo city. 
Geopy information is available at https://geopy.readthedocs.io/en/stable/

### How I will use the data

I will use the foursquare to check how other restaurants are graded and how they are distibuted among the Sao Paulo city´s neighborhood. 
This will give a clue for the best type of restaurant/cousine to open checking the restaurants with minor grades and telling in which neighbors this type of cousine is not available yet. 
Therefore, I will get some other statistics data to include in the dataset from Sao Paulo City official site https://www.prefeitura.sp.gov.br/.
I will grab the educational data by neighborhood (Level of scholarity) and financial data ( #people with best income rate ) to decide which would be the best neighbor to open the new restaurant.
Another datasource that migh be used is the about number of new houses/apartments build by neighborhood and include on the dataset in order to get a more accurate model because when will have more new residents we have more customers. 
I will use clustering and cloropleth in order to visualize and base the study.

### Loading proper libraries

In [133]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation


pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

! pip install folium
import folium # plotting library

!pip install lxml
import lxml

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


## Additional data from Sao Paulo government site

## Number of residents by income in Sao Paulo city grouped by neighborhood

In [134]:
link = "https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/Domicilios_faixa_rendimento_sal_minimos_2010.xls"
df_inc = pd.read_excel(link, skiprows=6, thousands=".")
df_inc.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento (3)
0,São Paulo,3574286,20129,225166,588778,1212485,714900,380801,224798,202016
1,Aricanduva/Formosa/Carrão,85188,197,4788,11237,28095,21081,10898,4228,4622
2,Aricanduva,27661,90,1996,4457,10327,6550,2402,475,1341
3,Carrão,27115,42,1266,2908,8239,7254,4400,1585,1418
4,Vila Formosa,30412,65,1526,3872,9529,7277,4096,2168,1863


## This is the educational level by person and neighborhood in Sao Paulo city

In [135]:
link = "https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/Grau%20de%20instru%C3%A7%C3%A3o_Pesquisa%20OD_2017.xls"
df_edu = pd.read_excel(link, skiprows=4, thousands=".", converters={'Total':float,'Não alfabetizado / Fundamental I incompleto':float, 'Fundamental I completo / Fundamental II incompleto': float, 'Fundamental II completo / Médio incompleto':float, 'Médio completo / Superior incompleto': float, 'Superior completo':float})
df_edu.head()

Unnamed: 0,Unidades territoriais,Total,Não alfabetizado / Fundamental I incompleto,Fundamental I completo / Fundamental II incompleto,Fundamental II completo / Médio incompleto,Médio completo / Superior incompleto,Superior completo
0,Município de São Paulo,11739200.0,2392470.0,1673060.0,1689830.0,3916720.0,2067160.0
1,Aricanduva/Formosa/Carrão,265623.0,53186.4,51161.3,38522.2,79950.8,42802.4
2,Aricanduva,86580.0,22871.1,13768.9,12226.0,26249.7,11464.3
3,Carrão,84711.0,15984.0,20130.6,14174.3,18426.8,15995.3
4,Vila Formosa,94332.0,14331.3,17261.8,12121.9,35274.3,15342.7


### Do some cleansy

In [136]:
# Drop grouped columns that contains /
indexnames=df_edu[df_edu['Unidades territoriais'].str.contains("/")].index
df_edu.drop(indexnames , inplace=True)
#Drop duplicates
df_edu.drop_duplicates(inplace=True)
df_edu = df_edu.iloc[:-1]
df_edu.tail()

Unnamed: 0,Unidades territoriais,Total,Não alfabetizado / Fundamental I incompleto,Fundamental I completo / Fundamental II incompleto,Fundamental II completo / Médio incompleto,Médio completo / Superior incompleto,Superior completo
123,Moema,88407,9876.45,3319.43,3405.05,19054.2,52751.9
124,Saúde,133683,18195.5,10186.0,10751.0,29433.3,65117.2
125,Vila Mariana,131989,13007.6,5645.54,7640.71,29479.3,76215.8
126,Vila Prudente,247597,46168.5,28693.1,34469.1,95864.3,42402.0
127,São Lucas,142954,28043.3,14246.7,23628.4,53409.9,23625.6


#### Now, lets group by Fundamental, College and University degree only

In [137]:
df_edu.drop('Total',axis=1,inplace=True)
df_edu['Fundamental']=df_edu['Não alfabetizado / Fundamental I incompleto']+df_edu['Fundamental I completo / Fundamental II incompleto']+df_edu['Fundamental II completo / Médio incompleto']

In [138]:
df_edu=df_edu.filter(['Unidades territoriais','Fundamental','Médio completo / Superior incompleto','Superior completo'])
df_edu.columns=(['Neighborhood','Fundamental','College','University'])
df_edu.head()

Unnamed: 0,Neighborhood,Fundamental,College,University
0,Município de São Paulo,5755360.0,3916720.0,2067160.0
2,Aricanduva,48866.0,26249.7,11464.3
3,Carrão,50288.9,18426.8,15995.3
4,Vila Formosa,43714.9,35274.3,15342.7
5,Butantã,204958.0,141800.0,106605.0


#### Removing Município de São Paulo as long and others that are duplicated in the xls provided. The duplicates one are subtotals and need to be removed.

In [139]:
df_edu=df_edu[df_edu['Neighborhood']!="Município de São Paulo"]
df_edu = df_edu.groupby('Neighborhood').agg({'Fundamental': ['min'],'College':['min'],'University':['min']})
df_edu.reset_index(inplace=True)
df_edu.columns=(['Neighborhood','Fundamental','College','University'])
df_edu.head()

Unnamed: 0,Neighborhood,Fundamental,College,University
0,Alto de Pinheiros,10077.686005,8376.646471,23023.663492
1,Anhanguera,44641.502808,27990.750558,8214.733694
2,Aricanduva,48865.985529,26249.69685,11464.306756
3,Artur Alvim,46509.056355,39514.172049,15391.771105
4,Barra Funda,5106.950583,4222.710931,6369.337611


#### We can see the Education Grouped and having the correct values for each degree

In [140]:
df_edu[0:20]

Unnamed: 0,Neighborhood,Fundamental,College,University
0,Alto de Pinheiros,10077.686005,8376.646471,23023.663492
1,Anhanguera,44641.502808,27990.750558,8214.733694
2,Aricanduva,48865.985529,26249.69685,11464.306756
3,Artur Alvim,46509.056355,39514.172049,15391.771105
4,Barra Funda,5106.950583,4222.710931,6369.337611
5,Bela Vista,22381.984499,17390.484601,32764.534509
6,Belém,21612.519085,14745.720573,12041.753336
7,Bom Retiro,17220.748445,14292.074117,6405.171068
8,Brasilândia,170485.416981,88996.286363,18676.28324
9,Brás,15057.458941,13265.085803,3973.456761


## This is the number of houses build on each neighborhood in Sao Paulo city

In [141]:
link = "https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/15_numero_de_unidades_residenciais_verticai_1992_2018.xls"
df_homes = pd.read_excel(link, skiprows=4, thousands=".")

In [142]:
df_homes.head()

Unnamed: 0,Unidades Territoriais,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
0,MSP,10266,21308,24510,25759,30207,38518,20910,25881,28676,21714,20243,24442,20020,23541,24736,37107,32577,30558,37174,37107,27087,32008,32830,20218,18839,36169,34743
1,Aricanduva/Formosa/Carrão,158,628,812,1120,782,1242,534,1173,740,244,507,501,477,394,931,1153,1855,1314,2240,2086,731,722,756,821,294,142,199
2,Aricanduva,-,-,104,-,-,-,-,400,160,-,48,227,112,-,64,208,346,378,708,572,483,-,50,399,-,-,141
3,Carrão,72,336,212,272,218,679,322,581,378,72,131,-,182,394,709,832,1117,826,588,348,60,220,370,138,242,-,58
4,Vila Formosa,86,292,496,848,564,563,212,192,202,172,328,274,183,-,158,113,392,110,944,1166,188,502,336,284,52,142,-


#### Fixing column issues and replacing empty values with zeroes

In [143]:
print(df_homes.columns.values)

['Unidades Territoriais' 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
 2016 2017 2018]


In [144]:
neig=df_homes['Unidades Territoriais']
home_grouped=pd.DataFrame(neig,columns=['Neighborhood'])
home_grouped['Neighborhood']=neig
home_grouped['2016']=df_homes[2016]
home_grouped['2017']=df_homes[2017]
home_grouped['2018']=df_homes[2018]

home_grouped['2016'].fillna(0)
home_grouped['2017'].fillna(0)
home_grouped['2018'].fillna(0)

home_grouped['2016']=home_grouped['2016'].replace(['-'],0)
home_grouped['2017']=home_grouped['2017'].replace(['-'],0)
home_grouped['2018']=home_grouped['2018'].replace(['-'],0)

home_grouped = home_grouped.iloc[:-3]

home_grouped['2016'].astype(int)
home_grouped['2017'].astype(int)
home_grouped['2018'].astype(int)

home_grouped.head()

Unnamed: 0,Neighborhood,2016,2017,2018
0,MSP,18839.0,36169.0,34743.0
1,Aricanduva/Formosa/Carrão,294.0,142.0,199.0
2,Aricanduva,0.0,0.0,141.0
3,Carrão,242.0,0.0,58.0
4,Vila Formosa,52.0,142.0,0.0


### For this dataframe I will consider just the last 3 years 2016,2017 and 2018 for each neighborhood. So lets do some cleansy and group to see where we have more new houses.

In [145]:
home_grouped.tail()

Unnamed: 0,Neighborhood,2016,2017,2018
125,São Lucas,0.0,588.0,738.0
126,Sapopemba,0.0,0.0,0.0
127,Vila Prudente,708.0,764.0,242.0
128,Sapopemba*,84.0,0.0,0.0
129,Sapopemba,84.0,0.0,0.0


#### Lets do a sum in order to get the total of new houses for the last 3 years

In [146]:
home_grouped['Total']=home_grouped['2016']+home_grouped['2017']+home_grouped['2018']
home_grouped=home_grouped.filter(['Neighborhood','Total'])
home_grouped.head()

Unnamed: 0,Neighborhood,Total
0,MSP,89751.0
1,Aricanduva/Formosa/Carrão,635.0
2,Aricanduva,141.0
3,Carrão,300.0
4,Vila Formosa,194.0


In [147]:
home_grouped[0:30]

Unnamed: 0,Neighborhood,Total
0,MSP,89751.0
1,Aricanduva/Formosa/Carrão,635.0
2,Aricanduva,141.0
3,Carrão,300.0
4,Vila Formosa,194.0
5,Butantã,5453.0
6,Butantã,1136.0
7,Morumbi,304.0
8,Raposo Tavares,2608.0
9,Rio Pequeno,136.0


In [148]:
# Drop grouped columns that contains /
indexnames=home_grouped[home_grouped['Neighborhood'].str.contains("/")].index
home_grouped.drop(indexnames , inplace=True)
#Drop duplicates
home_grouped.drop_duplicates(inplace=True)


In [149]:
home_grouped=home_grouped[home_grouped['Neighborhood']!="MSP"]
home_grouped = home_grouped.groupby('Neighborhood').agg({'Total': ['min']})
home_grouped.reset_index(inplace=True)
home_grouped.columns=(['Neighborhood','Total'])
home_grouped.head()

Unnamed: 0,Neighborhood,Total
0,Alto de Pinheiros,76.0
1,Anhanguera,0.0
2,Aricanduva,141.0
3,Artur Alvim,393.0
4,Barra Funda,406.0


## Methodology section

After do some data cleansy, we will treat the Income dataframe and Neighborhoods in order to get the geografical coordinates using Geopy.
We also will uses Foursquare to get restaurants locations. We will looking mainly for the type of the restaurants and group them.
After get the foursquare data, will be possible to make a cluster and plot a map about the categories we have found.
Later we will work with education and new houses data in order to do some clustering also and compare to understand which neighborhoods has the best education and income rate.
These will probably will be our preffered locations.

For clustering we use K-means which is a non-supervised algorithm. This algorithm is best used because we are working with unlabeled datasets.

In [150]:
df_inc = df_inc.rename(columns={'Unnamed: 0': 'Neighborhood','Unnamed: 1':'Total'})
df_inc.head()


Unnamed: 0,Neighborhood,Total,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento (3)
0,São Paulo,3574286,20129,225166,588778,1212485,714900,380801,224798,202016
1,Aricanduva/Formosa/Carrão,85188,197,4788,11237,28095,21081,10898,4228,4622
2,Aricanduva,27661,90,1996,4457,10327,6550,2402,475,1341
3,Carrão,27115,42,1266,2908,8239,7254,4400,1585,1418
4,Vila Formosa,30412,65,1526,3872,9529,7277,4096,2168,1863


In [151]:
df_inc.tail()

Unnamed: 0,Neighborhood,Total,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento (3)
128,,,,,,,,,,
129,"Fonte: IBGE, Censo Demográfico 2010.",,,,,,,,,
130,Nota: Os dados de rendimento são preliminares.,,,,,,,,,
131,(1) Inclusive os domicílios sem declaração de ...,,,,,,,,,
132,Elaboração: SMDU/Dipro,,,,,,,,,


#### Removing grouped neighbors, notes at the end of the dataframe and empty data

In [152]:
neighborhoods=[]
index=0
df_inc['Neighborhood']=df_inc['Neighborhood'].astype(str)
for index, neig in df_inc.iterrows():
     if (neig["Neighborhood"].find("//")==-1 and neig["Neighborhood"].find("Fonte")==-1 and neig["Neighborhood"].find("\\")==-1 and neig["Neighborhood"].find("Elab") and neig["Neighborhood"].find("nan")==-1 and neig["Neighborhood"].find("rendimento")==-1):
       neighborhoods.append(neig["Neighborhood"])
neighborhoods.remove('Aricanduva/Formosa/Carrão')
neighborhoods.remove('Casa Verde/Cachoeirinha')
neighborhoods.remove('Freguesia/Brasilândia')
neighborhoods.remove('Jaçanã/Tremembé')
neighborhoods.remove('Santana/Tucuruvi')
neighborhoods.remove('São Miguel\xa0')
neighborhoods.append('São Miguel')
neighborhoods.remove('Vila Maria/Vila Guilherme')
neighborhoods.remove('Vila Prudente/Sapopemba')
neighborhoods.sort()
neighborhoods


['Alto de Pinheiros',
 'Anhanguera',
 'Aricanduva',
 'Artur Alvim',
 'Barra Funda',
 'Bela Vista',
 'Belém',
 'Bom Retiro',
 'Brasilândia',
 'Brás',
 'Butantã',
 'Butantã',
 'Cachoeirinha',
 'Cambuci',
 'Campo Belo',
 'Campo Grande',
 'Campo Limpo',
 'Campo Limpo',
 'Cangaíba',
 'Capela do Socorro',
 'Capão Redondo',
 'Carrão',
 'Casa Verde',
 'Cidade Ademar',
 'Cidade Ademar',
 'Cidade Dutra',
 'Cidade Líder',
 'Cidade Tiradentes',
 'Cidade Tiradentes',
 'Consolação',
 'Cursino',
 'Ermelino Matarazzo',
 'Ermelino Matarazzo',
 'Freguesia do Ó',
 'Grajaú',
 'Guaianases',
 'Guaianases',
 'Iguatemi',
 'Ipiranga',
 'Ipiranga',
 'Itaim Bibi',
 'Itaim Paulista',
 'Itaim Paulista',
 'Itaquera',
 'Itaquera',
 'Jabaquara',
 'Jabaquara',
 'Jaguara',
 'Jaguaré',
 'Jaraguá',
 'Jardim Helena',
 'Jardim Paulista',
 'Jardim São Luís',
 'Jardim Ângela',
 'Jaçanã',
 'José Bonifácio',
 'Lajeado',
 'Lapa',
 'Lapa',
 'Liberdade',
 'Limão',
 "M'Boi Mirim",
 'Mandaqui',
 'Marsilac',
 'Moema',
 'Mooca',
 'Mo

#### Drop duplicates

In [153]:
df_neigh = pd.DataFrame (neighborhoods,columns=['Neighborhood'])
df_neigh.drop_duplicates(inplace=True)
df_neigh.head()

Unnamed: 0,Neighborhood
0,Alto de Pinheiros
1,Anhanguera
2,Aricanduva
3,Artur Alvim
4,Barra Funda


### First we look for Foursquare data about Venues in Sao paulo City 

#### Use geopy library to get the latitude and longitude values of Sao Paulo City. 

In [154]:
address = 'Sé, Sao Paulo,Brazil'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Sao Paulo City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Sao Paulo City are -23.5506507, -46.6333824.


In [155]:
geolocator = Nominatim(user_agent="sp_explorer")
df_neigh["Latitude"]=0.00
df_neigh["Longitude"]=0.00

for index, neig in df_neigh.iterrows():
   address = neig["Neighborhood"] + ' ' + ',Sao Paulo, Brazil'
   location = geolocator.geocode(address)
   if location:
      latitude = location.latitude
      longitude = location.longitude      
   else:
      if (address=="Capela do Socorro ,Sao Paulo, Brazil"):
            latitude=-23.69057
            longitude=-46.70358
   df_neigh["Latitude"][index]=latitude
   df_neigh["Longitude"][index]=longitude
   print('Neig {}, Lat {}, Long {}'.format(df_neigh["Neighborhood"][index],df_neigh["Latitude"][index], df_neigh["Longitude"][index]))
df_neigh.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Neig Alto de Pinheiros, Lat -23.5494609, Long -46.71229274842102
Neig Anhanguera, Lat -23.4329085, Long -46.788533962659386
Neig Aricanduva, Lat -23.5780239, Long -46.511454
Neig Artur Alvim, Lat -23.539220999999998, Long -46.48526468200837
Neig Barra Funda, Lat -23.5254616, Long -46.6675134
Neig Bela Vista, Lat -23.5622095, Long -46.64776648788944
Neig Belém, Lat -23.5348833, Long -46.5949387
Neig Bom Retiro, Lat -23.5271385, Long -46.636834846501365
Neig Brasilândia, Lat -23.4482715, Long -46.69026927092207
Neig Brás, Lat -23.5453263, Long -46.6164435
Neig Butantã, Lat -23.5690555, Long -46.7218833626702
Neig Cachoeirinha, Lat -23.449511450000003, Long -46.66366119497354
Neig Cambuci, Lat -23.566128499999998, Long -46.61365030871091
Neig Campo Belo, Lat -23.626730549999998, Long -46.66942867841393
Neig Campo Grande, Lat -23.67554775, Long -46.687234400083085
Neig Campo Limpo, Lat -23.632557650000003, Long -46.759666126372395
Neig Cangaíba, Lat -23.5058996, Long -46.5314253
Neig Capel

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alto de Pinheiros,-23.549461,-46.712293
1,Anhanguera,-23.432908,-46.788534
2,Aricanduva,-23.578024,-46.511454
3,Artur Alvim,-23.539221,-46.485265
4,Barra Funda,-23.525462,-46.667513


In [156]:
address = 'Centro, Sao Paulo,Brazil'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Sao Paulo City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Sao Paulo City are -23.550389799999998, -46.633080956332904.


#### Create a map of Sao Paulo Neighborhood

In [157]:
# create map of Sao Paulo using latitude and longitude values
map_sp = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, neighborhood in zip(df_neigh['Latitude'], df_neigh['Longitude'], df_neigh['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sp)  
    
map_sp

## Now lets explore the Neighborhoods using Foursquare

In [158]:
CLIENT_ID = 'VC122JEYSUGKBCYI0PLJH24TF0RRSJNUNBIKI31DJAAPYW1W' # your Foursquare ID
CLIENT_SECRET = 'NIO0XXKO1UHHMBH0ZOS2WS55AUQM5VZNLGO03FMZFVGESL0U' # your Foursquare Secret
ACCESS_TOKEN = '4VWIV0EMNBA5XUDOAPJYXM1XOSJYPTQKZOXF5MEV3EAOE1UQ' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: VC122JEYSUGKBCYI0PLJH24TF0RRSJNUNBIKI31DJAAPYW1W
CLIENT_SECRET:NIO0XXKO1UHHMBH0ZOS2WS55AUQM5VZNLGO03FMZFVGESL0U


#### This is a function that explore the venues in each Neighborhood

In [159]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [160]:
sp_venues = getNearbyVenues(names=df_neigh['Neighborhood'],
                                   latitudes=df_neigh['Latitude'],
                                   longitudes=df_neigh['Longitude']
                                  )

Alto de Pinheiros
Anhanguera
Aricanduva
Artur Alvim
Barra Funda
Bela Vista
Belém
Bom Retiro
Brasilândia
Brás
Butantã
Cachoeirinha
Cambuci
Campo Belo
Campo Grande
Campo Limpo
Cangaíba
Capela do Socorro
Capão Redondo
Carrão
Casa Verde
Cidade Ademar
Cidade Dutra
Cidade Líder
Cidade Tiradentes
Consolação
Cursino
Ermelino Matarazzo
Freguesia do Ó
Grajaú
Guaianases
Iguatemi
Ipiranga
Itaim Bibi
Itaim Paulista
Itaquera
Jabaquara
Jaguara
Jaguaré
Jaraguá
Jardim Helena
Jardim Paulista
Jardim São Luís
Jardim Ângela
Jaçanã
José Bonifácio
Lajeado
Lapa
Liberdade
Limão
M'Boi Mirim
Mandaqui
Marsilac
Moema
Mooca
Morumbi
Moóca
Parelheiros
Pari
Parque do Carmo
Pedreira
Penha
Perdizes
Perus
Pinheiros
Pirituba
Ponte Rasa
Raposo Tavares
República
Rio Pequeno
Sacomã
Santa Cecília
Santana
Santo Amaro
Sapopemba
Saúde
Socorro
São Domingos
São Lucas
São Mateus
São Miguel
São Paulo
São Rafael
Sé
Tatuapé
Tremembé
Tucuruvi
Vila Andrade
Vila Curuçá
Vila Formosa
Vila Guilherme
Vila Jacuí
Vila Leopoldina
Vila Maria
Vil

In [161]:
sp_venues.shape

(1819, 7)

In [162]:
sp_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Alto de Pinheiros,-23.549461,-46.712293,Praça Conde de Barcelos,-23.552146,-46.713696,Dog Run
1,Alto de Pinheiros,-23.549461,-46.712293,Pista de Caminhada,-23.550455,-46.712494,Trail
2,Alto de Pinheiros,-23.549461,-46.712293,Casa do BemStar,-23.547963,-46.710297,Gym / Fitness Center
3,Alto de Pinheiros,-23.549461,-46.712293,Praça Pero Vaz de Caminha,-23.550362,-46.711968,Plaza
4,Alto de Pinheiros,-23.549461,-46.712293,Shiatsu Luiza Sato,-23.546862,-46.710579,Spa


Let's check how many venues were returned for each neighborhood

In [163]:
sp_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alto de Pinheiros,11,11,11,11,11,11
Anhanguera,10,10,10,10,10,10
Aricanduva,7,7,7,7,7,7
Artur Alvim,20,20,20,20,20,20
Barra Funda,30,30,30,30,30,30
Bela Vista,30,30,30,30,30,30
Belém,27,27,27,27,27,27
Bom Retiro,30,30,30,30,30,30
Brasilândia,3,3,3,3,3,3
Brás,21,21,21,21,21,21


#### Let's find out how many unique categories can be curated from all the returned venues

In [164]:
print('There are {} uniques categories.'.format(len(sp_venues['Venue Category'].unique())))

There are 250 uniques categories.


### Lets filter just Restaurants and Bistros from the dataset to go deep in restaurants

In [165]:
sp_venues=sp_venues[sp_venues['Venue Category'].str.contains("estaurant") | sp_venues['Venue Category'].str.contains("istro") ]


In [166]:
sp_venues.shape

(321, 7)

In [167]:
sp_venues[:20]

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
43,Artur Alvim,-23.539221,-46.485265,I Love Yakissoba,-23.540614,-46.484996,Asian Restaurant
46,Artur Alvim,-23.539221,-46.485265,kayama sushi,-23.537377,-46.487119,Asian Restaurant
50,Barra Funda,-23.525462,-46.667513,Moça Prendada,-23.52336,-46.664607,Restaurant
57,Barra Funda,-23.525462,-46.667513,Tanta Felicità Ristorante,-23.524836,-46.66344,Italian Restaurant
63,Barra Funda,-23.525462,-46.667513,Seleto Nutri Service Restaurante,-23.522045,-46.664659,Brazilian Restaurant
64,Barra Funda,-23.525462,-46.667513,Lá Em Minas,-23.528232,-46.671354,Brazilian Restaurant
74,Barra Funda,-23.525462,-46.667513,Ydalah Lounge Sushi Bar,-23.522818,-46.664351,Japanese Restaurant
77,Barra Funda,-23.525462,-46.667513,Bar e Lanchonete Matarazzo,-23.528023,-46.67088,Brazilian Restaurant
85,Bela Vista,-23.56221,-46.647766,Bánh Mì Vietnam,-23.563338,-46.649773,Vietnamese Restaurant
98,Bela Vista,-23.56221,-46.647766,Osteria Generale,-23.565734,-46.646792,Italian Restaurant


## Analyze each neighborhood

#### Lets build a new table and distribute according Restaurants Categories

In [168]:
# one hot encoding
sp_onehot = pd.get_dummies(sp_venues[['Venue Category']], prefix="", prefix_sep="")
sp_onehot.head()

Unnamed: 0,American Restaurant,Argentinian Restaurant,Asian Restaurant,Baiano Restaurant,Bistro,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Comfort Food Restaurant,Doner Restaurant,Dumpling Restaurant,Empanada Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Kebab Restaurant,Korean Restaurant,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mineiro Restaurant,Northeastern Brazilian Restaurant,Northern Brazilian Restaurant,Paella Restaurant,Persian Restaurant,Peruvian Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Southeastern Brazilian Restaurant,Spanish Restaurant,Sushi Restaurant,Swiss Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
43,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
46,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
50,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
57,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
63,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [169]:
# add neighborhood column back to dataframe
sp_onehot['Neighborhood'] = sp_venues['Neighborhood'] 
sp_onehot.head()

Unnamed: 0,American Restaurant,Argentinian Restaurant,Asian Restaurant,Baiano Restaurant,Bistro,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Comfort Food Restaurant,Doner Restaurant,Dumpling Restaurant,Empanada Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Kebab Restaurant,Korean Restaurant,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mineiro Restaurant,Northeastern Brazilian Restaurant,Northern Brazilian Restaurant,Paella Restaurant,Persian Restaurant,Peruvian Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Southeastern Brazilian Restaurant,Spanish Restaurant,Sushi Restaurant,Swiss Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Neighborhood
43,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Artur Alvim
46,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Artur Alvim
50,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,Barra Funda
57,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Barra Funda
63,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Barra Funda


In [170]:
# move neighborhood column to the first column
fixed_columns = [sp_onehot.columns[-1]] + list(sp_onehot.columns[:-1])
sp_onehot = sp_onehot[fixed_columns]

sp_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Argentinian Restaurant,Asian Restaurant,Baiano Restaurant,Bistro,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Comfort Food Restaurant,Doner Restaurant,Dumpling Restaurant,Empanada Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Kebab Restaurant,Korean Restaurant,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mineiro Restaurant,Northeastern Brazilian Restaurant,Northern Brazilian Restaurant,Paella Restaurant,Persian Restaurant,Peruvian Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Southeastern Brazilian Restaurant,Spanish Restaurant,Sushi Restaurant,Swiss Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
43,Artur Alvim,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
46,Artur Alvim,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
50,Barra Funda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
57,Barra Funda,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
63,Barra Funda,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [171]:
sp_grouped = sp_onehot.groupby('Neighborhood').mean().reset_index()
sp_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Argentinian Restaurant,Asian Restaurant,Baiano Restaurant,Bistro,Brazilian Restaurant,Cajun / Creole Restaurant,Chinese Restaurant,Comfort Food Restaurant,Doner Restaurant,Dumpling Restaurant,Empanada Restaurant,Falafel Restaurant,Fast Food Restaurant,German Restaurant,Greek Restaurant,Italian Restaurant,Japanese Restaurant,Jewish Restaurant,Kebab Restaurant,Korean Restaurant,Latin American Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Mineiro Restaurant,Northeastern Brazilian Restaurant,Northern Brazilian Restaurant,Paella Restaurant,Persian Restaurant,Peruvian Restaurant,Portuguese Restaurant,Restaurant,Seafood Restaurant,Southeastern Brazilian Restaurant,Spanish Restaurant,Sushi Restaurant,Swiss Restaurant,Tapas Restaurant,Thai Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,Artur Alvim,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Barra Funda,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Bela Vista,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25
3,Belém,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bom Retiro,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.083333,0.0,0.416667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333


### Lets group the similar foods in a few groups
This will help to understand better the clustering later

In [172]:
sp_grouped['Brazilian Food']=sp_grouped['Restaurant']+sp_grouped['Baiano Restaurant']+sp_grouped['Brazilian Restaurant']+sp_grouped['Mineiro Restaurant']+sp_grouped['Northeastern Brazilian Restaurant']+sp_grouped['Northern Brazilian Restaurant']+sp_grouped['Southeastern Brazilian Restaurant']+sp_grouped['Seafood Restaurant']+sp_grouped['Fast Food Restaurant']
sp_grouped['Argentinian Food']=sp_grouped['Argentinian Restaurant']+sp_grouped['Empanada Restaurant']
sp_grouped['Jewish/Arabian Food']=sp_grouped['Jewish Restaurant']+sp_grouped['Kebab Restaurant']+sp_grouped['Falafel Restaurant']+sp_grouped['Middle Eastern Restaurant']+sp_grouped['Persian Restaurant']+sp_grouped['Doner Restaurant']
sp_grouped['American Food']=sp_grouped['American Restaurant']+sp_grouped['Cajun / Creole Restaurant']+sp_grouped['Comfort Food Restaurant']
sp_grouped['Vegan Food']=sp_grouped['Vegetarian / Vegan Restaurant']
sp_grouped['Portuguese Food']=sp_grouped['Portuguese Restaurant'] 
sp_grouped['Spanish Food']=sp_grouped['Paella Restaurant']+sp_grouped['Spanish Restaurant']+sp_grouped['Tapas Restaurant']
sp_grouped['German Food']=sp_grouped['German Restaurant']
#sp_grouped['French Food']=sp_grouped['French Restaurant']
sp_grouped['Italian Food']=sp_grouped['Italian Restaurant']
sp_grouped['Mexican Food']=sp_grouped['Mexican Restaurant']
sp_grouped['Asian Food']=sp_grouped['Asian Restaurant']+sp_grouped['Chinese Restaurant']+sp_grouped['Dumpling Restaurant']+sp_grouped['Japanese Restaurant']+sp_grouped['Korean Restaurant']+sp_grouped['Sushi Restaurant']+sp_grouped['Thai Restaurant']+sp_grouped['Vietnamese Restaurant']

sp_grouped=sp_grouped.filter(['Neighborhood','Argentinian Food','American Food', 'Asian Food', 'Brazilian Food','French Food','German Food','Italian Food','Mexican Food','Jewish/Arabian Food','Portuguese Food','Spanish Food','Vegan Food'])
sp_grouped.head()

Unnamed: 0,Neighborhood,Argentinian Food,American Food,Asian Food,Brazilian Food,German Food,Italian Food,Mexican Food,Jewish/Arabian Food,Portuguese Food,Spanish Food,Vegan Food
0,Artur Alvim,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Barra Funda,0.0,0.0,0.166667,0.666667,0.0,0.166667,0.0,0.0,0.0,0.0,0.0
2,Bela Vista,0.0,0.0,0.25,0.0,0.0,0.75,0.0,0.0,0.0,0.0,0.0
3,Belém,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bom Retiro,0.0,0.0,0.5,0.333333,0.0,0.0,0.0,0.083333,0.0,0.0,0.0


#### Let's print each neighborhood along with the top 5 restaurants

In [173]:
num_top_venues = 5

for hood in sp_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sp_grouped[sp_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Artur Alvim----
              venue  freq
0        Asian Food   1.0
1  Argentinian Food   0.0
2     American Food   0.0
3    Brazilian Food   0.0
4       German Food   0.0


----Barra Funda----
              venue  freq
0    Brazilian Food  0.67
1        Asian Food  0.17
2      Italian Food  0.17
3  Argentinian Food  0.00
4     American Food  0.00


----Bela Vista----
              venue  freq
0      Italian Food  0.75
1        Asian Food  0.25
2  Argentinian Food  0.00
3     American Food  0.00
4    Brazilian Food  0.00


----Belém----
              venue  freq
0    Brazilian Food   1.0
1  Argentinian Food   0.0
2     American Food   0.0
3        Asian Food   0.0
4       German Food   0.0


----Bom Retiro----
                 venue  freq
0           Asian Food  0.50
1       Brazilian Food  0.33
2  Jewish/Arabian Food  0.08
3     Argentinian Food  0.00
4        American Food  0.00


----Brás----
              venue  freq
0    Brazilian Food   1.0
1  Argentinian Food   0.0
2     Ame

#### Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [174]:
#function to order the results
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [175]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sp_grouped['Neighborhood']

for ind in np.arange(sp_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sp_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Artur Alvim,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Brazilian Food,American Food
1,Barra Funda,Brazilian Food,Italian Food,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,German Food,American Food
2,Bela Vista,Italian Food,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,German Food,Brazilian Food,American Food
3,Belém,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food
4,Bom Retiro,Asian Food,Brazilian Food,Jewish/Arabian Food,Vegan Food,Spanish Food,Portuguese Food,Mexican Food,Italian Food,German Food,American Food


## Cluster Neighborhoods

Run _k_-means to cluster the neighborhood into 5 clusters.

In [176]:
# set number of clusters
kclusters = 6

sp_grouped_clustering = sp_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sp_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 4, 3, 1, 3, 1, 1, 1, 3], dtype=int32)

In [177]:
df_neigh.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alto de Pinheiros,-23.549461,-46.712293
1,Anhanguera,-23.432908,-46.788534
2,Aricanduva,-23.578024,-46.511454
3,Artur Alvim,-23.539221,-46.485265
4,Barra Funda,-23.525462,-46.667513


#### Adding coordinates to the dataframe 

In [178]:
# add clustering labels and geo coordinates
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sp_merged = df_neigh

# merge sp_grouped with df_neigh to add latitude/longitude for each neighborhood
sp_merged = sp_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
#Drop NaN values for venues not located on foursquare
sp_merged=sp_merged.dropna(subset=['Cluster Labels'])
sp_merged["Cluster Labels"]=sp_merged["Cluster Labels"].astype(int)

sp_merged

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Artur Alvim,-23.539221,-46.485265,2,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Brazilian Food,American Food
4,Barra Funda,-23.525462,-46.667513,1,Brazilian Food,Italian Food,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,German Food,American Food
5,Bela Vista,-23.56221,-46.647766,4,Italian Food,Asian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,German Food,Brazilian Food,American Food
6,Belém,-23.534883,-46.594939,3,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food
7,Bom Retiro,-23.527138,-46.636835,1,Asian Food,Brazilian Food,Jewish/Arabian Food,Vegan Food,Spanish Food,Portuguese Food,Mexican Food,Italian Food,German Food,American Food
9,Brás,-23.545326,-46.616444,3,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food
10,Butantã,-23.569056,-46.721883,1,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food
13,Cambuci,-23.566128,-46.61365,1,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food
14,Campo Belo,-23.626731,-46.669429,1,Brazilian Food,Italian Food,Vegan Food,Jewish/Arabian Food,German Food,Argentinian Food,Spanish Food,Portuguese Food,Mexican Food,Asian Food
15,Campo Grande,-23.675548,-46.687234,3,Brazilian Food,Vegan Food,Spanish Food,Portuguese Food,Jewish/Arabian Food,Mexican Food,Italian Food,German Food,Asian Food,American Food


In [179]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sp_merged['Latitude'], sp_merged['Longitude'], sp_merged['Neighborhood'], sp_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Results section 


#### According the Clustering made for the Food Restaurants of Sao paulo, we can understand:

- Cluster group 0 - Red - Most Brazilian Food
- Cluster group 1 - Purple - Most Asian and Vegan Food
- Cluster group 2 - Blue - Most American and Jewish/Arabian Food
- Cluster group 3 - Light Blue - Most Brazilian, Vegan and Spanish Food
- Cluster group 4 - Green - Most Asian and Jewish/Arabian Food 
- Cluster group 5 - Orange - Most Italian Restaurants

#### To decide the type of restaurant we should choose, we need analyze together with other data like financial and educational data. This will leverage us to discover what would it be the best place to open the restaurant.

### Lets now select the place based on economic and educational data

In [180]:
df_inc

Unnamed: 0,Neighborhood,Total,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento (3)
0,São Paulo,3574286.0,20129.0,225166.0,588778.0,1212485.0,714900.0,380801.0,224798.0,202016.0
1,Aricanduva/Formosa/Carrão,85188.0,197.0,4788.0,11237.0,28095.0,21081.0,10898.0,4228.0,4622.0
2,Aricanduva,27661.0,90.0,1996.0,4457.0,10327.0,6550.0,2402.0,475.0,1341.0
3,Carrão,27115.0,42.0,1266.0,2908.0,8239.0,7254.0,4400.0,1585.0,1418.0
4,Vila Formosa,30412.0,65.0,1526.0,3872.0,9529.0,7277.0,4096.0,2168.0,1863.0
5,Butantã,135821.0,482.0,5860.0,16371.0,37882.0,28879.0,22223.0,17181.0,6571.0
6,Butantã,18542.0,42.0,561.0,1129.0,3684.0,4815.0,4640.0,3042.0,628.0
7,Morumbi,15448.0,37.0,425.0,1137.0,2450.0,2261.0,2654.0,5284.0,1152.0
8,Raposo Tavares,29865.0,125.0,1682.0,5427.0,11584.0,6233.0,2437.0,689.0,1678.0
9,Rio Pequeno,37308.0,135.0,1932.0,4983.0,11348.0,7968.0,5754.0,3181.0,1717.0


In [181]:
indexNames = df_inc[(df_inc['Neighborhood'] == 'Aricanduva/Formosa/Carrão') | \
                    (df_inc['Neighborhood'] == 'Casa Verde/Cachoeirinha') | \
                    (df_inc['Neighborhood'] == 'Freguesia/Brasilândia') | \
                    (df_inc['Neighborhood'] == 'Jaçanã/Tremembé') | \
                    (df_inc['Neighborhood'] == 'Santana/Tucuruvi') | \
                    (df_inc['Neighborhood'] == 'Vila Maria/Vila Guilherme') | \
                    (df_inc['Neighborhood'] == 'São Paulo') | \
                    (df_inc['Neighborhood'] == 'Vila Prudente/Sapopemba')].index

In [182]:
df_inc.drop(indexNames , inplace=True)
df_inc = df_inc.iloc[:-5]
df_inc.tail()

Unnamed: 0,Neighborhood,Total,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento (3)
122,Saúde,49278,141,956,2463,8720,11752,13202,9549,2382
123,Vila Mariana,51822,77,549,1678,7032,11350,13626,14123,2275
125,São Lucas,45770,135,2593,6607,16480,12162,4856,752,2176
126,Sapopemba,84686,721,7606,18943,35860,14426,2824,294,4003
127,Vila Prudente,34707,369,1759,4374,11294,8308,4933,1888,1780


In [183]:
df_inc['Neighborhood'] = df_inc['Neighborhood'].replace(['São Miguel\xa0'],'São Miguel')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [184]:
df_inc.columns=['Neighborhood', 'Total', 'Até 1/2', 'Mais de 1/2 a 1', 'Mais de 1 a 2', 'Mais de 2 a 5', 'Mais de 5 a 10', 'Mais de 10 a 20', 'Mais de 20', 'Sem rendimento']

In [185]:
df_inc.head()

Unnamed: 0,Neighborhood,Total,Até 1/2,Mais de 1/2 a 1,Mais de 1 a 2,Mais de 2 a 5,Mais de 5 a 10,Mais de 10 a 20,Mais de 20,Sem rendimento
2,Aricanduva,27661,90,1996,4457,10327,6550,2402,475,1341
3,Carrão,27115,42,1266,2908,8239,7254,4400,1585,1418
4,Vila Formosa,30412,65,1526,3872,9529,7277,4096,2168,1863
5,Butantã,135821,482,5860,16371,37882,28879,22223,17181,6571
6,Butantã,18542,42,561,1129,3684,4815,4640,3042,628


In [186]:
df_inc['Total']=df_inc['Total'].astype(int)
df_inc['Até 1/2']=df_inc['Até 1/2'].astype(int)
df_inc['Mais de 1/2 a 1']=df_inc['Total'].astype(int)
df_inc['Mais de 1 a 2']=df_inc['Mais de 1 a 2'].astype(int)
df_inc['Mais de 2 a 5']=df_inc['Mais de 2 a 5'].astype(int)
df_inc['Mais de 5 a 10']=df_inc['Mais de 5 a 10'].astype(int)
df_inc['Mais de 10 a 20']=df_inc['Mais de 10 a 20'].astype(int)
df_inc['Mais de 20']=df_inc['Mais de 20'].astype(int)
df_inc['Sem rendimento']=df_inc['Sem rendimento'].astype(int)
df_inc['Até 10']=df_inc['Até 1/2']+df_inc['Mais de 1/2 a 1']+df_inc['Mais de 2 a 5']+df_inc['Mais de 5 a 10']
df_inc['Acima de 10']=df_inc['Mais de 10 a 20']+df_inc['Mais de 20']
df_inc = df_inc.filter(items=['Neighborhood','Sem rendimento','Até 10','Acima de 10'])
inc_grouped = df_inc.groupby('Neighborhood').mean().reset_index()
inc_grouped.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

Unnamed: 0,Neighborhood,Sem rendimento,Até 10,Acima de 10
0,Alto de Pinheiros,919.0,19984.0,9211.0
1,Anhanguera,707.0,31416.0,609.0
2,Aricanduva,1341.0,44628.0,2877.0
3,Artur Alvim,1074.0,56243.0,2285.0
4,Barra Funda,186.0,8166.0,2405.0


#### Normalizing data

In [187]:
inc_neig=inc_grouped['Neighborhood']
inc_grouped.drop('Neighborhood',axis=1,inplace=True)
inc_grouped=((inc_grouped-inc_grouped.min())/(inc_grouped.max()-inc_grouped.min()))*20
inc_grouped.head()

Unnamed: 0,Sem rendimento,Até 10,Acima de 10
0,1.282702,1.228296,3.656858
1,0.911716,2.074754,0.23062
2,2.021174,3.053007,1.133981
3,1.553942,3.913015,0.898183
4,0.0,0.353258,0.94598


In [190]:
inc_grouped['Neighborhood']=inc_neig

### Clustering Income

In [191]:
# set number of clusters
kclusters = 5

# run k-means clustering
inc_kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(inc_grouped.drop(['Neighborhood'],1))

# check cluster labels generated for each row in the dataframe
inc_kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 2, 1], dtype=int32)

In [192]:
inc_kmeans.labels_.shape

(99,)

In [193]:
inc_grouped.shape

(99, 4)

In [194]:
df_neigh.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Alto de Pinheiros,-23.549461,-46.712293
1,Anhanguera,-23.432908,-46.788534
2,Aricanduva,-23.578024,-46.511454
3,Artur Alvim,-23.539221,-46.485265
4,Barra Funda,-23.525462,-46.667513


In [195]:
inc_grouped.insert(0, 'Cluster Labels', inc_kmeans.labels_)

In [196]:
inc_grouped.head()

Unnamed: 0,Cluster Labels,Sem rendimento,Até 10,Acima de 10,Neighborhood
0,1,1.282702,1.228296,3.656858,Alto de Pinheiros
1,1,0.911716,2.074754,0.23062,Anhanguera
2,1,2.021174,3.053007,1.133981,Aricanduva
3,1,1.553942,3.913015,0.898183,Artur Alvim
4,1,0.0,0.353258,0.94598,Barra Funda


In [197]:
# add clustering labels
#inc_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

inc_merged = inc_grouped
inc_merged = inc_merged.join(df_neigh.set_index('Neighborhood'), on='Neighborhood')

inc_merged=inc_merged.dropna(subset=['Cluster Labels'])
inc_merged["Cluster Labels"]=inc_merged["Cluster Labels"].astype(int)

inc_merged.head()

Unnamed: 0,Cluster Labels,Sem rendimento,Até 10,Acima de 10,Neighborhood,Latitude,Longitude
0,1,1.282702,1.228296,3.656858,Alto de Pinheiros,-23.549461,-46.712293
1,1,0.911716,2.074754,0.23062,Anhanguera,-23.432908,-46.788534
2,1,2.021174,3.053007,1.133981,Aricanduva,-23.578024,-46.511454
3,1,1.553942,3.913015,0.898183,Artur Alvim,-23.539221,-46.485265
4,1,0.0,0.353258,0.94598,Barra Funda,-23.525462,-46.667513


In [198]:
address = 'Centro, Sao Paulo,Brazil'

geolocator = Nominatim(user_agent="sp_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Sao Paulo City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Sao Paulo City are -23.550389799999998, -46.633080956332904.


In [199]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(inc_merged['Latitude'], inc_merged['Longitude'], inc_merged['Neighborhood'], inc_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Clustering Education

#### Review Education data

In [200]:
df_edu.head()

Unnamed: 0,Neighborhood,Fundamental,College,University
0,Alto de Pinheiros,10077.686005,8376.646471,23023.663492
1,Anhanguera,44641.502808,27990.750558,8214.733694
2,Aricanduva,48865.985529,26249.69685,11464.306756
3,Artur Alvim,46509.056355,39514.172049,15391.771105
4,Barra Funda,5106.950583,4222.710931,6369.337611


#### Normalizing data

In [201]:
edu_neig=df_edu['Neighborhood']
df_edu.drop('Neighborhood',axis=1,inplace=True)
edu_grouped=((df_edu-df_edu.min())/(df_edu.max()-df_edu.min()))*20
edu_grouped.head()

Unnamed: 0,Fundamental,College,University
0,0.27971,0.565673,4.906251
1,2.224664,2.423477,1.694159
2,2.462381,2.258568,2.398999
3,2.329754,3.51495,3.250875
4,0.0,0.172222,1.293888


#### Get back the Neighborhood column to the dataset

In [202]:
edu_grouped['Neighborhood']=edu_neig

In [203]:
# set number of clusters
kclusters = 5

# run k-means clustering
edu_kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(edu_grouped.drop(['Neighborhood'],1))

# check cluster labels generated for each row in the dataframe
edu_kmeans.labels_[0:10] 

array([3, 0, 0, 0, 0, 3, 0, 0, 4, 0], dtype=int32)

In [204]:
# add clustering labels
#edu_grouped.drop('Cluster Labels',inplace=True,axis=1)
edu_grouped.insert(0, 'Cluster Labels', edu_kmeans.labels_)

edu_merged = edu_grouped
edu_merged = edu_merged.join(df_neigh.set_index('Neighborhood'), on='Neighborhood')

edu_merged=edu_merged.dropna(subset=['Cluster Labels'])
edu_merged=edu_merged.dropna(subset=['Latitude'])

edu_merged[0:90]

Unnamed: 0,Cluster Labels,Fundamental,College,University,Neighborhood,Latitude,Longitude
0,3,0.27971,0.565673,4.906251,Alto de Pinheiros,-23.549461,-46.712293
1,0,2.224664,2.423477,1.694159,Anhanguera,-23.432908,-46.788534
2,0,2.462381,2.258568,2.398999,Aricanduva,-23.578024,-46.511454
3,0,2.329754,3.51495,3.250875,Artur Alvim,-23.539221,-46.485265
4,0,0.0,0.172222,1.293888,Barra Funda,-23.525462,-46.667513
5,3,0.97209,1.419444,7.019069,Bela Vista,-23.56221,-46.647766
6,0,0.928791,1.168937,2.524248,Belém,-23.534883,-46.594939
7,0,0.68166,1.125969,1.30166,Bom Retiro,-23.527138,-46.636835
8,4,9.306075,8.201784,3.963293,Brasilândia,-23.448272,-46.690269
9,0,0.559929,1.028695,0.774215,Brás,-23.545326,-46.616444


In [205]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(edu_merged['Latitude'], edu_merged['Longitude'], edu_merged['Neighborhood'], edu_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

where you discuss the results.

### Clustering locations by new houses build

#### Normalizing data

In [206]:
home_neig=home_grouped['Neighborhood']
home_grouped.drop('Neighborhood',axis=1,inplace=True)
home_grouped=((home_grouped-home_grouped.min())/(home_grouped.max()-home_grouped.min()))*20
home_grouped.head()

Unnamed: 0,Total
0,0.198252
1,0.0
2,0.36781
3,1.025173
4,1.059084


Get back the neighborhood column

In [207]:
home_grouped['Neighborhood']=home_neig

In [208]:
# set number of clusters
kclusters = 5

# run k-means clustering
home_kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(home_grouped.drop(['Neighborhood'],1))

# check cluster labels generated for each row in the dataframe
home_kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 2, 2, 2, 0, 1], dtype=int32)

In [209]:
# add clustering labels
#home_grouped.drop('Cluster Labels',inplace=True,axis=1)
home_grouped.insert(0, 'Cluster Labels', home_kmeans.labels_)

home_merged = home_grouped
home_merged = home_merged.join(df_neigh.set_index('Neighborhood'), on='Neighborhood')

home_merged=home_merged.dropna(subset=['Cluster Labels'])
home_merged=home_merged.dropna(subset=['Latitude'])

home_merged.head()

Unnamed: 0,Cluster Labels,Total,Neighborhood,Latitude,Longitude
0,0,0.198252,Alto de Pinheiros,-23.549461,-46.712293
1,0,0.0,Anhanguera,-23.432908,-46.788534
2,0,0.36781,Aricanduva,-23.578024,-46.511454
3,0,1.025173,Artur Alvim,-23.539221,-46.485265
4,0,1.059084,Barra Funda,-23.525462,-46.667513


In [210]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(home_merged['Latitude'], home_merged['Longitude'], home_merged['Neighborhood'], home_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Results for Education

- Red
We have more people with low education level, inccomplete high school or college degree

- Blue 
Low education with few people with University degree

- Purple 
We have good education skills at the average with College and University degree on this group.

- Orange
We have more people with University mainly in this group.

Taking this analysis we must consider Purple and Orange neighborhoods.

In [104]:
result_edu=edu_merged[edu_merged['Cluster Labels']==2]
result_edu

Unnamed: 0,Cluster Labels,Fundamental,College,University,Neighborhood,Latitude,Longitude
33,2,0.744586,2.131289,11.458527,Itaim Bibi,-23.584381,-46.678444
36,2,5.533108,6.828075,10.804535,Jabaquara,-23.652066,-46.650037
41,2,0.641773,1.364664,12.297041,Jardim Paulista,-23.567435,-46.663692
53,2,0.646782,1.577023,11.354375,Moema,-23.597085,-46.662888
54,2,7.406249,11.908252,20.0,Mooca,-23.560681,-46.597192
62,2,1.058519,1.871802,14.698541,Perdizes,-23.537929,-46.680671
75,2,1.914662,2.560113,14.036449,Saúde,-23.615178,-46.643393
94,2,1.192217,2.564469,16.443758,Vila Mariana,-23.5837,-46.632741


### Results for House Building

- Red
Few houses builded

- Blue 
Up to 1.500 houses builded

- Orange
Up to 2.000 houses builded

- Purple 
More than 2.000 houses builded

Taking this analysis , the Purple region has more development and possible new customers.

In [130]:
result_homes=home_merged[home_merged['Cluster Labels']==4]
result_homes

Unnamed: 0,Cluster Labels,Total,Neighborhood,Latitude,Longitude
15,4,5.196296,Campo Limpo,-23.632558,-46.759666
33,4,5.021521,Itaim Bibi,-23.584381,-46.678444
53,4,5.751924,Moema,-23.597085,-46.662888
68,4,5.785835,República,-23.545335,-46.642257
70,4,4.288509,Sacomã,-23.601282,-46.602555
71,4,4.606756,Santa Cecília,-23.52966,-46.651894
76,4,4.356332,Saúde,-23.615178,-46.643393
84,4,5.480631,Sé,-23.550651,-46.633382
85,4,5.879744,Tatuapé,-23.540252,-46.576642
98,4,4.47111,Vila Prudente,-23.592335,-46.574961


## Discussion section 


According the above results we can notice that Moema, Saúde and Itaim Bibi are the recommended Neighbors to open a new restaraunt as it has more people with high education and with more investiment for new house building. We are not taking in count the total population in each neighbour but this indicator are enough in order to suppose good neighbours/districts to open.

Let´s see the top 3 types of restaurant in each neighborhood that we discovered:

#### - Moema
  - Jewish/Arabian Food
  - Asian Food
  - Brazilian Food

#### - Saúde
  - Brazilian Food
  - Asian Food
  - Vegan Food

#### - Itaim Bibi
  - Brazilian Food
  - Asian Food
  - Italian Food



## Conclusion section 

Taking in count the type of restaurants we could consider the 3dr type of restaurant in each neighborhood for instance. We could consider this because other restaurants should have less demand and this top 3 are already stabilished and success type of restaurants on each location.

#### Using this methodology, we can conclude therefore that we could open a new Italian Restaurant on Itaim Bibi, a new Vegan restaurant on Saude or a new Brazilian restaurant on Moema neighborhood.
