# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera
#### Author: Vinicius Lago

## Table of contents
* [1.Introduction: Business Problem](#introduction)
* [2.Data](#data)
* [3.Methodology](#methodology)
* [4.Analysis](#analysis)
* [5.Results and Discussion](#results)
* [6.Conclusion](#conclusion)



## 1. Introduction: Business Problem <a name="introduction"></a>

We work in a real estate development company in the city of São Caetano do Sul, a small city near São Paulo – Brazil. We work in the area of prospecting for new land and our main goal is to find the best potential neighborhoods in São Caetano to launch our new product. Our new product is designed for younger people and we want the that the neighborhood where the product will be launch has the profile of this audience. In our previous surveys, we could identify the following most important characteristics for our potential customers:

* Restaurants;
* Gym.

So, we will use the geolocation of the neighborhoods of São Caetano do Sul and the data of API Foursquare to identify which 3 neighborhoods have the greatest number of these kinds of establishments. 

## 2. Data <a name="data"></a>

To achieve our goal, we will use 2 different datasets:

* Geolocation of the neighborhoods of São Caetano do Sul - This dataset has the information of name, latitude and longitude for the 15 neighborhoods of São Caetano do Sul;


* API Foursquare – This API allows you to search the information of establishments around a specific geographic position and we will use to get the information for a specific radius of neighborhoods of São Caetano do Sul. In detail, we will search for establishments within a radius of up to 500 meters from the center of each neighborhood and the API will respond to the names and categories of the establishments found. We will filter the categories of interest, like “Restaurant” and “Gym”. Lastly, we will summarize the occurrences of categories and compare the neighborhoods to identify the 3 best options. 

### 2.1. Neighborhoods of São Caetano do Sul

In this step, we will get the geolocation dataset of all the 15 neighborhoods of São Caetano do Sul and show it. This dataset is in a Github repository.

In [1]:
import pandas as pd

url = 'https://github.com/viniciusyl/Coursera_Capstone/blob/e0f343c74d63c6565d3c15dd92132d66818124fb/cep_bruto.csv?raw=true'
df = pd.read_csv(url, sep = ';')

df

Unnamed: 0,city,neighborhood,latitude,longitude
0,São Caetano do Sul,Barcelona,-2362278742,-4655214212
1,São Caetano do Sul,Jardim São Caetano,-2363819233,-4657999794
2,São Caetano do Sul,Osvaldo Cruz,-2362875921,-4656692561
3,São Caetano do Sul,Prosperidade,-2360981206,-4654987626
4,São Caetano do Sul,Santa Maria,-236336414,-4655289476
5,São Caetano do Sul,Boa Vista,-2364087908,-465592047
6,São Caetano do Sul,Centro,-2361177457,-4657477105
7,São Caetano do Sul,Cerâmica,-2362530149,-4657609657
8,São Caetano do Sul,Fundação,-2360628524,-4657025651
9,São Caetano do Sul,Mauá,-2364536958,-4657239209


As we can see, the dataset has 4 columns (city, neighborhood, latitude and longitude). Now, we need to format the latitude and longitude columns replacing "," to "." and convert to numeric.  

In [2]:
# Replace "," to "."
df['latitude'] = df['latitude'].str.replace(",", ".")
df['longitude'] = df['longitude'].str.replace(",", ".")

# Convert to numeric
df['latitude'] = pd.to_numeric(df['latitude'])
df['longitude'] = pd.to_numeric(df['longitude'])

df.head()

Unnamed: 0,city,neighborhood,latitude,longitude
0,São Caetano do Sul,Barcelona,-23.622787,-46.552142
1,São Caetano do Sul,Jardim São Caetano,-23.638192,-46.579998
2,São Caetano do Sul,Osvaldo Cruz,-23.628759,-46.566926
3,São Caetano do Sul,Prosperidade,-23.609812,-46.549876
4,São Caetano do Sul,Santa Maria,-23.633641,-46.552895


In [3]:
pip install folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 5.5 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1
Note: you may need to restart the kernel to use updated packages.


Lastly, let's see the neighborhoods of São Caetano do Sul in the map.

In [4]:
import folium

# Define Latitude and Longitude of São Caetano do Sul
latitude_scs = -23.625501760856505
longitude_scs = -46.56624893089427

# create map of São Caetano do Sul using latitude and longitude values
map_scs = folium.Map(location=[latitude_scs, longitude_scs], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['latitude'], df['longitude'], df['neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scs)  
    
map_scs

### 2.2. Foursquare API

Now that we have geolocation, let's use Foursquare API to get info about the establishments around each neighborhood. But first, we have to define our Foursquare Credentials and Version (hide cell).

In [5]:
# The code was removed by Watson Studio for sharing.

Let's define the function to get the information from all neighborhoods in our dataset.

In [6]:
import requests

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Let's get the neighborhood information.

In [7]:
scs_venues = getNearbyVenues(names=df['neighborhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )

Barcelona
Jardim São Caetano
Osvaldo Cruz
Prosperidade
Santa Maria
Boa Vista
Centro
Cerâmica
Fundação
Mauá
Nova Gerty
Olímpico
Santa Paula
Santo Antônio
São José


Let's check the size and the top rows of the resulting dataframe

In [8]:
print(scs_venues.shape)
scs_venues.head()

(436, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Barcelona,-23.622787,-46.552142,Feira Livre,-23.623861,-46.553367,Farmers Market
1,Barcelona,-23.622787,-46.552142,Droga Raia,-23.623749,-46.550823,Pharmacy
2,Barcelona,-23.622787,-46.552142,Old Man Sandwich Shop,-23.624427,-46.554718,Burger Joint
3,Barcelona,-23.622787,-46.552142,Sodiê Doces,-23.626227,-46.553213,Dessert Shop
4,Barcelona,-23.622787,-46.552142,Meet Café,-23.620706,-46.553122,Coffee Shop


## Methodology <a name="methodology"></a>

To achieve our goal, we are going to analyze the number of establishments in the Restaurants and Gyms categories within 500 meters for each neighborhood of são caetano do sul.

In first step, we will filter the establishments with "Restaurant" and "Gym" categories in dataset of neighborhoods of são caetano do sul.

In second step, we will create a list for each category and order the neighborhoods according to the number of establishments for each (rank 1 = greater number of establishments / rank 15 = smaller number of establishments).

In third step, we will calculate the average of the position of each of the neighborhoods in the two lists and and order them from lowest to highest average (rank 1 = lowest average position / rank 15 = higher average position).

The three neighborhoods with the best ranking will be chosen as our potential neighborhoods for the product. In case of a tie within the categories, we will use the name of the neighborhoods as key. In case of a tie the final rank, the neighborhood with the highest number of restaurants will be chosen.

## Analysis <a name="analysis"></a>

First, let's filter the dataset using interest categories.

In [9]:
scs_est = scs_venues[(scs_venues["Venue Category"].str.contains("Restaurant")) | (scs_venues["Venue Category"].str.contains("Gym"))]
scs_est.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,Barcelona,-23.622787,-46.552142,O Pirata,-23.62254,-46.556518,Seafood Restaurant
7,Barcelona,-23.622787,-46.552142,Flor de Romã Restaurante,-23.619769,-46.552782,Brazilian Restaurant
8,Barcelona,-23.622787,-46.552142,Cantinho da Síria,-23.625611,-46.55361,Middle Eastern Restaurant
10,Barcelona,-23.622787,-46.552142,Academia,-23.620523,-46.548989,Gym / Fitness Center
11,Barcelona,-23.622787,-46.552142,Fit&Co Coaching Results,-23.622729,-46.552751,Gym


Now, let's normalize the names of categories as "Restaurant" and "Gym".

In [10]:
scs_est.loc[scs_est['Venue Category'].str.contains("Restaurant") == True, "Venue Category"] = "Restaurant"
scs_est.loc[scs_est['Venue Category'].str.contains("Gym") == True, "Venue Category"] = "Gym"
scs_est.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,Barcelona,-23.622787,-46.552142,O Pirata,-23.62254,-46.556518,Restaurant
7,Barcelona,-23.622787,-46.552142,Flor de Romã Restaurante,-23.619769,-46.552782,Restaurant
8,Barcelona,-23.622787,-46.552142,Cantinho da Síria,-23.625611,-46.55361,Restaurant
10,Barcelona,-23.622787,-46.552142,Academia,-23.620523,-46.548989,Gym
11,Barcelona,-23.622787,-46.552142,Fit&Co Coaching Results,-23.622729,-46.552751,Gym


The next step is to group the neighborhoods, summarize as occurrences of the categories and order the lists according to the rules.

In [11]:
import numpy as np

# Create a list to check neighborhoods in groupby output
scs_neig = scs_venues['Neighborhood'].drop_duplicates()

# Restaurant categories
scs_restaurant = scs_est.loc[scs_est['Venue Category'].str.contains("Restaurant")].groupby('Neighborhood', as_index=False).agg({"Venue Category": "count"})
scs_restaurant = pd.merge(scs_restaurant, scs_neig, on = 'Neighborhood', how = "right")
scs_restaurant['Venue Category'] = scs_restaurant['Venue Category'].replace(np.nan, 0)
scs_restaurant.rename(columns = {'Venue Category' : "Number of restaurants"}, inplace = True)
scs_restaurant.sort_values(by = ['Number of restaurants', 'Neighborhood'], inplace = True, ascending = False)
scs_restaurant.insert(2, 'Rank Restaurant', list(range(1,16,1)))
scs_restaurant.reset_index(drop=True, inplace=True)
scs_restaurant

Unnamed: 0,Neighborhood,Number of restaurants,Rank Restaurant
0,Santa Paula,26.0,1
1,Centro,14.0,2
2,Barcelona,7.0,3
3,Cerâmica,6.0,4
4,Santo Antônio,4.0,5
5,São José,3.0,6
6,Mauá,3.0,7
7,Boa Vista,3.0,8
8,Osvaldo Cruz,2.0,9
9,Fundação,2.0,10


In [13]:
# Gym categories
scs_gym = scs_est.loc[scs_est['Venue Category'].str.contains("Gym")].groupby('Neighborhood', as_index=False).agg({"Venue Category": "count"})
scs_gym = pd.merge(scs_gym, scs_neig, on = 'Neighborhood', how = "right")
scs_gym['Venue Category'] = scs_gym['Venue Category'].replace(np.nan, 0)
scs_gym.rename(columns = {'Venue Category' : "Number of gyms"}, inplace = True)
scs_gym.sort_values(by = ['Number of gyms', 'Neighborhood'], inplace = True, ascending = False)
scs_gym.insert(2, 'Rank Gym', list(range(1,16,1)))
scs_gym.reset_index(drop=True, inplace=True)
scs_gym

Unnamed: 0,Neighborhood,Number of gyms,Rank Gym
0,Santa Paula,6.0,1
1,Barcelona,4.0,2
2,Santo Antônio,2.0,3
3,Santa Maria,2.0,4
4,Olímpico,2.0,5
5,Nova Gerty,2.0,6
6,Fundação,2.0,7
7,São José,1.0,8
8,Osvaldo Cruz,1.0,9
9,Cerâmica,1.0,10


In the next step, we will join the two datasets, **scs_restaurant** and **scs_gym**, calculate the average rank and create our Final Rank.

In [14]:
scs_rank = pd.merge(scs_restaurant, scs_gym, on = 'Neighborhood', how = "left")
scs_rank['Final Rank'] = (scs_rank['Rank Restaurant'] + scs_rank['Rank Gym']) / 2
scs_rank.sort_values(by = ['Final Rank', 'Rank Restaurant'], inplace = True)
scs_rank.reset_index(drop=True, inplace=True)
scs_rank

Unnamed: 0,Neighborhood,Number of restaurants,Rank Restaurant,Number of gyms,Rank Gym,Final Rank
0,Santa Paula,26.0,1,6.0,1,1.0
1,Barcelona,7.0,3,4.0,2,2.5
2,Santo Antônio,4.0,5,2.0,3,4.0
3,Centro,14.0,2,1.0,11,6.5
4,Cerâmica,6.0,4,1.0,10,7.0
5,São José,3.0,6,1.0,8,7.0
6,Fundação,2.0,10,2.0,7,8.5
7,Olímpico,1.0,12,2.0,5,8.5
8,Osvaldo Cruz,2.0,9,1.0,9,9.0
9,Nova Gerty,1.0,13,2.0,6,9.5


In [18]:
print('The average number of restaurants: ', scs_rank['Number of restaurants'].mean())
print('The average number of gyms: ', scs_rank['Number of gyms'].mean())

The average number of restaurants:  4.933333333333334
The average number of gyms:  1.6


In [25]:
# Define Latitude and Longitude of São Caetano do Sul
latitude_scs = -23.625501760856505
longitude_scs = -46.56624893089427

# create map of São Caetano do Sul using latitude and longitude values
map_scs = folium.Map(location=[latitude_scs, longitude_scs], zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['latitude'], df['longitude'], df['neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scs)  
    
# add markers to map best 3 neighborhoods
df_top3 = df[(df['neighborhood'] == 'Santa Paula') | (df['neighborhood'] == 'Barcelona') | (df['neighborhood'] == 'Santo Antônio')]
for lat, lng, label in zip(df_top3['latitude'], df_top3['longitude'], df_top3['neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scs)  
    
map_scs

Lastly, the table contain the information for each neighborhood and the **Final Rank**. As we can see, the three top neighborhoods in our study are **Santa Paula**, **Barcelona** e **Santo Antônio**.

## Results and Discussion <a name="results"></a>

In the previous section, through the definitions made previously, we arrived at the list of potential neighborhoods and their respective positions in our rank.

The neighborhood with the greatest positive highlight was Santa Paula, which was placed first in both rankings. Santa Paula has 26 restaurants, against the average of 4.9, and 6 gyms, against the average of 1.6. As we can see on the map above (red dots), it's interesting to note that the other two top 3 neighborhoods are relatively in the same region of the city (center-north). This may reflect a concentration of points of interest in the vicinity of the city's most famous avenue, Góias Avenue, which is close to the three neighborhoods.
 
In the other hand, Jardim São Caetano was the worst performing neighborhood in our rank. This result was predictable for those who know the city. Jardim São Caetano is a neighborhood of big and expensive houses and, consequently, a very residential neighborhood. So, Jardim São Caetano was expected to have fewest establishments.

## Conclusion <a name="conclusion"></a>

The main objective of this project was to use geographic and establishment information to identify which neighborhoods have the desired characteristics for our new product.

Using the dataset of geographic position of neighborhoods and the API of Foursquare, we were able to analyze the neighborhoods of São Caetano do Sul and rank them according our rules. 

According to results showed in the previous section, we can define the Santa Paula, Barcelona and Santo Antônio as our best neighborhoods to launch our new product. 