<h1>FINAL ASSIGNMENT - WEEK 1</h1>
<br>
<hr>

<h2>Table of contents</h2>
    <li><a href='#methodology'>Introduction<a/></li>
    <li><a href='#data'>Data<a/></li>

<h2>Introduction<a id='introduction'></a></h2>
<hr>

<p>During the COVID-19 pandemic, someone who needs to choose a neighborhood in the city of Rio de Janeiro to live, may question which areas are safe or not. In this final assignment, we are going to visualize each neighborhood of Rio de Janeiro by the COVID-19 death rate and the distribution of the city's medical infrastructure (hospitals, urgent care centers and emergency rooms).</p>
<p>This is just one criteria amongst many others that are related to death or recovery cases of infected people.</p>
<p>As this study progresses, we will discover that good medical infrastructure in certain areas can help, as expected, in the recovery of patients. On the other hand, the lack of proper medical care can directly influence on a larger death count.</p>

<h2>Data<a id='data'><a/></h2>
<hr>

<p>The geospatial data, as well the COVID cases data, are from official institutions of the local government. The datasets can be obtained in the following URLs:</p>

<li><a href=http://dadosabertos.rio.rj.gov.br/apiUrbanismo/apresentacao/csv/bairros_.csv>http://dadosabertos.rio.rj.gov.br/apiUrbanismo/apresentacao/csv/bairros_.csv<a/></li>
<li><a href=https://www.data.rio/datasets/cep-dos-casos-confirmados-de-covid-19-no-munic%C3%ADpio-do-rio-de-janeiro>https://www.data.rio/datasets/cep-dos-casos-confirmados-de-covid-19-no-munic%C3%ADpio-do-rio-de-janeiro</a></li>
<li><a href=https://www.data.rio/datasets/limite-de-bairros?geometry=-44.899%2C-23.138%2C-41.992%2C-22.695>https://www.data.rio/datasets/limite-de-bairros?geometry=-44.899%2C-23.138%2C-41.992%2C-22.695 (The geoson file with the neighborhood limits)</a></li>

<h3>Importing libraries, initializing variables and preparing the input data</h3>

In [1]:
# Importing the required libraries
from geopy.geocoders import Nominatim
import folium
# library to build a legend on the map
from branca.element import Template, MacroElement
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import json

# Foursquare credentials
CLIENT_ID = 'CRN2QP54XJ4SSKPST0LYZTLSISLNWRJVAMKSBNP5ULMO5Q0C' # your Foursquare ID
CLIENT_SECRET = 'FGZOZCDZ5LUQVECCPKW3BI2RSWRTOAWRWQW0IVFH2ZZT4T1D' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

# Rio de Janeiro Map initial information
address = 'Rio de Janeiro, BR'

geolocator = Nominatim(user_agent="rio_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# Rio de Janeiro geospatial data of each neighborhood
# from: http://dadosabertos.rio.rj.gov.br/apiUrbanismo/apresentacao/csv/bairros_.csv
# The file was converted from Windows Ansi to UTF-8 CSV and the first row was removed on Excel. The new file was saved as bairros_finall.csv
dfRio = pd.read_csv('bairros_final.csv')
# Converting the latitude and longitude data to string
dfRio['Latitude'] = dfRio['Latitude'].apply(str)
dfRio['Longitude'] = dfRio['Longitude'].apply(str)
# Removing all the special chars and accents and converting everything to uppercase
dfRio['Bairro'] = dfRio['Bairro'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('ascii').str.upper()
# Renaming the first column to Neighborhood (Bairro means Neighborhood in Portuguese - BR)
dfRio.rename(columns={"Bairro":"Neighborhood"}, inplace=True)

# Data from Coronavirus incidence in each neighborhood of Rio de Janeiro
# from: https://www.data.rio/datasets/cep-dos-casos-confirmados-de-covid-19-no-munic%C3%ADpio-do-rio-de-janeiro
rioCovid = pd.read_csv('Dados_CEP_MRJ_covid_19.csv', sep=';')

# Read geoson file with the neighborhood limits and import the neighborhood id to a new dataframe.
# This dataframe will be merged with the main dataframe (dfRioData) to make a choropleth map later
with open('Limite_de_Bairros.geojson', 'r') as geoson_file:
    jsonFile = geoson_file.read()

# Parse file
jsonObj = json.loads(jsonFile)

# Create the new Dataframe
dfNeighborhoodID = pd.DataFrame(columns=['ID', 'Neighborhood'], index=range(len(jsonObj['features'])))

for i in range(len(jsonObj['features'])):
    dfNeighborhoodID.loc[i].ID = jsonObj['features'][i]['properties']['OBJECTID']
    dfNeighborhoodID.loc[i].Neighborhood = jsonObj['features'][i]['properties']['NOME']

# Removing all the special chars and accents and converting everything to uppercase
dfNeighborhoodID['Neighborhood'] = dfNeighborhoodID['Neighborhood'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('ascii').str.upper()

# Verifying which Neighborhoods does not match because of mispellings
#rioNeighborhood = pd.merge(dfNeighborhoodID, dfRio, on='Neighborhood', how='outer')
# Bras de Pina, Oswaldo Cruz and Vila Cosmos are the neighborhoods that need correction, so let's do it
dfRio.replace('BRAZ DE PINA', 'BRAS DE PINA', inplace=True)
dfRio.replace('OSWALDO CRUZ', 'OSVALDO CRUZ', inplace=True)
dfRio.replace('VILA COSMOS', 'VILA KOSMOS', inplace=True)

# Merge the dataframes
rioNeighborhood = pd.merge(dfNeighborhoodID, dfRio, on='Neighborhood')

# Checking if there is any NaN values
#print(rioNeighborhood[rioNeighborhood['ID'].isna()])
#rioNeighborhood[rioNeighborhood['Latitude'].isna()]   

<h3>Data Cleansing and Creating the final dataframe</h3>
<p>This dataframe, dfRioData, will be used in the clustering process and in the Foursquare exploratory analysis.</p>
<p>Regarding the COVID status: all cases that are active will be removed from the dataframe since we can't be assured how many will recover or not from the disease. The total cases in the neighborhood will consider the sum of deaths and recoveries and from this point, we can calculate the death rate and recovery rate from each neighborhood. With this methodology, the neighborhood population is not relevant because the rates already include it implicitily.</p>

In [2]:
# Dropping some columns and renaming the others
rioCovid.drop(['dt_notific', 'dt_inicio_sintomas', 'ap_residencia_estadia', 'dt_evolucao', 'cep', 'data_atualizacao'], axis=1, inplace = True)
rioCovid.rename(columns={"bairro_resid__estadia":"Neighborhood","dt_notific":"Date", "evolucao":"Status"}, inplace=True)

# Translating the status values from portuguese to english
# First, let's list all the status values
status = rioCovid['Status'].unique()
# The values are: 'OBITO', 'RECUPERADO', 'ATIVO'
statusDict = {"OBITO":"Death", "RECUPERADO":"Recovered", "ATIVO":"Active"}
rioCovid.replace({"Status":statusDict}, inplace=True)

# Hot encoding the Status column
covid_onehot = pd.get_dummies(rioCovid[['Status']], prefix="", prefix_sep="")
dfRioCovid = pd.merge(rioCovid, covid_onehot, left_index=True, right_index=True)

# Dropping the Active and Status columns that are not be considered in the final study
dfRioCovid.drop(['Status', 'Active'], axis=1, inplace = True)

# Grouping the data by Neighborhood
dfRioCovid = dfRioCovid.groupby('Neighborhood').sum().reset_index()

# Calculating the rate of recoveries and deaths (total cases = (Deaths + Recoveries), Death Rate = Deaths/Total Cases, Recovery Rate = Recoveries/Total Cases)
dfRioCovid['Total Cases'] = dfRioCovid['Death'] + dfRioCovid['Recovered']
dfRioCovid['Death Rate'] = dfRioCovid['Death']/dfRioCovid['Total Cases']
dfRioCovid['Recovery Rate'] = dfRioCovid['Recovered']/dfRioCovid['Total Cases']

# Merging the Covid dataframe with the Neighborhood dataframe
# First, let's verify again which Neighborhoods does not match because of mispellings
#dfRioData = pd.merge(rioNeighborhood, dfRioCovid, on='Neighborhood', how='outer')

# Checking if there is any NaN values
#print(dfRioData[dfRioData['ID'].isna()])
#dfRioData[dfRioData['Death'].isna()]

# From the test above, there are two Neighborhoods from the Covid source that doesn't exist in the Neighborhood dataframe ('Fora do Municipio' and 'Vila Kennedy'). There are 5 others that need
# corrections in the COVID DF duo to mispellings: 'Cavalcanti', 'Freguesia (Ilha)', 'Freguesia (Jacarepagua)', 'Osvaldo Cruz' and 'Ricardo de Albuquerque'. We need to correct them.
neighDict = {"CAVALCANTE":"CAVALCANTI", "RICARDO ALBUQUERQUE":"RICARDO DE ALBUQUERQUE", "OSWALDO CRUZ":"OSVALDO CRUZ", "FREGUESIA-ILHA":"FREGUESIA (ILHA)", "FREGUESIA-JPA":"FREGUESIA (JACAREPAGUA)"}
dfRioCovid.replace({"Neighborhood":neighDict}, inplace=True)

# That's the final dataframe that will be use throughout the study
dfRioData = pd.merge(rioNeighborhood, dfRioCovid, on='Neighborhood')

dfRioData.head()

Unnamed: 0,ID,Neighborhood,Latitude,Longitude,Death,Recovered,Total Cases,Death Rate,Recovery Rate
0,325,PAQUETA,-22.7597222,-43.1088889,8.0,160.0,168.0,0.047619,0.952381
1,326,FREGUESIA (ILHA),-22.7863894,-43.1722945,40.0,253.0,293.0,0.136519,0.863481
2,327,BANCARIOS,-22.7959138,-43.175927,24.0,254.0,278.0,0.086331,0.913669
3,328,GALEAO,-22.8091667,-43.2380556,36.0,683.0,719.0,0.05007,0.94993
4,330,PORTUGUESA,-22.7988923,-43.20618839999999,50.0,460.0,510.0,0.098039,0.901961


<h3>Using Foursquare to analyze the venues and cluster the neighborhoods by the medical category</h3>
<p>Those are the IDs of the three categories that will be used, primarily:</p>
<li>Hospital ID on Foursquare: 4bf58dd8d48988d196941735</li>
<li>Emergency Room ID on Foursquare: 4bf58dd8d48988d194941735</li>
<li>Urgente Care Center ID on Foursquare: 56aa371be4b08b9a8d573526</li>
<hr>
<h5>The function below explore venues in a certain category, specified as a parameter (a slightly modified version the function used in the course)</h5>

In [3]:
def getNearbyCategoryVenues(names, latitudes, longitudes, category, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?categoryId={}&intent=browse&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            category,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
    
        print(url)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# Those are the categories of hospitals, emergency rooms and urgent care centers
rio_venues = getNearbyCategoryVenues(dfRioData.Neighborhood, dfRioData.Latitude, dfRioData.Longitude, '4bf58dd8d48988d196941735,4bf58dd8d48988d194941735,56aa371be4b08b9a8d573526 ')

# one hot encoding
rio_onehot = pd.get_dummies(rio_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
rio_onehot['Neighborhood'] = rio_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [rio_onehot.columns[-1]] + list(rio_onehot.columns[:-1])
rio_onehot = rio_onehot[fixed_columns]

rio_onehot.head()

PAQUETA
https://api.foursquare.com/v2/venues/search?categoryId=4bf58dd8d48988d196941735,4bf58dd8d48988d194941735,56aa371be4b08b9a8d573526 &intent=browse&client_id=CRN2QP54XJ4SSKPST0LYZTLSISLNWRJVAMKSBNP5ULMO5Q0C&client_secret=FGZOZCDZ5LUQVECCPKW3BI2RSWRTOAWRWQW0IVFH2ZZT4T1D&v=20180605&ll=-22.7597222,-43.1088889&radius=500&limit=100
FREGUESIA (ILHA)
https://api.foursquare.com/v2/venues/search?categoryId=4bf58dd8d48988d196941735,4bf58dd8d48988d194941735,56aa371be4b08b9a8d573526 &intent=browse&client_id=CRN2QP54XJ4SSKPST0LYZTLSISLNWRJVAMKSBNP5ULMO5Q0C&client_secret=FGZOZCDZ5LUQVECCPKW3BI2RSWRTOAWRWQW0IVFH2ZZT4T1D&v=20180605&ll=-22.786389399999997,-43.1722945&radius=500&limit=100
BANCARIOS
https://api.foursquare.com/v2/venues/search?categoryId=4bf58dd8d48988d196941735,4bf58dd8d48988d194941735,56aa371be4b08b9a8d573526 &intent=browse&client_id=CRN2QP54XJ4SSKPST0LYZTLSISLNWRJVAMKSBNP5ULMO5Q0C&client_secret=FGZOZCDZ5LUQVECCPKW3BI2RSWRTOAWRWQW0IVFH2ZZT4T1D&v=20180605&ll=-22.7959138,-43.175927&r