## Correlating COVID-19 Cases with Neighborhood Venues in San Francisco¶

### IBM Applied Data Science Capstone

#### Introduction
The city of San Francisco has been one of the earliest responders to the COVID-19 pandemic in the United States, issuing stay-at-home orders to the people on March 14, 2020. The state of California would follow to issue a state-wide call to stay-at-home on March 20, 2020. Due to early response and strict guidelines, San Francisco is one of the large metropolitan areas in the US that has been keeping COVID-19 largely undercontrol, with a relatively low number of cases and deaths compared to its population (7,000 cases and 64 deaths out of over 800,000 residents).

This project aims to understand the relationship between confirmed COVID-19 cases and San Francisco neighborhoods. As the city continues to re-open in the recent months, it is imperative to understand the relationship between the number of confirmed COVID-19 cases and neighborhood composition, particularly its venues. Under the assumption that most individuals are infected outside of their home, we can consider each venue as a potential site of infection. Doing so, we can analyze the relationship between the types and numbers of venues in a neighborhood and its cases.

The results of this analysis would be invaluable for local policymakers looking to understand the impact of re-opened venues on COVID-19 cases. This will inform them in shaping re-opening policy for the city in order to maintain public safety while still stimulating the local economy.

#### Data Source
In order to correlate San Francisco COVID-19 cases and venues, we will be using two data sources: Four Square and DataSF.

Four Square is a location technology platform that provides information on venues. It uses crowdsourced data to provide information on venues around a point of interest. The venue information consists of:

 - Venue name
 - Venue address
 - Venue type
 - Venue tips
 - Venue photos

DataSF provides public datasets to the city departments of San Francisco. The dataset we will be using, will detail:

 - Medical provider confirmed COVID-19 cases
 - Medical provider confirmed COVID-19 related deaths
 - Neighborhood population

In [46]:
#Imports the necessary packages
import pandas as pd
from sodapy import Socrata
import numpy as np
import itertools
import requests
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim
import googlemaps
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

#### Pulling San Francisco COVID-19 data

For this project, we will be using the "COVID-19 Cases and Deaths Summarized by Geography" dataset which is provided by the Department of Public Health of San Francisco. DataSF provides an API link for users to directly download this datasets. The data is segragated based on zip code, neighborhood and census districts. For the purpose of this analysis, we will focus on the subset detailing neighborhood cases.

In [19]:
client = Socrata("data.sfgov.org", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("tpyr-dvnc", limit=2000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)



In [30]:
# Isolates the data based on neighborhood segmentation
covid_df = results_df[results_df['area_type'] == 'Analysis Neighborhood']

for i in range(-3,0):
    df_head = list(covid_df)
    covid_df = covid_df.drop(df_head[i],1)

covid_df.fillna(0, inplace=True)
covid_df.head()

Unnamed: 0,acs_population,area_type,count,deaths,id
1,19458,Analysis Neighborhood,122,0,Financial District/South Beach
12,45891,Analysis Neighborhood,155,0,Outer Richmond
20,8641,Analysis Neighborhood,47,0,Glen Park
22,59639,Analysis Neighborhood,1216,0,Mission
24,26579,Analysis Neighborhood,144,0,Nob Hill


#### Identifying SF Neighborhoods

From the COVID-19 data, we obtain a list of San Francisco neighborhoods. The next step would be to find the coordinates of each neighborhood.

In [34]:
neigh = covid_df['id'].unique()
for n in neigh:
    print(n)
print("There are {} neighborhoods in San Francisco!".format(len(neigh)))

Financial District/South Beach
Outer Richmond
Glen Park
Mission
Nob Hill
Noe Valley
Outer Mission
Bernal Heights
Castro/Upper Market
Golden Gate Park
Inner Sunset
Twin Peaks
Visitacion Valley
Portola
West of Twin Peaks
Lincoln Park
Excelsior
Chinatown
South of Market
Inner Richmond
Hayes Valley
Oceanview/Merced/Ingleside
McLaren Park
Mission Bay
Sunset/Parkside
Western Addition
Potrero Hill
Haight Ashbury
Pacific Heights
Lone Mountain/USF
Seacliff
Presidio
Tenderloin
Presidio Heights
Russian Hill
Bayview Hunters Point
North Beach
Marina
Japantown
Treasure Island
Lakeshore
There are 41 neighborhoods in San Francisco!


In [67]:
# @hidden_cell 

api_key = 'AIzaSyBzO5t7U-xiuVf1fcSVcBAEQwgcQ2103zc'
gmaps = googlemaps.Client(key=api_key)

In [59]:
lat = []
lng = []

for n in neigh:
    geocode_result = gmaps.geocode(n + ' San Francisco')
    coord = geocode_result[0]['geometry']['location']
    lat.append(coord['lat'])
    lng.append(coord['lng'])
    
    

In [None]:
def get_coord(api_key, address):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]