# Milestone 2: EDA
## Predicting COVID-19 Cases
- Austin Rochon
- Emily Xie
- Mark Lock

<hr style="height:2pt">

## Table of Contents

0. [Introduction](#introduction)
1. [Global Data](#global)
2. [U.S. Data](#us)

<a id='introduction'></a>
## Introduction
Key Questions: 
1. Given everything you have learned, if you faced this data set in the wild, how would you proceed? 
2. What are the important measures? 
3. What are the right questions to ask, and how can the data answer them?

<a id='global'></a>
## Global Data

In [117]:
import pandas as pd
import folium
import geopandas as gpd
import numpy as np

In [16]:
# fetchg lobal data from the Johns Hopkins github
covid_global = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

In [18]:
# clean global data. first, transpose all date columns such that we have a single date column
# with each date entry as a row
dates = []
non_dates = ["Province/State", "Country/Region", "Lat", "Long"]
for col in covid_global.columns:
    if col not in non_dates:
        dates.append(col)
        
    
covid_global = pd.melt(covid_global, id_vars=non_dates, value_vars=dates,
                var_name="date", value_name="confirmed")

# next, simplify the column names to make analysis later easier
covid_global.rename(columns={"Province/State": "province", 
                             "Country/Region": "country",
                             "Lat": "lat",
                             "Long": "long"}, inplace=True)
covid_global.head()

Unnamed: 0,province,country,lat,long,date,confirmed
0,,Afghanistan,33.0,65.0,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


In [19]:
covid_global.loc[covid_global["country"] == "US"]

Unnamed: 0,province,country,lat,long,date,confirmed
225,,US,37.0902,-95.7129,1/22/20,1
487,,US,37.0902,-95.7129,1/23/20,1
749,,US,37.0902,-95.7129,1/24/20,2
1011,,US,37.0902,-95.7129,1/25/20,2
1273,,US,37.0902,-95.7129,1/26/20,5
...,...,...,...,...,...,...
18565,,US,37.0902,-95.7129,4/1/20,213372
18827,,US,37.0902,-95.7129,4/2/20,243453
19089,,US,37.0902,-95.7129,4/3/20,275586
19351,,US,37.0902,-95.7129,4/4/20,308850


<a id='us'></a>
## U.S. Data
For the U.S. data, we will use the [CovidTracking Project's](https://covidtracking.com/api) dataset instead of the Johns Hopkins data set. The CovidTracking project has better access to test data. We'll start by loading and cleaning that data.

In [51]:
# load daily covid data, per state
covid_us_states = pd.read_csv("https://covidtracking.com/api/v1/states/daily.csv")

In [52]:
covid_us_states.head()

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,hospitalized,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease
0,20200405,AK,185.0,6099.0,,,20.0,,,,...,20.0,6284,6284,6284,2,1.0,4.0,230.0,14.0,244.0
1,20200405,AL,1796.0,11282.0,,,231.0,,,,...,231.0,13078,13078,13078,1,2.0,19.0,2009.0,216.0,2225.0
2,20200405,AR,830.0,10412.0,,67.0,130.0,,43.0,27.0,...,130.0,11242,11242,11242,5,2.0,130.0,785.0,87.0,872.0
3,20200405,AS,0.0,20.0,6.0,,,,,,...,,26,20,20,60,0.0,0.0,0.0,0.0,0.0
4,20200405,AZ,2269.0,25141.0,,,310.0,,108.0,,...,310.0,27410,27410,27410,4,12.0,13.0,0.0,250.0,250.0


In [53]:
# add full state name as a column
states_dict = states = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'
}

covid_us_states["name"] = covid_us_states["state"].map(states_dict)
covid_us_states.head()

Unnamed: 0,date,state,positive,negative,pending,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,total,totalTestResults,posNeg,fips,deathIncrease,hospitalizedIncrease,negativeIncrease,positiveIncrease,totalTestResultsIncrease,name
0,20200405,AK,185.0,6099.0,,,20.0,,,,...,6284,6284,6284,2,1.0,4.0,230.0,14.0,244.0,Alaska
1,20200405,AL,1796.0,11282.0,,,231.0,,,,...,13078,13078,13078,1,2.0,19.0,2009.0,216.0,2225.0,Alabama
2,20200405,AR,830.0,10412.0,,67.0,130.0,,43.0,27.0,...,11242,11242,11242,5,2.0,130.0,785.0,87.0,872.0,Arkansas
3,20200405,AS,0.0,20.0,6.0,,,,,,...,26,20,20,60,0.0,0.0,0.0,0.0,0.0,American Samoa
4,20200405,AZ,2269.0,25141.0,,,310.0,,108.0,,...,27410,27410,27410,4,12.0,13.0,0.0,250.0,250.0,Arizona


In [102]:
# next, load population data
us_states_population = pd.read_csv("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/SCPRC-EST2019-18+POP-RES.csv")
us_states_population = us_states_population[["NAME", "POPESTIMATE2019"]]
us_states_population.rename(columns={"NAME":"name", "POPESTIMATE2019": "population"}, inplace=True)

In [103]:
# join to cases
covid_us_states = covid_us_states.merge(us_states_population, on="name", how="inner")

# create cases/population col
covid_us_states["positive_percap"] = covid_us_states["positive"] / covid_us_states["population"]
covid_us_states.head()

# join to lat/long
state_latlong = pd.read_csv("./data/statelatlong.csv")[["State", "Latitude", "Longitude"]]
state_latlong.rename(columns={"State": "state", 
                              "Latitude": "lat",
                              "Longitude": "long"}, inplace=True)
covid_us_states = covid_us_states.merge(state_latlong, on="state", how="inner")

How about a bubble plot

In [106]:
us_states_geo = "https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json"
date = covid_us_states["date"].max()
covid_us_states_mostrecent = covid_us_states.loc[covid_us_states["date"] == date]

In [136]:
# https://python-graph-gallery.com/313-bubble-map-with-folium/

# make an empty map of the US
m = folium.Map(location=[30, -90], zoom_start=4)
 
# add bubbles according to the relative cases
for i in range(0,len(covid_us_states_mostrecent)):
    coords = [covid_us_states_mostrecent.iloc[i]['lat'], covid_us_states_mostrecent.iloc[i]['long']]
    state = covid_us_states_mostrecent.iloc[i]['state']
    cases = int(covid_us_states_mostrecent.iloc[i]['positive'])
    cases_per_cap = covid_us_states_mostrecent.iloc[i]['positive_percap']
    
    folium.Circle(
      location=coords,
      popup=f"{state}\n{cases}",
      radius=cases_per_cap*50000000,
      color='crimson',
      fill=True,
      fill_color='crimson'
   ).add_to(m)

# plot
m


## Future ideas

#### Additional data
- Whether state went on lockdown
- COVID-19 Community Mobility Reports from [Google](https://www.google.com/covid19/mobility/)

#### Data engineering:
- How many days since lockdown