# see19 Guide

**A dataset and interface for visualizing and analyzing the epidemiology of Coronavirus Disease 2019 aka SARS-CoV-2 aka COVID19 aka C19**

# Analysis

Please read my various deep dives with see19 exploring different aspects of COVID19.

* [How Effective Is Social Distancing?](https://ryanskene.github.io/see19/analysis/How%20Effective%20Is%20Social%20Distancing%3F.html)

# Contents

0. [Purpose](#section0)
1. [Getting Started](#section1)
2. [the Data](#section2)  
    2.1 [Data Sources](#section2.1)  
    2.2 [Dataset Characteristics](#section2.2)  
    2.3 [Disclaimer](#section2.3)
3. [the CaseStudy Interface](https://ryanskene.github.io/see19/guide/3.%20See19%20-%20the%20CaseStudy%20Interface.ipynb)
4. [Visualizing Regional Impacts](https://ryanskene.github.io/see19/guide/4.%20See19%20-%20Visualizing%20Regional%20Impacts.ipynb)
5. [Visualizing Factors in 4D](https://ryanskene.github.io/see19/guide/5.%20See19%20-%20Visualizing%20Factors%20in%204D.ipynb)
6. [Visualizing via Heat Map](https://ryanskene.github.io/see19/guide/6.%20See19%20-%20Visualizing%20via%20Heat%20Map.ipynb)

<h1><a id='section0'>0. Purpose</a></h1>


##### _"It is better to be vaguely right than exactly wrong."_   

_- Carveth Read, Logic, Chapter 22_

<br/>

**See19** is an early stage attempt to aggregate various data and analyze their impact (in aggregate and in isolation) on the virulence of SARS-CoV2.

There is no single all-encompassing data from an undoubted source that will serve the needs of every user for every use case. Thus, the dataset as it stands is an adhoc aggregation from multiple sources with _eyeball_-style approximations used in some instances (see Disclaimer section). But while the dataset's imperfections are numerous, they cannot blunt the power of the insights that can be gleaned from an early exploratory analysis.

Visualization tools have been developed to help users in that early exploration.

Statistical analysis is also a goal of the project and I expect to add such analysis tools as time progresses.

Ease-of-use is paramount, thus, all data from all sources have been compiled into a single structure, readily consumed and manipulated in the ubiquitous `csv` format.

Until then, the data is available for all.

**I AM AN AMATEUR ENTHUSIAST. THIS IS A SOLO PROJECT UNTIL NOW. I AM SURE THERE ARE MISTAKES. PLEASE FLAG ANY ISSUES YOU SEE!**

<h1><a id='section1'>1. Getting Started</a></h1>

**See19** is a dataset ***and*** a python package.

The dataset can be accessed directly **[here]('https://github.com/ryanskene/see19/tree/master/dataset')**. Files are timestamped with creation date.

The package can be installed via pip.

`pip install see19`

<h1><a id='section2'> 2. the Data</a></h1>
The See19 dataset aggregates global data on COVID19 fatalities in various regions, as available data allows, and marries that data with available datasets on various exogenous factors that might impact the epidemiology of the virus.

The dataset was compiled using `Selenium`, `Django`, `SQLite`, and `Pandas`.


#### COVID19 Data Characteristics:
* Cumulative Cases for each region on each date
* Cumulative Fatalities for each region on each date
* State / Provincial-level data available for:
    * Australia
    * Brazil
    * Canada
    * China
    * Italy
    * United States
* Country-level data available for all other regions

**Factor Data Characteristics** available for most regions:
* Longitude / Latitude
    * I just wrote a script that searched the region name on [this website]('https://www.openstreetmap.org/') and pulled the coordinates from the resulting url
* Population
* Population demographic segmentation
* Land Density
* City Density (typically the density of the largest city in the region)
* Climate Characteristics including:
    * Average daily temperature
    * Average daily dewpoint temperate
    * Average daily relative humidity (derived from temperature and dewpoint temperature)
    * Total daily UV-B Radiation
* Air quality measures      
* Historical Health Outcomes
* Travel Popularity
* Social Distancing Implementation
    
Aim to update the dataset each morning.

<h2><a id='section2.1'>2.1 Data Sources</a></h2>

#### COVID Case and Fatality Data:
* [Brazil Regional Data from the government supported site](https://covid.saude.gov.br/)
* [Italy Regional Data from the government github repo](https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni-20200224.csv)
* [US Regional Data from the University of Virginia](https://nssac.bii.virginia.edu/covid-19/dashboard/)
* [Other Regions from Johns Hopkins via humdata.org](https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases)

Other Data:
* Longitude & Latitude
    * I just wrote a script that searched each region name on this [site]('https://www.openstreetmap.org/')
    * Any errors were fixed manually
* [Population, Demographics, and Density from SEDAC](https://sedac.ciesin.columbia.edu/data/set/gpw-v4-admin-unit-center-points-population-estimates-rev11)
    * Matched to regional case data by name, often manually
* [Climate Data from European Centre for Medium-Range Weather Forecasts](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview)
    * Climate data pulled from nearest matching longitude & latitude coordinate in the dataset
* [Air Quality Data from the World Air Quality Project](https://aqicn.org/data-platform/covid19/verify/1c09b43b-09f2-4244-a86f-24647e1fa3d9)
    * Air quality data recorded at city-level, with limited number of cities available
    * City data is aggregated to the regional or country-level
    * So, where a region has mutiple cities reporting AQ data, the region value is aggregate of the cities
    * Where a region has only a single city, that city represents the whole region
    * Where a region has no cities, NADA
* Social Distancing Stringency Index via [Oxford Covid Government Response Tracker](https://github.com/OxCGRT/covid-policy-tracker)
* [Global Mobility Data via Google](https://www.google.com/covid19/mobility/)
* GDP Per Capita via the [OECD](https://stats.oecd.org/Index.aspx?DataSetCode=REGION_ECONOM) and [WorldBank](https://data.worldbank.org/indicator/NY.GDP.MKTP.PP.CD?most_recent_year_desc=false)
    * utilizing real 2016 Purchasing Power Parity figures indexed to 2015 US dollars
* Causes of Death
    * A fairly messy hodgepodge of data for [global](https://ourworldindata.org/causes-of-death), [US](https://wonder.cdc.gov/controller/datarequest/D76;jsessionid=7D21B11E6FF1F1059C184EE313E58875), and [Italy](http://dati.istat.it/Index.aspx?QueryId=26435&lang=en#)
* Travel Popularity
    * An even messier hodgepodge of data pulled from the World Tourism Organization via [indexmundi](https://www.indexmundi.com/facts/indicators/ST.INT.ARVL/rankings)
    * State/Provincial data were derived from the country-level and other various sources in an ad-hoc fashion
    * Good travel data is surprisingly difficult to come by. There are a number of services that offer data on flight statistics, however, it is prohibitively expensive

<h2><a id='section2.2'>2.2 Dataset Characteristics</a></h2>

With `see19` installed, we can download the dataset via `get_baseframe`

In [19]:
from see19 import get_baseframe
baseframe = get_baseframe()

The dataset is arranged such that each row is a unique entry for each `region_id` on each `date`

All other columns are the value of that particular factor in that particular region on that particular date

In [2]:
baseframe.head(3)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,genito,childbirth,perinatal,congenital,other,external,visitors,travel_year,gdp,gdp_year
0,282,110,Abruzzo,ITA,Italy,2020-01-01 00:00:00+00:00,,,1302305.0,5836.611979,...,442.0,1.0,16.0,19.0,384.0,2059,181458.0,2017.0,45608600000.0,2016.0
1,282,110,Abruzzo,ITA,Italy,2020-01-02 00:00:00+00:00,,,1302305.0,5836.611979,...,442.0,1.0,16.0,19.0,384.0,2059,181458.0,2017.0,45608600000.0,2016.0
2,282,110,Abruzzo,ITA,Italy,2020-01-03 00:00:00+00:00,,,1302305.0,5836.611979,...,442.0,1.0,16.0,19.0,384.0,2059,181458.0,2017.0,45608600000.0,2016.0


_This could perhaps be more appropriately structured as a multi-index frame, however, I find such indexes cumbersome to work with._

In [3]:
num_regs = baseframe.region_id.unique().size

There are **{{num_regs}}** unique regions in the dataset

Australia, Brazil, Canada, China, Italy, and the US have state/provincial level data.

For example, regions within Italy and Brazil are as follows:

In [11]:
baseframe[baseframe.country.isin(['Italy', 'Brazil'])].region_name.unique()

array(['Abruzzo', 'Acre', 'Alagoas', 'Amapa', 'Amazonas', 'Bahia',
       'Basilicata', 'Calabria', 'Campania', 'Ceara', 'Distrito Federal',
       'Emilia-Romagna', 'Espirito Santo', 'Friuli Venezia Giulia',
       'Goias', 'Lazio', 'Liguria', 'Lombardia', 'Maranhao', 'Marche',
       'Mato Grosso', 'Mato Grosso Do Sul', 'Minas Gerais', 'Molise',
       'P.A. Bolzano', 'P.A. Trento', 'Para', 'Paraiba', 'Parana',
       'Pernambuco', 'Piaui', 'Piemonte', 'Puglia', 'Rio De Janeiro',
       'Rio Grande Do Norte', 'Rio Grande Do Sul', 'Rondonia', 'Roraima',
       'Santa Catarina', 'Sao Paulo', 'Sardegna', 'Sergipe', 'Sicilia',
       'Tocantins', 'Toscana', 'Umbria', "Valle d'Aosta", 'Veneto'],
      dtype=object)

In [12]:
num_dates = baseframe.date.unique().size
rows = baseframe.date.shape[0]

Each region has up to **{{num_dates}}** dates in the dataset.

Thus there are **{{rows}}** rows in the dataset, with one row for each unique `region_id`-`date` combination.

In [13]:
num_cols = baseframe.columns.size

There are currently **{{num_cols}}** columns in the dataset, most of which are observable factors.

The factors can be seen as split between two types:
* **Time-static** factors, i.e. do not change by the date. 
    * population, density, population demographic ranges, cause of death outcomes, travel popularity
    
* **Time-dynamic** factors, i.e. change with each date. 
    * fatalities, climate, pollution, mobility, and the Oxford stringency index

They can be found as follows:

In [14]:
ny = baseframe[baseframe.region_name == 'New York']

dynamic = []
static = []
for col in ny.columns:
    if ny[col].unique().size > 1:
        dynamic.append(col)
    else:
        static.append(col)

In [15]:
size = baseframe.size

The entire set has **{{size}}** different data points

**Aggregate to Country Level**

There may be times the user wishes to focus on just country level analysis.

Thus, there is an `agg_to_country_level` function that aggregates any state/provincial-level regions to their respective country level.

**NOTE:** Climate measure such as temperature and uvb-radiation are averaged, which can be an inaccurate tool across large countries (like the United States).

**NOTE:** Several factors are removed that can't easily be aggregated.

In [16]:
from see19 import agg_to_country_level
country_base = agg_to_country_level(baseframe)

In [17]:
country_base[country_base.region_name == 'USA'].head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,genito,childbirth,perinatal,congenital,other,external,visitors,travel_year,gdp,gdp_year
0,reg_for_USA,236,USA,USA,United States of America (the),2020-01-01 00:00:00+00:00,,0.0,6053834.0,9090182.0,...,68647.0,181.0,6481.0,3985.0,29419.0,14062,102108383.0,2017.0,18602100000000.0,2016.0
1,reg_for_USA,236,USA,USA,United States of America (the),2020-01-02 00:00:00+00:00,,0.0,6053834.0,9090182.0,...,68647.0,181.0,6481.0,3985.0,29419.0,14062,102108383.0,2017.0,18602100000000.0,2016.0


<h2><a id='section2.3'>2.3 Disclaimer</a></h2>

I have said before and it bears repeating: **This is an imperfect dataset.** Specific problems are highlighted here.

**GENERAL ISSUES**
* Not all factors have available measurements for each region or each date.
    * These are typically expressed as `NaN`

* Some factors are available at regional levels while others are not
    * Measurements for a region are often compared to other measurements at the country level. This isn't necessarily problematic ... for large geographic and populous countries like the US, it is likely better that state-level data is used to compare to other smaller countries.
    * State-level measurements are often estimate by mixing separate data sources. For instance, Visitor data for the provinces of Brazil was estimated by taking the country-level data from the World Tourism Organization and weighting it by the province's proportionate share in visitor travel from separate data from the Brazilian government.
* Some data is outdated.
    * GDP data lags signficantly particularly for large groups of countries, so 2016 figures have been used, presuming that the relative mix among countries has remained constant
    
**DENSITY**

Population density is oft-cited as a potential explanatory factor in COVID19 infection rates. And I couldn't agree more that it is important to consider. However, the study of density suffers from many issues.


* Denisty is highly variable within regions. And case and fatality rates have been highly variable within regions and across densities. In New York City, for example, some of the least dense regions have had the highest infection rates.

* With only regional data available, to be rigourous the safest option is to simple choose the density of the region. However, this is often a poor reflection of reality. New York State actually has signficant land mass despite most of its population residing on a tiny island on the southeastern edge.

* To account for this, See19 includes a factor `city_dens`. `city_dens` is the density of the largest city in the region, so :
    * for New York State, `city_dens` is the density of New York City,
    * for Taiwan, `city_dens` is the density of Taipei, 
    * for Japan, `city_dens` is the density of Tokyo, and so on.

    This approach results in its own issues. For instance, at present, for all of Russia, `city_dens` reflects the density of Moscow.

Other geographic measurements, such as `temperature` and `uvb radiation` suffer from similar issues.


The only true way to address these shortcomings is for ***daily*** case and fatality statistics to be released at the county-level (or equivalent) in every country around the globe.

**CASE DATA**

Aside from just the difficulties of aggregating data, there are well-documented issues with the underlying case and fatality counts as well.


* Confirmed cases are likely well below actual cases given up to 50% of all COVID19 cases may be asymptomatic and limited testing in the early stages led to many symptomatic cases going unreported.


* The rapid improvement in testing likely exaggerated the growth of infections over time


* Fatalities were unreported at peak periods due to lack of health care capacity


* Fatalities have been retroactively added to data, without adjusting back to the days the fatalities actually occured, so for regions like Hubei and New York state, there are massive spikes in fatalities that don't reflect the actual experience.


* China has been heavily criticized for under-reporting, late-reporting, and recently added ~20% increase in cumulative fatalities on a random day in March. For these reasons, throughout this tutorial, you will see that China is often excluded from the dataset.

# Next Section

Click on this link to go to the next notebook: [3. CaseStudy Interface](3.%20See19%20-%20the%20CaseStudy%20Interface.ipynb)