# see19 Guide

**A dataset and interface for visualizing and analyzing the epidemiology of Coronavirus Disease 2019 aka SARS-CoV-2 aka COVID19 aka C19**

Find it on [GitHub](https://github.com/ryanskene/see19)

# 3. the Casestudy Interface

See19 Visualization and Data analysis is completed via the `CaseStudy` class.
    
`CaseStudy` can be access directly from the `see19` module

In [1]:
from see19 import CaseStudy, get_baseframe
baseframe = get_baseframe()
casestudy = CaseStudy(baseframe)

The original baseframe can be accessed via the `baseframe` attribute

In [2]:
casestudy.baseframe.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,genito,childbirth,perinatal,congenital,other,external,visitors,travel_year,gdp,gdp_year
0,282,110,Abruzzo,ITA,Italy,2020-01-01 00:00:00+00:00,,,1302305.0,5836.611979,...,442.0,1.0,16.0,19.0,384.0,2059,181458.0,2017.0,45608600000.0,2016.0
1,282,110,Abruzzo,ITA,Italy,2020-01-02 00:00:00+00:00,,,1302305.0,5836.611979,...,442.0,1.0,16.0,19.0,384.0,2059,181458.0,2017.0,45608600000.0,2016.0


`CaseStudy` automatically computes different adjustments including:

1. Daily new cases and fatalities
2. Daily Moving Average (DMA) for new and cumulative cases and fatalities
3. Population and density adjustments for new and cumulative cases and fatalities
4. Daily growth or change in 1. thru 3. above

These adjustments are referred to as `count_categories`.

The amended dataframe can be accessed via the `df` attribute:

In [3]:
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,growth_deaths_new_dma_per_1M,growth_deaths_new_dma_per_person_per_land_KM2,growth_deaths_new_dma_per_person_per_city_KM2,growth_cases_per_1M,growth_cases_per_person_per_land_KM2,growth_cases_per_person_per_city_KM2,growth_deaths_per_1M,growth_deaths_per_person_per_land_KM2,growth_deaths_per_person_per_city_KM2,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,...,,,,1.38961,1.38961,1.38961,,,,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,...,2.0,2.0,2.0,1.523364,1.523364,1.523364,2.0,2.0,2.0,1 days


For ease of selection, `CaseStudy` has a number of class attributes with different groupings of count categories: `BASECOUNT_CATS`, `PER_CATS`, `LOGNAT_CATS`, `ALL_CATS`, `DMA_COUNT_CATS`, `PER_COUNT_CATS`


In [4]:
CaseStudy.DMA_COUNT_CATS

['cases_dma',
 'cases_new_dma',
 'deaths_dma',
 'deaths_new_dma',
 'cases_dma_per_1M',
 'cases_dma_per_person_per_land_KM2',
 'cases_dma_per_person_per_city_KM2',
 'cases_new_dma_per_1M',
 'cases_new_dma_per_person_per_land_KM2',
 'cases_new_dma_per_person_per_city_KM2',
 'deaths_dma_per_1M',
 'deaths_dma_per_person_per_land_KM2',
 'deaths_dma_per_person_per_city_KM2',
 'deaths_new_dma_per_1M',
 'deaths_new_dma_per_person_per_land_KM2',
 'deaths_new_dma_per_person_per_city_KM2',
 'cases_dma_lognat',
 'cases_new_dma_lognat',
 'deaths_dma_lognat',
 'deaths_new_dma_lognat',
 'cases_dma_per_1M_lognat',
 'cases_dma_per_person_per_land_KM2_lognat',
 'cases_dma_per_person_per_city_KM2_lognat',
 'cases_new_dma_per_1M_lognat',
 'cases_new_dma_per_person_per_land_KM2_lognat',
 'cases_new_dma_per_person_per_city_KM2_lognat',
 'deaths_dma_per_1M_lognat',
 'deaths_dma_per_person_per_land_KM2_lognat',
 'deaths_dma_per_person_per_city_KM2_lognat',
 'deaths_new_dma_per_1M_lognat',
 'deaths_new_dma_per

By providing `lognat=True`, it will also take the natural log of each of 1. thru 3. above

In [None]:
casestudy = CaseStudy(baseframe, lognat=True)

In [None]:
casestudy.LOGNAT_CATS

In [7]:
num_cats = len(CaseStudy.ALL_COUNT_CATS)

In total, there are  **{{num_cats}}** different `count_categories` to choose from.

Thankfully, `casestudy.df` can be limited to specific count categories via the `count_categories` attribute:

In [8]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_dma_per_person_per_land_KM2')
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,land_dens,city_KM2,city_dens,deaths_new_dma_per_person_per_land_KM2,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.001901,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.003803,1 days


In [9]:
casestudy = CaseStudy(baseframe, count_categories=['deaths_new_dma_per_person_per_land_KM2', 'growth_deaths_new_per_1M'])
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,land_dens,city_KM2,city_dens,deaths_new_dma_per_person_per_land_KM2,growth_deaths_new_per_1M,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.001901,,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.003803,1.0,1 days


`CaseStudy` can further filter `baseframe` as follows:
    
* Limiting data via different start and tail hurdles
* Limiting the frame to certain regions
* Limiting the frame to certain countries
* Excluding certain regions
* Excluding certain countries

The default parameters for `start_factor` and `start_hurdle` limit the data to regions with at least one cumulative fatality.

**NOTE**: a `days` column is added to `casestudy.df`. This is a count of the number of days from the current date for to first date in frame, for the current region.  When a `start_factor` is provide, this is the first date that the `start_hurdle` is cleared. When `start_factor` is not provided, this is the first date in the dataset.

This can be customized for various factors:

In [10]:
casestudy = CaseStudy(
    baseframe, regions=['New York', 'Florida'], count_categories=CaseStudy.BASECOUNT_CATS, 
    start_factor='deaths_new', start_hurdle=3
)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,city_dens,cases_dma,cases_new,cases_new_dma,deaths_dma,deaths_new,deaths_new_dma,cases.1,deaths.1,days
10002,64,236,Florida,USA,United States of America (the),2020-03-23 00:00:00+00:00,1227.0,18.0,18801310.0,139073.534072,...,505.760953,999.0,220.0,221.333333,14.666667,5.0,2.333333,1227.0,18.0,0 days
10003,64,236,Florida,USA,United States of America (the),2020-03-24 00:00:00+00:00,1467.0,23.0,18801310.0,139073.534072,...,505.760953,1233.666667,240.0,234.666667,18.0,5.0,3.333333,1467.0,23.0,1 days


In [11]:
casestudy = CaseStudy(
    baseframe, excluded_countries=['United States of America (the)'], 
    count_categories='deaths_new', start_factor='deaths_new', start_hurdle=3
)
casestudy.df[casestudy.df.country == 'United States of America (the)']

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,land_dens,city_KM2,city_dens,deaths_new,days


To see the earliest dates in the dataframe, prior to any deaths being recorded, set `start_factor` to `''`.

In [12]:
casestudy = CaseStudy(
    baseframe, regions=['New York', 'Florida'], count_categories='deaths_new_dma', 
    factors=['temp', 'strindex'], start_factor=''
)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,land_dens,city_KM2,city_dens,deaths_new_dma,temp,strindex,days
9920,64,236,Florida,USA,United States of America (the),2020-01-01 00:00:00+00:00,,,18801310.0,139073.534072,135.189705,37174.301192,505.760953,,9.323175,0.0,0 days
9921,64,236,Florida,USA,United States of America (the),2020-01-02 00:00:00+00:00,,,18801310.0,139073.534072,135.189705,37174.301192,505.760953,,10.66485,0.0,1 days


The remaining columns in the `baseframe` can be included in a `CaseStudy` instance on an ***opt-in*** basis via the `factors` attribute:

In [13]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_per_person_per_land_KM2', factors=['no2', 'strindex'])
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,land_dens,city_KM2,city_dens,deaths_new_per_person_per_land_KM2,no2,strindex,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.005704,,87.43,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,175.310262,2938.79544,175.310262,0.005704,,87.43,1 days


For convenience, a number of factor groupings can be accessed via `CaseStudy` attributes:

* `MOBIS`, `CAUSES`, `MAJOR_CAUSES`, `POLLUTS`, `TEMP_MSMTS`, `MSMTS`
    * various groupings for factor data
* `STRINDEX_CATS`, `CONTAIN_CATS`, `ECON_CATS`, `HEALTH_CATS`
    * groupings for the Oxford Stringency Index


Demographic population age groupings can be accessed via the `see19` module:
* `ALL_RANGES` - all the possible demographic age ranges
* `RANGES` - a dictionary of various groupings of age ranges

In [14]:
print (CaseStudy.MSMTS)
print (CaseStudy.MAJOR_CAUSES)

['uvb', 'rhum', 'temp', 'dewpoint']
['circul', 'infectious', 'respir', 'endo']


In [15]:
from see19 import RANGES
RANGES.keys()

dict_keys(['UNDERS', 'OVERS', 'SCHOOL_GOERS', 'Y_MILLS', 'MILLS', 'MID', 'MID_PLUS'])

In [16]:
overs = RANGES['OVERS']['ranges']

casestudy = CaseStudy(baseframe, count_categories='deaths_new_per_person_per_land_KM2', factors=overs)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,A70PLUSB,A75PLUSB,A80PLUSB,A85PLUSB,A65PLUSB_%,A70PLUSB_%,A75PLUSB_%,A80PLUSB_%,A85PLUSB_%,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,...,77125.0,51969.0,229.0,0.0,0.203018,0.149699,0.100871,0.000444,0.0,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,...,77125.0,51969.0,229.0,0.0,0.203018,0.149699,0.100871,0.000444,0.0,1 days


In [17]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_per_person_per_land_KM2', factors=CaseStudy.MAJOR_CAUSES)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,deaths_new_per_person_per_land_KM2,circul,infectious,respir,endo,circul_%,infectious_%,respir_%,endo_%,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,...,0.005704,4048,210,820,340.0,0.007857,0.000408,0.001592,0.00066,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,...,0.005704,4048,210,820,340.0,0.007857,0.000408,0.001592,0.00066,1 days


With respect to the `STRINDEX_CATS` subgroups, if all the required categories are provided, `CaseStudy` will sum the individual category values. 

For example, if `CONTAIN_CATS` are provided, the aggregate of the eight categories will be included in the `c_sum` column.

In [18]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_per_person_per_land_KM2', factors=CaseStudy.CONTAIN_CATS)
casestudy.df.c_sum

25119    20.0
25120    20.0
25121    20.0
25122    20.0
25123    20.0
         ... 
28763     9.0
28764     9.0
28765     5.0
28766     5.0
28767     5.0
Name: c_sum, Length: 12595, dtype: float64

Additional computations can be added for each factor via the `factor_dmas` attribute. 

The attribute is a dictionary of the form `str(factor_name): int(dma)`. 

When provided, `CaseStudy` will automatically add `_dma`, `_growth`, and `_growth_dma` computations

In [19]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_dma_per_1M', 
    factors=['temp', 'c1', 'strindex'], factor_dmas={'temp': 7, 'c1': 14}
)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,temp,c1,strindex,temp_dma,temp_growth,temp_growth_dma,c1_dma,c1_growth,c1_growth_dma,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,...,8.68194,3.0,87.43,5.893195,0.959184,1.452707,3.0,1.0,1.0,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,...,9.148065,3.0,87.43,6.718051,1.053689,1.170931,3.0,1.0,1.0,1 days


To provide a single dma for all the factors submitted, build the dictionary ahead of time:

In [20]:
factor_dmas = {msmt: 14 for msmt in CaseStudy.MSMTS}
casestudy = CaseStudy(
    baseframe, count_categories='deaths_new_dma_per_1M', 
    factors=CaseStudy.MSMTS, factor_dmas=factor_dmas
)
casestudy.df.head(2)

Unnamed: 0,region_id,country_id,region_name,country_code,country,date,cases,deaths,population,land_KM2,...,rhum_dma,rhum_growth,rhum_growth_dma,temp_dma,temp_growth,temp_growth_dma,dewpoint_dma,dewpoint_growth,dewpoint_growth_dma,days
25119,32,110,P.A. Trento,ITA,Italy,2020-03-12 00:00:00+00:00,107.0,1.0,515201.0,2938.79544,...,90.887667,1.050915,1.014481,4.082184,0.959184,1.238369,-1.975261,1.896068,-0.823534,0 days
25120,32,110,P.A. Trento,ITA,Italy,2020-03-13 00:00:00+00:00,163.0,2.0,515201.0,2938.79544,...,91.989446,0.995192,1.014527,4.513664,1.053689,1.218875,-0.780131,1.026207,-0.81909,1 days


Other factors are adjusted to population. These factors are appended with `_%` and can be seen via the `pop_cats` attribute.

These are typically time-static factors.

In [21]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_dma_per_1M', factors=['visitors', 'gdp', 'A65PLUSB' ])
casestudy.pop_cats

['A65PLUSB', 'visitors', 'gdp']

In [22]:
casestudy.df[['region_name', 'date', 'visitors_%', 'gdp_%', 'A65PLUSB_%']].head(2)

Unnamed: 0,region_name,date,visitors_%,gdp_%,A65PLUSB_%
25119,P.A. Trento,2020-03-12 00:00:00+00:00,19.864474,54504.746691,0.203018
25120,P.A. Trento,2020-03-13 00:00:00+00:00,19.864474,54504.746691,0.203018


# Next Section

Click on this link to go to the next notebook: [4. Visualizing Regional Impacts](https://ryanskene.github.io/see19/guide/4.%20See19%20-%20Visualizing%20Regional%20Impacts.html)