# see19 Guide

**A dataset and interface for visualizing and analyzing the epidemiology of Coronavirus Disease 2019 aka SARS-CoV-2 aka COVID19 aka C19**

Find it on [GitHub](https://github.com/ryanskene/see19)

Current with version 0.3.0.

# 3. the Casestudy Interface

3.1 [Basics](#section3.1)  
3.2 [Filtering](#section3.2)  
3.3 [Available Factors](#section3.3)  
3.4 [Additional Flags](#section3.4)

See19 Visualization and Data analysis is completed via the `CaseStudy` class.
    
`CaseStudy` can be accessed directly from the `see19` module

In [1]:
import pandas as pd

In [2]:
from see19 import CaseStudy, get_baseframe
baseframe = get_baseframe()
casestudy = CaseStudy(baseframe)

[*********************100%*************************] Downloading ... COMPLETE

<h2><a id='section3.1'>3.1 Basics</a></h2>

The original baseframe can be accessed via the `baseframe` attribute

In [3]:
casestudy.baseframe.head(2)

NameError: name 'casestudy' is not defined

`CaseStudy` automatically computes different adjustments including:

1. Daily new cases, fatalities, and tests
2. Daily Moving Average (DMA) for new and cumulative cases, fatalities, and tests
3. Population and density adjustments for new and cumulative cases, fatalities, and tests
4. Daily growth or change in 1. thru 3. above

These adjustments are referred to as `count_categories`.

The amended dataframe can be accessed via the `df` attribute:

In [None]:
casestudy.df.head(2)

For ease of selection, `CaseStudy` has a number of class attributes with different groupings of count categories: `BASECOUNT_CATS`, `PER_CATS`, `LOGNAT_CATS`, `ALL_CATS`, `DMA_COUNT_CATS`, `PER_COUNT_CATS`.

`DMA_COUNT_CATS` is shown as an example:

In [None]:
CaseStudy.DMA_COUNT_CATS[:10]

By providing `lognat=True`, `CaseStudy` will also take the natural log of each of 1. thru 3. above

In [None]:
casestudy = CaseStudy(baseframe, lognat=True)

In [None]:
casestudy.LOGNAT_CATS[10:20]

In [None]:
'In total, there are {} different `count_categories` to choose from.'.format(len(CaseStudy.ALL_COUNT_CATS))

<h2><a id='section3.2'>3.2 Filtering</a></h2>

Thankfully, `casestudy.df` can be limited to specific count categories via the `count_categories` attribute:

In [None]:
casestudy = CaseStudy(baseframe, count_categories='tests_new_dma_per_person_per_land_KM2')
casestudy.df.head(2)

In [None]:
casestudy = CaseStudy(baseframe, count_categories=['deaths_new_dma_per_person_per_land_KM2', 'growth_cases_new_per_1M'])
casestudy.df.head(2)

`CaseStudy` can further filter `baseframe` as follows:
    
* `regions` to limit the frame to certain regions
* `countries` to limit the frame to certain countries
* `exclude_regions` to exclude certain regions
* `exclude_countries` to exclude certain countries

Specific regions can be included or excluded by providing the `region_name`, `region_code`, or `region_id`.
Specific countries can be included or excluded by providing the `country`, `country_code`, or `country_id`.

Each of the four parameters can accept a single region as a `str` object or multiple regions via several common iterables.

Below we select three regions:

In [None]:
regions = ['New York', 'FL', 32]
casestudy = CaseStudy(
    baseframe, regions=regions, count_categories=CaseStudy.BASECOUNT_CATS, 
)

In [None]:
casestudy.df.head(3)

We can see that all three regions are indeed in the object by grouping:

In [None]:
pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)

The region and country filters are important mechanisms for isolating data.

Here, we focus on US regions only, but exclude some of the most impacted ones:

In [None]:
countries = ['USA']
excluded_regions = ['NY', 'NJ']
casestudy = CaseStudy(
    baseframe, countries=countries, excluded_regions=excluded_regions, count_categories=CaseStudy.BASECOUNT_CATS, 
)

And below we can see that we have various US states in the dataset and that New York or New Jersey are *not* included.

In [None]:
casestudy.df.head(2)

In [None]:
pd.concat([df_group.iloc[:1] for region_id, df_group in casestudy.df.groupby('region_id')]).head(3)

In [None]:
casestudy.df[casestudy.df.region_name.isin(excluded_regions)]

### Limiting data via different start and tail hurdles

Parameters exist that allow you to filter the dataset such that regions and days appear only if they meet certain criteria.

`start_factor` and `start_hurdle` provide the ability to effectively *crop* the beginning of region's period of data.

`tail_factor` and `tail_hurdle` do the same for the end of a region's period.

`start_factor` and `tail_factor` accept almost any factor in the dataset, from the count_categories to dates.

The `hurdle` is the level the region must reach to be included. For instance, if a `start_factor` of `cases_new_per_1M` is selected and a `start_hurdle` of `1.0`, then each region's first row in `casestudy.df` will be the day that the region met or exceeded **1.0 new cases per 1M people**.

These options are a convenient way to compare regions that have been impacted in similar ways or, perhaps, to fairly compare regions that were impacted at different times.

The default parameters for `start_factor` and `start_hurdle` limit the data to regions with at least one cumulative fatality.

**NOTE**: a `days` column is added to `casestudy.df`. This is a count of the number of days from the current date back to the first date in frame.  When a `start_factor` is provided, this is the first date that the `start_hurdle` is met. When `start_factor` is not provided, this is the first date in the dataset.

Examples are show below.

In [None]:
casestudy = CaseStudy(
    baseframe, regions=['Spain'], count_categories=CaseStudy.BASECOUNT_CATS, 
    start_factor='cases', start_hurdle=3
)
casestudy.df.head(2)

In [None]:
casestudy = CaseStudy(
    baseframe, countries=['Sweden'], 
    count_categories='deaths_new', start_factor='deaths_new', start_hurdle=3
)
casestudy.df.head(2)

To see the earliest dates in the dataframe, prior to any deaths being recorded, set `start_factor` to `''`.

In [None]:
casestudy = CaseStudy(
    baseframe, regions='RJ', count_categories='tests_new_dma', 
    factors=['temp', 'strindex'], start_factor=''
)
casestudy.df.head(2)

<h2><a id='section3.3'>3.3 Available Factors</a></h2>

The remaining columns in the `baseframe` can be included in a `CaseStudy` instance on an ***opt-in*** basis via the `factors` attribute:

In [None]:
casestudy = CaseStudy(baseframe, count_categories='cases_new_per_person_per_land_KM2', factors=['no2', 'strindex'])
casestudy.df.head(2)

For convenience, a number of factor groupings can be accessed via `CaseStudy` attributes:

* `GMOBIS`, `AMOBIS`, `CAUSES`, `MAJOR_CAUSES`, `POLLUTS`, `TEMP_MSMTS`, `MSMTS`
    * various groupings for factor data
    * `GMOBIS` refer to Google Mobility data.
    * `AMOBIS` refer to Apple Mobility data.
* `STRINDEX_CATS`, `CONTAIN_CATS`, `ECON_CATS`, `HEALTH_CATS`
    * groupings for the Oxford Stringency Index

In [None]:
print (CaseStudy.MSMTS)
print (CaseStudy.MAJOR_CAUSES)

Demographic population age groupings can be accessed via the `see19` module:
* `ALL_RANGES` - all the possible demographic age ranges
* `RANGES` - a dictionary of various groupings of age ranges

In [None]:
from see19 import RANGES
RANGES.keys()

In [None]:
overs = RANGES['OVERS']['ranges']
casestudy = CaseStudy(baseframe, regions='Lombardia', count_categories='deaths_new_per_person_per_land_KM2', factors=overs)
casestudy.df.head(2)

In [None]:
casestudy = CaseStudy(baseframe, regions='LOM', count_categories='deaths_new_per_person_per_land_KM2', factors=CaseStudy.MAJOR_CAUSES)
casestudy.df.head(2)

Some factors are only available at a country level, regardless of the sub regions available for some countries.

By setting `country_level=True`, `casestudy` will aggregate most data among the subregions up to the country level to allow for proper comparison across the broad range of countries.

The **Oxford Stringency Index** and its derivatives is one such data group only available at the country level.

In [None]:
casestudy = CaseStudy(baseframe, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors='strindex',
    country_level=True,
)
casestudy.df.tail(2)

Above you can see that all US states have been aggregated into a single region with an region_id 

With respect to the `STRINDEX_CATS` subgroups, if all the required categories are provided, `CaseStudy` will sum the individual category values. 

For example, if `CONTAIN_CATS` are provided, the aggregate of the eight categories will be included in the `c_sum` column.

Note if all five `h` indicators are provided, `CaseStudy` will also tabulate a `key3_sum`, which aggregates the scores on the `h1`, `h2`, and `h3` indicators.

In [None]:
casestudy = CaseStudy(baseframe, 
    count_categories='deaths_new_per_person_per_land_KM2', 
    factors=CaseStudy.CONTAIN_CATS,
    country_level=True,
)
casestudy.df.tail(2)

Additional computations can be added for each factor via the `factor_dmas` attribute. 

The attribute is a dictionary of the form `str(factor_name): int(dma)`. 

When provided, `CaseStudy` will automatically add `_dma`, `_growth`, and `_growth_dma` computations

In [None]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_dma_per_1M', 
    factors=['temp', 'c1', 'strindex'], 
    factor_dmas={'temp': 7, 'c1': 14},
    country_level=True,
)
casestudy.df.head(2)

To provide a single dma for all the factors submitted, build the dictionary ahead of time:

In [None]:
factor_dmas = {msmt: 14 for msmt in CaseStudy.MSMTS}
casestudy = CaseStudy(
    baseframe, count_categories='tests_new_per_1M', 
    factors=CaseStudy.MSMTS, factor_dmas=factor_dmas
)
casestudy.df.head(2)

Other factors are adjusted to population. These factors are appended with `_%` and can be seen via the `pop_cats` attribute.

These are typically time-static factors.

In [None]:
casestudy = CaseStudy(baseframe, count_categories='deaths_new_dma_per_1M', factors=['visitors', 'gdp', 'A65PLUSB' ])
casestudy.pop_cats

In [None]:
casestudy.df[['region_name', 'date', 'visitors_%', 'gdp_%', 'A65PLUSB_%']].head(2)

<h3><a id='section3.4'>3.4 Additional Flags</a></h3>

There are several additional flags and methods that will be touched on briefly, however, you are encouraged to read the analysis pages to see them in action.

* `world_averages`: when set to `True`, averages each date in the dataset across all the regions, to provide a ***per_region*** statistic for each factor

* `favor_earlier`: when set to `True`, scales any selected rows such that the rows values favor earlier dates over later ones. A new column is added with the `_earlier` suffix. This is helpful when attempting to study the impacts of early moves to, say, social distance. Factors are selected by passing a list to the `factors_to_favor_earlier` parameter.

# Next Section

Click on this link to go to the next notebook: [4. Visualizing Regional Impacts](https://ryanskene.github.io/see19/guide/4.%20See19%20-%20Visualizing%20Regional%20Impacts.html)