Add documentation

solveforj · May 20, 2021 · 14af27b · 14af27b
1 parent 551240f
commit 14af27b
Show file tree

Hide file tree

Showing 12 changed files with 96 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ See our latest predictions at our [website](https://itsonit.com) and check us ou
 ## Disclaimer
 We hope to serve as a valuable  resource for understanding trends in the ongoing pandemic and raise awareness about COVID-19 at the community level.  However, we strongly advise against **over-interpreting our predictions**. Machine learning models are only as good as the data that trains them.  We use the best quality data that is available to us, but we acknowledge that error in our predictions is unavoidable.
 
-## Generating projections
+## Setup
 
 * First, clone this repository:
   ```
@@ -28,18 +28,31 @@ We hope to serve as a valuable  resource for understanding trends in the ongoing
   pip install -r requirements.txt
   ```
 * For map rendering, install [Orca](https://github.com/plotly/orca) and ensure that it is added to your PATH
+* Get a COVIDActNow.org API access key if you intend to generate projections (register [here](https://apidocs.covidactnow.org/#register))
 
-* From the project directory, run the entire module with the command below, specifying the date for which you want to generate projections, optional arguments to compute feature importance (`--importance`) or produce projections for the prior Sunday to the input reference date (`--sunday`), and your COVIDActNow.org API Access key (register [here](https://apidocs.covidactnow.org/#register)) as follows:
-  ```
-  python auto.py -d 2021-04-11 -r API_KEY --importance --sunday
-  ```
+## Generating projections
+
+There are two ways to generate projections:
+
+1. **Standard Methodology**:  This can be used to generate the latest projections for the present date by using the latest datasets and reduces runtime by eliminating feature importance scoring. You need the *reference date* for which you want to generate projections and your *COVIDActNow.org API access key* (see above for registration instructions). Use the command below:
+```
+python auto.py -d 2021-05-18 -r API_KEY
+```
+If your reference date is not a Sunday (i.e. week start), you may specify an optional flag to create projections for the Sunday prior to your reference date:
+```
+python auto.py -d 2021-05-18 -r API_KEY --sunday
+```
+
+2. **Publication Methodology (for projections ONLY before 2021-01-15)**: Used to generate projections using the specific methodology of our publication in progress  (i.e. including feature importance scoring and the input datasets from 2021-05-18), with a similar command to the Standard Methodology:
+```
+python auto.py -d 2021-05-18 -r API_KEY --publication_method --sunday
+```
+Statistics and figures from the publication manuscript may be regenerated using preloaded datasets referenced specifically by the manuscript with the command below:
+```
+python auto.py --figures
+```
 
-* The output data files will be in the `/output` directory:
-  * The `/feature_ranking` directory contains feature rankings for each projection model (using mobility data) trained for each reference date
-  * The `/model_stats` directory contains model performance statistics (mean absolute error and R<sup>2</sup> on forecasts vs. actual cases in the training dataset) for each date projections were generated for
-  * The `/raw_predictions` directory contains model projections for the date projections were generated for and the preceding 9 weeks; all training dataset features are included for each of these weeks, as well
-  * The `/ReichLabFormat` directory contains predictions formatted as necessary for submission to the COVID-19 Forecast Hub
-  * The `/website` directory contains various files, many of which are published in some form on our [website](https://www.itsonit.com)
+Projections and related files will be in the `output/` subdirectory and publication figures will be in the `publication/output` subdirectory.
 
 ## Credits
 * [Joseph Galasso](https://github.com/solveforj/)

diff --git a/data/COVIDTracking/README.md b/data/COVIDTracking/README.md
@@ -9,3 +9,5 @@ Processes all COVID-19 testing-related datasets to generate testing features for
     * [Johns Hopkins Centers for Civic Impact for the Coronavirus Resource Center](https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/testing_data/time_series_covid19_US.csv), with documentation [here](https://github.com/govex/COVID-19)
     * See `/census` and `/geodata` directories for information on any census and geographical datasets used here, respectively
   * **testing_data.csv.gz** is the output of `preprocess.py`
+  * **covidtracking_2021_03_07.csv** contains the [COVID Tracking Project API](https://covidtracking.com/api/v1/states/daily.csv) dataset at its last update on 2021-03-07
+  * **testing_data_test.csv.gz** contains the output of `preprocess.py` as of 2021-05-18
diff --git a/data/JHU/README.md b/data/JHU/README.md
@@ -8,3 +8,4 @@ COVID-19 county-level case data from [John Hopkins University](https://coronavir
   - [JHU CSSE Github](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv)
   - See `/census` and `/geodata` directories for information on any census and geographical datasets used here, respectively
 - **jhu_data.csv** is the output of `preprocess.py` and contains all case features
+- **jhu_data_test.csv** is the output of `preprocess.py` as of 2021-05-18
diff --git a/data/Rt/README.md b/data/Rt/README.md
@@ -13,4 +13,4 @@ COVID-19 effective reproduction number (R<sub>t</sub>) time-series is obtained f
 - **aligned_rt_14.csv** is one output of `preprocess.py` and  contains 14-week forecast R<sub>t</sub> features
 - **aligned_rt_21.csv** is one output of `preprocess.py` and contains 21-week forecast R<sub>t</sub> features
 - **aligned_rt_28.csv** is one output of `preprocess.py` and contains 28-week forecast R<sub>t</sub> features
-- **higher_corrs.csv** is one output of `preprocess.py` and describes if the county or state R<sub>t</sub> was used for generating case predictions for each county of interest
+- **higher_corrs.csv** is one output of `preprocess.py` and describes, for each county that projections were made for, which of the two R<sub>t</sub> time-series (county or state) had the highest max correlation with the case time-series when aligned
diff --git a/data/facebook/README.md b/data/facebook/README.md
@@ -4,5 +4,6 @@
 Processes county-level population mobility data from [facebook.com](https://dataforgood.fb.com/docs/covid19/) to generate population mobility features
 
 ### Files
-  * **preprocess.py** downloads and processes Facebook Movement Range Maps available from [here] (https://data.humdata.org/dataset/movement-range-maps), generating population mobility features
+  * **preprocess.py** downloads and processes Facebook Movement Range Maps available from [here](https://data.humdata.org/dataset/movement-range-maps), generating population mobility features
   * **mobility.csv.gz** is the output of `preprocess.py` and contains all population mobility features per county and date
+  * **mobility_test.csv.gz** is the output of `preprocess.py` as of 2021-05-18
diff --git a/model/README.md b/model/README.md
@@ -0,0 +1,12 @@
+# Model Training and Forecasting
+
+### Description
+All the files in this directory sequentially update datasets, merge them into a training dataset, train Random Forest models, and generate projections and visualizations of those projections for [itsonit.com](https://www.itsonit.com) and the [COVID-19 Forecast Hub](https://covid19forecasthub.org/).  Output is found in the `output/` directory.
+
+### Files
+- **merge.py** updates time-series datasets in the `data/` directory and merges all datasets in this directory into training datasets
+- **train.py** trains mobility and non-mobility Random Forest models for forecasting COVID-19 cases and then generates projections and feature importance scores for the models
+- **predict.py** reformats and condenses projections
+- **web.py** reformats projections for the website
+- **map.py** generates visualizations of the projections
+- **reichlab.py** reformats projections for the COVID-19 Forecasting Hub
diff --git a/output/README.md b/output/README.md
@@ -0,0 +1,11 @@
+# Model Output
+
+### Description
+The subdirectories (as explained below) contain the output files of the code in the `model/` directory.
+
+### Subdirectories
+- `feature_ranking/` contains feature importance scores.  Those used in the publication manuscript are isolated in the `feature_ranking/publication/` subdirectory.
+- `model_stats/` contains statistics about model performance on training and validation datasets.  Those used in the publication manuscript are isolated in the `model_stats/publication/` subdirectory.
+- `raw_predictions/` contains partially condensed projections and features for their counties from the training dataset.  Those used in the publication manuscript are isolated in the `raw_predictions/publication/` subdirectory.
+- `ReichLabFormat/` contains condensed and reformatted projections for the [COVID-19 Forecast Hub](https://covid19forecasthub.org/).  Those used in the publication manuscript are isolated in the `ReichLabFormat/publication/` subdirectory.\
+- `website/` contains any visualizations or data files for [itsonit.com](https://www.itsonit.com) or social media communications.
diff --git a/publication/README.md b/publication/README.md
@@ -0,0 +1,15 @@
+# Publication Data and Figures
+
+### Description
+The files in this directory generate all figures and statistics presented in the publication manuscript.  The subdirectories in this directory contain the figures/statistics or other input data required to generate them.
+
+### Files
+- **performance_comparison.py** loads data from the `pandemic-central/output/ReichLabFormat/publication/` directory and the `data/comparison_models/` subdirectory to generate statistics on their relative performance.  These statistics are stored in the `output/model_performance/` subdirectory.
+- **performance_graph.py** produces graphs of the statistics from `performance_comparison.py`.  These graphs are stored in the `output/performance_figures/` subdirectory.
+- **feature_importance.py** produces graphs of computed feature importances over time from datasets in the `pandemic-central/output/feature_ranking/publication/` directory. The output graphs are in the `output/feature_importance_figures/` subdirectory.
+- **rt_alignment.py** graphs an example of the R<sub>t</sub> alignment process with the case curves, producing the figure in the `output/rt_alignment_figures/` subdirectory.
+- **misc_stats.py** prints statistics referenced in the publication manuscript Results and Discussion section.
+
+### Subdirectories
+- `data/` contains projections from other models in the `data/comparison_models/` subdirectory and the **higher_corrs.csv** file originally found in the `pandemic-central/data/Rt/` directory, but computed for 2021-01-10 projections.
+- `output/` contains output figures and datasets from the code files in this directory as described in the section directly above on Files in this directory.
diff --git a/publication/feature_importance.py b/publication/feature_importance.py
@@ -1,6 +1,13 @@
 import pandas as pd
 import matplotlib.pyplot as plt
 
+__author__ = 'Duy Cao, Joseph Galasso'
+__copyright__ = '© Pandemic Central, 2021'
+__license__ = 'MIT'
+__status__ = 'release'
+__url__ = 'https://github.com/solveforj/pandemic-central'
+__version__ = '3.0.0'
+
 def feature_importance():
     print("GRAPHING FEATURE IMPORTANCES\n")
     weeks = [1, 2, 3, 4]

diff --git a/publication/misc_stats.py b/publication/misc_stats.py
@@ -2,6 +2,13 @@
 import numpy as np
 import glob
 
+__author__ = 'Duy Cao, Joseph Galasso'
+__copyright__ = '© Pandemic Central, 2021'
+__license__ = 'MIT'
+__status__ = 'release'
+__url__ = 'https://github.com/solveforj/pandemic-central'
+__version__ = '3.0.0'
+
 weeks = [1, 2, 3, 4]
 dates = ["2020-11-01", "2020-11-08", "2020-11-15", "2020-11-22", "2020-11-29","2020-12-06", "2020-12-13", "2020-12-20", "2020-12-27", "2021-01-03", "2021-01-10"]
 

diff --git a/publication/performance_graph.py b/publication/performance_graph.py
@@ -4,6 +4,13 @@
 from epiweeks import Week
 from datetime import date
 
+__author__ = 'Duy Cao, Joseph Galasso'
+__copyright__ = '© Pandemic Central, 2021'
+__license__ = 'MIT'
+__status__ = 'release'
+__url__ = 'https://github.com/solveforj/pandemic-central'
+__version__ = '3.0.0'
+
 MODELS = ['JHU_IDD',\
             'OneQuietNight',\
             'RF',\

diff --git a/publication/rt_alignment.py b/publication/rt_alignment.py
@@ -6,6 +6,13 @@
 import matplotlib.ticker as ticker
 from sklearn.linear_model import LinearRegression
 
+__author__ = 'Duy Cao, Joseph Galasso'
+__copyright__ = '© Pandemic Central, 2021'
+__license__ = 'MIT'
+__status__ = 'release'
+__url__ = 'https://github.com/solveforj/pandemic-central'
+__version__ = '3.0.0'
+
 def rt_alignment():
     print("GENERATING GRAPH OF Rt ALIGNMENT PROCESS\n")
     county_rt=pd.read_csv("data/Rt/rt_data.csv", dtype={"FIPS":str})