Move data_pipeline to script #29

gailin-p · 2022-05-31T19:46:50Z

data_pipeline script. Options:
--year
--gtn_years how many years to calc GTN ratio
--small filters out 95% of plants so pipeline runs faster (testing only)
Note: --small makes the run faster, but because the filter currently occurs after data_cleaning.clean_cems(year), it still takes 10+ minutes. If it were faster, we could run it in a commit hook to guarantee that data_cleaning is always functional. To make this possible, we would need to enable filtering in data_cleaning.clean_cems(year) or create a smaller testing version of all the data sources in data/downloads
Add column checks for files created in data_pipeline.
Note: files created outside data_pipeline, including residual profiles, GTN ratios, and subplant mappings, are not checked currently. We can add checks for these if they become outputs whose contents we want to guarantee
Move residual-related calculations to residual.py

Note on code structure: Currently all data paths are hardcoded to assume a working directory within src (eg ../data/downloads/x. Eventually, we should move this into a variable that can be set, allowing us to treat the code in src as a package and move data_pipeline.py outside of src, to treat it as a script calling a package.

1) data_pipeline script. Options: `--year` `--gtn_years` how many years to calc GTN ratio `--small` filters out 95% of plants so pipeline runs faster (testing only) 2) Add column checks for files created in data_pipeline. Note: files created outside data_pipeline, including residual profiles, GTN ratios, and subplant mappings, are not checked currently. We can add checks for these if they become outputs whose contents we want to guarantee 3) Move residual-related calculations to `residual.py`

gailin-p · 2022-05-31T19:48:03Z

Two small notes:

The following call was commented out in GTN item of data_pipeline; I’ve left it out of data_pipeline.py

# for generators where there is heat input but no gross generation reported, impute hourly net generation based on reported EIA values
# TODO: Need to match data on unit level rather than plant level
# cems = data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated)

data_pipeline.py does not run GTN regressions if folder exists, so user must delete that folder if there are changes to the GTN calculation that they want to take effect. Could also make this a command line argument if it’s a common use case

Specify that working directory is `src` (needed because data paths are hardcoded elsewhere in code)

gailin-p requested a review from grgmiller May 31, 2022 19:46

Update comment on how to run file

1f8b65b

Specify that working directory is `src` (needed because data paths are hardcoded elsewhere in code)

grgmiller mentioned this pull request May 31, 2022

Improve functionality of --small argument for testing data pipeline #30

Closed

2 tasks

grgmiller merged commit b543dbd into master May 31, 2022

gailin-p deleted the refactor_data_pipeline branch May 31, 2022 20:39

gailin-p mentioned this pull request May 31, 2022

Refactor data pipeline #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move data_pipeline to script #29

Move data_pipeline to script #29

gailin-p commented May 31, 2022 •

edited

Loading

gailin-p commented May 31, 2022

Move data_pipeline to script #29

Move data_pipeline to script #29

Conversation

gailin-p commented May 31, 2022 • edited Loading

gailin-p commented May 31, 2022

gailin-p commented May 31, 2022 •

edited

Loading