Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move data_pipeline to script #29

Merged
merged 2 commits into from
May 31, 2022
Merged

Move data_pipeline to script #29

merged 2 commits into from
May 31, 2022

Conversation

gailin-p
Copy link
Collaborator

@gailin-p gailin-p commented May 31, 2022

  1. data_pipeline script. Options:
    --year
    --gtn_years how many years to calc GTN ratio
    --small filters out 95% of plants so pipeline runs faster (testing only)
    Note: --small makes the run faster, but because the filter currently occurs after data_cleaning.clean_cems(year), it still takes 10+ minutes. If it were faster, we could run it in a commit hook to guarantee that data_cleaning is always functional. To make this possible, we would need to enable filtering in data_cleaning.clean_cems(year) or create a smaller testing version of all the data sources in data/downloads

  2. Add column checks for files created in data_pipeline.
    Note: files created outside data_pipeline, including residual profiles, GTN ratios, and subplant mappings, are not checked currently. We can add checks for these if they become outputs whose contents we want to guarantee

  3. Move residual-related calculations to residual.py

Note on code structure: Currently all data paths are hardcoded to assume a working directory within src (eg ../data/downloads/x. Eventually, we should move this into a variable that can be set, allowing us to treat the code in src as a package and move data_pipeline.py outside of src, to treat it as a script calling a package.

1) data_pipeline script. Options:
 `--year`
 `--gtn_years` how many years to calc GTN ratio
 `--small` filters out 95% of plants so pipeline runs faster (testing only)
2) Add column checks for files created in data_pipeline.
Note: files created outside data_pipeline,
including residual profiles, GTN ratios, and subplant mappings,
are not checked currently.
We can add checks for these if they become outputs whose contents we
want to guarantee
3) Move residual-related calculations to `residual.py`
@gailin-p gailin-p requested a review from grgmiller May 31, 2022 19:46
@gailin-p
Copy link
Collaborator Author

Two small notes:

  • The following call was commented out in GTN item of data_pipeline; I’ve left it out of data_pipeline.py
# for generators where there is heat input but no gross generation reported, impute hourly net generation based on reported EIA values
# TODO: Need to match data on unit level rather than plant level
# cems = data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated)
  • data_pipeline.py does not run GTN regressions if folder exists, so user must delete that folder if there are changes to the GTN calculation that they want to take effect. Could also make this a command line argument if it’s a common use case

Specify that working directory is `src` (needed because data paths are hardcoded elsewhere in code)
@grgmiller grgmiller merged commit b543dbd into master May 31, 2022
@gailin-p gailin-p deleted the refactor_data_pipeline branch May 31, 2022 20:39
@gailin-p gailin-p mentioned this pull request May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants