Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BA-level data quality metrics #233

Merged
merged 7 commits into from
Oct 25, 2022
Merged

Add BA-level data quality metrics #233

merged 7 commits into from
Oct 25, 2022

Conversation

grgmiller
Copy link
Collaborator

@grgmiller grgmiller commented Sep 16, 2022

Previously, all of our data quality metrics only provided information about the quality of data for the entire country. However, the quality of data in individual BAs could vary quite widely.

This PR is a work in progress, but will add BA-level metrics (in addition to the national metrics) to:

  • input_data_source
  • hourly_profile_method

By working on this, I also discovered that some plants report data at multiple frequencies (#232), which was causing some of the metrics to not sum to 100% because there was missing frequency codes. This patches that by filling missing frequency codes with "multiple" (as opposed to "annual" or "monthly")

@grgmiller grgmiller changed the base branch from development to v0.1.2 October 21, 2022 23:29
@grgmiller grgmiller marked this pull request as ready for review October 21, 2022 23:30
@grgmiller
Copy link
Collaborator Author

In addition to adding BA-level data quality metrics, I also added a notebook that can be used to explore stats for generators that do not report to CEMS and how they might differ. A lot of this work is based on feedback I got during some peer-review of my dissertation chapter.

@grgmiller
Copy link
Collaborator Author

This PR also fixes several other bugs that were introduced in other PRs that have been merged into the v0.1.2 branch:

  • At the beginning of the data pipeline, we were deleting the results folder for the year we were running, but this was raising an error if this directory didn't already exist (ie if this is the first time the user has ever run the pipeline). This fixes that by first looking to see if the directory exists already.
  • In Validate that data does not overlap when combining #245 I noticed that the validation check for overlapping data was not being run when combining hourly data, so I "fixed" this by setting the validate parameter to True. However, I discovered that there was actually a reason that this had been set to false (although it was not documented): The overlapping data validation checks the data at the subplant level, and when we combine hourly data, the shaped_eia_data df only contains fleet-level data and not subplant data so this validation check won't work. I set the validation parameter back to False when combining the hourly data and added a note about why we are doing that. (Closes Hourly data not validated for overlapping data #231)
  • Fixes some SettingWithCopy warnings in output_data.write_plant_metadata()

@gailin-p gailin-p merged commit a761999 into v0.1.2 Oct 25, 2022
@gailin-p gailin-p deleted the quality_metrics_by_ba branch October 25, 2022 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants