Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update partial CEMS shaping for mixed fuel plants #238

Merged
merged 13 commits into from
Oct 21, 2022
Merged

Conversation

gailin-p
Copy link
Collaborator

Summary of changes

Close #230 by changing hourly data source identification methods:

  • A renewable generator (hydro, wind, solar, nuclear, geothermal) won't use the partial_cems_plant hourly data source methodology
  • A subplant that contains generators of mixed fuel types will choose the hourly data source of the generator with the largest generation (this is relevant for subplants with subplant_id=NaN, which can have generators of mixed fuel types; see comments on Ensure complete subplant_id mapping #49)

The post-fix hourly emission rate for PJM is below. The abrupt drops to zero (seen in #230) are gone.
Screen Shot 2022-09-28 at 9 22 30 AM

The PJM nuclear emission rate, below, shows the three hours where the generator is on as higher-than-zero emission rates:
Screen Shot 2022-09-28 at 9 22 47 AM

Open question: subplants in hourly plant_data results

Plant 2410, the nuclear plant originally responsible for the PJM data issues, has one subplant made up of a CEMS-reporting diesel generator and one subplant with its nuclear generation. Generation from the CEMS-reporting subplant is in plant_data/hourly/individual_plant_data.csv, while generation from the nuclear subplant is in shaped_fleet_data.csv. This is confusing, since a user looking only at individual_plant_data.csv would think that plant 2410 had only ~10 MWh of generation in 2020, but if they looked at the annual data, they'd see annual generation of 16,000,000 MWh.

Proposed solutions:

  1. Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia
  2. Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

I think 1 is the better solution, but it does mean removing some hourly data that we're actually pretty confident in. (eg, for 2410, we do know that the diesel generator ran for those 3 hours).

(hourly_validation and map_visualization)
* update 930 time lag notebook for new dir structure
notebook was used during issue 230 investigation

notebook is from gailin/clean_cems branch, which can now be deleted
* A renewable generator (hydro, wind, solar, nuclear, geothermal) won't use the `partial_cems_plant` hourly shaping methodology
* A subplant that contains generators of mixed fuel types will choose the hourly shaping method of the generator with the largest generation
@gailin-p
Copy link
Collaborator Author

Other changes:

There are some notebook changes included here that are not directly related to #230:

  • hourly_validation.ipynb and map_visualization.ipynb are updated to reflect final changes I made before publishing the visualizations in the two blog posts (announcement and real-time validation posts)
  • work_in_progress/clean_cems_outliers.ipynb is a notebook I pulled out of the old/outdated gailin/clean_cems branch. This notebook is background research to advance Identify outlier values in reported CEMS data #50. Moving the notebook into work_in_progress will allow us to delete the outdated gailin/clean_cems branch.

@grgmiller
Copy link
Collaborator

grgmiller commented Sep 28, 2022

Proposed solutions:

  1. Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia
  2. Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

So the original partial plant methodology was designed such that different parts of a plant couldn't end up in separate files: If all subplants had hourly_data_source=="eia" all of the plant data would end up in shaped_fleet_data, but if only subplants had eia as the data source, we would use one of the partial cems methods, and then all of the data for that plant would end up in individual_plant_data (or at least that was the intent - if you've found that's not how it was working then we would want to fix that bug).

However, with this current issue, we've discovered that we don't always want to shape an entire plant with partial cems data because of the spike issue. Currently, this PR now is splitting up the data so that (in the case of 2410 for example) the nuclear portion would end up in the shaped fleet data and the diesel portion would end up in the individual data.

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

I'm kind of leaning toward 2 as the simpler fix for now.

There are also a couple of other options that we could consider:
3. Instead of using "NA" subplant ids, use the unit_id_pudl and or generator id to assign a non-na value to these subplants
4. Publish all of the data as individual plant data

src/data_cleaning.py Outdated Show resolved Hide resolved
src/data_cleaning.py Outdated Show resolved Hide resolved
src/data_cleaning.py Outdated Show resolved Hide resolved
@gailin-p
Copy link
Collaborator Author

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

The new methodology proposed in this PR results in 106 plant-months (over 12 unique plants) with split EIA and CEMS/partial_CEMS hourly data sources. The generation from these plants is split pretty evenly over CEMS and EIA hourly data sources, with about 21,000,000 MWh generation in CEMS subplants and 29,000,000 in EIA subplants.

The affected plants are: [557, 621, 645, 1355, 2410, 2707, 2953, 6074, 8223, 10029, 10823, 58236]. (clarification note: these are plants with split methodologies between subplants, which is a different set of plants than the plants with NaN-id subplants with split methodologies within a subplant.)

@gailin-p
Copy link
Collaborator Author

gailin-p commented Oct 1, 2022

725e39a added a new hourly plant-level output file, partial_plant_data.csv, which contains the CEMS and partial CEMS plant data for plants with one or more subplants without hourly data.

src/data_cleaning.py Outdated Show resolved Hide resolved
src/data_cleaning.py Outdated Show resolved Hide resolved
src/data_pipeline.py Show resolved Hide resolved
@grgmiller
Copy link
Collaborator

The more I think about it, I'm thinking that perhaps we should revert 725e39a and just keep the outputs split among the two files with a disclaimer in the documentation (and maybe even on the download page) that data might be split between two files. Here's my thought process:

  • I think that maybe in the next release, we will actually want to move to publishing all of the hourly plant-level data for individual plants, instead of being split into two files. In talking to some of the early adopters of the OGE dataset, it seems that there is interested in the individual plant data at the hourly resolution. In the past, we haven't done this partially due to memory issues, but I think there may be a way around this: if we shape the individual plant data in chunks only during export, and do not actually store a full version of the shaped plant-level data in memory, I think this could work. We'd want to export the data in chunks (eg for each BA, or each state) so that each file isn't huge. Once we export the hourly plant level data, we could apply the existing aggregated shaping method since that's all we need for the subsequent power sector and consumed outputs.
  • For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.
  • It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

@gailin-p
Copy link
Collaborator Author

gailin-p commented Oct 4, 2022

It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

Most of the reorganization is actually unrelated -- I just moved the writing of eia923 to just after the data is finished because I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172. The only other pipeline change is to not delete it so it can be passed to the plant writing function.

For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.

I see your point here, though it doesn't feel too problematic to me (it'll still get zipped up with the rest of the hourly plant data).

I think if we want to bundle the partial hourly plant data with the regular plant data, we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial. There's no other way to get that info without the outputs files. (We should maybe also copy that file to the plant_data/hourly folders so users get that information without having to search out and download another zip)

@grgmiller
Copy link
Collaborator

I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172

Oh sorry about that - I realize now that in a previous iteration of the code the eia923_allocated df was modified by the partial CEM shaping process, but that is no longer the case, so your move makes total sense!

we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

src/data_cleaning.py Outdated Show resolved Hide resolved
src/output_data.py Outdated Show resolved Hide resolved
@gailin-p
Copy link
Collaborator Author

gailin-p commented Oct 5, 2022

Ok, I like the idea of re-combining the hourly plant-level output files and directing users to plant_metadata to identify which plants have hourly data split between the plant file and the synthetic plant file.

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

Great point! This seems like a good spot for this data. However, plant_metadata.csv will need to be modified, because it doesn't currently contain rows for EIA-only plants (only rows for the synthetic plants aggregated from many EIA plants). For example, using the classic example plant for this issue, 2410 (the PJM nuclear plant with the diesel backup): we see metadata rows for the three months when the diesel subplant reported to CEMS, but no plant_metadata.csv row for the nuclear subplant -- it's just grouped in the synthetic plant row.

Two options for fixing:

  1. Add rows only for shaped subplants of plants with some individual plant data (ie, the plant-months currently getting written to partial_plant_data.csv
  2. Add rows for all subplant-months. This would make for a much larger file, but it guarantees that a user will be able to look up where the hourly data is for any plant they're interested in

@grgmiller
Copy link
Collaborator

I think either option 1 or 2 would work, and we should do whatever is easiest to implement at this point (based on my educated guess, 2 might be easier since it involves less segmenting of the data, but I could be mistaken). The plant_static_attributes file also contains the mapping of plant id to shaped plant id, so in the worst case, a user would have to cross reference the two tables to figure out the metadata for a plant that had been aggregated.

src/output_data.py Outdated Show resolved Hide resolved
Rename col to in plant_metadata
@gailin-p
Copy link
Collaborator Author

07853b7
Removed separated full and partial plant-level hourly outputs
Renamed metadata id to "<See shaped plant ID">

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think we're ready to merge

@grgmiller grgmiller merged commit bcb91d5 into v0.1.2 Oct 21, 2022
@grgmiller grgmiller deleted the gailin/issue230 branch October 21, 2022 23:02
@grgmiller grgmiller changed the title Gailin/issue230 Update partial CEMS shaping for mixed fuel plants Oct 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants