Update partial CEMS shaping for mixed fuel plants #238

gailin-p · 2022-09-28T13:53:51Z

Summary of changes

Close #230 by changing hourly data source identification methods:

A renewable generator (hydro, wind, solar, nuclear, geothermal) won't use the partial_cems_plant hourly data source methodology
A subplant that contains generators of mixed fuel types will choose the hourly data source of the generator with the largest generation (this is relevant for subplants with subplant_id=NaN, which can have generators of mixed fuel types; see comments on Ensure complete subplant_id mapping #49)

The post-fix hourly emission rate for PJM is below. The abrupt drops to zero (seen in #230) are gone.

The PJM nuclear emission rate, below, shows the three hours where the generator is on as higher-than-zero emission rates:

Open question: subplants in hourly `plant_data` results

Plant 2410, the nuclear plant originally responsible for the PJM data issues, has one subplant made up of a CEMS-reporting diesel generator and one subplant with its nuclear generation. Generation from the CEMS-reporting subplant is in plant_data/hourly/individual_plant_data.csv, while generation from the nuclear subplant is in shaped_fleet_data.csv. This is confusing, since a user looking only at individual_plant_data.csv would think that plant 2410 had only ~10 MWh of generation in 2020, but if they looked at the annual data, they'd see annual generation of 16,000,000 MWh.

Proposed solutions:

Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia
Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

I think 1 is the better solution, but it does mean removing some hourly data that we're actually pretty confident in. (eg, for 2410, we do know that the diesel generator ran for those 3 hours).

(hourly_validation and map_visualization) * update 930 time lag notebook for new dir structure

notebook was used during issue 230 investigation notebook is from gailin/clean_cems branch, which can now be deleted

* A renewable generator (hydro, wind, solar, nuclear, geothermal) won't use the `partial_cems_plant` hourly shaping methodology * A subplant that contains generators of mixed fuel types will choose the hourly shaping method of the generator with the largest generation

gailin-p · 2022-09-28T15:57:31Z

Other changes:

There are some notebook changes included here that are not directly related to #230:

hourly_validation.ipynb and map_visualization.ipynb are updated to reflect final changes I made before publishing the visualizations in the two blog posts (announcement and real-time validation posts)
work_in_progress/clean_cems_outliers.ipynb is a notebook I pulled out of the old/outdated gailin/clean_cems branch. This notebook is background research to advance Identify outlier values in reported CEMS data #50. Moving the notebook into work_in_progress will allow us to delete the outdated gailin/clean_cems branch.

grgmiller · 2022-09-28T17:55:30Z

Proposed solutions:

Exclude plants from individual_plant_data.csv if one or more of their subplants has hourly_data_source=eia

Make a note in the documentation that some plants in individual_plant_data.csv may not contain all subplants, and indicate how users can identify whether a plant has complete hourly data

So the original partial plant methodology was designed such that different parts of a plant couldn't end up in separate files: If all subplants had hourly_data_source=="eia" all of the plant data would end up in shaped_fleet_data, but if only subplants had eia as the data source, we would use one of the partial cems methods, and then all of the data for that plant would end up in individual_plant_data (or at least that was the intent - if you've found that's not how it was working then we would want to fix that bug).

However, with this current issue, we've discovered that we don't always want to shape an entire plant with partial cems data because of the spike issue. Currently, this PR now is splitting up the data so that (in the case of 2410 for example) the nuclear portion would end up in the shaped fleet data and the diesel portion would end up in the individual data.

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

I'm kind of leaning toward 2 as the simpler fix for now.

There are also a couple of other options that we could consider:
3. Instead of using "NA" subplant ids, use the unit_id_pudl and or generator id to assign a non-na value to these subplants
4. Publish all of the data as individual plant data

src/data_cleaning.py

gailin-p · 2022-09-29T14:48:50Z

If we went with option 1, do you know how much CEMS data we'd be ignoring (ie is it a handful of backup generators with negligible generation, or would this affect a wider set of generation)?

The new methodology proposed in this PR results in 106 plant-months (over 12 unique plants) with split EIA and CEMS/partial_CEMS hourly data sources. The generation from these plants is split pretty evenly over CEMS and EIA hourly data sources, with about 21,000,000 MWh generation in CEMS subplants and 29,000,000 in EIA subplants.

The affected plants are: [557, 621, 645, 1355, 2410, 2707, 2953, 6074, 8223, 10029, 10823, 58236]. (clarification note: these are plants with split methodologies between subplants, which is a different set of plants than the plants with NaN-id subplants with split methodologies within a subplant.)

gailin-p · 2022-10-01T18:41:29Z

725e39a added a new hourly plant-level output file, partial_plant_data.csv, which contains the CEMS and partial CEMS plant data for plants with one or more subplants without hourly data.

src/data_cleaning.py

src/data_pipeline.py

grgmiller · 2022-10-02T00:07:35Z

The more I think about it, I'm thinking that perhaps we should revert 725e39a and just keep the outputs split among the two files with a disclaimer in the documentation (and maybe even on the download page) that data might be split between two files. Here's my thought process:

I think that maybe in the next release, we will actually want to move to publishing all of the hourly plant-level data for individual plants, instead of being split into two files. In talking to some of the early adopters of the OGE dataset, it seems that there is interested in the individual plant data at the hourly resolution. In the past, we haven't done this partially due to memory issues, but I think there may be a way around this: if we shape the individual plant data in chunks only during export, and do not actually store a full version of the shaped plant-level data in memory, I think this could work. We'd want to export the data in chunks (eg for each BA, or each state) so that each file isn't huge. Once we export the hourly plant level data, we could apply the existing aggregated shaping method since that's all we need for the subsequent power sector and consumed outputs.
For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.
It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

…sions into v0.1.2

gailin-p · 2022-10-04T21:07:17Z

It seems like the fix for outputting data into a separate file involves re-organizing the data pipeline a bit, and I'd want to review this a bit more carefully before implementing.

Most of the reorganization is actually unrelated -- I just moved the writing of eia923 to just after the data is finished because I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172. The only other pipeline change is to not delete it so it can be passed to the plant writing function.

For a patch release, I'm not sure that we want to be adding a whole new output to the dataset, especially if it is a temporary fix that we will probably not keep in the next major release.

I see your point here, though it doesn't feel too problematic to me (it'll still get zipped up with the rest of the hourly plant data).

I think if we want to bundle the partial hourly plant data with the regular plant data, we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial. There's no other way to get that info without the outputs files. (We should maybe also copy that file to the plant_data/hourly folders so users get that information without having to search out and download another zip)

grgmiller · 2022-10-04T23:16:56Z

I found it confusing for it to be exported later in the pipeline when it hasn't actually been modified since line 172

Oh sorry about that - I realize now that in a previous iteration of the code the eia923_allocated df was modified by the partial CEM shaping process, but that is no longer the case, so your move makes total sense!

we should write a new data_quality_metrics file that reports the plant-months for which hourly plant data is partial

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

src/data_cleaning.py

src/output_data.py

gailin-p · 2022-10-05T20:52:55Z

Ok, I like the idea of re-combining the hourly plant-level output files and directing users to plant_metadata to identify which plants have hourly data split between the plant file and the synthetic plant file.

Doesn't results/2020/plant_data/plant_metadata.csv already do this? For each subplant-month, it identifies the source of the data and the source of the hourly profile. Without looking at a specific plant example, I'm not sure if this output needs to be further modified to meet these needs, but we should be able to use this output file.

Great point! This seems like a good spot for this data. However, plant_metadata.csv will need to be modified, because it doesn't currently contain rows for EIA-only plants (only rows for the synthetic plants aggregated from many EIA plants). For example, using the classic example plant for this issue, 2410 (the PJM nuclear plant with the diesel backup): we see metadata rows for the three months when the diesel subplant reported to CEMS, but no plant_metadata.csv row for the nuclear subplant -- it's just grouped in the synthetic plant row.

Two options for fixing:

Add rows only for shaped subplants of plants with some individual plant data (ie, the plant-months currently getting written to partial_plant_data.csv
Add rows for all subplant-months. This would make for a much larger file, but it guarantees that a user will be able to look up where the hourly data is for any plant they're interested in

…grid-emissions into gailin/issue230

grgmiller · 2022-10-05T21:43:10Z

I think either option 1 or 2 would work, and we should do whatever is easiest to implement at this point (based on my educated guess, 2 might be easier since it involves less segmenting of the data, but I could be mistaken). The plant_static_attributes file also contains the mapping of plant id to shaped plant id, so in the worst case, a user would have to cross reference the two tables to figure out the metadata for a plant that had been aggregated.

src/output_data.py

Rename col to in plant_metadata

gailin-p · 2022-10-21T22:25:36Z

07853b7
Removed separated full and partial plant-level hourly outputs
Renamed metadata id to "<See shaped plant ID">

grgmiller

Looks good! I think we're ready to merge

gailin-p added 3 commits September 26, 2022 11:32

* final versions of blog visualization notebooks

2ed5427

(hourly_validation and map_visualization) * update 930 time lag notebook for new dir structure

Add cems outlier cleaning notebook

e693f28

notebook was used during issue 230 investigation notebook is from gailin/clean_cems branch, which can now be deleted

gailin-p requested a review from grgmiller September 28, 2022 13:54

grgmiller requested changes Sep 28, 2022

View reviewed changes

src/data_cleaning.py Outdated Show resolved Hide resolved

src/data_cleaning.py Outdated Show resolved Hide resolved

src/data_cleaning.py Outdated Show resolved Hide resolved

add mwh check for mixed fuel type subplants

7ea2474

Output partial plants to different file

725e39a

grgmiller requested changes Oct 1, 2022

View reviewed changes

src/data_cleaning.py Outdated Show resolved Hide resolved

src/data_cleaning.py Outdated Show resolved Hide resolved

src/data_pipeline.py Show resolved Hide resolved

grgmiller mentioned this pull request Oct 2, 2022

Export hourly individual plant data for shaped plants #241

Closed

gailin-p added 3 commits October 4, 2022 14:32

Merge branch 'v0.1.2' of github.com:singularity-energy/open-grid-emis…

714981f

…sions into v0.1.2

Merge branch 'v0.1.2' into gailin/issue230

82b6808

Remove code to unify hourly methods

50d0b0b

update documentation and variable names

9d66440

grgmiller requested changes Oct 4, 2022

View reviewed changes

src/data_cleaning.py Outdated Show resolved Hide resolved

src/output_data.py Outdated Show resolved Hide resolved

gailin-p added 2 commits October 5, 2022 16:59

clearer comment and var name for mixed subplant method check

84e4a9e

Merge branch 'gailin/issue230' of github.com:singularity-energy/open-…

32b5753

…grid-emissions into gailin/issue230

Add aggregated subplants to plant_metadata.csv

e59b3a2

grgmiller requested changes Oct 20, 2022

View reviewed changes

src/output_data.py Outdated Show resolved Hide resolved

Remove separated plant level outputs

07853b7

Rename col to in plant_metadata

grgmiller approved these changes Oct 21, 2022

View reviewed changes

grgmiller merged commit bcb91d5 into v0.1.2 Oct 21, 2022

grgmiller deleted the gailin/issue230 branch October 21, 2022 23:02

grgmiller changed the title ~~Gailin/issue230~~ Update partial CEMS shaping for mixed fuel plants Oct 22, 2022

This was referenced Oct 22, 2022

Validate Partial CEMS methodology for subplants #247

Open

v0.1.2 #251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update partial CEMS shaping for mixed fuel plants #238

Update partial CEMS shaping for mixed fuel plants #238

gailin-p commented Sep 28, 2022

gailin-p commented Sep 28, 2022

grgmiller commented Sep 28, 2022 •

edited

Loading

gailin-p commented Sep 29, 2022

gailin-p commented Oct 1, 2022

grgmiller commented Oct 2, 2022

gailin-p commented Oct 4, 2022

grgmiller commented Oct 4, 2022

gailin-p commented Oct 5, 2022 •

edited

Loading

grgmiller commented Oct 5, 2022

gailin-p commented Oct 21, 2022

grgmiller left a comment

Update partial CEMS shaping for mixed fuel plants #238

Update partial CEMS shaping for mixed fuel plants #238

Conversation

gailin-p commented Sep 28, 2022

Summary of changes

Open question: subplants in hourly plant_data results

gailin-p commented Sep 28, 2022

grgmiller commented Sep 28, 2022 • edited Loading

gailin-p commented Sep 29, 2022

gailin-p commented Oct 1, 2022

grgmiller commented Oct 2, 2022

gailin-p commented Oct 4, 2022

grgmiller commented Oct 4, 2022

gailin-p commented Oct 5, 2022 • edited Loading

grgmiller commented Oct 5, 2022

gailin-p commented Oct 21, 2022

grgmiller left a comment

Choose a reason for hiding this comment

Open question: subplants in hourly `plant_data` results

grgmiller commented Sep 28, 2022 •

edited

Loading

gailin-p commented Oct 5, 2022 •

edited

Loading