Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure complete subplant_id mapping #49

Open
1 of 3 tasks
grgmiller opened this issue Jun 7, 2022 · 9 comments
Open
1 of 3 tasks

Ensure complete subplant_id mapping #49

grgmiller opened this issue Jun 7, 2022 · 9 comments
Assignees
Labels
crosswalk improve crosswalking between data sources data cleaning Cleaning and standardizing data methodology Improve methodology

Comments

@grgmiller
Copy link
Collaborator

grgmiller commented Jun 7, 2022

Currently, subplant IDs are only created for units that exist both in CEMS and EIA-923, meaning that there are certain generators/units that have a subplant ID of NaN.

  • Ensure that all merge and groupby functions that use subplant_id as one of the keys are not dropping observations with missing subplant values.
  • Although the primary purpose of the subplant ID is to group CEMS units with EIA generators and boilers, it could also be useful for grouping EIA boilers and generators that do not exist in CEMS. We should update the pudl.analysis.epa_crosswalk code to generate subplant IDs for all boilers/generators that exist in the EIA data, regardless of whether data exists in CEMS.
  • If there are any remaining missing subplant values, we should perhaps fill these missing values with a code of 99 so that there is a non-missing code that would not overlap with any subplant ids already assigned during the crosswalk process.
@grgmiller grgmiller added methodology Improve methodology data cleaning Cleaning and standardizing data labels Jun 7, 2022
@grgmiller grgmiller added this to the Initial Public Release milestone Jun 7, 2022
@grgmiller grgmiller self-assigned this Jun 7, 2022
@grgmiller
Copy link
Collaborator Author

I should check whether subplant ids are used at all in the clean_eia923() function, but adding subplant ids for EIA-only data might become irrelevant (at least for the initial public release) if we are grouping this data by BA-fuel anyway.

@grgmiller
Copy link
Collaborator Author

It appears that certain plants/generators that exist in both CEMS EIA-860 are missing from the crosswalk.

One reason for this might be that we currently inner join the CEMS ids with EIA ids from EIA-923 and not EIA-860, but it is possible that EIA-860 is more complete.

@gailin-p
Copy link
Collaborator

gailin-p commented Jul 7, 2022

One example of a missing plant is plant_id_eia=2379, which has two generators according to EIA-860 (CA1 and CA2).

@grgmiller
Copy link
Collaborator Author

So at least part of the issue was that when we were filtering the CEMS data using the EPA crosswalk, certain units were being dropped because of a mismatch in unitid: In the CEMS data, we had stripped leading zeros from the id, but in the crosswalk, we did not, which was leading to those plants being dropped. I've now fixed that issue.

@grgmiller
Copy link
Collaborator Author

Maybe we can get this fixed in PUDL: catalyst-cooperative/pudl#1769
It also looks like EPA is getting ready to release a new version of the crosswalk, which may improve the coverage for subplant mapping: USEPA/camd-eia-crosswalk#25 (comment)

@gailin-p
Copy link
Collaborator

gailin-p commented Sep 27, 2022

Fuel category differences within subplants with subplant_id=NaN

In some cases, generators in a single plant missing from subplant_crosswalk have a mix of renewable and fossil fuel types. This occurs in 74 subplant-months in plants 141, 621, 1943, 2240, 10025, 10823, and 58236. In these cases, all generators in the plant which are not in subplant_crosswalk are assigned the same subplant, subplant_id=NaN.

In #230, we propsoed that subplants within a plant should not share the same CEMS profile (hourly shaping method partial_cems_plant) when they have different primary fuel types, since this resulted in one case where all nuclear generation from a large nuclear power plant plant_id_eia=2410 was being assigned to the 3 hours where a backup diesel generator was on and reporting to CEMS. However, because renewable and fossil generators are combined in each of the subplants listed above, the renewable and fossil generators cannot be assigned different profiles.

If the renewable and fossil generators were assigned different subplants, we could safely use partial_cems_plant to shape the subplant with the fossil generators and a residual profile method to shape the subplant with the renewable generators. This would be conceptually more correct than choosing one method to apply to a sublant with mixed fossil and renewable generation.

To fix this, we would need to update subplant crosswalk (see @grgmiller 's comments above, we could potentially do this in PUDL) to assign different subplant IDs to generators within a plant whose fuel types differ.

@gailin-p
Copy link
Collaborator

adding subplant ids for EIA-only data might become irrelevant (at least for the initial public release) if we are grouping this data by BA-fuel anyway.

Since hourly data is shaped at the subplant level, I think this does end up affecting currently released data.

@grgmiller
Copy link
Collaborator Author

I think that one way to fix this issue would be to take advantage of the existing unit_id_pudl identifiers created by the pudl data pipeline (see the "Unit mapping through network analysis" section of this blog post for more information). These unit_id_pudl are created using the same network analysis that is used for the subplant_id mapping, but only based on EIA data. However, in order to use these unit_id_pudl alongside the subplant_id, the two would likely need to be harmonized (or potentially just used as two separate keys). See catalyst-cooperative/pudl#1769 for more background on this harmonization issue.

@grgmiller
Copy link
Collaborator Author

As noted in catalyst-cooperative/pudl#1769 (comment), I've actually noticed that the current subplant id mapping is not behaving as expected (mapping units to generators and boilers) because it ignores all of the boiler-generator associations.

@grgmiller grgmiller added the crosswalk improve crosswalking between data sources label Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crosswalk improve crosswalking between data sources data cleaning Cleaning and standardizing data methodology Improve methodology
Projects
None yet
Development

No branches or pull requests

2 participants