Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add operating and retirement dates to plant static attributes #367

Merged
merged 6 commits into from
May 24, 2024

Conversation

rouille
Copy link
Collaborator

@rouille rouille commented May 22, 2024

Purpose

Add operating and retirement dates to plant static attributes.

A bug is fixed when calculating the nameplate capacity at the plant level

This PR also fixes an issue for years < 2013 where missing BA codes were being assigned, resulting in inaccurate BA-level results.

What the code is doing

Create a new function that adds the operating and retirement dates of a plant to the plant static attributes data frame. The operating date of a plant is taken as the earliest date among all generators' operating date over all report dates. Likewise, the retirement date of a plant is taken as the latest date among all generators' retirement date over all report dates.

Testing

Successfully ran the 2013 pipeline.

Where to look

  • the new add_plant_operating_and_retirement_dates in the oge.helpers module.
  • the oge.column_checks module where the new fields were added. Note that I added some missing datetime to the list of columns defined in the apply_dtypes function. These columns won't be converted.

Usage Example/Visuals

Screenshot 2024-05-22 at 7 28 14 PM

Review estimate

10min

Future work

N/A

Checklist

  • Update the documentation to reflect changes made in this PR
  • Format all updated python files using black
  • Clear outputs from all notebooks modified
  • Add docstrings and type hints to any new functions created

@rouille rouille requested a review from grgmiller May 22, 2024 22:01
@rouille rouille self-assigned this May 22, 2024
@rouille rouille marked this pull request as ready for review May 23, 2024 02:30
Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See requested changes.

Before we merge this, it may be helpful to do a couple quick validations:

  • Are there any missing operating dates? If so, we will just need to understand how to deal with those in MISO
  • Load up some EIA-923 data for this year and just quickly check if there are any plants that we marked as retired prior to 2022 that reported 923 data in 2022... sometimes plants do continue to report, so this may be okay, but we should at least manually double check that our algorithm didn't mistakenly mark a plant as retired if at least one generator is still going

"generator_operating_date",
"generator_retirement_date",
"current_planned_generator_operating_date",
"operating_date",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to follow the pudl naming conventions, and for clarity, let's call these "plant_operating_date" and "plant_retirement_date"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

generators_dates.groupby("plant_id_eia")[
["generator_operating_date", "generator_retirement_date"]
]
.agg({"generator_operating_date": "min", "generator_retirement_date": "max"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this will work for the operating date, this will not work for the retirement date. For example, what if a plant has 10 generators and only one retires? This would currently say the entire plant is retired.

For the retirement date, one way to do this would be to check for plants where there are no NA retirement dates across all generators, and then take the max of that.

Looking at the sample outputs you posted, it currently shows that plant 3 "Barry" retired in 2015, but this plant is still operational

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One easy way to do this would be to load the "operational_status' column and just identify where all generators are retired as of the latest_validated_year

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pd.DataFrame: original data frame with additional 'operating_date' and
'retirement_date' column.
"""
generators_dates = load_data.load_pudl_table(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"generator_dates"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

'retirement_date' column.
"""
generators_dates = load_data.load_pudl_table(
"denorm_generators_eia",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table will contain values for each year reported, so before we run our min and max operations, we need to drop duplicates. Before we drop duplicates though, we may need to do a groupby([plant_id, generator_id]).ffill() and .bfill() to make sure that we have complete values for all years

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may also want to filter to only include data up to the latest_validated_year

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I was working on in GRETA, but a similar pattern may work here:

min_operating = oge.load_data.load_pudl_table(
    "generators_eia860",
    year=earliest_data_year,
    end_year=latest_validated_year,
    columns=[
        "report_date",
        "plant_id_eia",
        "generator_id",
        "minimum_load_mw",
        "capacity_mw",
        "summer_capacity_mw",
        "winter_capacity_mw",
    ],
).sort_values(by=["plant_id_eia","generator_id","report_date"], ascending=True)

# fill missing capacity values
capacity_columns = ["minimum_load_mw", "capacity_mw", "summer_capacity_mw", "winter_capacity_mw"]

for col in capacity_columns:
    min_operating[col] = min_operating.groupby(["plant_id_eia","generator_id"])[col].bfill()
    min_operating[col] = min_operating.groupby(["plant_id_eia","generator_id"])[col].ffill()

# keep only the most recent year of data
min_operating = min_operating.drop_duplicates(subset=["plant_id_eia","generator_id"], keep="last")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Implemented.

@rouille
Copy link
Collaborator Author

rouille commented May 23, 2024

Screen shot with new implementation
Screenshot 2024-05-23 at 9 30 11 AM

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.
One small request would be to change the order of the columns so that we are grouping data together and make it easier to read. My suggestion for column order would be:

      "plant_id_eia", #identification columns
      "plant_name_eia",
       "capacity_mw", # what type of plant is this
      "plant_primary_fuel",
      "fuel_category",
      "fuel_category_eia930",
      "state", # where is it located
        "county",
        "city",
      "ba_code",
      "ba_code_physical",
     "latitude",
      "longitude",
      "plant_operating_date", #operational status columns
      "plant_retirement_date",
      "distribution_flag", #other random metadata
      "timezone",
      "data_availability",
      "shaped_plant_id",
      
        
        

@rouille
Copy link
Collaborator Author

rouille commented May 23, 2024

The calculation of the nameplate capacity was bugged and is fixed in the katest commit

@rouille
Copy link
Collaborator Author

rouille commented May 23, 2024

Looks good, thanks. One small request would be to change the order of the columns so that we are grouping data together and make it easier to read. My suggestion for column order would be:

      "plant_id_eia", #identification columns
      "plant_name_eia",
       "capacity_mw", # what type of plant is this
      "plant_primary_fuel",
      "fuel_category",
      "fuel_category_eia930",
      "state", # where is it located
        "county",
        "city",
      "ba_code",
      "ba_code_physical",
     "latitude",
      "longitude",
      "plant_operating_date", #operational status columns
      "plant_retirement_date",
      "distribution_flag", #other random metadata
      "timezone",
      "data_availability",
      "shaped_plant_id",
      
        
        

Done

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about the nameplate capacity fix.

)["capacity_mw"].ffill()

# keep only the most recent year of data
generator_capacity = generator_capacity.drop_duplicates(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of nameplate capacity, I think that we only want to keep the specific data year, not the latest validated year. Nameplate capacity can chance over time if the generator is repowered, so this value might be annually varying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still good that we load all years and do the fill in case there is missing capacity data in a specific year.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Implemented.

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capacity changes look good

@grgmiller grgmiller merged commit 6f4c9e3 into historical_coverage_feature May 24, 2024
1 check passed
@grgmiller grgmiller deleted the ben/dates branch May 24, 2024 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants