-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add operating and retirement dates to plant static attributes #367
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See requested changes.
Before we merge this, it may be helpful to do a couple quick validations:
- Are there any missing operating dates? If so, we will just need to understand how to deal with those in MISO
- Load up some EIA-923 data for this year and just quickly check if there are any plants that we marked as retired prior to 2022 that reported 923 data in 2022... sometimes plants do continue to report, so this may be okay, but we should at least manually double check that our algorithm didn't mistakenly mark a plant as retired if at least one generator is still going
src/oge/column_checks.py
Outdated
"generator_operating_date", | ||
"generator_retirement_date", | ||
"current_planned_generator_operating_date", | ||
"operating_date", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to follow the pudl naming conventions, and for clarity, let's call these "plant_operating_date" and "plant_retirement_date"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/oge/helpers.py
Outdated
generators_dates.groupby("plant_id_eia")[ | ||
["generator_operating_date", "generator_retirement_date"] | ||
] | ||
.agg({"generator_operating_date": "min", "generator_retirement_date": "max"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this will work for the operating date, this will not work for the retirement date. For example, what if a plant has 10 generators and only one retires? This would currently say the entire plant is retired.
For the retirement date, one way to do this would be to check for plants where there are no NA retirement dates across all generators, and then take the max of that.
Looking at the sample outputs you posted, it currently shows that plant 3 "Barry" retired in 2015, but this plant is still operational
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One easy way to do this would be to load the "operational_status' column and just identify where all generators are retired as of the latest_validated_year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/oge/helpers.py
Outdated
pd.DataFrame: original data frame with additional 'operating_date' and | ||
'retirement_date' column. | ||
""" | ||
generators_dates = load_data.load_pudl_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"generator_dates"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
'retirement_date' column. | ||
""" | ||
generators_dates = load_data.load_pudl_table( | ||
"denorm_generators_eia", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this table will contain values for each year reported, so before we run our min and max operations, we need to drop duplicates. Before we drop duplicates though, we may need to do a groupby([plant_id, generator_id]).ffill() and .bfill() to make sure that we have complete values for all years
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may also want to filter to only include data up to the latest_validated_year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I was working on in GRETA, but a similar pattern may work here:
min_operating = oge.load_data.load_pudl_table(
"generators_eia860",
year=earliest_data_year,
end_year=latest_validated_year,
columns=[
"report_date",
"plant_id_eia",
"generator_id",
"minimum_load_mw",
"capacity_mw",
"summer_capacity_mw",
"winter_capacity_mw",
],
).sort_values(by=["plant_id_eia","generator_id","report_date"], ascending=True)
# fill missing capacity values
capacity_columns = ["minimum_load_mw", "capacity_mw", "summer_capacity_mw", "winter_capacity_mw"]
for col in capacity_columns:
min_operating[col] = min_operating.groupby(["plant_id_eia","generator_id"])[col].bfill()
min_operating[col] = min_operating.groupby(["plant_id_eia","generator_id"])[col].ffill()
# keep only the most recent year of data
min_operating = min_operating.drop_duplicates(subset=["plant_id_eia","generator_id"], keep="last")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
One small request would be to change the order of the columns so that we are grouping data together and make it easier to read. My suggestion for column order would be:
"plant_id_eia", #identification columns
"plant_name_eia",
"capacity_mw", # what type of plant is this
"plant_primary_fuel",
"fuel_category",
"fuel_category_eia930",
"state", # where is it located
"county",
"city",
"ba_code",
"ba_code_physical",
"latitude",
"longitude",
"plant_operating_date", #operational status columns
"plant_retirement_date",
"distribution_flag", #other random metadata
"timezone",
"data_availability",
"shaped_plant_id",
The calculation of the nameplate capacity was bugged and is fixed in the katest commit |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about the nameplate capacity fix.
src/oge/helpers.py
Outdated
)["capacity_mw"].ffill() | ||
|
||
# keep only the most recent year of data | ||
generator_capacity = generator_capacity.drop_duplicates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of nameplate capacity, I think that we only want to keep the specific data year, not the latest validated year. Nameplate capacity can chance over time if the generator is repowered, so this value might be annually varying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still good that we load all years and do the fill in case there is missing capacity data in a specific year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capacity changes look good
Purpose
Add operating and retirement dates to plant static attributes.
A bug is fixed when calculating the nameplate capacity at the plant level
This PR also fixes an issue for years < 2013 where missing BA codes were being assigned, resulting in inaccurate BA-level results.
What the code is doing
Create a new function that adds the operating and retirement dates of a plant to the plant static attributes data frame. The operating date of a plant is taken as the earliest date among all generators' operating date over all report dates. Likewise, the retirement date of a plant is taken as the latest date among all generators' retirement date over all report dates.
Testing
Successfully ran the 2013 pipeline.
Where to look
add_plant_operating_and_retirement_dates
in theoge.helpers
module.oge.column_checks
module where the new fields were added. Note that I added some missing datetime to the list of columns defined in theapply_dtypes
function. These columns won't be converted.Usage Example/Visuals
Review estimate
10min
Future work
N/A
Checklist
black