-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement data anomaly screening described in Ruggles et al. #349
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Ben - I think adding all of these checks to the codebase will be helpful for some of our future work that requires anomaly screening, but I think many of these checks are likely mostly relevant to electricity demand data (like in the ruggles paper, but probably not the CEMS/generation data).
The short request is that:
- I think we should only flag data that fails the GLOBAL_EXTREME test for now
- We want to run this test for the generation column, the fuel column, and the emissions columns (co2, nox, so2)
The longer explanation why not to use the other tests for flagging data:
- IDENTICAL_RUN: unlike demand data, which we'd expect to be constantly varying, it is often the case that generation may remain constant for several hours in a row, especially in the case of baseload generation.
- MISSING: We expect some of the generation data to be missing, because some plants do not report all months to CEMS. We also have a separate process to check for missing values elsewhere in the code.
- OKAY: I don't think we need to flag data that is ok
- ZERO: generation data can be zero, and we already check for negative values elsewhere.
- GLOBAL_EXTREME +/1 1H: While this will be useful for identifying values to try imputing if/when we get there, I don't think that we need to use this to flag data, since this check is dependent on GLOBAL_EXTREME and doesn't neccessarily tell us anything new - ie if GLOBAL_EXTREME is 0, this check will also be zero.
Maybe for the printout from this test, we print a dataframe that flags the number of global extreme values for each subplant (rows) for each data column (eg generation, fuel, emissions). (We'd also want to have the BA code of course).
Maybe instead of / in addition to the count of extreme values, we could also print the average magnitude (multiplier above the median) of these extreme values. For example, we might care more about a small number of extremely high outliers than a large number of smaller outliers.
9f8ed7c
to
0402ca3
Compare
5424786
to
8347523
Compare
I have implemented the warning in the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just a few small requests
…EMS generation, fuel consumption and CO2 emission timseries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
Purpose
Screen timeseries for anomalous value following the algorithm steps described in Tyler H. Ruggles et al. Developing reliable hourly electricity demand data through screening and imputation (2020). Closes CAR-1882.
Note that the screening algorithms have been developed for demand time series and some of these algorithms might not be tailored for generation/emission time series.
The screening is conducted in 2 steps. Step 1 removes the most egregious anomalies where few or no calculations are needed. Afterward, in Step 2, the most extreme values have been removed making calculations of local characteristics of the data more reasonable. Through this screening process hourly values can be re-categorized from okay to other classifications based on the algorithms.
What the code is doing
Implement the screening algorithms using a notebook provided by the authors here. Algorithms from the first step are enclosed in the
AnomalyScreeningFirstStep
class. A second class,AnomalyScreeningSecondStep
, inherits fromAnomalyScreeningFirstStep
and perform 2/4 algorithms of the second step on top of the first one.Testing
Manually. See example below.
Where to look
Everything is in the
oge.data_cleaning
module.Usage Example/Visuals
Looking at a specific unit:
Review estimate
30min
Future work
Implement the single sided delta and anomalous region filters (see filter 3 and 4 of second step)
Checklist
black