Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data anomaly screening described in Ruggles et al. #349

Merged
merged 6 commits into from
Mar 29, 2024

Conversation

rouille
Copy link
Collaborator

@rouille rouille commented Mar 7, 2024

Purpose

Screen timeseries for anomalous value following the algorithm steps described in Tyler H. Ruggles et al. Developing reliable hourly electricity demand data through screening and imputation (2020). Closes CAR-1882.

Note that the screening algorithms have been developed for demand time series and some of these algorithms might not be tailored for generation/emission time series.

The screening is conducted in 2 steps. Step 1 removes the most egregious anomalies where few or no calculations are needed. Afterward, in Step 2, the most extreme values have been removed making calculations of local characteristics of the data more reasonable. Through this screening process hourly values can be re-categorized from okay to other classifications based on the algorithms.

What the code is doing

Implement the screening algorithms using a notebook provided by the authors here. Algorithms from the first step are enclosed in the AnomalyScreeningFirstStep class. A second class, AnomalyScreeningSecondStep, inherits from AnomalyScreeningFirstStep and perform 2/4 algorithms of the second step on top of the first one.

Testing

Manually. See example below.

Where to look

Everything is in the oge.data_cleaning module.

Usage Example/Visuals

>>> import pandas as pd
>>> 
>>> from oge.load_data import load_cems_data
>>> from oge.data_cleaning import AnomalyScreeningFirstStep, AnomalyScreeningSecondStep
>>> 
>>> 
>>> cems = load_cems_data(2022)
>>> 
>>> unit_screening = []
>>> for i in cems.groupby(["plant_id_eia", "emissions_unit_id_epa"]).groups.keys():
...     AS = AnomalyScreeningFirstStep(cems.query("plant_id_eia == @i[0] and emissions_unit_id_epa == @i[1]")[["plant_id_eia", "emissions_unit_id_epa", "co2_mass_lb"]], "co2_mass_lb")
...     AS.run_all_first_step()
...     unit_screening.append(pd.DataFrame(AS.get_filtered_df().groupby("category").size().to_dict(), index=[i]))
... 
>>> summary = pd.concat(unit_screening).fillna(0).astype("int")
>>> 
>>> summary.head(20)
            IDENTICAL_RUN  MISSING  OKAY  ZERO  GLOBAL_EXTREME  GLOBAL_EXTREME_+/-_1H
(3, 1)                360     7226  1174     0               0                      0
(3, 2)                294     7126  1339     1               0                      0
(3, 4)                 23     7327  1408     2               0                      0
(3, 5)                  5     2128  6618     9               0                      0
(3, 6A)               205      547  7997    11               0                      0
(3, 6B)                44      489  8227     0               0                      0
(3, 7A)               126      677  7952     5               0                      0
(3, 7B)                79      800  7881     0               0                      0
(7, 1)                  1     8659   100     0               0                      0
(7, 2)                  0     8760     0     0               0                      0
(9, CTG-1)              0     8760     0     0               0                      0
(10, 1)               919     2346  5493     2               0                      0
(10, 2)              1175     1598  5987     0               0                      0
(10, CT10)            183     8253   324     0               0                      0
(10, CT2)              14     8712    34     0               0                      0
(10, CT3)              56     8632    72     0               0                      0
(10, CT4)             103     8527   130     0               0                      0
(10, CT5)              36     8652    72     0               0                      0
(10, CT6)             100     8467   193     0               0                      0
(10, CT7)             140     8422   198     0               0                      0
>>> summary.query("GLOBAL_EXTREME != 0")
                 IDENTICAL_RUN  MISSING  OKAY  ZERO  GLOBAL_EXTREME  GLOBAL_EXTREME_+/-_1H
(563, 14B)                   0     8741     8     4               4                      3
(673, S-3)                   7     8725    11    13               4                      0
(874, 5)                   338     7869   508     0              43                      2
(1507, 3)                  303     8302   143     0              12                      0
(1507, 4)                  711     7774   257     0              17                      1
(1554, 1)                  321     7459   947     0              27                      6
(1571, SMECO)                1     8736    14     4               4                      1
(1702, A)                   86     7572   855    46             154                     47
(1702, B)                  100     7240   934    64             333                     89
(2499, CT01-1)              81     8553   121     2               2                      1
(2499, CT01-2)              72     8576   106     4               1                      1
(2499, CT01-3)              64     8593   100     1               1                      1
(2499, CT01-4)              64     8592    99     3               1                      1
(2499, CT01-8)              57     8604    93     1               3                      2
(2499, CT02-1)              92     8518   147     1               1                      1
(2499, CT02-2)              40     8620    97     1               1                      1
(2499, CT02-3)              66     8574   116     2               1                      1
(2499, CT02-4)             107     8495   154     2               1                      1
(2499, CT02-5)             129     8453   173     1               3                      1
(2499, CT02-6)              53     8614    87     4               1                      1
(2499, CT02-7)              78     8556   121     3               1                      1
(2499, CT02-8)              52     8613    86     4               3                      2
(2632, 1)                    0     8711    34     0               5                     10
(3403, GCT2)                 4     8740    14     0               1                      1
(3576, BW2)                 78     6255  2201     0             139                     87
(3576, BW3)                144     6768  1727     0              69                     52
(3809, 3)                  108     7971   359   230              84                      8
(3992, 9)                    6     8486   257     0               5                      6
(4195, 2)                  104     7601  1052     0               1                      2
(4266, 4)                   48     7512  1123     0              43                     34
(4266, 5)                   54     7110  1372     0             140                     84
(6042, PMT1)                 3     8626   126     0               3                      2
(10350, CTGB)               18     8647    70    20               3                      2
(55381, CT-005)             24     8418   180     0             107                     31
(55381, CT-006)             34     8445   161     0              85                     35
(55381, CT-007)             30     8462   147     0              84                     37
(55419, 600)                46      797  7878     0              22                     17
(55419, 700)                86      463  8166     0              29                     16
(55419, 800)               108       41  8569     0              28                     14
>>> AS = AnomalyScreeningSecondStep(cems.query("plant_id_eia == 55419 and emissions_unit_id_epa == '600'")[["plant_id_eia", "emissions_unit_id_epa", "co2_mass_lb"]], "co2_mass_lb")
>>> AS.summary()
2024-03-07 01:16:12 [INFO] oge.oge.data_cleaning:2265                        count  fraction
category                              
GLOBAL_EXTREME            22  0.002511
GLOBAL_EXTREME_+/-_1H     17  0.001941
IDENTICAL_RUN             46  0.005251
MISSING                  797  0.090982
OKAY                    7878  0.899315
>>> AS.flag_local_deviation_from_expected_value()
>>> AS.flag_deltas()
>>> 
>>> AS.summary()
2024-03-07 01:16:38 [INFO] oge.oge.data_cleaning:2265                        count  fraction
category                              
DOUBLE_DELTA             232  0.026484
GLOBAL_EXTREME            22  0.002511
GLOBAL_EXTREME_+/-_1H     17  0.001941
IDENTICAL_RUN             46  0.005251
LOCAL_DOWN               535  0.061073
LOCAL_UP                 147  0.016781
MISSING                  797  0.090982
OKAY                    6964  0.794977

Looking at a specific unit:

>>> check = cems.query("plant_id_eia == 55419 and emissions_unit_id_epa == '600'").set_index("datetime_utc")["co2_mass_lb"]
>>> check.min()
4736.000061035156
>>> check.max()
33493601.5625
>>> ax = check.plot(ylim=(1000, 500000))
>>> ax.set_xlabel("")
Text(0.5, 14.644444444444442, '')
>>> ax.set_ylabel("CO2 Emission (lb)")
Text(22.944444444444432, 0.5, 'CO2 Emission (lb)')
>>> plt.show()

example

Review estimate

30min

Future work

Implement the single sided delta and anomalous region filters (see filter 3 and 4 of second step)

Checklist

  • Update the documentation to reflect changes made in this PR
  • Format all updated python files using black
  • Clear outputs from all notebooks modified
  • Add docstrings and type hints to any new functions created

@rouille rouille added new feature New feature or request data cleaning Cleaning and standardizing data hourly profiles Accuracy of hourly profile imputation labels Mar 7, 2024
@rouille rouille self-assigned this Mar 7, 2024
Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben - I think adding all of these checks to the codebase will be helpful for some of our future work that requires anomaly screening, but I think many of these checks are likely mostly relevant to electricity demand data (like in the ruggles paper, but probably not the CEMS/generation data).

The short request is that:

  1. I think we should only flag data that fails the GLOBAL_EXTREME test for now
  2. We want to run this test for the generation column, the fuel column, and the emissions columns (co2, nox, so2)

The longer explanation why not to use the other tests for flagging data:

  • IDENTICAL_RUN: unlike demand data, which we'd expect to be constantly varying, it is often the case that generation may remain constant for several hours in a row, especially in the case of baseload generation.
  • MISSING: We expect some of the generation data to be missing, because some plants do not report all months to CEMS. We also have a separate process to check for missing values elsewhere in the code.
  • OKAY: I don't think we need to flag data that is ok
  • ZERO: generation data can be zero, and we already check for negative values elsewhere.
  • GLOBAL_EXTREME +/1 1H: While this will be useful for identifying values to try imputing if/when we get there, I don't think that we need to use this to flag data, since this check is dependent on GLOBAL_EXTREME and doesn't neccessarily tell us anything new - ie if GLOBAL_EXTREME is 0, this check will also be zero.

Maybe for the printout from this test, we print a dataframe that flags the number of global extreme values for each subplant (rows) for each data column (eg generation, fuel, emissions). (We'd also want to have the BA code of course).

Maybe instead of / in addition to the count of extreme values, we could also print the average magnitude (multiplier above the median) of these extreme values. For example, we might care more about a small number of extremely high outliers than a large number of smaller outliers.

src/oge/data_cleaning.py Outdated Show resolved Hide resolved
src/oge/data_cleaning.py Outdated Show resolved Hide resolved
src/oge/data_cleaning.py Outdated Show resolved Hide resolved
@rouille rouille force-pushed the ben/ruggles branch 7 times, most recently from 9f8ed7c to 0402ca3 Compare March 27, 2024 18:11
@rouille rouille force-pushed the ben/ruggles branch 4 times, most recently from 5424786 to 8347523 Compare March 28, 2024 23:42
@rouille
Copy link
Collaborator Author

rouille commented Mar 28, 2024

I have implemented the warning in the clean_cems function of the data_cleaning module. This is what we get for 2022:

2024-03-28 16:38:12 [WARNING] oge.oge.data_cleaning:756 Global extreme detected in CENS time series
2024-03-28 16:38:12 [WARNING] oge.oge.data_cleaning:757 
                                   gross_generation_mwh                fuel_consumed_mmbtu                   co2_mass_lb                ba_code
                                         GLOBAL_EXTREME MEAN_DEVIATION      GLOBAL_EXTREME MEAN_DEVIATION GLOBAL_EXTREME MEAN_DEVIATION        
plant_id_eia emissions_unit_id_epa                                                                                                             
315          3                                    350.0      11.993077                 NaN            NaN            NaN            NaN    CISO
             4                                    297.0      12.050640                 NaN            NaN            NaN            NaN    CISO
356          5                                     55.0      11.488485                 NaN            NaN            NaN            NaN    CISO
             6                                    122.0      11.484192                 NaN            NaN            NaN            NaN    CISO
377          5                                      NaN            NaN               113.0      10.634884          157.0      10.653220    LDWP
563          14B                                    NaN            NaN                 4.0      15.812850            4.0      16.375000    ISNE
673          S-3                                    NaN            NaN                 6.0      62.849096            6.0      66.123334    FMPP
874          5                                      NaN            NaN                44.0      37.872499           44.0      37.716477     PJM
1554         1                                      NaN            NaN                14.0      12.123729           28.0      13.579799     PJM
             4                                      NaN            NaN                12.0      14.239448           12.0      19.221903     PJM
1571         SMECO                                  NaN            NaN                 5.0      19.709896            5.0      14.359615     PJM
1702         A                                      NaN            NaN               156.0      30.646180          154.0      30.158810    MISO
             B                                      NaN            NaN               343.0      31.068923          342.0      31.068300    MISO
2049         CTA                                    NaN            NaN                 3.0      10.609770            NaN            NaN    SOCO
             CTB                                    NaN            NaN                 3.0      10.620860            NaN            NaN    SOCO
2081         11                                    17.0      13.450980                 8.0      13.671393            NaN            NaN    SWPP
             12                                     7.0      12.000000                 1.0      14.995356            NaN            NaN    SWPP
             13                                     6.0      14.722222                 NaN            NaN            NaN            NaN    SWPP
             14                                    18.0      12.185185                 6.0      12.332316            NaN            NaN    SWPP
             15                                    15.0      13.177778                 4.0      11.454545            NaN            NaN    SWPP
             16                                    11.0      13.787879                 5.0      12.125126            NaN            NaN    SWPP
             17                                    25.0      14.773333                10.0      11.752003            NaN            NaN    SWPP
             18                                    25.0      14.333333                10.0      12.957722            NaN            NaN    SWPP
2499         CT01-1                                 NaN            NaN                 2.0      17.588460            2.0      17.595678    NYIS
             CT01-2                                 NaN            NaN                 1.0      10.157670            1.0      10.151316    NYIS
             CT01-3                                 NaN            NaN                 1.0      10.157670            1.0      10.151316    NYIS
             CT01-8                                 NaN            NaN                 3.0      16.067152            3.0      15.833333    NYIS
             CT02-1                                 NaN            NaN                 1.0      10.579541            1.0      10.586419    NYIS
             CT02-2                                 NaN            NaN                 1.0      12.369623            1.0      12.340909    NYIS
             CT02-3                                 NaN            NaN                 1.0      10.721941            1.0      10.717105    NYIS
             CT02-4                                 NaN            NaN                 1.0      10.457506            1.0      10.465116    NYIS
             CT02-5                                 NaN            NaN                 3.0      15.354239            3.0      15.368216    NYIS
             CT02-6                                 NaN            NaN                 1.0      11.991748            1.0      11.942623    NYIS
             CT02-7                                 NaN            NaN                 1.0      10.579541            1.0      10.586419    NYIS
             CT02-8                                 NaN            NaN                 3.0      16.818250            3.0      16.777778    NYIS
2632         1                                      NaN            NaN                 5.0      10.900407            5.0      10.881579    NYIS
3161         3                                      NaN            NaN                 NaN            NaN            2.0      10.308270     PJM
3403         GCT2                                   2.0      10.500000                 1.0      10.998761            1.0      10.916667     TVA
3576         BW2                                    NaN            NaN               139.0      14.145552          139.0      14.047807    ERCO
             BW3                                    NaN            NaN                88.0      16.065703           90.0      16.132975    ERCO
3809         3                                      NaN            NaN                94.0      22.985355           87.0      14.391267     PJM
3992         9                                      NaN            NaN                 5.0      10.434699            5.0      10.437838    MISO
4266         4                                    507.0      24.520710                41.0      10.580426           43.0      10.614618    ERCO
             5                                    582.0      25.776632               140.0      12.498052          140.0      12.451361    ERCO
6042         PMT1                                   NaN            NaN                 3.0      11.472898            3.0      11.566540     FPL
6074         1                                      8.0      10.100000                 2.0      10.297569            NaN            NaN    SWPP
             2                                      1.0      10.500000                 1.0      12.668601            NaN            NaN    SWPP
             3                                      1.0      10.000000                 1.0      10.001303            NaN            NaN    SWPP
             4                                      1.0      10.833333                 1.0      10.099935            NaN            NaN    SWPP
6085         16A                                    2.0      17.166667                 NaN            NaN            NaN            NaN    MISO
6651         CT01                                   NaN            NaN                 6.0      26.919475            NaN            NaN    MISO
6824         1                                     11.0      17.363636                17.0      25.641679            NaN            NaN    MISO
             2                                     12.0      16.583333                17.0      25.231016            NaN            NaN    MISO
7425         1                                      NaN            NaN                 9.0      17.058380            NaN            NaN    MISO
10350        CTGB                                   NaN            NaN                 7.0      16.728111            7.0      16.894603    CISO
50732        ETBLR1                                 NaN            NaN                33.0      25.215167            NaN            NaN     PJM
             ETBLR2                                 NaN            NaN                24.0      33.637866            NaN            NaN     PJM
             ETBLR3                                 NaN            NaN                25.0      33.581149            NaN            NaN     PJM
55096        BLR2                                   NaN            NaN                 4.0      14.014076            NaN            NaN    MISO
55381        CT-005                                 NaN            NaN               119.0      13.069302          119.0      13.042396     PJM
55419        600                                    NaN            NaN                 NaN            NaN           22.0      28.393516    MISO
             700                                    NaN            NaN                 NaN            NaN           29.0      33.363435    MISO
             800                                    NaN            NaN                 NaN            NaN           28.0      33.987326    MISO

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just a few small requests

src/oge/data_cleaning.py Outdated Show resolved Hide resolved
src/oge/data_cleaning.py Outdated Show resolved Hide resolved
src/oge/anomaly_screening.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good!

@rouille rouille merged commit ae44283 into development Mar 29, 2024
2 checks passed
@rouille rouille deleted the ben/ruggles branch March 29, 2024 18:14
@rouille rouille mentioned this pull request Apr 4, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data cleaning Cleaning and standardizing data hourly profiles Accuracy of hourly profile imputation new feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants