Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor preprocessing resampling, validation, and alignment #580

Merged
merged 42 commits into from
Oct 14, 2020

Conversation

wholmgren
Copy link
Member

@wholmgren wholmgren commented Sep 30, 2020

There is a problem with the current implementation. Flags that represent bad data should be excluded from the time series before resampling (e.g. limits exceeded) and that's not done here. We're trading one problem for another. I don't know how to solve this without a richer flag datamodel or awful hacks. So maybe it's worth addressing that now. Or we can say this is a net win, move on, and address that later as originally planned. I guess I would say address it now but I don't have a good idea of what the datamodel changes would actually look like.

@alorenzo175 can you review and let me know what you think?

@wholmgren wholmgren added enhancement New feature or request validation Issue pertains to data validation metrics Issue pertains to metrics calculation labels Sep 30, 2020
@wholmgren wholmgren added this to the 1.0 rc4 milestone Sep 30, 2020
Copy link
Contributor

@alorenzo175 alorenzo175 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. I think the datamodel changes basically involve adding keys like threshold_percentage to QualityFlagFilter. To handle this new issue, I think a key like discard_before_resample would help.

We should also keep in mind that one can specify multiple quality flag filters in the report. Right now they are merged into one (https://github.com/SolarArbiter/solarforecastarbiter-core/pull/580/files#diff-715b43f6fac14c48d8729da1c197d286R430) but if we go the route of adding these keys, maybe the filters should be applied successively.

# what's the point of copying the fill map, potentially adding a new key,
# and then only accessing one key?
# this would make more sense to me (wholmgren):
# forecast_fill_str = FORECAST_FILL_STRING_MAP.get(forecast_fill_method,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I can see, I agree

@wholmgren
Copy link
Member Author

I think the datamodel changes basically involve adding keys like threshold_percentage to QualityFlagFilter. To handle this new issue, I think a key like discard_before_resample would help.

Would these be exposed in the report json or would they be preset for predefined QualityFlagFilters for each kind of quality flag?

@alorenzo175
Copy link
Contributor

I think the datamodel changes basically involve adding keys like threshold_percentage to QualityFlagFilter. To handle this new issue, I think a key like discard_before_resample would help.

Would these be exposed in the report json or would they be preset for predefined QualityFlagFilters for each kind of quality flag?

I was thinking they would be in the report json. If we wanted some kind of presets, I would think the dashboard could handle that like it might already do for other fields.

'UNEVEN FREQUENCY', 'LIMITS EXCEEDED', 'CLEARSKY EXCEEDED',
'DAYTIME STALE VALUES', 'INCONSISTENT IRRADIANCE COMPONENTS'
)
type: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't ideal, but we should probably keep the quality_flags key (and the other parameters would apply to all flags in that tuple). Otherwise all current reports would be broken since I think the report json is loaded through the report datamodel before displaying/downloading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So keep the old quality_flags key in addition to the new keys? I agree that could work. A couple of things to consider as alternatives:

  1. script to migrate current reports from old json to new json
  2. create new objects and json keys, leave the old ones untouched until some later time when we implement 1

I'm not sure what would be less work because the testing implications for keeping the old keys sound miserable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well keep old key, and add new keys except type. So all flags in quality_flags are processed with the other options. You would probably have the same options for many of the flags anyway, and you can still have multiple QualityFlagFilters with different sets of flags.

We've done the migration before an it was a real pain (and motivation to eventually get the report schema versioned). One problem is you also have to modify the json core posts to the API.

I don't think testing would be bad. All the keys are still used (if type is removed), we just have a couple of new keys w/ reasonable defaults if empty

@wholmgren
Copy link
Member Author

Should these have the same behavior?

# separate flags:
qflag_1 = QualityFlagFilter(('NIGHTTIME', ), discard_before_resample=False, resample_threshold_percentage=30.)
qflag_2 = QualityFlagFilter(('CLEARSKY', ), discard_before_resample=False, resample_threshold_percentage=30.)

# combined flags
qflag_combined = QualityFlagFilter(('NIGHTTIME', 'CLEARSKY'), discard_before_resample=False, resample_threshold_percentage=30.)

Imagine a 15 minute validation time series like

nighttime = [True, False, False, False]
clearsky = [False, False, True, False]

and we want to resample into 1 hour. If we apply the filters individually, we keep the hourly interval. But if we combine the filters we have [True, False, True, False] and we discard the interval.

I think the "separate flags" case should keep the interval and the "combined flags" case should discard the interval, but I'm open to the idea that they should have the same behavior.

@alorenzo175
Copy link
Contributor

Interesting. I guess I always thought they would have the same behaviour and be treated separately. I can see how the combined flags case could be useful. Just need to make sure it's documented well.

@wholmgren
Copy link
Member Author

More scope creep.... TimeOfDayFilter and ValueFilter are not currently implemented, but they'll need to fit into this pattern eventually. It's not obvious to me how that would work. I feel like they might need to be eliminated and instead set as keys on QualityFlagFilter (perhaps renamed). It's worth considering that before committing to this refactor.

@wholmgren
Copy link
Member Author

I'd like to think it makes more sense now.

Although, I guess it wouldn't be too bad if there were just a Filter object that had keys quality_flags, time_of_day_range, value_range. Either way, I think we can lay the groundwork in this PR and make a follow on soon after if necessary

I added a few comments on where we could easily add support for this. Best left to another PR, though.

docs/source/whatsnew/1.0.0rc4.rst Outdated Show resolved Hide resolved
resample_threshold_percentage : float, default 10.
The percentage of points in a resampled interval that must be
flagged for the resampled interval to be flagged.
Ignored if ``discard_before_resample`` is True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs updating?

wholmgren and others added 3 commits October 13, 2020 14:19
Co-authored-by: Tony Lorenzo <atlorenzo@email.arizona.edu>
@wholmgren
Copy link
Member Author

@lboeman we'll need some dashboard updates to go along with this. see #580 (comment)

The easiest thing might be to make separate QualityFlagFilters for each dashboard option (instead of putting them all together like we do now), with each discard_before_resample set by validation.quality_mapping.DISCARD_BEFORE_RESAMPLE.

Let me know if I should make any changes here to support that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request metrics Issue pertains to metrics calculation validation Issue pertains to data validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

accounting for marginal nighttime and clearsky conditions when resampling observations
2 participants