-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor preprocessing resampling, validation, and alignment #580
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. I think the datamodel changes basically involve adding keys like threshold_percentage to QualityFlagFilter. To handle this new issue, I think a key like discard_before_resample would help.
We should also keep in mind that one can specify multiple quality flag filters in the report. Right now they are merged into one (https://github.com/SolarArbiter/solarforecastarbiter-core/pull/580/files#diff-715b43f6fac14c48d8729da1c197d286R430) but if we go the route of adding these keys, maybe the filters should be applied successively.
# what's the point of copying the fill map, potentially adding a new key, | ||
# and then only accessing one key? | ||
# this would make more sense to me (wholmgren): | ||
# forecast_fill_str = FORECAST_FILL_STRING_MAP.get(forecast_fill_method, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I can see, I agree
Would these be exposed in the report json or would they be preset for predefined QualityFlagFilters for each kind of quality flag? |
I was thinking they would be in the report json. If we wanted some kind of presets, I would think the dashboard could handle that like it might already do for other fields. |
solarforecastarbiter/datamodel.py
Outdated
'UNEVEN FREQUENCY', 'LIMITS EXCEEDED', 'CLEARSKY EXCEEDED', | ||
'DAYTIME STALE VALUES', 'INCONSISTENT IRRADIANCE COMPONENTS' | ||
) | ||
type: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't ideal, but we should probably keep the quality_flags
key (and the other parameters would apply to all flags in that tuple). Otherwise all current reports would be broken since I think the report json is loaded through the report datamodel before displaying/downloading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So keep the old quality_flags
key in addition to the new keys? I agree that could work. A couple of things to consider as alternatives:
- script to migrate current reports from old json to new json
- create new objects and json keys, leave the old ones untouched until some later time when we implement 1
I'm not sure what would be less work because the testing implications for keeping the old keys sound miserable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well keep old key, and add new keys except type
. So all flags in quality_flags
are processed with the other options. You would probably have the same options for many of the flags anyway, and you can still have multiple QualityFlagFilters with different sets of flags.
We've done the migration before an it was a real pain (and motivation to eventually get the report schema versioned). One problem is you also have to modify the json core posts to the API.
I don't think testing would be bad. All the keys are still used (if type
is removed), we just have a couple of new keys w/ reasonable defaults if empty
Should these have the same behavior? # separate flags:
qflag_1 = QualityFlagFilter(('NIGHTTIME', ), discard_before_resample=False, resample_threshold_percentage=30.)
qflag_2 = QualityFlagFilter(('CLEARSKY', ), discard_before_resample=False, resample_threshold_percentage=30.)
# combined flags
qflag_combined = QualityFlagFilter(('NIGHTTIME', 'CLEARSKY'), discard_before_resample=False, resample_threshold_percentage=30.) Imagine a 15 minute validation time series like
and we want to resample into 1 hour. If we apply the filters individually, we keep the hourly interval. But if we combine the filters we have I think the "separate flags" case should keep the interval and the "combined flags" case should discard the interval, but I'm open to the idea that they should have the same behavior. |
Interesting. I guess I always thought they would have the same behaviour and be treated separately. I can see how the combined flags case could be useful. Just need to make sure it's documented well. |
More scope creep.... |
I'd like to think it makes more sense now.
I added a few comments on where we could easily add support for this. Best left to another PR, though. |
solarforecastarbiter/datamodel.py
Outdated
resample_threshold_percentage : float, default 10. | ||
The percentage of points in a resampled interval that must be | ||
flagged for the resampled interval to be flagged. | ||
Ignored if ``discard_before_resample`` is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs updating?
Co-authored-by: Tony Lorenzo <atlorenzo@email.arizona.edu>
@lboeman we'll need some dashboard updates to go along with this. see #580 (comment) The easiest thing might be to make separate Let me know if I should make any changes here to support that. |
docs/source/api.rst
for API changes.docs/source/whatsnew
for all changes. Includes link to the GitHub Issue with:issue:`num`
or this Pull Request with:pull:`num`
. Includes contributor name and/or GitHub username (link with:ghuser:`user`
).There is a problem with the current implementation. Flags that represent bad data should be excluded from the time series before resampling (e.g. limits exceeded) and that's not done here. We're trading one problem for another. I don't know how to solve this without a richer flag datamodel or awful hacks. So maybe it's worth addressing that now. Or we can say this is a net win, move on, and address that later as originally planned. I guess I would say address it now but I don't have a good idea of what the datamodel changes would actually look like.
@alorenzo175 can you review and let me know what you think?