Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pvdaq reference sites, fetch, config #438

Merged
merged 25 commits into from May 12, 2020

Conversation

wholmgren
Copy link
Member

@wholmgren wholmgren commented May 7, 2020

  • Closes PVDAQ sites for reference data #397 .
  • I am familiar with the contributing guidelines.
  • Tests added.
  • Updates entries to docs/source/api.rst for API changes.
  • Adds descriptions to appropriate "what's new" file in docs/source/whatsnew for all changes. Includes link to the GitHub Issue with :issue:`num` or this Pull Request with :pull:`num`. Includes contributor name and/or GitHub username (link with :ghuser:`user`).
  • New code is fully documented. Includes numpydoc compliant docstrings, examples, and comments where necessary.
  • Maintainer: Appropriate GitHub Labels and Milestone are assigned to the Pull Request and linked Issue.

Additional To do:

  • make fetch wrapper functions. Need to reindex to fixed frequency (most sites report at only quasi-regular intervals). Also need to localize data in fetch wrapper because PVDAQ apparently returns local time (no DST), but without the tz offset. I pulled tz offset data from google's timezone API.
  • add pv modeling parameters somewhere
  • add sites to the reference forecast configuration too - not sure how that's done.
  • set up SFA API key
  • add sites to google map accessible at https://solarforecastarbiter.org/referencedata/

get_pvdaq_data function is based on pvlib/pvlib-python#664

It takes about 3 minutes to pull all of the 2020 data.

@wholmgren wholmgren added the enhancement New feature or request label May 7, 2020
@wholmgren wholmgren added this to the 1.0 release candidate milestone May 7, 2020
@wholmgren wholmgren added the IO Issue pertains to data IO label May 7, 2020
@wholmgren
Copy link
Member Author

wholmgren commented May 7, 2020

@alorenzo175 @lboeman I found this existing code for conforming to an index:

# we don't extend the new index to start, end, since reference
# data has some lag time from the end it was requested from
# and it isn't necessary to keep the nans between uploads in db
new_index = pd.date_range(start=data.index[0], end=data.index[-1],
freq=observation.interval_length)
data = data.reindex(new_index)
# set quality flags

Can we change that to

    data_resampled = data.resample(observation.interval_length).first()
    new_index = pd.date_range(start=data.index[0], end=data.index[-1],
                              freq=observation.interval_length)
    data = data_resampled.reindex(new_index)

@alorenzo175
Copy link
Member

Hmm, and if interval label is ending? Seems like .last() is more appropriate. And if interval_label == 'mean' seems like a mean would be appropriate?

@wholmgren
Copy link
Member Author

I was originally just trying to fix a problem with timestamps that need to be rounded to be on the right interval. Here are a couple of examples:

ac_power dc_power
Date-Time
2020-03-12 13:30:00 61000.0 64600.0
2020-03-12 13:35:00 101700.0 107700.0
2020-03-12 13:40:00 60300.0 63800.0
2020-03-12 13:45:00 57300.0 60800.0
2020-03-12 13:49:58 58400.0 61800.0
2020-03-12 13:50:00 58400.0 62100.0
2020-03-12 13:55:00 67200.0 71100.0

and

ac_power
Date-Time
2020-01-04 12:00:00 1057
2020-01-04 12:15:01 1685
2020-01-04 12:30:00 1982
2020-01-04 12:45:00 2891
2020-01-04 13:00:00 2428
2020-01-04 13:15:00 2285
2020-01-04 13:30:00 2169
2020-01-04 13:45:00 2075
2020-01-04 14:00:00 1418
2020-01-04 14:15:00 1829
2020-01-04 14:30:01 1181
2020-01-04 14:45:00 2127

In the first case we could average or we could just discard. In the second case we want to round. If discarding is ok, then we can use resample().first() regardless of the label.

But there are also two sites that report 15s data that we'll need to resample to 1 minute intervals (unless we want to support non integer interval_lengths in the API).

@alorenzo175
Copy link
Member

Dumping data is ifne with me

@wholmgren
Copy link
Member Author

I'd like some feedback on the proposed pattern before implementing/fixing tests.

pvdaq_reference_sites.json is a dict with keys 'sites' and 'observations'. The sites are all of the pvdaq sites, and the observations are a flat list of all of the pvdaq observations. I don't have a strong opinion about this vs. a nested structure. The code I'm parsing a handful of ugly python dicts and dataframes into Site and Observation objects. From there it's simple to change the organization of the json file if people prefer something else. I know @lboeman was thinking of something similar for SRML PV sites.

common._prepare_data_to_post gets a new resample_how argument that can be None or a pandas resampler method. For PVDAQ, this lets us drop data or resample sub-minute data to 1 minute data. common.post_observation_data pulls this from observation.extra_parameters, with default None. I could put this kind of logic in reference_observations.pvdaq.fetch if I assign the resample rules to the Site instead of the Observation. While that might be consistent with the existing expectations of all reference site data, I think we need to be more flexible in the future. I also just think it makes more sense to put it in the observation metadata.

@wholmgren
Copy link
Member Author

It's now working with the solararbiter referencedata init and update commands.

See this site for an example: https://dev-dashboard.solarforecastarbiter.org/observations/bfdb2b98-923f-11ea-bfa1-0a580a82013d?start=2020-04-01T22%3A00Z&end=2020-05-09T22%3A00Z

Observations won't update at the other sites because I first created them with not quite the correct extra_parameters (it was a dict instead of string-ified dict). I deleted the site/observation/forecasts at just the linked site above (major pain - separate issue) and then recreated it with extra parameters that are strings, and the values post was successful.

@wholmgren wholmgren requested a review from lboeman May 11, 2020 17:13
@wholmgren
Copy link
Member Author

Looks like everything is working. There are a couple of sites on the dev dashboard that have the wrong AC/DC capacity since I hadn't yet fixed the W-->MW conversion when they were created, but they would be recreated correctly following a database wipe. It takes about 20 minutes to pull and post all of the 2020 data. About 75% of the time is in the post.

Copy link
Member

@lboeman lboeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I do like storing the sites as valid API schema. Maybe eventually we convert the full list to a json file with the format:

{ 'NETWORK': {
    'sites': [ ... ],
    'observations': [...], // optional
 },
 'NETWORK 2': {...}
}

Then we can have a set of required extra_parameters, and special cases can be added without needing to update the whole list. Not suggesting this should happen here, but just thinking ahead a little.

start : datetime
The beginning of the period to request data for.
end : datetime
The end of the period to request data for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nrel_pvdaq_api_key here

Copy link
Member

@alorenzo175 alorenzo175 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the approach. Needs tests for common and resample_how and fetch/pvdaq at least.

# Each year must queries separately, so iterate over the years and
# generate a list of dataframes.
# Consider putting this loop in its own private function with
# try / except / try again pattern for network issues and NREL API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we see any issues from this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pvlib CI struggled with a different NREL API but I haven't run into any with the pvdaq API. They have a 1000 requests per hour limit but we are no where close to that. Let's see if it's a problem in the rc cycle.

solarforecastarbiter/io/reference_observations/common.py Outdated Show resolved Hide resolved
@wholmgren
Copy link
Member Author

@alorenzo175 tests pass, coverage is almost 100%, so I think we're close. The mocks grew more aggressive as the hour grew later, but I think they're consistent with other modules.

@wholmgren wholmgren merged commit 03a2c29 into SolarArbiter:master May 12, 2020
@wholmgren wholmgren deleted the pvdaq branch May 12, 2020 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request IO Issue pertains to data IO
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PVDAQ sites for reference data
3 participants