Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skyfield calculations may be overly accurate for requirements at the cost of computation #4

Open
thesofakillers opened this issue Aug 20, 2021 · 5 comments

Comments

@thesofakillers
Copy link
Owner

Have not analyzed big O performance but it is slow enough to be a nuisance for larger datasets, especially since this calculation needs to be repeated across train/test sets and perhaps even across folds.

The following lines need addressing:

sunset_idxs = np.zeros(len(datetime_series), dtype=int)
for sunset in sunsets[1:]:
change = np.where(datetime_series > sunset)[0][0]
sunset_idxs[change:] += 1

@thesofakillers thesofakillers changed the title Time since sunset calculation is slow Skyfield appear to be slow Aug 26, 2021
@thesofakillers thesofakillers changed the title Skyfield appear to be slow Skyfield calculations appear to be slow Aug 26, 2021
@thesofakillers
Copy link
Owner Author

thesofakillers commented Aug 26, 2021

Upon closer look, while the lines pointed out above indeed need addressing, the skyfield library seems generally to be quite slow for what we are asking it to do. Both the _t_since_sunset and _sun_elevation functions are quite slow for large datasets, with the latter taking the crown, with these lines also causing speed issues:

astrometric_pos = location.at(
skyfield_ts.from_datetimes(
datetime_series.dt.tz_localize("UTC").dt.to_pydatetime()
)
).observe(eph["sun"])

Skyfield was selected because the popular library pyephem suggests it.

Further attention should be dedicated into determining whether

  • the skyfield API is being used correctly
  • a faster library should be used
  • if this is actually the speed limit for what we are computing

@brandon-rhodes
Copy link

I'll be interested to hear which component in that call has the higher cost. Is there a way you can produce that series of times, I wonder, without creating n separate Python datetime objects? If the times are evenly spaced on the clock or calendar there are other approaches in Skyfield for creating the time array.

In the meantime, if you split your call into two lines with a temporary variable:

dtseries = datetime_series.dt.tz_localize("UTC").dt.to_pydatetime()
result = skyfield_ts.from_datetimes(dtseries)

— then could you could separately measure the cost of the Pandas operation from the Skyfield operation.

@thesofakillers
Copy link
Owner Author

thesofakillers commented Aug 27, 2021

@brandon-rhodes Hi thanks for commenting! Wasn't expecting the library creator to come and help 😅

The times are indeed evenly spaced so this could be done, as you say, with other approaches from skyfield. That was just a quick and dirty solution without having to dig too deeply into the documentation.

I've updated my second comment with the complete section of code causing problems. Evaluating line by line as you suggested sheds some light on what operations are more costly. You can see my experimentation in this notebook, where i use %lprun to profile the same function with different input time series.

While for small time series the pandas operation is more costly than skyfield's from_datetimes, this flips for larger time series. More importantly however, the greatest slow down is caused by the location.at() call, which we can see takes between 60% and 80% of the execution time depending on the input size.

In the notebook the time series I try on are of length 1000, 10000 and 100000. I wasn't able to try it on 1000000 since it would not complete execution. For this particular project, time series of this size (millions of timestamps) are what we are concerned with.


Thanks again for the interest! I hope this satisfies your interest :).


We would really appreciate if you have any suggestions on how to do this in a smarter way! We're trying to get the sun's altitude at each timestamp for millions of timestamps, with the timestamps being evenly spaced (regular sample rate basically).

We are also interested in determining the most recent sunset for each timestamp (code here) although this has been less problematic

@brandon-rhodes
Copy link

My guess is that most of the time is invested in computing precession and nutation — the precise angle at which the Earth's poles point at any given moment. The standard nutation series, in particular, has 687 terms of 5 coefficients each which need to be evaluated at every point on your timeline, so if you have one million times then that’s something like 5×687×10⁶ multiplies and then all the necessary adds to put them together.

What kind of accuracy are you after? The full series is only useful if you're going for observatory-grade angles that are useful for pointing a radio telescope (which is the accuracy Skyfield aims for by default).

Since your times are so close together, my guess is that we could evaluate nutation and precession daily, or only a few times a day, and linearly interpolate between them, to get better speed. But that will introduce error, so we need first to know how exact an answer you need on the Sun's position in the sky.

What will you be doing with the angles? That can help establish tolerances.

@thesofakillers
Copy link
Owner Author

What will you be doing with the angles? That can help establish tolerances.

We are using the angles as additional input data to provide to machine learning models to be used for turbulence forecasting.

What kind of accuracy are you after?

I have asked my colleagues what sort of accuracy we are after and will update this post

Since your times are so close together, my guess is that we could evaluate nutation and precession daily, or only a few times a day, and linearly interpolate between them, to get better speed.

This seems like a good idea and depending on our tolerance it may be sufficient to go this route. I imagine this will be the case.

Thank you again for all the help, let us know if you have any more comments 😊

@thesofakillers thesofakillers changed the title Skyfield calculations appear to be slow skyfield calculations may be overly accurate for requirements at the cost of computation Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants