ENH: Use ddof=0 and nanstd, remove skipped test #36

analicia · 2016-11-23T16:26:00Z

Empyrical is not generally used for the sample of a large population. Change the degrees of freedom(ddof) to 0 because that is how the standard deviation of a population is calculated. Regressive tests have been updated to reflect change in method.

Use nanstd instead of np.std for consistency.

There was a flapping test without sound logic that is being removed as well.

twiecki · 2016-11-23T19:03:05Z

Why would we have the whole population here? There are ever only samples.

ssanderson · 2016-11-23T20:53:59Z

@twiecki this came up in the context of adding a Volatility built-in Factor in Zipline.

The question I'm unsure of here is what we're considering as "the population" when we're calculating the volatility of a returns timeseries. It's straightforward that we should use ddof=1 if, for example, we were estimating the volatility of returns of all assets by taking a random sample of 1000 assets.

In this case, however, one could argue that we have "the whole population", in the sense that we have a data point for every day that we're interested in. In particular, for returns, I think question is whether the mean of the daily returns for some asset can ever deviate from the mean of the returns measured at some finer granularity, since the source of bias corrected by using (n - 1) in the denominator here is that we have to estimate population variance using an estimate of the population mean.

@CaptainKanuk @mmargenot your thoughts here would also be helpful.

ssanderson · 2016-11-23T20:59:14Z

Another way of putting the above is to observe that daily returns aren't samples from an underlying continuous returns series, they're aggregations of an underlying continuous series. This is in contrast, for example, with estimating the standard deviation of a stock's price by looking at end-of-day data, which amounts to taking regular samples from a much larger population. In the latter case, I think ddof=1 would be clearly justified.

CaptainKanuk · 2016-11-23T21:27:54Z

Before we start, this is a good wiki page. https://en.wikipedia.org/wiki/Bessel's_correction

My take is that it actually depends on how the user interprets the result of the computation, which is tricky. If we are trying to measure/estimate the mean daily return over a fixed time period and have data for that entire time period (no NaNs), then we have all the data for the statistic we are trying to measure and it should be ddof=0 (not applying Bessel's correction).

If we are trying to measure the inherent volatility of an asset, then as Max pointed out in Slack we don't actually know the overall mu of the data generating process (DGP) which drives the returns. As such we don't know the population mu and are just sampling. This is in my experience the more common interpretation of vol, and is an interpretation that people often use without realizing it. Another case is that if our data has some missing elements, then we don't have the full population and should use ddof=1.

I say ddof=1 for the following reasons, but agree it isn't clear cut.

Because people will more often assume that the resulting std is indicative of some underlying asset property and therefore require ddof=1.
Between the two ddof=1 will uniformly overestimate volatility and risk and is therefore a safer choice in my opinion.

mmargenot · 2016-11-23T21:35:08Z

Seconding Delaney, mainly on the premise that any more granular return series will still be a sample of the DGP. We can't know the true movement of the underlying without divine inspiration.

ENH: Use ddof=0 and nanstd, remove skipped test

dc45a5e

analicia assigned twiecki Nov 23, 2016

analicia closed this Nov 28, 2016

analicia deleted the ddof-0 branch November 28, 2016 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use ddof=0 and nanstd, remove skipped test #36

ENH: Use ddof=0 and nanstd, remove skipped test #36

analicia commented Nov 23, 2016

twiecki commented Nov 23, 2016 •

edited

ssanderson commented Nov 23, 2016

ssanderson commented Nov 23, 2016

CaptainKanuk commented Nov 23, 2016

mmargenot commented Nov 23, 2016

ENH: Use ddof=0 and nanstd, remove skipped test #36

ENH: Use ddof=0 and nanstd, remove skipped test #36

Conversation

analicia commented Nov 23, 2016

twiecki commented Nov 23, 2016 • edited

ssanderson commented Nov 23, 2016

ssanderson commented Nov 23, 2016

CaptainKanuk commented Nov 23, 2016

mmargenot commented Nov 23, 2016

twiecki commented Nov 23, 2016 •

edited