New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add winsorize factor #1696

Merged
merged 2 commits into from Mar 7, 2017

Conversation

Projects
None yet
3 participants
@ahgnaw
Contributor

ahgnaw commented Mar 3, 2017

This factor is useful when the effects of spurious outliers to be mimized.

@coveralls

This comment has been minimized.

coveralls commented Mar 3, 2017

Coverage Status

Coverage increased (+0.004%) to 87.408% when pulling fe6dbbb on add-winsorize into f90cd1c on master.

@@ -833,6 +834,84 @@ def linear_regression(self, target, regression_length, mask=NotSpecified):
mask=mask,
)
def winsorize(self,

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

I think this should have the same @expect_types and float64_only guards as zscore and rank.

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

How well known do you think the name winsorize is in the broader community. Wondering if a name like clamp_outliers might be clearer for a general audience.

This comment has been minimized.

@ahgnaw

ahgnaw Mar 3, 2017

Contributor

Clamp is slightly different than winsorize. In clamping, the upper and lower bounds are provided, in winsorizing they are derived from the data set. For example the lower limit in inclusive winsorizing in generated by arr.sorted[int(limit*len(arr))], assuming arr is a sorted array of the input values.

dtype=self.dtype,
missing_value=self.missing_value,
mask=mask,
window_safe=True,

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

I don't think that winsorizing makes an input automatically window_safe. See note on the tests above.

Returns
-------
winsorized : zipline.pipeline.Factor
A Factor producing a winsorized version of the self.

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

No need for "the" in "the self".

when limiting the impact of extreme values.
If ``mask`` is supplied, ignore values where ``mask`` returns False
when computing row means and standard deviations, and output NaN

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

Looks like the mask section here is copy/pasted from zscore?

@@ -833,6 +834,84 @@ def linear_regression(self, target, regression_length, mask=NotSpecified):
mask=mask,
)
def winsorize(self,
limits=0.05,

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

I don't think we should be providing a default limit here.

This comment has been minimized.

@ahgnaw

ahgnaw Mar 3, 2017

Contributor

Ok

Example
-------
price = USEquityPricing

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

I don't think this example works:

  • price needs to be USEquityPricing.close
  • The calls to winsorize need to be on price.latest, not on price.
'price': price.latest,
'winsor_inc': price.winsorize(limits=0.25, inclusive=(True, True)),
'winsor_exc': price.winsorize(
limits=0.25, inclusive=(False, False)

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

The example here is mostly focused on changing the inclusive/exclusive flags, but I think people will most often want to change limits. The limits API for tuples is also kind of odd for winsorize (if you want the 5th and 90th percentile, you pass (0.05, 0.10) instead of (0.05, 0.90)), , so we probably want to show an example passing a tuple. (Alternatively, we might just make our API be that you have to pass lower and upper limits, which would match the current api for percentile_between).

This comment has been minimized.

@ahgnaw

ahgnaw Mar 3, 2017

Contributor

Yeah I think changing the behavior of limits makes sense.

@@ -1062,6 +1068,9 @@ def test_demean_is_window_safe_if_input_is_window_safe(self):
self.assertFalse(F(window_safe=False).demean().window_safe)
self.assertTrue(F(window_safe=True).demean().window_safe)
def test_winsorize_is_window_safe(self):
self.assertTrue(F().winsorize().window_safe)

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

Winsorizing something doesn't make it automatically window safe. (In particular, winsorizing with no limits is just an identity function.)

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

I think the right behavior here is that a winsorized term is window safe iff the unwinsorized term was also window safe.

@@ -714,6 +715,7 @@ def test_normalizations_hand_computed(self):
normalizer_name_and_func=[
('demean', lambda row: row - nanmean(row)),
('zscore', lambda row: (row - nanmean(row)) / nanstd(row)),
('winsorize', lambda row: scipy_winsorize(row, limits=0.05)),

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

We should probably add another test with different limits to this suite.

given day could look like:
'price' 'winsor_inc' 'winsor_exc' 'winsor_alt'
Asset_1 1 2 3 2

This comment has been minimized.

@ssanderson

ssanderson Mar 3, 2017

Member

This would be a good example to turn into a test. We do something similar for demean in test_normalizations_hand_computed (unfortunately, it looks like we're missing the analogous test for zscore :( ).

@ahgnaw ahgnaw force-pushed the add-winsorize branch 2 times, most recently from d3f19fa to 0718935 Mar 6, 2017

@coveralls

This comment has been minimized.

coveralls commented Mar 6, 2017

Coverage Status

Coverage decreased (-0.4%) to 87.037% when pulling 0718935 on add-winsorize into f90cd1c on master.

@ahgnaw ahgnaw force-pushed the add-winsorize branch from 0718935 to 72ef80f Mar 6, 2017

@coveralls

This comment has been minimized.

coveralls commented Mar 6, 2017

Coverage Status

Coverage increased (+0.0001%) to 87.403% when pulling 72ef80f on add-winsorize into f90cd1c on master.

@@ -1062,6 +1168,11 @@ def test_demean_is_window_safe_if_input_is_window_safe(self):
self.assertFalse(F(window_safe=False).demean().window_safe)
self.assertTrue(F(window_safe=True).demean().window_safe)
def test_winsorize_is_window_safe(self):

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

Seems like this is no longer the right name for this test.

def winsorize(row, limits, inclusive):
if isinstance(limits, int) or isinstance(limits, float):

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

Rather than doing this logic on every row we process, we should just convert our args into the format expected by scipy_winsorize at term construction time in Factor.winsorize.

This comment has been minimized.

@ahgnaw

ahgnaw Mar 7, 2017

Contributor

GP

inclusive : a tuple of bool, optional
A bool indicating whether the data on each side should be
rounded(True) or truncated(False). A value of None can be used if
one side is not being winsorized. Default is (False, False).

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

The default listed here doesn't match the one in the function signature.

Parameters
----------
limits : None, tuple of float, optional

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

Reading through the examples from the tests, I'm still not thrilled with this API for limits. Reading f.winsorize(limits=33) in the tests, for example, feels awkward to me. It's not clear in that usage that limits is a percentile and not an absolute value (an issue that would persist even for more reasonable limits like 5).

I spent a while thinking about this without coming up with any suggestions that I was 100% sold on. I think my current preferred proposal is to just take required parameters for the lower and upper bounds (as absolute percentiles, i.e. (5, 95) rather than (5, 5)). This would match the api for percentile_between.

Let's talk about this first thing tomorrow and figure out a plan so that we can get this merged.

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

One concrete note: I don't think there's any case where it's useful for us to accept None for the boundary parameter.

@ssanderson

This comment has been minimized.

Member

ssanderson commented Mar 7, 2017

@ahgnaw changes here generally look good. I'm still not sold on our current API for describing the boundaries for winsorization. Let's talk early tomorrow and flesh those out.

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.0001%) to 87.403% when pulling 1cd8de1 on add-winsorize into f90cd1c on master.

@ahgnaw ahgnaw force-pushed the add-winsorize branch 3 times, most recently from c18f141 to 2118820 Mar 7, 2017

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.01%) to 87.415% when pulling 2118820 on add-winsorize into f90cd1c on master.

@ahgnaw ahgnaw force-pushed the add-winsorize branch 2 times, most recently from aa58ac8 to 0d9ca7b Mar 7, 2017

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.01%) to 87.414% when pulling 0d9ca7b on add-winsorize into f90cd1c on master.

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage decreased (-0.04%) to 87.36% when pulling 0d9ca7b on add-winsorize into f90cd1c on master.

(-.1, 1),
(0, 95),
(5, 95),
(5, 5),

This comment has been minimized.

@ssanderson

ssanderson Mar 7, 2017

Member

One more good case here would be 0.6, 0.5 for the case where the min and max are both valid, but min is greater than max.

@ahgnaw ahgnaw force-pushed the add-winsorize branch from 0d9ca7b to 1e13b9f Mar 7, 2017

Ana Ruelas

@ahgnaw ahgnaw force-pushed the add-winsorize branch from 1e13b9f to b4e97bc Mar 7, 2017

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.02%) to 87.423% when pulling b4e97bc on add-winsorize into f90cd1c on master.

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.02%) to 87.419% when pulling b4e97bc on add-winsorize into f90cd1c on master.

@coveralls

This comment has been minimized.

coveralls commented Mar 7, 2017

Coverage Status

Coverage increased (+0.02%) to 87.419% when pulling b4e97bc on add-winsorize into f90cd1c on master.

@ahgnaw ahgnaw merged commit 0bc29e1 into master Mar 7, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ahgnaw ahgnaw deleted the add-winsorize branch Mar 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment