New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for strings in Pipeline. #1174

Merged
merged 27 commits into from May 5, 2016

Conversation

Projects
None yet
4 participants
@ssanderson
Member

ssanderson commented Apr 29, 2016

  • Adds a new class, LabelArray, which is a subclass of np.ndarray.
    LabelArray is conceptually similar to pandas.Categorical, in that it
    stores data with many duplicate values as indices into an array of
    unique values. For string data with many duplicates (e.g. time-series
    of tickers or or industry classifications), this provides multiple
    orders of magnitude of improvement when doing string operations,
    especially string comparison/matching operations.
  • Adds a new generic object "specialization" for AdjustedArrayWindow,
    and a corresponding ObjectOverwrite adjustment.
  • Adds a new postprocess method to zipline.pipeline.term.Term.
    This method is called on the final result of any pipeline expression
    after screen filtering has occurred. The default implementation of
    postprocess is identity, but Classifier overrides it to coerce
    string columns into pandas.Categoricals before presenting them to the
    user.

@ssanderson ssanderson force-pushed the string-classifiers branch 3 times, most recently from 9c39bca to b51102a Apr 29, 2016

elif data_dtype.name.startswith('datetime'):
return data.astype(int64), {'dtype': dtype(int64)}
elif data_dtype in CATEGORICAL_DTYPES:
if not isinstance(missing_value, (bytes, unicode)):

This comment has been minimized.

@dmichalowicz

dmichalowicz May 2, 2016

Contributor

This will need to use six.string_types

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

I think we actually want this to continue to support bytes in py3. (string_types is just unicode in py3).

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Note that unicode here isn't the builtin name (since that's only defined in py2), it's a value imported from zipline.utils.compat.

Py_ssize_t first_col,
Py_ssize_t last_col,
object value):
super(Adjustment, self).__init__(

This comment has been minimized.

@dmichalowicz

dmichalowicz May 2, 2016

Contributor

Did you mean for this to be super(_ObjectAdjustment, ...)?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Yes. I think we're being saved here by the fact that Cython implicitly calls all your parent's __new__s before yours. Will update.

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Updated.

categories=optional(list),
)
@expect_kinds(values=("O", "S", "U"))
def __new__(cls,

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

Should there be a docstring here describing the params?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

It probably belongs in the class docstring, but yeah, I can add that.

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Updated.

In this case, `self` will be the new LabelArray instance, and
``obj` will be the array on which ``view`` is being called.
The caller of ``obj.view`` is responsible for copying setting

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

wording

This comment has been minimized.

@ssanderson
-------
matches : np.ndarray[bool]
An array with the same shape as self indicating whether each
element of self ended with ``suffix``.w

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

cruft

This comment has been minimized.

@ssanderson
repr_lines[-1] = repr_lines[-1].rsplit(',', 1)[0] + ')'
# The extra spaces here account for the difference in length between
# 'array(' and 'LabelArray('.
return '\n '.join(repr_lines)

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

Should these extra spaces be parameterized/moved into a variable somewhere so that it's clearer how we expect this to look?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

This would look something like:

separator = '\n' + ' ' * (len(type(self).__name__) - len('array'))
return separator.join(repr_lines)

I'm not sure that's actually much clearer (I would want the comment explaining what's happening either way), so I'd be inclined to just leave this as-is with the comment.

@@ -250,7 +251,7 @@ def _ensure_sliding_windows(self, assets, dts, field):
adjs = {}
window = Float64Window(
array[:, i].reshape(prefetch_len, 1),
dtype_,
view_kwargs,

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

What does view_kwargs do differently from dtype_?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

The relevant corresponding change is https://github.com/quantopian/zipline/pull/1174/files#diff-9c48be544e3a72c4ea00f841fe7f9b7bL96.

Previously, the low-level array iterator would do:

self._data.view(self.viewtype)

it now instead does:

self._date.view(**self.view_kwargs)

The difference is that we used to always pass a dtype to view, but we now pass an empty dict (which is a no-op) when we have a LabelArray. The alternative would have been to pass None or another sentinel and have AdjustedArrayWindow check its viewtype on every iteration, but it makes the downstream logic cleaner to just always receive a dictionary.

# Each term that computed an output has its postprocess method
# called on the filtered result.
#
# We use this convert LabelArrays into categoricals, among other

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

to convert?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

Updated.

"""
return LabelArray(
array.astype(initial_dtype),
missing_value=initial_dtype.type(''),

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

Why is this an empty string?

This comment has been minimized.

@ssanderson

ssanderson May 2, 2016

Member

this should be missing_value

"""
Generate test cases for overwrite adjustments.
The algorithm used here is the same as the one used above for
multiplicative adjustments. The only difference is the semantics of how
the adjustments are expected to modify the arrays.
This is parameterized on `make_input` and make_expected_output functions,

This comment has been minimized.

@mtydykov

mtydykov May 2, 2016

Contributor

make_expected_output -> make_expected_output

return data.astype(int64), dtype(int64)
elif data_dtype.name.startswith('datetime'):
return data.astype(int64), {'dtype': dtype(int64)}
elif data_dtype in CATEGORICAL_DTYPES:

This comment has been minimized.

@dmichalowicz

dmichalowicz May 3, 2016

Contributor

Why doesn't CATEGORICAL_DTYPES include 'S' and 'U'? I was under the impression that LabelArray can accept data of strings/unicode, not just dtype object.

This comment has been minimized.

@ssanderson

ssanderson May 3, 2016

Member

S and U aren't actually dtypes; they're families of dtypes, one for each possible fixed length. If you make an array with dtype('S'), it's actually equivalent to dtype('S1'), which has some unexpected behavior like silently truncating any string longer than one character. I actually had 'S' and 'U' here in an earlier draft and removed them because I ended up having to guard defensively against the dtype being 'S' or 'U' all over the place.

@mtydykov

This comment has been minimized.

Contributor

mtydykov commented May 3, 2016

Looks good to me (whenever the build passes).

@ssanderson ssanderson force-pushed the string-classifiers branch 2 times, most recently from 5983091 to 221e7e2 May 3, 2016

If we have an outputs tuple, the default is an empty recarray with
``self.outputs`` as field names. Each field will have dtype
``self.dtype``, the default shape is ``self.shape``.

This comment has been minimized.

@dmichalowicz

dmichalowicz May 4, 2016

Contributor

self.shape --> shape

from zipline.pipeline.classifiers import (
Classifier,
Latest as LatestClassifier,
)

This comment has been minimized.

@dmichalowicz

dmichalowicz May 4, 2016

Contributor

Could you maybe add a brief comment as to why these imports need to happen here?

This comment has been minimized.

@ssanderson

ssanderson May 4, 2016

Member

I think these actually don't need to be lazy imports anymore (I don't remember why they were originally...). Will move them out.

ssanderson added some commits Apr 28, 2016

ENH: Add support for strings in Pipeline.
- Adds a new class, ``LabelArray``, which is a subclass of np.ndarray.
  LabelArray is conceptually similar to pandas.Categorical, in that it
  stores data with many duplicate values as indices into an array of
  unique values.  For string data with many duplicates (e.g. time-series
  of tickers or or industry classifications), this provides multiple
  orders of magnitude of improvement when doing string operations,
  especially string comparison/matching operations.

- Adds a new generic object "specialization" for `AdjustedArrayWindow`,
  and a corresponding ObjectOverwrite adjustment.

- Adds a new ``postprocess`` method to ``zipline.pipeline.term.Term``.
  This method is called on the final result of any pipeline expression
  after screen filtering has occurred. The default implementation of
  ``postprocess`` is identity, but Classifier overrides it to coerce
  string columns into pandas.Categoricals before presenting them to the
  user.
ENH: Use np.void for labelarray storage.
This disables most broken ufuncs

ssanderson added some commits May 3, 2016

BUG: Tests/bugfixes for LabelArray slicing.
- Fixes a bug where __setitem__ was not called when setting with a slice
  on Python 2 (__setslice__ was called instead), which caused strange
  behavior when setting an empty string.  This is fixed by overriding
  __setslice__ and forwarding to __setitem__.

- Fixes a bug where __getitem__ returned an instance of np.void when
  returning a scalar.  We now correctly return an entry from our
  categoricals.
ENH: Fail fast on outputs in CustomClassifier.
We don't support multiple outputs for CustomClassifier because we use
LabelArrays for string classifiers.
MAINT: Remove lazy imports of Latest.
They're no longer needed to break import cycles.

@ssanderson ssanderson force-pushed the string-classifiers branch from edac46f to 7091727 May 4, 2016

@ssanderson ssanderson force-pushed the string-classifiers branch from 7091727 to 317ecc8 May 4, 2016

@coveralls

This comment has been minimized.

coveralls commented May 4, 2016

Coverage Status

Coverage increased (+0.3%) to 81.431% when pulling 317ecc8 on string-classifiers into 8756bf2 on master.

ssanderson added some commits May 4, 2016

BUG: Fix broken isnull() on string classifiers.
Adds a special case in NullFilter to handle LabelArrays correctly.
TEST: Don't assert particular numpy error.
They change from version to version.
@coveralls

This comment has been minimized.

coveralls commented May 5, 2016

Coverage Status

Coverage increased (+0.3%) to 81.47% when pulling 2ceeac1 on string-classifiers into 8756bf2 on master.

@ssanderson ssanderson force-pushed the string-classifiers branch from 84ad454 to e0aeda4 May 5, 2016

@coveralls

This comment has been minimized.

coveralls commented May 5, 2016

Coverage Status

Coverage increased (+0.3%) to 81.47% when pulling e0aeda4 on string-classifiers into 8756bf2 on master.

BUG: View with specific int dtype.
Just viewing as int is broken on win32.
@coveralls

This comment has been minimized.

coveralls commented May 5, 2016

Coverage Status

Coverage increased (+0.3%) to 81.471% when pulling 9fd8ec1 on string-classifiers into 8756bf2 on master.

@ssanderson ssanderson merged commit 402fb2a into master May 5, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ssanderson ssanderson deleted the string-classifiers branch May 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment