Skip to content

Conversation

yupbank
Copy link
Contributor

@yupbank yupbank commented Apr 16, 2017

Reference Issue

fixes #6045

What does this implement/fix? Explain your changes.

add iterable support for value types in dictvectorizer

Any other comments?

It would be helpful if we add a feature_dimension_ property for DictVectorizer to make the dictvectorizer support aggregation over list/array of numbers or vectors

data = [{'a': [1, 2, 3],  'b': {1: 'fail', 2: 'fail'}}] 

^ this type of input would fail

@yupbank yupbank force-pushed the multi-value-dict-vec branch 3 times, most recently from f1b51f6 to 488562d Compare April 17, 2017 14:51
@yupbank yupbank changed the title add support iterable values for dictvectorizer add iterable values support for dictvectorizer Apr 17, 2017
@yupbank yupbank changed the title add iterable values support for dictvectorizer [MRG] add iterable values support for dictvectorizer Apr 17, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a benchmark for how much slower this makes the incumbent behaviour, if at all?

def add_element(self, f, v, feature_names, vocab, fitting=True,
transforming=False, indices=None, values=None):
if not isinstance(v, six.string_types) and isinstance(v, Iterable):
for vv in v:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to have iterables of anything but strings. I think it would be worth our while raising an error here, even if that slows things down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. it's not just a refactor but also address this #6045

I think it is convenient to accept list as a value of dict in DictVectorizer.fit()

and yeah. I'll add some benchmark data for this

@yupbank
Copy link
Contributor Author

yupbank commented May 25, 2017

before this pr

In [2]: a = DictVectorizer()

In [3]: data = [dict(a=i) for i in xrange(100)]

In [5]: %timeit a.fit_transform(data)
   ...:
   ...:
1000 loops, best of 3: 468 µs per loop

after this pr

In [5]: data = [dict(a=i) for i in xrange(100)]

In [6]:  %timeit a.fit_transform(data)
1000 loops, best of 3: 878 µs per loop

In [2]: data_list = [dict(a=[i]) for i in xrange(100)]
In [4]:  %timeit a.fit_transform(data_list)

1000 loops, best of 3: 1.04 ms per loop

@jnothman
Copy link
Member

jnothman commented May 25, 2017 via email

elif isinstance(v, Number) or (v is True) or\
(v is False) or (v is None):
feature_name = f
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should put the iterable check here

pass

return Xa
return self._transform(X, fitting=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happened to the dense case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(v, six.string_types):
feature_name = "%s%s%s" % (f, self.separator, v)
v = 1
elif isinstance(v, Number) or (v is True) or\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> isinstance(True, Number)
True

But more generally, I don't understand why we need this check. We've never done it before and it only helps in a case where creating the matrix will ultimately fail. And there might be numpy types that we should be allowing here.

Copy link
Contributor Author

@yupbank yupbank May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [11]: isinstance(np.int32(10), numbers.Number)
Out[11]: True

@jnothman
Copy link
Member

@yupbank are you intending to complete this?

@amueller
Copy link
Member

Should we mark this for someone to take over then?

@yupbank
Copy link
Contributor Author

yupbank commented May 23, 2018

Oh... sorry, I’ll pick this up! Total forget about this for no reason

@amueller
Copy link
Member

sweet, thanks. No worries, things happen. This looks like a great contribution :)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docstring and user guide

v = 1
elif isinstance(v, Number) or (v is None):
feature_name = f
elif isinstance(v, Iterable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am uncomfortable including Mappings here.

feature_name = f
elif isinstance(v, Iterable):
for vv in v:
self.add_element(f, vv, feature_names, vocab,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will flatten a nested list. Is there a reason we would want that? Is there a rain we would not want that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested list case should, at least, be tested, for either support or failure

@@ -77,6 +77,23 @@ def test_one_of_k():
assert_false("version" in names)


def test_iterable_value():
D_in = [{"version": ["1", "2"], "ham": 2},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test fit_transform and transform where this is a finite generator rather than a list

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imap and xrange don't exist in Python 3


names = v.get_feature_names()
assert_true("version=2" in names)
assert_true("version=1" in names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer use nosetests. You should use bare assert here

{"version=3": True, "spam": -1}]
v = DictVectorizer(sparse=False)
X = v.fit_transform(D_in)
assert_equal(X.shape, (3, 5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use assert X.shape == (3, 5) rather than this function

{"version": "2", "spam": .3},
{"version=3": True, "spam": -1}]
v = DictVectorizer()
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will pass if the error is not raised. Use pytest.raises or assert_raise_message

@yupbank yupbank force-pushed the multi-value-dict-vec branch from 81a38de to 060c3f5 Compare June 1, 2018 02:08
@yupbank yupbank force-pushed the multi-value-dict-vec branch 3 times, most recently from fb57cf0 to dbb4852 Compare June 1, 2018 15:15
@yupbank yupbank force-pushed the multi-value-dict-vec branch from dbb4852 to 261271d Compare June 1, 2018 15:24
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't spot this sooner!

@@ -85,6 +89,17 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
True
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[0., 0., 4.]])
>>> D2 = [{'foo': '1', 'bar': '2'}, {'foo': '3', 'baz': '1'}, {'foo': ['1', '3']}]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of each additional example needs to be clear. Put a line of explanation before each

elif isinstance(v, Iterable) and not isinstance(v, Mapping):
for vv in v:
self.add_element(f, vv, feature_names, vocab,
fitting, transforming, indices, values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now thinking that this solution is too general. Lists of numeric values don't make sense, and I don't think we benefit from expanding nested lists. So rather than recursion, please just handle this case specially as a variant of the string value case, and raise an error if any elements of the iterable are not string-like

assert_equal(X.shape, (3, 5))
assert_equal(X.sum(), 14)
assert_equal(v.inverse_transform(X), D)
assert sp.issparse(X) == sparse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although we have a different convention for new code, updating existing code makes this PR hard to review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then don't..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it's not terribly bad, cv which is why I didn't suggest reverting, but for future reference it's better to keep a PR focused on one issue

@yupbank yupbank closed this Jun 2, 2018
@jnothman
Copy link
Member

jnothman commented Jun 2, 2018

Huh? Why closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accept multiple values for one categorical feature
3 participants