[MRG] add iterable values support for dictvectorizer #8750

yupbank · 2017-04-16T13:42:38Z

Reference Issue

What does this implement/fix? Explain your changes.

add iterable support for value types in dictvectorizer

Any other comments?

It would be helpful if we add a feature_dimension_ property for DictVectorizer to make the dictvectorizer support aggregation over list/array of numbers or vectors

data = [{'a': [1, 2, 3],  'b': {1: 'fail', 2: 'fail'}}]

^ this type of input would fail

jnothman

Could we get a benchmark for how much slower this makes the incumbent behaviour, if at all?

jnothman · 2017-05-25T12:11:20Z

sklearn/feature_extraction/dict_vectorizer.py

+    def add_element(self, f, v, feature_names, vocab, fitting=True,
+                    transforming=False, indices=None, values=None):
+        if not isinstance(v, six.string_types) and isinstance(v, Iterable):
+            for vv in v:


I don't think we want to have iterables of anything but strings. I think it would be worth our while raising an error here, even if that slows things down.

hmm.. it's not just a refactor but also address this #6045

I think it is convenient to accept list as a value of dict in DictVectorizer.fit()

and yeah. I'll add some benchmark data for this

yupbank · 2017-05-25T15:41:56Z

before this pr

In [2]: a = DictVectorizer()

In [3]: data = [dict(a=i) for i in xrange(100)]

In [5]: %timeit a.fit_transform(data)
   ...:
   ...:
1000 loops, best of 3: 468 µs per loop

after this pr

In [5]: data = [dict(a=i) for i in xrange(100)]

In [6]:  %timeit a.fit_transform(data)
1000 loops, best of 3: 878 µs per loop

In [2]: data_list = [dict(a=[i]) for i in xrange(100)]
In [4]:  %timeit a.fit_transform(data_list)

1000 loops, best of 3: 1.04 ms per loop

jnothman · 2017-05-25T23:05:00Z

So that's double the time for something the user could already express by pre-transforming their data. Could you benchmark comparison to few samples with lots of entries in the dict? I.e. is this due to the function call, or due to the isinstance check?

…

On 26 May 2017 at 01:41, Peng Yu ***@***.***> wrote: before this pr In [2]: a = DictVectorizer() In [3]: data = [dict(a=i) for i in xrange(100)] In [5]: %timeit a.fit_transform(data) ...: ...: 1000 loops, best of 3: 468 µs per loop after this pr In [5]: data = [dict(a=i) for i in xrange(100)] In [6]: %timeit a.fit_transform(data) 1000 loops, best of 3: 878 µs per loop In [2]: data_list = [dict(a=[i]) for i in xrange(100)] In [4]: %timeit a.fit_transform(data_list) 1000 loops, best of 3: 1.04 ms per loop — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8750 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6__7-4wT3jHVNSsrsi-yUZMPzxoYks5r9aFGgaJpZM4M-oWM> .

jnothman · 2017-05-25T23:07:27Z

sklearn/feature_extraction/dict_vectorizer.py

+            elif isinstance(v, Number) or (v is True) or\
+                    (v is False) or (v is None):
+                feature_name = f
+            else:


I think you should put the iterable check here

jnothman · 2017-05-25T23:10:04Z

sklearn/feature_extraction/dict_vectorizer.py

-                        pass
-
-            return Xa
+        return self._transform(X, fitting=False)


What's happened to the dense case?

it's taken care of here https://github.com/scikit-learn/scikit-learn/pull/8750/files/488562d95c92dd03a2b283eb0580e15a199ee2aa#diff-db0cc7cd94253605c2325da001c2983aR219

jnothman · 2017-05-25T23:10:33Z

sklearn/feature_extraction/dict_vectorizer.py

+            if isinstance(v, six.string_types):
+                feature_name = "%s%s%s" % (f, self.separator, v)
+                v = 1
+            elif isinstance(v, Number) or (v is True) or\


>>> isinstance(True, Number) True

But more generally, I don't understand why we need this check. We've never done it before and it only helps in a case where creating the matrix will ultimately fail. And there might be numpy types that we should be allowing here.

In [11]: isinstance(np.int32(10), numbers.Number) Out[11]: True

jnothman · 2018-01-29T12:38:22Z

@yupbank are you intending to complete this?

amueller · 2018-05-23T21:17:54Z

Should we mark this for someone to take over then?

yupbank · 2018-05-23T23:03:34Z

Oh... sorry, I’ll pick this up! Total forget about this for no reason

amueller · 2018-05-24T01:01:11Z

sweet, thanks. No worries, things happen. This looks like a great contribution :)

jnothman

Please update the docstring and user guide

jnothman · 2018-05-24T08:14:02Z

sklearn/feature_extraction/dict_vectorizer.py

+            v = 1
+        elif isinstance(v, Number) or (v is None):
+            feature_name = f
+        elif isinstance(v, Iterable):


I am uncomfortable including Mappings here.

jnothman · 2018-05-24T08:15:11Z

sklearn/feature_extraction/dict_vectorizer.py

+            feature_name = f
+        elif isinstance(v, Iterable):
+            for vv in v:
+                self.add_element(f, vv, feature_names, vocab,


This will flatten a nested list. Is there a reason we would want that? Is there a rain we would not want that?

The nested list case should, at least, be tested, for either support or failure

jnothman · 2018-05-24T08:17:11Z

sklearn/feature_extraction/tests/test_dict_vectorizer.py

@@ -77,6 +77,23 @@ def test_one_of_k():
    assert_false("version" in names)


+def test_iterable_value():
+    D_in = [{"version": ["1", "2"], "ham": 2},


Please also test fit_transform and transform where this is a finite generator rather than a list

jnothman

imap and xrange don't exist in Python 3

jnothman · 2018-05-26T21:20:08Z

sklearn/feature_extraction/tests/test_dict_vectorizer.py

+
+    names = v.get_feature_names()
+    assert_true("version=2" in names)
+    assert_true("version=1" in names)


We no longer use nosetests. You should use bare assert here

jnothman · 2018-05-26T21:21:05Z

sklearn/feature_extraction/tests/test_dict_vectorizer.py

+            {"version=3": True, "spam": -1}]
+    v = DictVectorizer(sparse=False)
+    X = v.fit_transform(D_in)
+    assert_equal(X.shape, (3, 5))


Use assert X.shape == (3, 5) rather than this function

jnothman · 2018-05-26T21:22:50Z

sklearn/feature_extraction/tests/test_dict_vectorizer.py

+            {"version": "2", "spam": .3},
+            {"version=3": True, "spam": -1}]
+    v = DictVectorizer()
+    try:


This will pass if the error is not raised. Use pytest.raises or assert_raise_message

jnothman

Sorry I didn't spot this sooner!

jnothman · 2018-06-02T11:24:03Z

sklearn/feature_extraction/dict_vectorizer.py

@@ -85,6 +89,17 @@ class DictVectorizer(BaseEstimator, TransformerMixin):
    True
    >>> v.transform({'foo': 4, 'unseen_feature': 3})
    array([[0., 0., 4.]])
+    >>> D2 = [{'foo': '1', 'bar': '2'}, {'foo': '3', 'baz': '1'}, {'foo': ['1', '3']}]


The purpose of each additional example needs to be clear. Put a line of explanation before each

jnothman · 2018-06-02T11:27:25Z

sklearn/feature_extraction/dict_vectorizer.py

+        elif isinstance(v, Iterable) and not isinstance(v, Mapping):
+            for vv in v:
+                self.add_element(f, vv, feature_names, vocab,
+                                 fitting, transforming, indices, values)


I'm now thinking that this solution is too general. Lists of numeric values don't make sense, and I don't think we benefit from expanding nested lists. So rather than recursion, please just handle this case specially as a variant of the string value case, and raise an error if any elements of the iterable are not string-like

jnothman · 2018-06-02T11:29:31Z

sklearn/feature_extraction/tests/test_dict_vectorizer.py

-                    assert_equal(X.shape, (3, 5))
-                    assert_equal(X.sum(), 14)
-                    assert_equal(v.inverse_transform(X), D)
+                    assert sp.issparse(X) == sparse


Although we have a different convention for new code, updating existing code makes this PR hard to review

then don't..

In this case it's not terribly bad, cv which is why I didn't suggest reverting, but for future reference it's better to keep a PR focused on one issue

jnothman · 2018-06-02T12:42:57Z

Huh? Why closed?

yupbank force-pushed the multi-value-dict-vec branch 3 times, most recently from f1b51f6 to 488562d Compare April 17, 2017 14:51

yupbank changed the title ~~add support iterable values for dictvectorizer~~ add iterable values support for dictvectorizer Apr 17, 2017

yupbank changed the title ~~add iterable values support for dictvectorizer~~ [MRG] add iterable values support for dictvectorizer Apr 17, 2017

jnothman approved these changes May 25, 2017

View reviewed changes

jnothman reviewed May 25, 2017

View reviewed changes

jnothman reviewed May 24, 2018

View reviewed changes

jnothman reviewed May 26, 2018

View reviewed changes

jnothman mentioned this pull request May 28, 2018

Accept multiple values for one categorical feature #6045

Closed

yupbank force-pushed the multi-value-dict-vec branch from fc7cd16 to 81a38de Compare June 1, 2018 02:04

yupbank added 7 commits May 31, 2018 22:06

add element support iterable values

7010a09

adjust py3 compatiable

54898f0

pep8

da42406

address comments

0286d12

address comments

d1fba7c

address comment

6763b1b

address comments

060c3f5

yupbank force-pushed the multi-value-dict-vec branch from 81a38de to 060c3f5 Compare June 1, 2018 02:08

fix the rebase conflict

7735eca

yupbank force-pushed the multi-value-dict-vec branch 3 times, most recently from fb57cf0 to dbb4852 Compare June 1, 2018 15:15

fix doc test

261271d

yupbank force-pushed the multi-value-dict-vec branch from dbb4852 to 261271d Compare June 1, 2018 15:24

jnothman reviewed Jun 2, 2018

View reviewed changes

yupbank closed this Jun 2, 2018

cmarmo mentioned this pull request May 27, 2020

[MRG] Add iterable values support for dictvectorizer #17367

Merged

cmarmo added a commit to cmarmo/scikit-learn that referenced this pull request May 27, 2020

Address comment from scikit-learn#8750. Remove unuseful tests (I think).

60c12e7

Uh oh!

[MRG] add iterable values support for dictvectorizer #8750

[MRG] add iterable values support for dictvectorizer #8750

Uh oh!

Conversation

yupbank commented Apr 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yupbank commented May 25, 2017

Uh oh!

jnothman commented May 25, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yupbank May 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 29, 2018

Uh oh!

amueller commented May 23, 2018

Uh oh!

yupbank commented May 23, 2018

Uh oh!

amueller commented May 24, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 2, 2018

Uh oh!

Uh oh!

yupbank commented Apr 16, 2017 •

edited

Loading

yupbank May 24, 2018 •

edited

Loading