-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG] add iterable values support for dictvectorizer #8750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f1b51f6
to
488562d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we get a benchmark for how much slower this makes the incumbent behaviour, if at all?
def add_element(self, f, v, feature_names, vocab, fitting=True, | ||
transforming=False, indices=None, values=None): | ||
if not isinstance(v, six.string_types) and isinstance(v, Iterable): | ||
for vv in v: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to have iterables of anything but strings. I think it would be worth our while raising an error here, even if that slows things down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. it's not just a refactor but also address this #6045
I think it is convenient to accept list as a value of dict in DictVectorizer.fit()
and yeah. I'll add some benchmark data for this
before this pr
after this pr
|
So that's double the time for something the user could already express by
pre-transforming their data. Could you benchmark comparison to few samples
with lots of entries in the dict? I.e. is this due to the function call, or
due to the isinstance check?
…On 26 May 2017 at 01:41, Peng Yu ***@***.***> wrote:
before this pr
In [2]: a = DictVectorizer()
In [3]: data = [dict(a=i) for i in xrange(100)]
In [5]: %timeit a.fit_transform(data)
...:
...:
1000 loops, best of 3: 468 µs per loop
after this pr
In [5]: data = [dict(a=i) for i in xrange(100)]
In [6]: %timeit a.fit_transform(data)
1000 loops, best of 3: 878 µs per loop
In [2]: data_list = [dict(a=[i]) for i in xrange(100)]
In [4]: %timeit a.fit_transform(data_list)
1000 loops, best of 3: 1.04 ms per loop
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8750 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6__7-4wT3jHVNSsrsi-yUZMPzxoYks5r9aFGgaJpZM4M-oWM>
.
|
elif isinstance(v, Number) or (v is True) or\ | ||
(v is False) or (v is None): | ||
feature_name = f | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should put the iterable check here
pass | ||
|
||
return Xa | ||
return self._transform(X, fitting=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happened to the dense case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if isinstance(v, six.string_types): | ||
feature_name = "%s%s%s" % (f, self.separator, v) | ||
v = 1 | ||
elif isinstance(v, Number) or (v is True) or\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> isinstance(True, Number)
True
But more generally, I don't understand why we need this check. We've never done it before and it only helps in a case where creating the matrix will ultimately fail. And there might be numpy types that we should be allowing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [11]: isinstance(np.int32(10), numbers.Number)
Out[11]: True
@yupbank are you intending to complete this? |
Should we mark this for someone to take over then? |
Oh... sorry, I’ll pick this up! Total forget about this for no reason |
sweet, thanks. No worries, things happen. This looks like a great contribution :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the docstring and user guide
v = 1 | ||
elif isinstance(v, Number) or (v is None): | ||
feature_name = f | ||
elif isinstance(v, Iterable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am uncomfortable including Mappings here.
feature_name = f | ||
elif isinstance(v, Iterable): | ||
for vv in v: | ||
self.add_element(f, vv, feature_names, vocab, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will flatten a nested list. Is there a reason we would want that? Is there a rain we would not want that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested list case should, at least, be tested, for either support or failure
@@ -77,6 +77,23 @@ def test_one_of_k(): | |||
assert_false("version" in names) | |||
|
|||
|
|||
def test_iterable_value(): | |||
D_in = [{"version": ["1", "2"], "ham": 2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test fit_transform
and transform
where this is a finite generator rather than a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imap and xrange don't exist in Python 3
|
||
names = v.get_feature_names() | ||
assert_true("version=2" in names) | ||
assert_true("version=1" in names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer use nosetests. You should use bare assert here
{"version=3": True, "spam": -1}] | ||
v = DictVectorizer(sparse=False) | ||
X = v.fit_transform(D_in) | ||
assert_equal(X.shape, (3, 5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use assert X.shape == (3, 5)
rather than this function
{"version": "2", "spam": .3}, | ||
{"version=3": True, "spam": -1}] | ||
v = DictVectorizer() | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will pass if the error is not raised. Use pytest.raises
or assert_raise_message
fc7cd16
to
81a38de
Compare
81a38de
to
060c3f5
Compare
fb57cf0
to
dbb4852
Compare
dbb4852
to
261271d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't spot this sooner!
@@ -85,6 +89,17 @@ class DictVectorizer(BaseEstimator, TransformerMixin): | |||
True | |||
>>> v.transform({'foo': 4, 'unseen_feature': 3}) | |||
array([[0., 0., 4.]]) | |||
>>> D2 = [{'foo': '1', 'bar': '2'}, {'foo': '3', 'baz': '1'}, {'foo': ['1', '3']}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of each additional example needs to be clear. Put a line of explanation before each
elif isinstance(v, Iterable) and not isinstance(v, Mapping): | ||
for vv in v: | ||
self.add_element(f, vv, feature_names, vocab, | ||
fitting, transforming, indices, values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now thinking that this solution is too general. Lists of numeric values don't make sense, and I don't think we benefit from expanding nested lists. So rather than recursion, please just handle this case specially as a variant of the string value case, and raise an error if any elements of the iterable are not string-like
assert_equal(X.shape, (3, 5)) | ||
assert_equal(X.sum(), 14) | ||
assert_equal(v.inverse_transform(X), D) | ||
assert sp.issparse(X) == sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although we have a different convention for new code, updating existing code makes this PR hard to review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then don't..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it's not terribly bad, cv which is why I didn't suggest reverting, but for future reference it's better to keep a PR focused on one issue
Huh? Why closed? |
Reference Issue
fixes #6045
What does this implement/fix? Explain your changes.
add iterable support for value types in dictvectorizer
Any other comments?
It would be helpful if we add a
feature_dimension_
property forDictVectorizer
to make the dictvectorizer support aggregation over list/array of numbers or vectors^ this type of input would fail