Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add iterable values support for dictvectorizer #17367

Merged
merged 68 commits into from Aug 1, 2020
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7010a09
add element support iterable values
yupbank Apr 16, 2017
54898f0
adjust py3 compatiable
yupbank Apr 16, 2017
da42406
pep8
yupbank Apr 16, 2017
0286d12
address comments
yupbank May 24, 2018
d1fba7c
address comments
yupbank May 24, 2018
6763b1b
address comment
yupbank May 25, 2018
060c3f5
address comments
yupbank Jun 1, 2018
7735eca
fix the rebase conflict
yupbank Jun 1, 2018
261271d
fix doc test
yupbank Jun 1, 2018
2b2a6a8
Sync with upstream master.
cmarmo May 27, 2020
2a0129c
Linting.
cmarmo May 27, 2020
aa6160a
Tentative fix for mypy linting.
cmarmo May 27, 2020
60c12e7
Address comment from #8750. Remove unuseful tests (I think).
cmarmo May 27, 2020
dce4e3b
Fix error in docstring.
cmarmo May 27, 2020
452cdf7
Fix error in docstring.
cmarmo May 28, 2020
74ba92c
Fix error in docstring.
cmarmo May 28, 2020
187fe10
Fix docstring tests.
cmarmo May 28, 2020
d397707
Fix error in docstring.
cmarmo May 28, 2020
770c445
Fix docstring tests.
cmarmo May 28, 2020
347a5c8
Make add_element private. Update what's new.
cmarmo May 28, 2020
32dfb00
Fix docstring tests (again).
cmarmo May 28, 2020
b5205fd
Resolve conflicts.
cmarmo Jun 4, 2020
fda0dc8
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 22, 2020
60ab819
Support only one level of nesting.
cmarmo Jun 22, 2020
7b2130c
Simplify add iterable element function.
cmarmo Jun 22, 2020
ace8ccf
Address comment in tests.
cmarmo Jun 22, 2020
3095140
Fix lint error.
cmarmo Jun 22, 2020
01525ae
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 23, 2020
9211a66
Fix feature_name undefined in fit.
cmarmo Jun 23, 2020
375e1a9
Sync with upstream.
cmarmo Jun 27, 2020
12e0e1c
Raise type error for non string iterables. Add test.
cmarmo Jun 27, 2020
108e820
Fix never used variable.
cmarmo Jun 27, 2020
288591e
Fix linting.
cmarmo Jun 27, 2020
afadd21
Change docstring.
cmarmo Jun 27, 2020
21aafc9
Trig CI.
cmarmo Jun 27, 2020
e07309c
Trig CI.
cmarmo Jun 27, 2020
1478de7
Fix docstring.
cmarmo Jun 27, 2020
543bb4f
Fix docstring.. again.
cmarmo Jun 27, 2020
42c8750
Merge remote-tracking branch 'upstream/master' into pr/17367
thomasjpfan Jun 28, 2020
b410610
Address comments.
cmarmo Jun 28, 2020
f5bcddf
Linting.
cmarmo Jun 28, 2020
9fa12a5
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 29, 2020
633b8ea
Address comments.
cmarmo Jun 29, 2020
acc613f
Address some comments.
cmarmo Jun 29, 2020
4ad6418
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 30, 2020
e0beb0e
Force unique values for string in iterables.
cmarmo Jun 30, 2020
d7bc480
Update sklearn/feature_extraction/_dict_vectorizer.py
cmarmo Jun 30, 2020
6c2f060
Update test_dict_vectorizer.py
cmarmo Jun 30, 2020
77ca0c2
Update documentation.
cmarmo Jun 30, 2020
1efa784
Update _dict_vectorizer.py
cmarmo Jun 30, 2020
6f1b821
Accept multiple entries.
cmarmo Jun 30, 2020
a8eabc3
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 30, 2020
cbe2d4b
Trig CI.
cmarmo Jun 30, 2020
6354013
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 5, 2020
a585b46
Synchronize with upstream.
cmarmo Jul 6, 2020
3ded55f
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 7, 2020
6eea923
Make keywords only optional parameters in _add_iterable_element.
cmarmo Jul 7, 2020
00911e6
Test counting behaviour.
cmarmo Jul 7, 2020
0f832fb
Apply comments.
cmarmo Jul 7, 2020
1395424
MOve multiple values example in the documentation.
cmarmo Jul 7, 2020
3ccf0aa
Fix doc tests.
cmarmo Jul 7, 2020
05bb6b0
Fix doc tests.
cmarmo Jul 7, 2020
28d57ac
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 9, 2020
66271f8
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 16, 2020
2c183e7
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 29, 2020
578ffab
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 31, 2020
b77e9e6
Address comments.
cmarmo Jul 31, 2020
c68a160
Fix linting errors.
cmarmo Jul 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/whats_new/v0.24.rst
Expand Up @@ -52,6 +52,13 @@ Changelog
which allows monitoring of each stage.
:pr:`16985` by :user:`Hao Chun Chang <haochunchang>`.

:mod:`sklearn.feature_extraction`
.................................

- |Enhancement| :class:`feature_extraction.DictVectorizer` accepts multiple
values for one categorical feature. :pr:`17367` by :user:`Peng Yu <yupbank>`
and :user:`Chiara Marmo <cmarmo>`

:mod:`sklearn.feature_selection`
................................

Expand Down
77 changes: 57 additions & 20 deletions sklearn/feature_extraction/_dict_vectorizer.py
Expand Up @@ -3,8 +3,9 @@
# License: BSD 3 clause

from array import array
from collections.abc import Mapping
from collections.abc import Mapping, Iterable
from operator import itemgetter
from numbers import Number

import numpy as np
import scipy.sparse as sp
Expand Down Expand Up @@ -35,6 +36,10 @@ class DictVectorizer(TransformerMixin, BaseEstimator):
a feature "f" that can take on the values "ham" and "spam" will become two
features in the output, one signifying "f=ham", the other "f=spam".

When feature values are iterables but not mapping, this transformer will
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
iterate over the values to perform the same transformation to every
element.

However, note that this transformer will only do a binary one-hot encoding
when feature values are of type string. If categorical features are
represented as numeric values such as int, the DictVectorizer can be
Expand Down Expand Up @@ -78,12 +83,30 @@ class DictVectorizer(TransformerMixin, BaseEstimator):
>>> X
array([[2., 0., 1.],
[0., 1., 3.]])
>>> v.inverse_transform(X) == \
[{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
>>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0},
... {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[0., 0., 4.]])


cmarmo marked this conversation as resolved.
Show resolved Hide resolved
Example with multiple values for one categorical values:

cmarmo marked this conversation as resolved.
Show resolved Hide resolved

>>> D2 = [{'foo': '1', 'bar': '2'}, {'foo': '3', 'baz': '1'},
ogrisel marked this conversation as resolved.
Show resolved Hide resolved
... {'foo': ['1', '3']}]
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
>>> X = v.fit_transform(D2)
>>> X
array([[1., 0., 1., 0.],
[0., 1., 0., 1.],
[0., 0., 1., 1.]])
>>> v.inverse_transform(X) == [{'foo=1': 1.0, 'bar=2': 1.0},
... {'foo=3': 1.0, 'baz=1': 1.0},
... {'foo=3': 1.0, 'foo=1': 1.0}]
True
>>> v.transform({'foo': '1', 'unseen_feature': [3]})
array([[0., 0., 1., 0.]])

See also
--------
FeatureHasher : performs vectorization using only a hash function.
Expand All @@ -98,6 +121,32 @@ def __init__(self, *, dtype=np.float64, separator="=", sparse=True,
self.sparse = sparse
self.sort = sort

def _add_element(self, f, v, feature_names, vocab, fitting=True,
transforming=False, indices=None, values=None):
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
if isinstance(v, str):
feature_name = "%s%s%s" % (f, self.separator, v)
v = 1
elif isinstance(v, Number) or (v is None):
feature_name = f
elif isinstance(v, Iterable) and not isinstance(v, str):
for vv in v:
self._add_element(f, vv, feature_names, vocab,
fitting, transforming, indices, values)
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
return
else:
raise ValueError(
'Unsupported Value Type %s for {%s: %s}' % (type(v), f, v))
cmarmo marked this conversation as resolved.
Show resolved Hide resolved

if fitting:
if feature_name not in vocab:
vocab[feature_name] = len(feature_names)
feature_names.append(feature_name)

if transforming:
if feature_name in vocab:
indices.append(vocab[feature_name])
values.append(self.dtype(v))

def fit(self, X, y=None):
"""Learn a list of feature name -> indices mappings.

Expand All @@ -117,11 +166,7 @@ def fit(self, X, y=None):

for x in X:
for f, v in x.items():
if isinstance(v, str):
f = "%s%s%s" % (f, self.separator, v)
if f not in vocab:
feature_names.append(f)
vocab[f] = len(vocab)
self._add_element(f, v, feature_names, vocab)

if self.sort:
feature_names.sort()
Expand Down Expand Up @@ -150,6 +195,8 @@ def _transform(self, X, fitting):
feature_names = self.feature_names_
vocab = self.vocabulary_

transforming = True

# Process everything as sparse regardless of setting
X = [X] if isinstance(X, Mapping) else X

Expand All @@ -163,18 +210,8 @@ def _transform(self, X, fitting):
# same time
for x in X:
for f, v in x.items():
if isinstance(v, str):
f = "%s%s%s" % (f, self.separator, v)
v = 1
if f in vocab:
indices.append(vocab[f])
values.append(dtype(v))
else:
if fitting:
feature_names.append(f)
vocab[f] = len(vocab)
indices.append(vocab[f])
values.append(dtype(v))
self._add_element(f, v, feature_names, vocab,
fitting, transforming, indices, values)

indptr.append(len(indices))

Expand Down
17 changes: 17 additions & 0 deletions sklearn/feature_extraction/tests/test_dict_vectorizer.py
Expand Up @@ -76,6 +76,23 @@ def test_one_of_k():
assert "version" not in names


def test_iterable_value():
D_in = [{"version": ["1", "2"], "ham": 2},
{"version": "2", "spam": .3},
{"version=3": True, "spam": -1}]
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
v = DictVectorizer()
X = v.fit_transform(D_in)
assert X.shape == (3, 5)
cmarmo marked this conversation as resolved.
Show resolved Hide resolved

D_out = v.inverse_transform(X)
assert D_out[0] == {"version=1": 1, "version=2": 1, "ham": 2}

names = v.get_feature_names()
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
assert "version=2" in names
assert "version=1" in names
assert "version" not in names


def test_unseen_or_no_features():
D = [{"camelot": 0, "spamalot": 1}]
for sparse in [True, False]:
Expand Down