Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add iterable values support for dictvectorizer #17367

Merged
merged 68 commits into from
Aug 1, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7010a09
add element support iterable values
yupbank Apr 16, 2017
54898f0
adjust py3 compatiable
yupbank Apr 16, 2017
da42406
pep8
yupbank Apr 16, 2017
0286d12
address comments
yupbank May 24, 2018
d1fba7c
address comments
yupbank May 24, 2018
6763b1b
address comment
yupbank May 25, 2018
060c3f5
address comments
yupbank Jun 1, 2018
7735eca
fix the rebase conflict
yupbank Jun 1, 2018
261271d
fix doc test
yupbank Jun 1, 2018
2b2a6a8
Sync with upstream master.
cmarmo May 27, 2020
2a0129c
Linting.
cmarmo May 27, 2020
aa6160a
Tentative fix for mypy linting.
cmarmo May 27, 2020
60c12e7
Address comment from #8750. Remove unuseful tests (I think).
cmarmo May 27, 2020
dce4e3b
Fix error in docstring.
cmarmo May 27, 2020
452cdf7
Fix error in docstring.
cmarmo May 28, 2020
74ba92c
Fix error in docstring.
cmarmo May 28, 2020
187fe10
Fix docstring tests.
cmarmo May 28, 2020
d397707
Fix error in docstring.
cmarmo May 28, 2020
770c445
Fix docstring tests.
cmarmo May 28, 2020
347a5c8
Make add_element private. Update what's new.
cmarmo May 28, 2020
32dfb00
Fix docstring tests (again).
cmarmo May 28, 2020
b5205fd
Resolve conflicts.
cmarmo Jun 4, 2020
fda0dc8
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 22, 2020
60ab819
Support only one level of nesting.
cmarmo Jun 22, 2020
7b2130c
Simplify add iterable element function.
cmarmo Jun 22, 2020
ace8ccf
Address comment in tests.
cmarmo Jun 22, 2020
3095140
Fix lint error.
cmarmo Jun 22, 2020
01525ae
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 23, 2020
9211a66
Fix feature_name undefined in fit.
cmarmo Jun 23, 2020
375e1a9
Sync with upstream.
cmarmo Jun 27, 2020
12e0e1c
Raise type error for non string iterables. Add test.
cmarmo Jun 27, 2020
108e820
Fix never used variable.
cmarmo Jun 27, 2020
288591e
Fix linting.
cmarmo Jun 27, 2020
afadd21
Change docstring.
cmarmo Jun 27, 2020
21aafc9
Trig CI.
cmarmo Jun 27, 2020
e07309c
Trig CI.
cmarmo Jun 27, 2020
1478de7
Fix docstring.
cmarmo Jun 27, 2020
543bb4f
Fix docstring.. again.
cmarmo Jun 27, 2020
42c8750
Merge remote-tracking branch 'upstream/master' into pr/17367
thomasjpfan Jun 28, 2020
b410610
Address comments.
cmarmo Jun 28, 2020
f5bcddf
Linting.
cmarmo Jun 28, 2020
9fa12a5
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 29, 2020
633b8ea
Address comments.
cmarmo Jun 29, 2020
acc613f
Address some comments.
cmarmo Jun 29, 2020
4ad6418
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 30, 2020
e0beb0e
Force unique values for string in iterables.
cmarmo Jun 30, 2020
d7bc480
Update sklearn/feature_extraction/_dict_vectorizer.py
cmarmo Jun 30, 2020
6c2f060
Update test_dict_vectorizer.py
cmarmo Jun 30, 2020
77ca0c2
Update documentation.
cmarmo Jun 30, 2020
1efa784
Update _dict_vectorizer.py
cmarmo Jun 30, 2020
6f1b821
Accept multiple entries.
cmarmo Jun 30, 2020
a8eabc3
Merge branch 'master' into multi-value-dict-vec
cmarmo Jun 30, 2020
cbe2d4b
Trig CI.
cmarmo Jun 30, 2020
6354013
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 5, 2020
a585b46
Synchronize with upstream.
cmarmo Jul 6, 2020
3ded55f
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 7, 2020
6eea923
Make keywords only optional parameters in _add_iterable_element.
cmarmo Jul 7, 2020
00911e6
Test counting behaviour.
cmarmo Jul 7, 2020
0f832fb
Apply comments.
cmarmo Jul 7, 2020
1395424
MOve multiple values example in the documentation.
cmarmo Jul 7, 2020
3ccf0aa
Fix doc tests.
cmarmo Jul 7, 2020
05bb6b0
Fix doc tests.
cmarmo Jul 7, 2020
28d57ac
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 9, 2020
66271f8
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 16, 2020
2c183e7
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 29, 2020
578ffab
Merge branch 'master' into multi-value-dict-vec
cmarmo Jul 31, 2020
b77e9e6
Address comments.
cmarmo Jul 31, 2020
c68a160
Fix linting errors.
cmarmo Jul 31, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,27 @@ is a traditional numerical feature::
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

:class:`DictVectorizer` accepts multiple string values for one
feature, like, e.g., multiple categories for a movie.

Assume a database classifies each movie using some categories (not mandatories)
and its year of release.

>>> movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003},
... {'category': ['animation', 'family'], 'year': 2011},
... {'year': 1974}]
>>> vec.fit_transform(movie_entry).toarray()
array([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],
[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],
[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])
>>> vec.get_feature_names() == ['category=animation', 'category=drama',
... 'category=family', 'category=thriller',
... 'year']
True
>>> vec.transform({'category': ['thriller'],
... 'unseen_feature': '3'}).toarray()
array([[0., 0., 0., 1., 0.]])

:class:`DictVectorizer` is also a useful representation transformation
for training sequence classifiers in Natural Language Processing models
that typically work by extracting feature windows around a particular
Expand Down
7 changes: 7 additions & 0 deletions doc/whats_new/v0.24.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,13 @@ Changelog
:class:`exceptions.NonBLASDotWarning` are deprecated and will be removed in
v0.26, :pr:`17804` by `Adrin Jalali`_.

:mod:`sklearn.feature_extraction`
.................................

- |Enhancement| :class:`feature_extraction.DictVectorizer` accepts multiple
values for one categorical feature. :pr:`17367` by :user:`Peng Yu <yupbank>`
and :user:`Chiara Marmo <cmarmo>`

:mod:`sklearn.feature_selection`
................................

Expand Down
100 changes: 81 additions & 19 deletions sklearn/feature_extraction/_dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
# License: BSD 3 clause

from array import array
from collections.abc import Mapping
from collections.abc import Mapping, Iterable
from operator import itemgetter
from numbers import Number

import numpy as np
import scipy.sparse as sp
Expand Down Expand Up @@ -35,10 +36,15 @@ class DictVectorizer(TransformerMixin, BaseEstimator):
a feature "f" that can take on the values "ham" and "spam" will become two
features in the output, one signifying "f=ham", the other "f=spam".

If a feature value is a sequence or set of strings, this transformer
will iterate over the values and will count the occurrences of each string
value.

However, note that this transformer will only do a binary one-hot encoding
when feature values are of type string. If categorical features are
represented as numeric values such as int, the DictVectorizer can be
followed by :class:`~sklearn.preprocessing.OneHotEncoder` to complete
represented as numeric values such as int or iterables of strings, the
DictVectorizer can be followed by
:class:`~sklearn.preprocessing.OneHotEncoder` to complete
binary one-hot encoding.

Features that do not occur in a sample (mapping) will have a zero value
Expand Down Expand Up @@ -78,8 +84,8 @@ class DictVectorizer(TransformerMixin, BaseEstimator):
>>> X
array([[2., 0., 1.],
[0., 1., 3.]])
>>> v.inverse_transform(X) == \
[{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
>>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0},
... {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[0., 0., 4.]])
Expand All @@ -98,6 +104,28 @@ def __init__(self, *, dtype=np.float64, separator="=", sparse=True,
self.sparse = sparse
self.sort = sort

def _add_iterable_element(self, f, v, feature_names, vocab, *,
fitting=True, transforming=False,
indices=None, values=None):
"""Add feature names for iterable of strings"""
for vv in v:
if isinstance(vv, str):
feature_name = "%s%s%s" % (f, self.separator, vv)
vv = 1
else:
raise TypeError(f'Unsupported type {type(vv)} in iterable '
'value. Only iterables of string are '
'supported.')
if fitting and feature_name not in vocab:
vocab[feature_name] = len(feature_names)
feature_names.append(feature_name)

if transforming and feature_name in vocab:
indices.append(vocab[feature_name])
values.append(self.dtype(vv))
cmarmo marked this conversation as resolved.
Show resolved Hide resolved

return

def fit(self, X, y=None):
"""Learn a list of feature name -> indices mappings.

Expand All @@ -106,6 +134,10 @@ def fit(self, X, y=None):
X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python
objects) to feature values (strings or convertible to dtype).

.. versionchanged:: 0.24
Accepts multiple string values for one categorical feature.

y : (ignored)

Returns
Expand All @@ -118,10 +150,22 @@ def fit(self, X, y=None):
for x in X:
for f, v in x.items():
if isinstance(v, str):
f = "%s%s%s" % (f, self.separator, v)
if f not in vocab:
feature_names.append(f)
vocab[f] = len(vocab)
feature_name = "%s%s%s" % (f, self.separator, v)
v = 1
elif isinstance(v, Number) or (v is None):
feature_name = f
elif isinstance(v, Mapping):
raise TypeError(f'Unsupported value type {type(v)} '
f'for {f}: {v}.\n'
'Mapping objects are not supported.')
elif isinstance(v, Iterable):
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
feature_name = None
self._add_iterable_element(f, v, feature_names, vocab)

if feature_name is not None:
if feature_name not in vocab:
vocab[feature_name] = len(feature_names)
feature_names.append(feature_name)

if self.sort:
feature_names.sort()
Expand Down Expand Up @@ -150,6 +194,8 @@ def _transform(self, X, fitting):
feature_names = self.feature_names_
vocab = self.vocabulary_

transforming = True

# Process everything as sparse regardless of setting
X = [X] if isinstance(X, Mapping) else X

Expand All @@ -164,17 +210,29 @@ def _transform(self, X, fitting):
for x in X:
for f, v in x.items():
if isinstance(v, str):
f = "%s%s%s" % (f, self.separator, v)
feature_name = "%s%s%s" % (f, self.separator, v)
v = 1
if f in vocab:
indices.append(vocab[f])
values.append(dtype(v))
else:
if fitting:
feature_names.append(f)
vocab[f] = len(vocab)
indices.append(vocab[f])
values.append(dtype(v))
elif isinstance(v, Number) or (v is None):
feature_name = f
elif isinstance(v, Mapping):
raise TypeError(f'Unsupported value Type {type(v)} '
f'for {f}: {v}.\n'
'Mapping objects are not supported.')
elif isinstance(v, Iterable):
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
feature_name = None
self._add_iterable_element(f, v, feature_names, vocab,
fitting=fitting,
transforming=transforming,
indices=indices, values=values)

if feature_name is not None:
if fitting and feature_name not in vocab:
vocab[feature_name] = len(feature_names)
feature_names.append(feature_name)

if feature_name in vocab:
indices.append(vocab[feature_name])
values.append(self.dtype(v))

indptr.append(len(indices))

Expand Down Expand Up @@ -218,6 +276,10 @@ def fit_transform(self, X, y=None):
X : Mapping or iterable over Mappings
Dict(s) or Mapping(s) from feature names (arbitrary Python
objects) to feature values (strings or convertible to dtype).

.. versionchanged:: 0.24
Accepts multiple string values for one categorical feature.

y : (ignored)

Returns
Expand Down
46 changes: 46 additions & 0 deletions sklearn/feature_extraction/tests/test_dict_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,52 @@ def test_one_of_k():
assert "version" not in names


def test_iterable_value():
D_names = ['ham', 'spam', 'version=1', 'version=2', 'version=3']
X_expected = [[2.0, 0.0, 2.0, 1.0, 0.0],
[0.0, 0.3, 0.0, 1.0, 0.0],
[0.0, -1.0, 0.0, 0.0, 1.0]]
D_in = [{"version": ["1", "2", "1"], "ham": 2},
{"version": "2", "spam": .3},
{"version=3": True, "spam": -1}]
cmarmo marked this conversation as resolved.
Show resolved Hide resolved
v = DictVectorizer()
X = v.fit_transform(D_in)
X = X.toarray()
assert_array_equal(X, X_expected)

D_out = v.inverse_transform(X)
assert D_out[0] == {"version=1": 2, "version=2": 1, "ham": 2}

names = v.get_feature_names()
cmarmo marked this conversation as resolved.
Show resolved Hide resolved

assert names == D_names


def test_iterable_not_string_error():
error_value = ("Unsupported type <class 'int'> in iterable value. "
"Only iterables of string are supported.")
D2 = [{'foo': '1', 'bar': '2'},
{'foo': '3', 'baz': '1'},
{'foo': [1, 'three']}]
v = DictVectorizer(sparse=False)
with pytest.raises(TypeError) as error:
v.fit(D2)
assert str(error.value) == error_value


def test_mapping_error():
error_value = ("Unsupported value type <class 'dict'> "
"for foo: {'one': 1, 'three': 3}.\n"
"Mapping objects are not supported.")
D2 = [{'foo': '1', 'bar': '2'},
{'foo': '3', 'baz': '1'},
{'foo': {'one': 1, 'three': 3}}]
v = DictVectorizer(sparse=False)
with pytest.raises(TypeError) as error:
v.fit(D2)
assert str(error.value) == error_value


def test_unseen_or_no_features():
D = [{"camelot": 0, "spamalot": 1}]
for sparse in [True, False]:
Expand Down