New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ENH: add support for dropping first level of categorical feature #9361
[MRG] ENH: add support for dropping first level of categorical feature #9361
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
True by default. | ||
drop_first_categorical_level: boolean, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_one_category perhaps?
I think Boolean, default=False would be clearer here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm going to update the rest of the docstring to comply with your suggestion. I've noticed this is not consistent across the entire project, unfortunately, but the suggested style is more common.
True by default. | ||
drop_first_categorical_level: boolean, optional | ||
Whether the first level of a categorical feature should be excluded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Level is unfamiliar jargon. First category?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify "each" categorical feature "(string valued)"
self.dtype = dtype | ||
self.separator = separator | ||
self.sparse = sparse | ||
self.sort = sort | ||
self.drop_first = drop_first_categorical_level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialisation parameter and stored attribute must have same name in our API
for x in X: | ||
for f, v in six.iteritems(x): | ||
if isinstance(v, six.string_types): | ||
if self.drop_first and f not in to_drop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be substantially more efficient to just drop these columns from the sparse matrix after vectorizing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. It also results in more readable code. I'm going to update the PR today, I'm making the final checks.
@jnothman are you happy with the changes I made? Feel free to leave additional comments if you find something that can be improved. I'm hoping to start working on some new bug fix as soon as this weekend. |
for x in X: | ||
for f, v in six.iteritems(x): | ||
if isinstance(v, six.string_types): | ||
if self.drop_first_category and f not in to_drop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a pity we still add this effort every sample for every feature. I also fear this is too sensitive to the ordering of the data for the user to find it explicable.
We could drop the alphanumerically first entry by using a find-min. Something like:
to_drop = {}
k, v in vocab.items():
prefix = k.rpartition(sep)[0]
if prefix not in to_drop or k < to_drop[prefix]:
to_drop[prefix] = k
We do encounter problems of potential amibugity in sep
.
You need to update the vocab indices below too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jnothman ,
I like your solution! I've intentionally avoided splitting the string using a separator to overcome issues of ambiguity ― I hope people don't ever use the =
character in columns names, but you never know :-) Let me know if you want me to implement your suggestion, and I'll update my PR.
I also fear this is too sensitive to the ordering of the data for the user to find it explicable.
That's a valid point. In my own project to overcome such problem, I've stored the dictionaries that I want to pass to DictVectorizer
inside a "master" dictionary. This master dictionary has keys that can be sorted in a way that guarantees the first category is deterministic.
x = vectorizer.fit_transform([v for k, v in sorted(master.items())])
In this example a key in master could be something like the following tuple:
(582498109, 'Desktop')
... where Desktop
is the level I want to drop, and each id (the first element in the tuple) is associated with multiple devices, such as Tablet
, Mobile
, etc. I appreciate this is specific to my use case and not always true.
To be entirely fair, as a Data Scientist, 99% of the times you don't really care about which category is being dropped since that is simply your baseline. I guess in those situations when you need a specific category to be dropped, you can always build your own sorting function to pass to the key
argument in sorted
.
What do you think?
vocab = {k: v for k, v in vocab.items() | ||
if k not in to_drop.values()} | ||
feature_names = [feature for feature in feature_names | ||
if feature in vocab.keys()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just want in vocab
here, else you'll needlessly convert vocab
to a list in each iteration.
ENH: add support for dropping first level of categorical feature DOC: improve docstring pep8 compliance fix py2 doctest remove spaces on empty line TST: test second occurrence of dropped categorical level MAINT: fix failing test adress comments from jnothman do not construct list
I think you'd best adopt something like my approach. Imagine someone
analysing the most stable important features under control validation. If
the feature dropped differs for each cv split, the results are
uninterpretable
On 21 Jul 2017 6:43 am, "Gianluca Rossi" <notifications@github.com> wrote:
*@IamGianluca* commented on this pull request.
------------------------------
In sklearn/feature_extraction/dict_vectorizer.py
<#9361 (comment)>
:
for x in X:
for f, v in six.iteritems(x):
if isinstance(v, six.string_types):
+ if self.drop_first_category and f not in to_drop:
Hi Joel,
I like your solution! I've intentionally avoided splitting the string using
a separator to overcome issues of ambiguity ― I hope people don't ever use
the = character in columns names, but you never know :-) Let me know if you
want me to implement your suggestion, and I'll update my PR.
I also fear this is too sensitive to the ordering of the data for the user
to find it explicable.
That's a valid point. In my own project to overcome such problem, I've
stored the dictionaries that I want to pass to DictVectorizer inside a
"master" dictionary. This master dictionary has keys that can be sorted in
a way that guarantees the first category is deterministic.
x = vectorizer.fit_transform([v for k, v in sorted(master.items())])
In this example a key in master could be something like the following tuple:
(582498109, 'Desktop')
... where Desktop is the level I want to drop, and each id (the first
element in the tuple) is associated with multiple devices, such as Tablet,
Mobile, etc. I appreciate this is specific to my use case and not always
true.
To be entirely fair, as a Data Scientist, 99% of the times you don't really
care about which category is being dropped since that is simply your
baseline. I guess in those situations when you need a specific category to
be dropped, you can always build your own sorting function to pass to the
key argument in sorted.
What do you think?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9361 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_FXTJlr9yn4TatzsLy6RkFc1lgfks5sP7vogaJpZM4OYxec>
.
|
Reference Issues
Fixes #6053
Fixes #9073
What does this implement/fix? Explain your changes.
This Pull Request adds an extra argument to
DictVectorizer
that, if set toTrue
, drops the first level of each categorical variable. This is extremely useful in a regression model that to not use regularisation, as it avoids multicollinearity.Any other comments
Even though multicollinearity doesn't affect the predictions, it hugely affects the regression coefficients, which makes troublesome both model inspection and further usage of such coefficients.