Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] ENH: add support for dropping first level of categorical feature #9361

Closed
wants to merge 1 commit into from
Closed

[MRG] ENH: add support for dropping first level of categorical feature #9361

wants to merge 1 commit into from

Conversation

IamGianluca
Copy link

@IamGianluca IamGianluca commented Jul 14, 2017

Reference Issues

Fixes #6053
Fixes #9073

What does this implement/fix? Explain your changes.

This Pull Request adds an extra argument to DictVectorizer that, if set to True, drops the first level of each categorical variable. This is extremely useful in a regression model that to not use regularisation, as it avoids multicollinearity.

Any other comments

Even though multicollinearity doesn't affect the predictions, it hugely affects the regression coefficients, which makes troublesome both model inspection and further usage of such coefficients.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

True by default.
drop_first_categorical_level: boolean, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_one_category perhaps?

I think Boolean, default=False would be clearer here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm going to update the rest of the docstring to comply with your suggestion. I've noticed this is not consistent across the entire project, unfortunately, but the suggested style is more common.

True by default.
drop_first_categorical_level: boolean, optional
Whether the first level of a categorical feature should be excluded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Level is unfamiliar jargon. First category?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify "each" categorical feature "(string valued)"

self.dtype = dtype
self.separator = separator
self.sparse = sparse
self.sort = sort
self.drop_first = drop_first_categorical_level
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialisation parameter and stored attribute must have same name in our API

for x in X:
for f, v in six.iteritems(x):
if isinstance(v, six.string_types):
if self.drop_first and f not in to_drop:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be substantially more efficient to just drop these columns from the sparse matrix after vectorizing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. It also results in more readable code. I'm going to update the PR today, I'm making the final checks.

@IamGianluca
Copy link
Author

@jnothman are you happy with the changes I made? Feel free to leave additional comments if you find something that can be improved. I'm hoping to start working on some new bug fix as soon as this weekend.

for x in X:
for f, v in six.iteritems(x):
if isinstance(v, six.string_types):
if self.drop_first_category and f not in to_drop:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a pity we still add this effort every sample for every feature. I also fear this is too sensitive to the ordering of the data for the user to find it explicable.

We could drop the alphanumerically first entry by using a find-min. Something like:

to_drop = {}
k, v in vocab.items():
    prefix = k.rpartition(sep)[0]
    if prefix not in to_drop or k < to_drop[prefix]:
        to_drop[prefix] = k

We do encounter problems of potential amibugity in sep.

You need to update the vocab indices below too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jnothman ,

I like your solution! I've intentionally avoided splitting the string using a separator to overcome issues of ambiguity ― I hope people don't ever use the = character in columns names, but you never know :-) Let me know if you want me to implement your suggestion, and I'll update my PR.

I also fear this is too sensitive to the ordering of the data for the user to find it explicable.

That's a valid point. In my own project to overcome such problem, I've stored the dictionaries that I want to pass to DictVectorizer inside a "master" dictionary. This master dictionary has keys that can be sorted in a way that guarantees the first category is deterministic.

x = vectorizer.fit_transform([v for k, v in sorted(master.items())])

In this example a key in master could be something like the following tuple:

(582498109, 'Desktop')

... where Desktop is the level I want to drop, and each id (the first element in the tuple) is associated with multiple devices, such as Tablet, Mobile, etc. I appreciate this is specific to my use case and not always true.

To be entirely fair, as a Data Scientist, 99% of the times you don't really care about which category is being dropped since that is simply your baseline. I guess in those situations when you need a specific category to be dropped, you can always build your own sorting function to pass to the key argument in sorted.

What do you think?

vocab = {k: v for k, v in vocab.items()
if k not in to_drop.values()}
feature_names = [feature for feature in feature_names
if feature in vocab.keys()]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just want in vocab here, else you'll needlessly convert vocab to a list in each iteration.

ENH: add support for dropping first level of categorical feature

DOC: improve docstring

pep8 compliance

fix py2 doctest

remove spaces on empty line

TST: test second occurrence of dropped categorical level

MAINT: fix failing test

adress comments from jnothman

do not construct list
@jnothman
Copy link
Member

jnothman commented Jul 20, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants