[MRG] ENH: add support for dropping first level of categorical feature #9361

IamGianluca · 2017-07-14T21:43:42Z

Reference Issues

What does this implement/fix? Explain your changes.

This Pull Request adds an extra argument to DictVectorizer that, if set to True, drops the first level of each categorical variable. This is extremely useful in a regression model that to not use regularisation, as it avoids multicollinearity.

Any other comments

Even though multicollinearity doesn't affect the predictions, it hugely affects the regression coefficients, which makes troublesome both model inspection and further usage of such coefficients.

jnothman

Thanks.

jnothman · 2017-07-15T22:14:24Z

sklearn/feature_extraction/dict_vectorizer.py

        True by default.
+    drop_first_categorical_level: boolean, optional


drop_one_category perhaps?

I think Boolean, default=False would be clearer here

I agree. I'm going to update the rest of the docstring to comply with your suggestion. I've noticed this is not consistent across the entire project, unfortunately, but the suggested style is more common.

jnothman · 2017-07-15T22:14:57Z

sklearn/feature_extraction/dict_vectorizer.py

        True by default.
+    drop_first_categorical_level: boolean, optional
+        Whether the first level of a categorical feature should be excluded


Level is unfamiliar jargon. First category?

Specify "each" categorical feature "(string valued)"

jnothman · 2017-07-15T22:16:34Z

sklearn/feature_extraction/dict_vectorizer.py

        self.dtype = dtype
        self.separator = separator
        self.sparse = sparse
        self.sort = sort
+        self.drop_first = drop_first_categorical_level


Initialisation parameter and stored attribute must have same name in our API

jnothman · 2017-07-15T22:20:13Z

sklearn/feature_extraction/dict_vectorizer.py

        for x in X:
            for f, v in six.iteritems(x):
                if isinstance(v, six.string_types):
+                    if self.drop_first and f not in to_drop:


Would it be substantially more efficient to just drop these columns from the sparse matrix after vectorizing?

Yes, I agree. It also results in more readable code. I'm going to update the PR today, I'm making the final checks.

IamGianluca · 2017-07-19T22:14:04Z

@jnothman are you happy with the changes I made? Feel free to leave additional comments if you find something that can be improved. I'm hoping to start working on some new bug fix as soon as this weekend.

jnothman · 2017-07-20T08:32:47Z

sklearn/feature_extraction/dict_vectorizer.py

        for x in X:
            for f, v in six.iteritems(x):
                if isinstance(v, six.string_types):
+                    if self.drop_first_category and f not in to_drop:


It's a pity we still add this effort every sample for every feature. I also fear this is too sensitive to the ordering of the data for the user to find it explicable.

We could drop the alphanumerically first entry by using a find-min. Something like:

to_drop = {} k, v in vocab.items(): prefix = k.rpartition(sep)[0] if prefix not in to_drop or k < to_drop[prefix]: to_drop[prefix] = k

We do encounter problems of potential amibugity in sep.

You need to update the vocab indices below too.

Hi @jnothman ,

I like your solution! I've intentionally avoided splitting the string using a separator to overcome issues of ambiguity ― I hope people don't ever use the = character in columns names, but you never know :-) Let me know if you want me to implement your suggestion, and I'll update my PR.

I also fear this is too sensitive to the ordering of the data for the user to find it explicable.

That's a valid point. In my own project to overcome such problem, I've stored the dictionaries that I want to pass to DictVectorizer inside a "master" dictionary. This master dictionary has keys that can be sorted in a way that guarantees the first category is deterministic.

x = vectorizer.fit_transform([v for k, v in sorted(master.items())])

In this example a key in master could be something like the following tuple:

(582498109, 'Desktop')

... where Desktop is the level I want to drop, and each id (the first element in the tuple) is associated with multiple devices, such as Tablet, Mobile, etc. I appreciate this is specific to my use case and not always true.

To be entirely fair, as a Data Scientist, 99% of the times you don't really care about which category is being dropped since that is simply your baseline. I guess in those situations when you need a specific category to be dropped, you can always build your own sorting function to pass to the key argument in sorted.

What do you think?

jnothman · 2017-07-20T08:32:49Z

sklearn/feature_extraction/dict_vectorizer.py

+            vocab = {k: v for k, v in vocab.items()
+                     if k not in to_drop.values()}
+            feature_names = [feature for feature in feature_names
+                             if feature in vocab.keys()]


you just want in vocab here, else you'll needlessly convert vocab to a list in each iteration.

ENH: add support for dropping first level of categorical feature DOC: improve docstring pep8 compliance fix py2 doctest remove spaces on empty line TST: test second occurrence of dropped categorical level MAINT: fix failing test adress comments from jnothman do not construct list

jnothman · 2017-07-20T22:26:18Z

I think you'd best adopt something like my approach. Imagine someone analysing the most stable important features under control validation. If the feature dropped differs for each cv split, the results are uninterpretable On 21 Jul 2017 6:43 am, "Gianluca Rossi" <notifications@github.com> wrote: *@IamGianluca* commented on this pull request. ------------------------------ In sklearn/feature_extraction/dict_vectorizer.py <#9361 (comment)> :

for x in X:

for f, v in six.iteritems(x): if isinstance(v, six.string_types): + if self.drop_first_category and f not in to_drop: Hi Joel, I like your solution! I've intentionally avoided splitting the string using a separator to overcome issues of ambiguity ― I hope people don't ever use the = character in columns names, but you never know :-) Let me know if you want me to implement your suggestion, and I'll update my PR. I also fear this is too sensitive to the ordering of the data for the user to find it explicable. That's a valid point. In my own project to overcome such problem, I've stored the dictionaries that I want to pass to DictVectorizer inside a "master" dictionary. This master dictionary has keys that can be sorted in a way that guarantees the first category is deterministic. x = vectorizer.fit_transform([v for k, v in sorted(master.items())]) In this example a key in master could be something like the following tuple: (582498109, 'Desktop') ... where Desktop is the level I want to drop, and each id (the first element in the tuple) is associated with multiple devices, such as Tablet, Mobile, etc. I appreciate this is specific to my use case and not always true. To be entirely fair, as a Data Scientist, 99% of the times you don't really care about which category is being dropped since that is simply your baseline. I guess in those situations when you need a specific category to be dropped, you can always build your own sorting function to pass to the key argument in sorted. What do you think? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9361 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_FXTJlr9yn4TatzsLy6RkFc1lgfks5sP7vogaJpZM4OYxec> .

jnothman reviewed Jul 15, 2017

View reviewed changes

jnothman reviewed Jul 20, 2017

View reviewed changes

drewmjohnston mentioned this pull request Jan 3, 2019

[MRG + 2] Add Drop Option to OneHotEncoder. #12908

Merged

jorisvandenbossche closed this in #12908 Feb 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] ENH: add support for dropping first level of categorical feature #9361

[MRG] ENH: add support for dropping first level of categorical feature #9361

IamGianluca commented Jul 14, 2017 •

edited

jnothman left a comment

jnothman Jul 15, 2017

IamGianluca Jul 16, 2017

jnothman Jul 15, 2017

jnothman Jul 15, 2017

jnothman Jul 15, 2017

jnothman Jul 15, 2017

IamGianluca Jul 16, 2017

IamGianluca commented Jul 19, 2017

jnothman Jul 20, 2017

IamGianluca Jul 20, 2017 •

edited

jnothman Jul 20, 2017

jnothman commented Jul 20, 2017 via email

		True by default.
		drop_first_categorical_level: boolean, optional

[MRG] ENH: add support for dropping first level of categorical feature #9361

[MRG] ENH: add support for dropping first level of categorical feature #9361

Conversation

IamGianluca commented Jul 14, 2017 • edited

Reference Issues

What does this implement/fix? Explain your changes.

Any other comments

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IamGianluca commented Jul 19, 2017

Choose a reason for hiding this comment

IamGianluca Jul 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jul 20, 2017 via email

IamGianluca commented Jul 14, 2017 •

edited

IamGianluca Jul 20, 2017 •

edited