Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] CountFeaturizer for Categorical Data #9614

Closed
wants to merge 6 commits into from

Conversation

chenhe95
Copy link
Contributor

Reference Issue

#5853

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Also a continuation of
#7803
#8144

@amueller
Copy link
Member

tests are failing ;)

@chenhe95
Copy link
Contributor Author

Have not had internet access the past few days due to moving and driving all my stuff back to university. Will take a look and see what's going on.

@chenhe95
Copy link
Contributor Author

Hmm..

E           AssertionError: 
E           sklearn.preprocessing.data.CountFeaturizer.fit arg mismatch: ['X']
E           sklearn.preprocessing.data.CountFeaturizer.transform arg mismatch: ['X']

@chenhe95 chenhe95 closed this Aug 30, 2017
@chenhe95 chenhe95 reopened this Aug 30, 2017
@amueller
Copy link
Member

Those are undocumented arguments of methods.

@chenhe95
Copy link
Contributor Author

chenhe95 commented Aug 30, 2017

(Accidentally closed it by accident)

E           AssertionError: 
E           sklearn.preprocessing.data.CountFeaturizer.fit arg mismatch: ['X']
E           sklearn.preprocessing.data.CountFeaturizer.transform arg mismatch: ['X']

Seems to be the error and it only occurs in a specific configuration of Python 3.6.1.
I'll look more into it.

edit: Ahh, I see, thanks!

@amueller
Copy link
Member

See comment above.

@amueller
Copy link
Member

Yeah I just created a PR for a more useful message (#9651)

@chenhe95
Copy link
Contributor Author

The test cases are passing now. Somehow during the copy pasting to fix the merge conflict, the newline chars got deleted from the docstrings

@amueller
Copy link
Member

@chenhe95 what's the status on this? Is it good to go from your end? Then please rename to MRG instead of WIP

@chenhe95 chenhe95 changed the title WIP CountFeaturizer for Categorical Data [MRG] CountFeaturizer for Categorical Data Oct 20, 2017
@chenhe95
Copy link
Contributor Author

It is indeed good to go, I have just changed it to MRG.

@amueller amueller added this to PR phase in Andy's pets Dec 5, 2017
@amueller
Copy link
Member

can you resolve the conflicts please?

@chenhe95
Copy link
Contributor Author

chenhe95 commented Jan 27, 2018

Merge conflicts should be fixed!
Currently, when I run nosetests --with-coverage preprocessing, I am getting

======================================================================
ERROR: sklearn.preprocessing.tests.test_data.test_polynomial_features_sparse_X
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_polynomial_features_sparse_X() takes exactly 4 arguments (0 given)

======================================================================
ERROR: sklearn.preprocessing.tests.test_target.test_transform_target_regressor_1d_transformer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_transform_target_regressor_1d_transformer() takes exactly 2 arguments (0 given)

======================================================================
ERROR: sklearn.preprocessing.tests.test_target.test_transform_target_regressor_2d_transformer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_transform_target_regressor_2d_transformer() takes exactly 2 arguments (0 given)

I'll try to see what this is about

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to give this another good look. I'd like to be sure that what we've got is reasonably close to standard and useful...

@staticmethod
def _check_inclusion(inclusion, n_input_features=1):
if inclusion is None:
raise ValueError("Inclusion cannot be none")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested

return np.array([[i] for i in range(n_input_features)])
elif CountFeaturizer._valid_data_type(inclusion):
if len(inclusion) == 0:
raise ValueError("Inclusion size must not be 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested

else:
return [inclusion]
else:
raise ValueError("Illegal data type in inclusion")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not tested

@amueller
Copy link
Member

amueller commented Feb 6, 2018

The User Guide is missing, right?

@amueller
Copy link
Member

amueller commented Feb 6, 2018

can we maybe make the example run a bit quicker?

max_estimators = (175 * n_datapoints // 500)
time_start = time.time()

for i in range(min_estimators, max_estimators + 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does going in steps of 10s make this faster?

Copy link
Contributor

@janvanrijn janvanrijn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller requested me to do a preliminary review. I will post it here.

The Feature Count module considers two scenarios.

a) there is no y value. In this case the Feature Count counts how often the combination of x values appears in the train set.
b) there is a y list or matrix. In this case the counts are measured against the class type.



time_start = time.time()
pipeline_cf = make_pipeline(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is unclear why the pipeline does not contain the classifier. Now the preprocessing steps and the classifier are separated.
    Major consequence: Code is hard to understand.
    Minor consequence: The time measurements only involve the classification time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the premises CountFeaturizer was built on (based on this Microsoft ML article on count featurization https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-learning-with-counts) was that it may be possible to reduce the time training and classification takes compared to something like one hot encoding due to there being less features generated. Here, the reason why the pipeline was split was because it's necessary to separate out the time it took to preprocess the data from the time it takes to train and use the classifier.



clf = get_classifier()
labels = ["CountFeaturizer + RandomForestClassifier",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this experiment (running the three pipelines on an increasing number of trees) be put into a for loop? Removes duplicate code

plt.plot(xs, ys, label=label)

plt.xlim(min_estimators, max_estimators)
plt.xlabel("n_estimators")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(* Personally I'm a big fan of declaring the plotting labels and other stuff (l135, 136, 137) high up in the code, so
the script is reusable. for this it would be even better to put everything in a function but maybe this is me taking it
too far .. )

@@ -2866,6 +2869,229 @@ def power_transform(X, method='box-cox', standardize=True, copy=True):
return pt.fit_transform(X)


def _get_nested_counter(remaining, y_dim, inclusion_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • _get_nested_counter: I know it's not a requirement to document private functions, but it would make the review
    so much easier to know what the components are supposed to do. E.g., what is a 'nested dictionary', what do you
    mean with layers and a 2d array of what in the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is something in the CI tests (something to do with the pickle module) that does not allow you to do something like
a = {}
a[1] = {}
a[1][1] = 4

So this is a workaround

Should I add "This is a workaround due to some pickle issues in CI testing regarding creating nested dicts dynamically" to this as a comment?

_get_nested_counter, remaining - 1, y_dim, inclusion_size))


class CountFeaturizer(BaseEstimator, TransformerMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The code seems small and elegant for something with many functionalities.

[1, 2, 1, 1], [1, 2, 1, 1]]))


def test_count_featurizer_inclusion():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test_count_featurizer_inclusion() why not run all examples on both arrays?

[1, 0, 2, 1, 1], [1, 0, 2, 1, 1]]))


def test_count_featurizer_multi_y():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test_count_featurizer_multi_y same question as (1) three additional values doesn't make sense to me.
    Please explain in comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does my comment on line 2527 clarify this? The first column in y contributes to 1 of the columns being added since it only takes 1 unique value. The second column in y contributes to 2 of the columns being added due to having two values 0 and 1.

np.array([[0, 0, 2, 0, 3, 1], [0, 0, 2, 0, 3, 1],
[1, 0, 1, 1, 3, 1], [1, 0, 1, 1, 3, 1]]))

cf = CountFeaturizer(inclusion=[[0]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • line 2521 cf = CountFeaturizer(inclusion=[[0]]) this one can be checked independently (by crosschecking against
    an instance with normally invoked featurizer)
    ** same for line 2514

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this, do you mean comparing the output against cf = CountFeaturizer() and checking if they are equal? If you meant it that way, I don't think it will work because CountFeaturizer() is equivalent to CountFeaturizer(inclusion=[0, 1]) in this specific case.

[1, 2, 1, 1], [1, 2, 1, 1]]))


def test_count_featurizer_multi_inclusion():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test_count_featurizer_multi_inclusion() the y value is differently encoded (2d array), which influences how to invoke
    the inclusion param. what happens if I invoke the inclusion in the vanilla way?

Copy link
Contributor Author

@chenhe95 chenhe95 May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by the comment you mean y = np.array([[0], [0], [0], [1]]) vs y = np.array([0, 0, 0, 1]) then it should be the same because of y = np.reshape(y, (-1, 1)). I can make a test case for that.

TODO: Test case for this scenario

np.array([[0, 0, 2, 0], [0, 0, 2, 0],
[1, 0, 1, 1], [1, 0, 1, 1]]))

cf = CountFeaturizer(inclusion="each")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • test_count_featurizer_multi_inclusion line 2527: Please explain how this results in 12 new outputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column of y gets its own columns outputted for each unique value it can take. For example, here, the first column of y gets 2 unique values (0 and 1), so 2 columns are outputted for each inclusion list (since 'each' is used, there are 2 inclusion lists [0] and [1]). The second column of y gets 4 unique values (0, 1, 2, 3) so 4 columns are outputted for each inclusion list. As a result (4 + 2, the total # of different y values) x (2, the size of the inclusion list) = 12 columns are outputted

@chenhe95
Copy link
Contributor Author

Thank you for the feedback. I will look into your comments and try to address them.


Shows how to use CountFeaturizer to transform some categorical variables
into a frequency feature. CountFeaturizer can often be used to reduce
training time, classification time, and classification error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would replace the last sentence with:

CountFeaturizer can be used as an alternative to one-hot encoding for non-linear estimators that do not work efficiently on high cardinality categorical variables such as ensemble of trees (random forests and gradient boosted trees).

@amueller
Copy link
Member

I'm confused. The docs say there's one column per value of X, I would expect one column per value of y...

@chenhe95
Copy link
Contributor Author

@amueller Could you elaborate which line?
The total number of added columns can be computed as such, assuming y is a multi-dimensional label matrix

total_added_columns = 0
for column_y_i in y:
    total_added_columns += len(unique(column_y_i)) * len(inclusion) 
# total_added_columns now contains the total number of added columns

@GaelVaroquaux
Copy link
Member

2 comments here, which echo concerns raised on the original issue (#5853):

  • Using as a feature only E[y|X_i] is not something that should be done: if there are many categories it will lead to overfit very fit. Overfit in the encoding is a problem because it gets then picked up by the supervised model, which places too much confidence on the corresponding feature. Typically this is done either by prior smoothing (or shrinkage): https://hccencoding-project.readthedocs.io/en/latest and reference therein.

  • I find the name surprising. I would at least use the term "Encoder" rather than "Featurizer". I would also rather have "Target" than count.

@amueller amueller added the Needs Decision Requires decision label Aug 6, 2019
@amueller
Copy link
Member

amueller commented Dec 2, 2019

@amueller
Copy link
Member

Found an interesting reference for the GLM model here:
http://contrib.scikit-learn.org/category_encoders/glmm.html

the GLM seems to be the preferred choice given the benchmark I posted above.
@GaelVaroquaux have you used the GLM approach? Basically seems like a specific smoothing approach.
image

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Apr 30, 2020 via email

@amueller
Copy link
Member

amueller commented May 1, 2020

@GaelVaroquaux indeed, but the GLM is a very particular form of determining the shrinkage.

Base automatically changed from master to main January 22, 2021 10:49
@thomasjpfan thomasjpfan added the Superseded PR has been replace by a newer PR label Feb 16, 2023
@thomasjpfan
Copy link
Member

I am closing this PR since it was superseded by #25334 . Thank you for working on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Needs Decision Requires decision Superseded PR has been replace by a newer PR
Projects
No open projects
Andy's pets
PR phase
Development

Successfully merging this pull request may close these issues.

None yet

8 participants