[MRG] CountFeaturizer for Categorical Data #9614

chenhe95 · 2017-08-23T20:00:10Z

Reference Issue

What does this implement/fix? Explain your changes.

It adds the CountFeaturizer transformation class, which can help with getting better accuracy because it will use how often a particular data row occurs as a feature

Any other comments?

Currently work in progress, please let me know if there is something that I should add or if there is anything I can do in a better or faster way!

Also a continuation of
#7803
#8144

amueller · 2017-08-28T21:01:39Z

tests are failing ;)

chenhe95 · 2017-08-29T01:19:17Z

Have not had internet access the past few days due to moving and driving all my stuff back to university. Will take a look and see what's going on.

chenhe95 · 2017-08-30T15:48:25Z

Hmm..

E           AssertionError: 
E           sklearn.preprocessing.data.CountFeaturizer.fit arg mismatch: ['X']
E           sklearn.preprocessing.data.CountFeaturizer.transform arg mismatch: ['X']

amueller · 2017-08-30T15:49:30Z

Those are undocumented arguments of methods.

chenhe95 · 2017-08-30T15:49:37Z

(Accidentally closed it by accident)

E           AssertionError: 
E           sklearn.preprocessing.data.CountFeaturizer.fit arg mismatch: ['X']
E           sklearn.preprocessing.data.CountFeaturizer.transform arg mismatch: ['X']

Seems to be the error and it only occurs in a specific configuration of Python 3.6.1.
I'll look more into it.

edit: Ahh, I see, thanks!

amueller · 2017-08-30T15:50:31Z

See comment above.

amueller · 2017-08-30T16:07:37Z

Yeah I just created a PR for a more useful message (#9651)

chenhe95 · 2017-08-30T23:52:53Z

The test cases are passing now. Somehow during the copy pasting to fix the merge conflict, the newline chars got deleted from the docstrings

amueller · 2017-10-20T18:54:30Z

@chenhe95 what's the status on this? Is it good to go from your end? Then please rename to MRG instead of WIP

chenhe95 · 2017-10-20T19:09:15Z

It is indeed good to go, I have just changed it to MRG.

amueller · 2017-12-12T19:11:01Z

can you resolve the conflicts please?

chenhe95 · 2018-01-27T23:18:23Z

Merge conflicts should be fixed!
Currently, when I run nosetests --with-coverage preprocessing, I am getting

======================================================================
ERROR: sklearn.preprocessing.tests.test_data.test_polynomial_features_sparse_X
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_polynomial_features_sparse_X() takes exactly 4 arguments (0 given)

======================================================================
ERROR: sklearn.preprocessing.tests.test_target.test_transform_target_regressor_1d_transformer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_transform_target_regressor_1d_transformer() takes exactly 2 arguments (0 given)

======================================================================
ERROR: sklearn.preprocessing.tests.test_target.test_transform_target_regressor_2d_transformer
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
TypeError: test_transform_target_regressor_2d_transformer() takes exactly 2 arguments (0 given)

I'll try to see what this is about

jnothman

I need to give this another good look. I'd like to be sure that what we've got is reasonably close to standard and useful...

jnothman · 2018-01-11T07:47:26Z

sklearn/preprocessing/data.py

+    @staticmethod
+    def _check_inclusion(inclusion, n_input_features=1):
+        if inclusion is None:
+            raise ValueError("Inclusion cannot be none")


jnothman · 2018-01-11T07:47:31Z

sklearn/preprocessing/data.py

+            return np.array([[i] for i in range(n_input_features)])
+        elif CountFeaturizer._valid_data_type(inclusion):
+            if len(inclusion) == 0:
+                raise ValueError("Inclusion size must not be 0")


jnothman · 2018-01-11T07:48:04Z

sklearn/preprocessing/data.py

+            else:
+                return [inclusion]
+        else:
+            raise ValueError("Illegal data type in inclusion")


amueller · 2018-02-06T22:28:27Z

The User Guide is missing, right?

amueller · 2018-02-06T22:31:07Z

can we maybe make the example run a bit quicker?

amueller · 2018-02-06T22:31:59Z

examples/preprocessing/plot_count_featurizer.py

+max_estimators = (175 * n_datapoints // 500)
+time_start = time.time()
+
+for i in range(min_estimators, max_estimators + 1):


does going in steps of 10s make this faster?

janvanrijn

@amueller requested me to do a preliminary review. I will post it here.

The Feature Count module considers two scenarios.

a) there is no y value. In this case the Feature Count counts how often the combination of x values appears in the train set.
b) there is a y list or matrix. In this case the counts are measured against the class type.

janvanrijn · 2018-05-16T14:24:13Z

examples/preprocessing/plot_count_featurizer.py

+
+
+time_start = time.time()
+pipeline_cf = make_pipeline(


It is unclear why the pipeline does not contain the classifier. Now the preprocessing steps and the classifier are separated.
Major consequence: Code is hard to understand.
Minor consequence: The time measurements only involve the classification time.

One of the premises CountFeaturizer was built on (based on this Microsoft ML article on count featurization https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/data-transformation-learning-with-counts) was that it may be possible to reduce the time training and classification takes compared to something like one hot encoding due to there being less features generated. Here, the reason why the pipeline was split was because it's necessary to separate out the time it took to preprocess the data from the time it takes to train and use the classifier.

janvanrijn · 2018-05-16T14:25:59Z

examples/preprocessing/plot_count_featurizer.py

+
+
+clf = get_classifier()
+labels = ["CountFeaturizer + RandomForestClassifier",


Can this experiment (running the three pipelines on an increasing number of trees) be put into a for loop? Removes duplicate code

janvanrijn · 2018-05-16T14:27:15Z

examples/preprocessing/plot_count_featurizer.py

+    plt.plot(xs, ys, label=label)
+
+plt.xlim(min_estimators, max_estimators)
+plt.xlabel("n_estimators")


(* Personally I'm a big fan of declaring the plotting labels and other stuff (l135, 136, 137) high up in the code, so
the script is reusable. for this it would be even better to put everything in a function but maybe this is me taking it
too far .. )

janvanrijn · 2018-05-16T14:27:41Z

sklearn/preprocessing/data.py

@@ -2866,6 +2869,229 @@ def power_transform(X, method='box-cox', standardize=True, copy=True):
    return pt.fit_transform(X)


+def _get_nested_counter(remaining, y_dim, inclusion_size):


_get_nested_counter: I know it's not a requirement to document private functions, but it would make the review
so much easier to know what the components are supposed to do. E.g., what is a 'nested dictionary', what do you
mean with layers and a 2d array of what in the end?

There is something in the CI tests (something to do with the pickle module) that does not allow you to do something like
a = {}
a[1] = {}
a[1][1] = 4

So this is a workaround

Should I add "This is a workaround due to some pickle issues in CI testing regarding creating nested dicts dynamically" to this as a comment?

janvanrijn · 2018-05-16T14:28:00Z

sklearn/preprocessing/data.py

+            _get_nested_counter, remaining - 1, y_dim, inclusion_size))
+
+
+class CountFeaturizer(BaseEstimator, TransformerMixin):


The code seems small and elegant for something with many functionalities.

janvanrijn · 2018-05-16T14:28:57Z