Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] FrequencyEncoder #11805

Closed
wants to merge 2 commits into from
Closed

Conversation

@sshleifer
Copy link
Contributor

sshleifer commented Aug 13, 2018

What does this implement/fix? Explain your changes.

This is an alternative to LabelEncoder and OneHotEncoder that encodes categoricals based on the number of times they occur in the training data. It usually provides more information about the encoded value that LabelEncoder at the cost of potential collisions. I would love to add test coverage and examples if people think this is a good idea.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 13, 2018

This is not suitable for encoding classification labels, but for features.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 13, 2018

See also #9614

@sshleifer

This comment has been minimized.

Copy link
Contributor Author

sshleifer commented Aug 13, 2018

agreed that this is for features and should be in preprocessing/.
#9614 is very related, but it creates three arrays for every array you send it. This is completely different than LabelEncoder.

@sshleifer

This comment has been minimized.

Copy link
Contributor Author

sshleifer commented Aug 21, 2018

would it make sense to try to steer that PR to a simpler API?

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 21, 2018

#9614 is very related, but it creates three arrays for every array you send it.

Can you elaborate on that? I don't think I understand.

@sshleifer

This comment has been minimized.

Copy link
Contributor Author

sshleifer commented Aug 22, 2018

Sure! From line 2935 of this file it appears that when CountFeaturizer.fit_transform is sent an array of shape (N, 1) it returns an array of shape (N, 3).

Another difference is that CountFeaturizer takes an inclusion argument: 'all', 'each', list, or numpy.ndarray, whereas in my experience most preprocessors preprocess all the data they are sent.

@amueller

This comment has been minimized.

Copy link
Member

amueller commented Aug 22, 2018

hm I think the interface of the CountVectorizer is a bit odd right now and I'd change some of that.
My expectation would be that in a multi-class setting you have n_classes many columns. How would you reduce it to a single column?

@sshleifer

This comment has been minimized.

Copy link
Contributor Author

sshleifer commented Aug 22, 2018

Something like

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """If a value appears n times in a column sent to fit, it is encoded as n."""

    def fit(self, X):
        self._counts = [Counter(X[:, i] for i in range(X.shape[1]))]
        return self

    def transform(self, X):
		# try to broadcast more
        return np.array([[self._counts[i][elem]
						  for elem in X[:, i]]
                          for i in range(X.shape[1])])

CountVectorizer changes could also be interesting. I am trying to transform features, not labels, though so I don't think whether the setting is multi-class matters?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Aug 23, 2018

@sshleifer

This comment has been minimized.

Copy link
Contributor Author

sshleifer commented Aug 23, 2018

My theory is that for a tree model, this would be like using OrdinalEncoder to transform categoricals, but provide more information than just the order of appearance.

if our data is [red, red, red, red, yellow, red, green, green, green], replacing red with 1 and yellow with 2 and green with 3 tells the model less than replacing them with 4, 1, 3. CountFeaturizer provides even more info at the cost of more features.

I can work on providing empirical proof, if that would help, but also happy to give up and look for other stuff to do!

@sshleifer sshleifer closed this Mar 27, 2019
@adrinjalali adrinjalali added this to To do in Categorical Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Categorical
  
To do
3 participants
You can’t perform that action at this time.