Skip to content

Added UnaryEncoder. #3336

Open
wants to merge 1 commit into from

5 participants

@adarob
adarob commented Jul 2, 2014

No description provided.

@jnothman
scikit-learn member
jnothman commented Jul 2, 2014

I have seen work that discretises features into buckets with this sort of scheme. Still, it's hard to explain without an example; could you put together an example comparing this and other categorical encodings?

I'm also not sure about the clarity of the name "unary", or is it used in the literature?

@larsmans larsmans commented on the diff Jul 3, 2014
sklearn/preprocessing/data.py
@@ -1093,3 +1094,178 @@ def transform(self, X):
"""
return _transform_selected(X, self._transform,
self.categorical_features, copy=True)
+
+class UnaryEncoder(BaseEstimator, TransformerMixin):
+ """Encode natural number features using a unary scheme.
+
+ The input to this transformer should be a matrix of integers, denoting
+ the values taken on by natural number features, with a meaningful ordering.
@larsmans
scikit-learn member
larsmans added a note Jul 3, 2014
"the values taken on by natural number features" -- do you mean levels of categorical features?

Never mind, those are of course not ordered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@adarob
adarob commented Jul 3, 2014
@vene
scikit-learn member
vene commented Jul 15, 2014

This seems related to what Introduction to Statistical Learning calls "step functions" (page 268.) I see the connection to the kernel space that motivates such an encoding, but I also wonder whether binning would not also accomplish the same thing, but leading to a more interpretable model.

@jnothman
scikit-learn member
@GaelVaroquaux
scikit-learn member

An example is needed to merge this (as well as documentation). It will probably clarify many of the questions on the usecase.

@jnothman
scikit-learn member
jnothman commented Aug 6, 2014

I have seen this sort of encoding used in semi-supervised approaches, where some statistics are collected on feature occurrences in an unlabelled dataset, and then a supervised system is trained with features along the lines of "in the unlabelled corpus, feature x present in this instance was seen at least k times", which are effectively quantized into bins and then unary-encoded.

I think finding a plurality of citations for this technique will be hard. But one example is in Bansal and Klein (ACL 2011), which states "For all features used, we add cumulative variants where indicators are fired for all count bins b' up to query count bin b."

I have seen somewhere (I wish I had a reference) a fairly general self-training-like approach which: trains a model on labelled data; runs it over unlabelled data; collects frequencies of feature activation (perhaps for a larger set of features) on the unlabelled data conditioned on the predicted label; stacks these binned, cumulatively-binarized (i.e. UnaryEncoded) conditional frequencies back onto the original feature space.

@jnothman
scikit-learn member

I just thought of this PR. If @adarob has no intentions to complete it, it's open for someone to take over to add an example and testing. I note that we currently have no examples using OneHotEncoder.

@GaelVaroquaux
scikit-learn member
@jnothman
scikit-learn member

ah, naming... rename Binarizer to Binner or Thresholder then merge this with OneHotEncoder as Binarizer with a mode parameter? Not being serious... but naming is hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.