Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LabelEncoder-like functionality #123

Closed
Hisham-Hussein opened this issue Feb 13, 2018 · 19 comments
Closed

LabelEncoder-like functionality #123

Hisham-Hussein opened this issue Feb 13, 2018 · 19 comments

Comments

@Hisham-Hussein
Copy link

@Hisham-Hussein Hisham-Hussein commented Feb 13, 2018

Hi Max, and thanks for the great package.

I watched your interesting presentation titled (I don't want to be a dummy), and actually based on my experience, I tend to find that label_encoding of categorical variables is generally better than one_hot (dummy variables) encoding, in terms of training speed as well as robustness to overfitting.

So my question is: is there a function that transforms a categorical variable to numeric representation (like the LabelEncoder class in scikit-learn), plus handling novel levels in the test data?

That would be very nice to have!

Thank you in advance

@topepo
Copy link
Collaborator

@topepo topepo commented Feb 14, 2018

Something like this:

> LabelEncoder <- function(x) {
+   as.numeric(x) - 1
+ }
> 
> test <- iris$Species[c(1:2, 51:52, 100:101)]
> test
[1] setosa     setosa     versicolor versicolor versicolor virginica 
Levels: setosa versicolor virginica
> LabelEncoder(test)
[1] 0 0 1 1 1 2

We generally eschew this way of representing qualitative data. It wouldn't be much trouble to ad this though.

I think that the tensorflow and/or keras packages might have built-in functions for doing this.

@Hisham-Hussein
Copy link
Author

@Hisham-Hussein Hisham-Hussein commented Feb 14, 2018

Thanks for your reply Max..

yes this works in a basic way. But I was hoping for a more sophisticated implementation:

  • LabelEncoder is a kind of estimator class that fits the training data and remembers the levels it saw
  • Then it transforms the training data
  • Then it transforms the testing data, provided that the test data is not known in advance, so it should has a built-in functionality to deal with novel levels in the test data (which is encountered frequently in the real world)
  • For example, once it sees novel levels in the test data ---> replace it with an arbitrary number the user chooses when fitting the transformer (and if he didn't specify that number, use NAs for the novel levels)

Something like what I'm describing is implemented here in CatEncoders package, and the author said it was inspired by the scikit-learn LabelEncoder class.

However I was hoping to see it in recipes because I really liked it and I wish we would never be forced to use any package for pre-processing aside from recipes and vtreat.

Also:
I'm curious to know why this way of representing qualitative data is not preferred?
Given that in your talk (I don't want to be a dummy) I saw that it led to either the same or better results than one-hot encoding, and in terms of training speed it was almost always better.

Also I've tried both representations (LabelEncoder and one-hot) many times in my work and in kaggle competitions and always found the one-hot encoding slower as well as more prone to overfitting, especially when using tree-based algorithms, especially xgboost

Could you please elaborate on this point? I'd like to learn from your experience as well

Thanks again Max,

Kind Regards,
Hisham

@topepo
Copy link
Collaborator

@topepo topepo commented Feb 14, 2018

Then it transforms the testing data, provided that the test data is not known in advance, so it should has a built-in functionality to deal with novel levels in the test data (which is encountered frequently in the real world)

So does it increase the number of integers for the new level or drop that level into an existing slot? We are adding a feature hashing step (although I find that approach enormously unsatisfying from a statistical perspective).

I'm curious to know why this way of representing qualitative data is not preferred?

(the following rant is largely about unordered qualitative data)

I want to separate how we store qualitative data from how we represent it for our purposes. I think conflating the two (as integer values would) is bad.

Since R has a wonderful method for storing and summarizing qualitative data (factors), we don't have to take a low-level approach of encoding them as some type of number(s). For example, using table(factor_var) is much better than table(integer_var) because

  • The factor has memory of the levels. An integer vector would need to have a definition for what the maximum number of integers would be so that it can put zeros in the table for levels that are possible but not observed.

  • For both aesthetic and content reasons, showing the actual levels is far better for communicating values.

and so on.

Also, there is the not-so-far-fetched issue of people doing computations on the numbers that treat an integer value of 2 is twice the value of the integer value of the first level. For neophytes, this can happen easily and they might not be aware of it.

I could ramble on (and have in the past) but basically I feel that you should keep the best representation of the data until you absolutely need to encode it numerically. I have a hard time figuring out a case where some indicator variable encoding is inferior to an integer representation. An integer representation is probably a construct of how the implementation is written and that would be the "tail wagging the dog."

In recipes, step_ordinalscore does something similar to what you are interested in but it is only for ordered factors (where it makes sense to do this) and wouldn't know what to do with new values.

I'll take a look at CatEncoders too.

@lorenzwalthert
Copy link
Contributor

@lorenzwalthert lorenzwalthert commented Jul 4, 2018

On a related note: bfgray3/cattonum#20. The reason why I am interested in the functionality is because I fit a neural net with keras with embeddings and the input vector has to be numerical (for some reason). So I can't use the class factor to represent my data.

topepo added a commit that referenced this issue Sep 4, 2018
@kevinykuo
Copy link

@kevinykuo kevinykuo commented Sep 12, 2018

Keras embedding layer requires that categorical variables be encoded as 0-based integer arrays. It doesn't seem like cf2e5e6 can do that since 0 is reserved for unseen labels?

@topepo
Copy link
Collaborator

@topepo topepo commented Sep 12, 2018

I thought that was the convention for layer_embed; zero values corresponded to potentially new values.

@kevinykuo
Copy link

@kevinykuo kevinykuo commented Sep 12, 2018

Here's the documentation for layers.Embedding: https://keras.io/layers/embeddings/. It doesn't seem like it has support for handling unseen data built in. mask_zero can be used if you have recurrent layers down the line, but it's not turned on by default anyway.

@topepo
Copy link
Collaborator

@topepo topepo commented Sep 12, 2018

I was following this guide which uses

The first layer is the embedding layer with the size of 7 weekdays plus 1 (for the unknowns).

and then

embeddings$name <- c("none", levels(wday(df$date, label = T)) )

to make the first level of the factor apart from the actual factor levels.

Maybe @skeydan would have more info...

@skeydan
Copy link

@skeydan skeydan commented Sep 12, 2018

I just commented on the other issue (#192 (comment))

I seem to come across the following way most often (just making use of a current example here and adding comments in-line )

###################    tokenize  ####################

# we will use the 5000 most frequent words only
# all others will be <unk>, here and in a later test set
# last I saw, but this could change over releases I guess, <unk> corresponded to 1
# so 0 is empty and later I will be using it to add <pad>
top_k <- 5000
tokenizer <- text_tokenizer(
  num_words = top_k,
  oov_token = "<unk>",
  filters = '!"#$%&()*+.,-/:;=?@[\\]^_`{|}~ ')
tokenizer$fit_on_texts(sample_captions)

# pad_sequences will use 0 to pad all captions to the same length
tokenizer$word_index["<pad>"] <- 0

########### convert text to integers #############

train_captions_tokenized <-
  tokenizer %>% texts_to_sequences(train_captions)


#######  create a lookup dataframe that allows us to go in both directions #######

word_index_df <- data.frame(
  word = tokenizer$word_index %>% names(),
  index = tokenizer$word_index %>% unlist(use.names = FALSE),
  stringsAsFactors = FALSE
)
word_index_df <- word_index_df %>% arrange(index)

#######  and some way to use that lookup table #######
decode_caption <- function(text) {
  paste(map(text, function(number)
    word_index_df %>%
      filter(index == number) %>%
      select(word) %>%
      pull()),
    collapse = " ")
}

###########  pad all sequences to the same length  ###########
train_captions_padded <-  pad_sequences(
  train_captions_tokenized,
  maxlen = max_length,
  padding = "post",
  truncating = "post"
)

So apart from the creating the lookup table, the recurring steps for me seem to be

  • train ("fit") the tokenizer
  • use it to convert text to integers (not to 1-hot, normally, as far as I see)
  • then pad to a common length

As an illustration of the latter part, here's an official TF tutorial that does it like this (only it does not have to do the tokenizing):

https://tensorflow.rstudio.com/keras/articles/tutorial_basic_text_classification.html

Does this answer the question somehow or wasn't this what it's about? :-)

@kevinykuo
Copy link

@kevinykuo kevinykuo commented Sep 12, 2018

The use case I'm mostly concerned about is using the embedding layer with a categorical variable in a structured data setting. What we're discussing is whether a categorical variable should be processed to be a 0- or 1- based vector. E.g. should iris$Species be encoded as 0, 0, ..., 1, 1, ..., 2, 2 or 1, 1, ..., 2, 2, ..., 3, 3.

@topepo
Copy link
Collaborator

@topepo topepo commented Sep 12, 2018

I can modify it to be zero-based (as an option) but would throw errors if a new value is encountered. 😩

@skeydan
Copy link

@skeydan skeydan commented Sep 12, 2018

My best guess ... I think it does not matter so much.... the way I understand the docs it means that per default, 0 can be a value like any other (if you leave mask_zero at its default of false)...

However if you just don’t use it, it would be like a mysterious value that never appears, but not change the results much...

So my guess would be starting from 1 could work just fine... but one could also do a quick test with simulated data to find out....

@kevinykuo
Copy link

@kevinykuo kevinykuo commented Sep 12, 2018

Throwing an exception may be desirable in some cases, e.g. when you know you shouldn't have unseen labels. We can have a parameter that controls the behavior. FWIW Spark has a parameter handleInvalid that defaults to "error" when it sees unseen labels or missings.

It feels weird to me to have to set your input dim to number of classes + 1, when you know that you won't have new levels.

@topepo
Copy link
Collaborator

@topepo topepo commented Sep 12, 2018

It feels weird to me to have to set your input dim to number of classes + 1, when you know that you won't have new levels.

But we don't know that and accommodating novel levels has been a consistent request.

@kevinykuo
Copy link

@kevinykuo kevinykuo commented Sep 12, 2018

Agreed we should support accommodating novel levels (by default), but would you be OK with adding a parameter to allow 0-based indexing and we can document that it'll puke on unseen labels? Or maybe lump novel levels to code num_classes in that case, since the known levels go from 0 to num_classes - 1? This way, input_dim can still correspond to number of classes, and the user won't notice the difference. If the user does encounter novel levels, Keras will throw an index out of bounds exception, which will be informative.

@topepo
Copy link
Collaborator

@topepo topepo commented Sep 12, 2018

That's a good idea. I'll get to it in a few days

@skeydan
Copy link

@skeydan skeydan commented Sep 14, 2018

Okay, from the code - they always say an embedding is nothing but a lookup and really it is:

def call(self, inputs):
        if K.dtype(inputs) != 'int32':
            inputs = K.cast(inputs, 'int32')
        out = K.gather(self.embeddings, inputs)
return out

All k.gather does is lookup indices in a tensor:

Description
Retrieves the elements of indices indices in the tensor reference.

Usage
k_gather(reference, indices)

But K.gather - Python - is 0-based. So to match this, R factors should probably be decremented by 1.

@topepo topepo added this to In progress in Package: recipes Oct 5, 2018
@lorenzwalthert
Copy link
Contributor

@lorenzwalthert lorenzwalthert commented Oct 9, 2018

Sorry just to double check (and spoil the thread): Can I use one-based indexing in keras for embeddings or am I supposed to use zero-based indexing? If yes, maybe the doc / help file should be adapted there? @skeydan

topepo added a commit that referenced this issue Oct 12, 2018
topepo added a commit that referenced this issue Oct 12, 2018
Label encoder changes for #123
@topepo
Copy link
Collaborator

@topepo topepo commented Oct 12, 2018

Try it now to see if that helps with zero-based integers

@topepo topepo moved this from In progress to Done in Package: recipes Oct 13, 2018
@topepo topepo closed this Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.