Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upLabelEncoder-like functionality #123
Comments
|
Something like this: > LabelEncoder <- function(x) {
+ as.numeric(x) - 1
+ }
>
> test <- iris$Species[c(1:2, 51:52, 100:101)]
> test
[1] setosa setosa versicolor versicolor versicolor virginica
Levels: setosa versicolor virginica
> LabelEncoder(test)
[1] 0 0 1 1 1 2We generally eschew this way of representing qualitative data. It wouldn't be much trouble to ad this though. I think that the |
|
Thanks for your reply Max.. yes this works in a basic way. But I was hoping for a more sophisticated implementation:
Something like what I'm describing is implemented here in CatEncoders package, and the author said it was inspired by the scikit-learn LabelEncoder class. However I was hoping to see it in recipes because I really liked it and I wish we would never be forced to use any package for pre-processing aside from recipes and vtreat. Also: Also I've tried both representations (LabelEncoder and one-hot) many times in my work and in kaggle competitions and always found the one-hot encoding slower as well as more prone to overfitting, especially when using tree-based algorithms, especially xgboost Could you please elaborate on this point? I'd like to learn from your experience as well Thanks again Max, Kind Regards, |
So does it increase the number of integers for the new level or drop that level into an existing slot? We are adding a feature hashing step (although I find that approach enormously unsatisfying from a statistical perspective).
(the following rant is largely about unordered qualitative data) I want to separate how we store qualitative data from how we represent it for our purposes. I think conflating the two (as integer values would) is bad. Since R has a wonderful method for storing and summarizing qualitative data (factors), we don't have to take a low-level approach of encoding them as some type of number(s). For example, using
and so on. Also, there is the not-so-far-fetched issue of people doing computations on the numbers that treat an integer value of 2 is twice the value of the integer value of the first level. For neophytes, this can happen easily and they might not be aware of it. I could ramble on (and have in the past) but basically I feel that you should keep the best representation of the data until you absolutely need to encode it numerically. I have a hard time figuring out a case where some indicator variable encoding is inferior to an integer representation. An integer representation is probably a construct of how the implementation is written and that would be the "tail wagging the dog." In I'll take a look at |
|
On a related note: bfgray3/cattonum#20. The reason why I am interested in the functionality is because I fit a neural net with keras with embeddings and the input vector has to be numerical (for some reason). So I can't use the class factor to represent my data. |
|
Keras embedding layer requires that categorical variables be encoded as 0-based integer arrays. It doesn't seem like cf2e5e6 can do that since |
|
I thought that was the convention for |
|
Here's the documentation for |
|
I was following this guide which uses
and then embeddings$name <- c("none", levels(wday(df$date, label = T)) )to make the first level of the factor apart from the actual factor levels. Maybe @skeydan would have more info... |
|
I just commented on the other issue (#192 (comment)) I seem to come across the following way most often (just making use of a current example here and adding comments in-line )
So apart from the creating the lookup table, the recurring steps for me seem to be
As an illustration of the latter part, here's an official TF tutorial that does it like this (only it does not have to do the tokenizing): https://tensorflow.rstudio.com/keras/articles/tutorial_basic_text_classification.html Does this answer the question somehow or wasn't this what it's about? :-) |
|
The use case I'm mostly concerned about is using the embedding layer with a categorical variable in a structured data setting. What we're discussing is whether a categorical variable should be processed to be a 0- or 1- based vector. E.g. should |
|
I can modify it to be zero-based (as an option) but would throw errors if a new value is encountered. |
|
My best guess ... I think it does not matter so much.... the way I understand the docs it means that per default, 0 can be a value like any other (if you leave mask_zero at its default of false)... However if you just don’t use it, it would be like a mysterious value that never appears, but not change the results much... So my guess would be starting from 1 could work just fine... but one could also do a quick test with simulated data to find out.... |
|
Throwing an exception may be desirable in some cases, e.g. when you know you shouldn't have unseen labels. We can have a parameter that controls the behavior. FWIW Spark has a parameter It feels weird to me to have to set your input dim to number of classes + 1, when you know that you won't have new levels. |
But we don't know that and accommodating novel levels has been a consistent request. |
|
Agreed we should support accommodating novel levels (by default), but would you be OK with adding a parameter to allow 0-based indexing and we can document that it'll puke on unseen labels? Or maybe lump novel levels to code |
|
That's a good idea. I'll get to it in a few days |
|
Okay, from the code - they always say an embedding is nothing but a lookup and really it is:
All
But K.gather - Python - is 0-based. So to match this, R factors should probably be decremented by 1. |
|
Sorry just to double check (and spoil the thread): Can I use one-based indexing in keras for embeddings or am I supposed to use zero-based indexing? If yes, maybe the doc / help file should be adapted there? @skeydan |
|
Try it now to see if that helps with zero-based integers |
Hi Max, and thanks for the great package.
I watched your interesting presentation titled (I don't want to be a dummy), and actually based on my experience, I tend to find that label_encoding of categorical variables is generally better than one_hot (dummy variables) encoding, in terms of training speed as well as robustness to overfitting.
So my question is: is there a function that transforms a categorical variable to numeric representation (like the LabelEncoder class in scikit-learn), plus handling novel levels in the test data?
That would be very nice to have!
Thank you in advance