Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412
What does this implement/fix? Explain your changes.
Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer
Any other comments?
Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.
Points to pay attention to:
Hello @ddudnik! Thanks for updating the PR.
Comment last updated on October 12, 2018 at 10:07 Hours UTC
referenced this pull request
Mar 5, 2018
@@ Coverage Diff @@ ## master #412 +/- ## ========================================== + Coverage 98.66% 98.69% +0.03% ========================================== Files 79 80 +1 Lines 4803 4999 +196 ========================================== + Hits 4739 4934 +195 - Misses 64 65 +1
I am starting to look at this PR.
Is the method stricly following the implementation of SMOTE-NC or is there a bit of trick there? I don't recall that they use OneHotEncoding for instance.
@jnothman similarly to
Considering the following case: a ColumnTransformer with a OneHotEncoder for categorical data and a StandardScaler for the numerical, SMOTE-NC would benefit if the OneHotEncoder is outputting an array with the column names which could be passed at
Do we have something in mind regarding this case and is it a "generic" enough case such that other estimators in scikit-learn can benefit from it?
That conversation is not happening in any straightforward way, though I have noted that they are sides of the same coin in the Draft Roadmap (see "Passing around information that is not (X, y)"). I do think we have substantial related problems in the limitations of
We have not yet considered combined/holistic solutions for sample, feature and target properties, but perhaps we should...
This is only a partial review
I will continue tomorrow.
I gave it another thought and looked through the code one more time (it was a long time since I initially implemented it) and it seems that 'auto' mode is not possible here.
Thing is that we need to distinguish between two different categorical features from the original dataset. I'll try to explain via an example.
Let's say, we have a dataset with 4 columns with features A, B, C and D. Features C and D are two separate categorical ones. Let's say, feature C has 2 possible values and feature D has 3 possible values. After one-hot encoding we'll end up with a new dataset with 7 columns: features A and B are left intact, but instead of feature C we have two columns (C1, C2) and instead of feature D we have 3 columns (D1, D2, D3). These new columns are of course binary.
For this code to work properly, it has to know where feature C columns end and where D start. I.e. we need to know that columns 2-3 represent feature C and columns 4-6 represent feature D.
This is exactly what we say with
I hope this explains why 'auto' is not really an option here. Unless I miss something.
We need to add an additional test to check the "auto" mode.
We should check that we get the same results by using "auto" or specifying the column which will be categorical.
So categorical_feature_indices is actually the starting index of a given feature.
I would prefer then to include the OneHotEncoder in SMOTE-NC.
We also ensure that the columns are encoded for sure (a user could forget the OneHotEncoder) if they use it without care.
I will try to finish up this PR. To simplify the implementation without changing as much code:
Perfectly, we could have use the ColumnTransformer but it does not have the
Oct 12, 2018
@glemaitre I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is
Did I forgot to divide it? Ups ... It seems that we will need a bug fix. Thanks to noticing it…
On Fri, 19 Oct 2018 at 20:51, Denis Dudnik ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is 2 * med_std. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#412 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9P2nQSDN2RjpnpQHl5-vgGCtyv2kDks5umh6ogaJpZM4SchbN> .
-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/