EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412

ddudnik · 2018-03-05T15:54:07Z

Reference Issue

What does this implement/fix? Explain your changes.

Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer

Any other comments?

Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.

Points to pay attention to:

working with sparse matrices
2 FIXME points in code
'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)

pep8speaks · 2018-03-05T15:54:13Z

Hello @ddudnik! Thanks for updating the PR.

There are no PEP8 issues in the file examples/over-sampling/plot_comparison_over_sampling.py !
There are no PEP8 issues in the file imblearn/over_sampling/__init__.py !
There are no PEP8 issues in the file imblearn/over_sampling/_smote.py !
There are no PEP8 issues in the file imblearn/over_sampling/tests/test_smote_nc.py !
There are no PEP8 issues in the file imblearn/utils/testing.py !
Complete extra results for this file :

file_to_check.py:159:-69: W605 invalid escape sequence '\d'

Comment last updated on October 12, 2018 at 10:07 Hours UTC

codecov · 2018-03-06T09:05:35Z

Codecov Report

Merging #412 into master will increase coverage by 0.03%.
The diff coverage is 99.51%.

@@            Coverage Diff             @@
##           master     #412      +/-   ##
==========================================
+ Coverage   98.66%   98.69%   +0.03%     
==========================================
  Files          79       80       +1     
  Lines        4803     4999     +196     
==========================================
+ Hits         4739     4934     +195     
- Misses         64       65       +1

Impacted Files	Coverage Δ
imblearn/utils/testing.py	`100% <100%> (ø)`	⬆️
imblearn/over_sampling/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/over_sampling/tests/test_smote_nc.py	`100% <100%> (ø)`
imblearn/over_sampling/_smote.py	`95.51% <98.82%> (+1.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae97633...a6ec6a9. Read the comment docs.

glemaitre · 2018-03-08T21:07:09Z

I'll check it when I get some time. Thanks for the contribution.

jijo7 · 2018-05-04T17:20:15Z

Hi
Dose SMOTE support non-continuous features?
Regards,

glemaitre · 2018-05-05T08:33:15Z

Not yet. We need to make the implementation to accept both continuous and categorical.

jijo7 · 2018-07-05T05:11:34Z

Hi
Sorry, does it take a long time to be successfully implemented?
Best regards,

glemaitre · 2018-07-20T09:50:29Z

@jijo7 Feel free to take over the PR and finish the implementation.
Full time on it, it should not take long to implement it.

glemaitre

I am starting to look at this PR.
Is the method stricly following the implementation of SMOTE-NC or is there a bit of trick there? I don't recall that they use OneHotEncoding for instance.

glemaitre · 2018-08-27T16:28:48Z

imblearn/base.py

@@ -63,7 +63,7 @@ def sample(self, X, y):

        return self._sample(X, y)

-    def fit_sample(self, X, y):
+    def fit_sample(self, X, y, **fit_params):


I think that I would avoid to pass a keyword here.
We need to pass the keyword when creating the instance.

In addition, I would propose to have an argument categorical_features='auto'orcategorical_features=[1, ...]to inform for the name of the index. With'auto', think that we can check the column which are binary {0, 1} previously encoded by a OneHotEncoder`.
It will be slower and we can directly pass the index of the column then (or even the column name if we want to support pandas dataframe).

Hi there, thanks for looking into this PR!

fit_params keyword (or, more specifically, feature_indices keyword parameter in SMOTENC class): I also considered the option of making it a init parameter to be set when SMOTENC instance is being created. However, in the end I opted for a keyword argument. Here's my motivation: feature_indices indicates which columns of input represent the nominal features after they were one-hot-encoded. Thus, it is directly connected to the provided input X. So if you want to call fit method on the different X, you might have to provide different feature_indices. Speaking more generally, I assumed that init should accept parameters which are not input-dependent and fit can take parameters which are dependent on given input X.

categorical_features=auto: I like this idea! (I guess you mean aforementioned feature_indices parameter?). Just one question: would it maybe make sense to implement it as an improvement/follow-up ticket after this PR is merged? Asking since I'm not sure if I will have enough time in near future to work on it.

As for you question regarding OneHotEncoding you're right, it is not mentioned in the paper. Also, strictly speaking, it is not part of SMOTENC implementation either. It's just expects the data to be one-hot encoded because: a) SMOTENC re-uses a lot of code from SMOTE (nearest-neighbors, initial sample generation etc.) which will not work if input contains non-numerical data ; b) OneHotEncoding is a go-to tool when dealing with nominal features in sklearn; c) when applied, it conveniently returns feature_indices array to be used by SMOTENC

Regards,
Denis

My concern is that we don't follow the current scikit-learn API. The ColumnTransformer will for instance get the same kind of parameters at initialization. Even if I agree that it would make sense to have it at fit time, I am not aware of mechanism to be able to forward the information of the OneHotEncoder to SMOTE within a Pipeline. It is something that at the long term should be done with sample_props which is actually something pass in fit similarly to what you are proposing. However, in the meantime I would be more conservative.

I prefer the term categorical_features instead of features_indices since it is explicit what are we targeting. My proposal regarding the 'auto' is actually to partially solve the problem spotted in 1. Regarding the extra work, I could implement it and push it in your PR, if you are OK with it.

I agree with your last point. However, since that we are actually not implemented the original SMOTE-NC, I would propose to rename the class to something different to not have any issue with other implementation which might differ. I don't have a good naming yet CatSMOTE maybe. Any proposal would be nice.

Agreed (I'll address it during re-base)

Agreed

I still think that we could use SMOTE-NC as a name. Even though OneHotEncoding is not mentioned in the paper, I think using it does not contradict the paper. All the steps 1-3 (median computation, nearest neighbors, sample generation) behave as described in the paper. I would call the usage of OneHotEncoder just an "implementation detail". But maybe I'm missing something so please let me know if you have a different idea.
(for reference https://arxiv.org/pdf/1106.1813.pdf, section 6.1)

@here I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Also, I did test it with some sample datasets having categorical values and nominal, but it results in a way that the one-hot-encoded values (binary) are also getting incremented with the 'step' parameter provided.It results in the values such as (0.1,0.12,0.6..etc).

How to convert them back to the original categorical dataset with these type of values?

I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Since I integrated the OneHotEncode within SMOTENC, the feature will be reverted.

How to convert them back to the original categorical dataset with these type of values?

Now you will get the original categories so it should not be an issue. I still have to implement more tests. to be sure that it behave as exptected in all situation.

@glemaitre There is some issue with OHE (when using normally).
Recent version provides functionality of converting it without LabelEncoding it. I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

I hope it might be of some use.

Recent version provides functionality of converting it without LabelEncoding it.

Yes you should not use the LabelEncoding for such usage. OneHotEncoding has been fixed in 0.20

I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

It depends how many categories you have by column. You can have a really huge matrix and depending of your RAM, you will get this error. One possibility is to convert the array to a sparse matrix.

We could think to always use the OHE internally with sparse matrix to avoid as much as possible such issue.

glemaitre · 2018-08-30T09:08:28Z

We recently refactored the SMOTE class which will require also a rebase.
If you could do that and if we can come to an agreement regarding the parameters and naming. I can takeover regarding the automatic detection of categorical features.

WDYT?

glemaitre · 2018-09-03T10:05:46Z

@jnothman similarly to sample_props, is there a similar proposal for feature.

Considering the following case: a ColumnTransformer with a OneHotEncoder for categorical data and a StandardScaler for the numerical, SMOTE-NC would benefit if the OneHotEncoder is outputting an array with the column names which could be passed at fit of the SMOTE-NC.

Do we have something in mind regarding this case and is it a "generic" enough case such that other estimators in scikit-learn can benefit from it?

jnothman · 2018-09-03T11:32:01Z

That conversation is not happening in any straightforward way, though I have noted that they are sides of the same coin in the Draft Roadmap (see "Passing around information that is not (X, y)"). I do think we have substantial related problems in the limitations of get_feature_names the identification of columns for transformation in ColumnTransformer, sample props and the issue of passing the full set of classes to sub-estimators and scorers (rather than inferring from y).

We have not yet considered combined/holistic solutions for sample, feature and target properties, but perhaps we should...

ddudnik · 2018-09-03T14:52:01Z

PR is rebased

glemaitre

This is only a partial review
I will continue tomorrow.

imblearn/over_sampling/_smote.py

imblearn/over_sampling/_smote_nc.py

glemaitre · 2018-09-03T22:54:22Z

imblearn/over_sampling/_smote_nc.py

+        """
+        sample = super(SMOTENC, self)._generate_sample(X, nn_data, nn_num,
+                                                       row, col, step)
+        if not hasattr(self, "categorical_feature_indices_"):


I would remove this part. We will have raised an error much earlier.

imblearn/over_sampling/_smote_nc.py

ddudnik · 2018-09-04T13:51:53Z

Regarding categorical_features="auto".

I gave it another thought and looked through the code one more time (it was a long time since I initially implemented it) and it seems that 'auto' mode is not possible here.

Thing is that we need to distinguish between two different categorical features from the original dataset. I'll try to explain via an example.

Let's say, we have a dataset with 4 columns with features A, B, C and D. Features C and D are two separate categorical ones. Let's say, feature C has 2 possible values and feature D has 3 possible values. After one-hot encoding we'll end up with a new dataset with 7 columns: features A and B are left intact, but instead of feature C we have two columns (C1, C2) and instead of feature D we have 3 columns (D1, D2, D3). These new columns are of course binary.

For this code to work properly, it has to know where feature C columns end and where D start. I.e. we need to know that columns 2-3 represent feature C and columns 4-6 represent feature D.

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7].

I hope this explains why 'auto' is not really an option here. Unless I miss something.

glemaitre

We need to add an additional test to check the "auto" mode.

We should check that we get the same results by using "auto" or specifying the column which will be categorical.

glemaitre · 2018-09-04T12:46:14Z

imblearn/over_sampling/_smote_nc.py

+    See also
+    --------
+    SMOTE : Over-sample using SMOTE.
+


We should add the BorderlineSMOTE

glemaitre · 2018-09-04T12:46:53Z

imblearn/over_sampling/_smote_nc.py

+from . import SMOTE
+
+
+class SMOTENC(SMOTE):


We still need to move this class in smote.py

glemaitre · 2018-09-04T12:58:43Z

imblearn/over_sampling/_smote_nc.py

+
+    {random_state}
+
+    {k_neighbors}


We should keep the full documentation in the docstring for those parameters.
We use it for the sampling_strategy and random_state only.
The other parameters can ease development if they are written in the documentation.

Long story short, people try to do that in scikit-learn and it failed, so we might want to avoid this :)

glemaitre · 2018-09-04T12:59:33Z

imblearn/over_sampling/_smote_nc.py

+                 sampling_strategy='auto',
+                 random_state=None,
+                 k_neighbors=5,
+                 m_neighbors='deprecated',


Let's remove all parameters which are deprecated. We don't need them here.
We will not support the old style version using kind.

glemaitre · 2018-09-04T13:26:54Z

imblearn/over_sampling/_smote_nc.py

+        self.categorical_feature_indices = categorical_feature_indices
+
+    def _fit_resample(self, X, y):
+        if self.categorical_feature_indices is None:


I would prefer:

if self.categorical_feature_indices == 'auto': self.categorical_feature_indices_ = [i for i in range(n_features) if len(_encode(X[:, i]) < 2] else: categorical_feature_indices = check_array(...) if categorical_feature_indices not in np.arange(n_features): raise ValueError(...) self.categorical_feature_indices_ = categorical_feature_indices

glemaitre · 2018-09-04T13:54:15Z

imblearn/over_sampling/_smote_nc.py

+            else:
+                sample[start:end] = new_nominal_value
+
+        return sample.tocsr() if is_sparse else sample


only return dense. The magic should occur in fit_resample in the sparse branch which should concatenate matrices and eliminate zeros

glemaitre · 2018-09-04T13:55:21Z

imblearn/over_sampling/_smote_nc.py

+        between two feature vectors, the difference of two different categorical
+        features will always equal to median standard deviation.
+        """
+        if not hasattr(self, "categorical_feature_indices_"):


Remove this part. We will not accept it for the moment.

glemaitre · 2018-09-04T14:05:15Z

imblearn/over_sampling/_smote_nc.py

+                          'to normal SMOTE', RuntimeWarning)
+            return X
+
+        X_copy = X.copy().tolil() if sparse.issparse(X) else X.copy()


Actually it should be better to create a temporary CSC matrix.
Go column of the feature and change the value inplace and return X.tocsr()

X = X.tocsc() for feature_idx in self.categorical_feature_indices: nnz_idx = X[:, feature_idx].nonzero()[0] X[nnz_idx, feature_idx] = self.std_median_ / 2 return X.to_csr()

glemaitre · 2018-09-04T14:11:59Z

imblearn/over_sampling/tests/test_smote_nc.py

+X_indices = [2, 5, 7]
+
+
+def test_sample_regular():


We will remove those steps with numbers. There is a common test that we will try the over-sampling. We will try to test the corner case and thing which we can properly control.

glemaitre · 2018-09-04T14:15:29Z

imblearn/over_sampling/tests/test_smote_nc.py

+        smote.fit_resample(X, Y)
+
+
+def test_fit_feature_indices_out_of_bounds():


rename feature_indices to categorical_...

I would concatenate all the error in a single test using pytest parametrize

@pytest.mark.parametrize( "categorical_feature_indices, msg_error", [([1], 'minimum of 2 is required'), ([3, 5, 8], 'Indices ...'), (...)] ) def test_smote_nc_errors(categorical_feature_indices, msg_err): smote = SMOTENC(random_state=RND_SEED, categorical_feature_indices=categorical_feature_indices) with pytest.raises(ValueError, match=msg_err): smote.fit_resample(X, y)

glemaitre · 2018-09-04T14:23:27Z

For the common test you need to add SMOTENC to the DONT_SUPPORT_RATIO list.

glemaitre · 2018-09-04T14:33:29Z

Thinking of which that we need to provide X which should be only categorical or only numerical to check that the sampler raises an error for those cases.

glemaitre · 2018-09-04T14:42:08Z

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7]. I hope this explains why 'auto' is not really an option here. Unless I miss something.

So categorical_feature_indices is actually the starting index of a given feature.
This is not good.

I would prefer then to include the OneHotEncoder in SMOTE-NC.
The user will provide the categorical column and we can apply a OneHotEncoder on those and the StandardScaler on the other (using the ColumnTransformer) and the, we will know the start-end ourselve and it will be easier as a user perspective.

We also ensure that the columns are encoded for sure (a user could forget the OneHotEncoder) if they use it without care.

glemaitre · 2018-09-04T17:19:51Z

imblearn/over_sampling/_smote.py

+                             'For this method to work X should have '
+                             'at least 1 continuous feature.')
+
+        if sparse.issparse(X):


Actually used the following function mean_variance_axis to compute the std on the sparse matrix.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs.py#L64

Basically you can factorize:

X_continous = X[:, self.continuous_feature_indices_] stds = (np.sqrt(mean_variance_axis(X_continous, axis=0)[1]) if sparse.issparse(X) else np.std(X_continous, axis=0)) self.std_median_ = np.median(stds)

glemaitre · 2018-10-08T23:09:02Z

I will try to finish up this PR. To simplify the implementation without changing as much code:
We only need to change the _sample function:

One-hot encode the input X knowing the categorical feature.
We can use the trick mentioned by replacing the 1 by the median.
We just have to create the sample calling _sample from the SMOTE implementation.
We need to revert the one-hot encoding.

Perfectly, we could have use the ColumnTransformer but it does not have the inverse_transform yet which would allow to reverse easily the one-hot encoding. Until scikit-learn 0.21, we will have to do this part by hand.

glemaitre · 2018-10-11T22:48:59Z

@ddudnik I think that we are almost good to go. I want to check if I need to add some additional tests.
Could you have a look regarding this PR and give any thoughts.

@chkoar Do you have time for a quick view regarding this PR.

glemaitre · 2018-10-12T12:49:03Z

Merging since it seems ready. @ddudnik Thanks for the implementation.

ddudnik · 2018-10-19T18:32:45Z

Glad to see it merged! Thank you for taking over and finishing this thing. I wish I would have had more time in the past month to be of more help.

ddudnik · 2018-10-19T18:49:07Z

@glemaitre I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is 2 * med_std.

glemaitre · 2018-10-19T18:55:15Z

Did I forgot to divide it? Ups ... It seems that we will need a bug fix. Thanks to noticing it

…

On Fri, 19 Oct 2018 at 20:51, Denis Dudnik ***@***.***> wrote: @glemaitre <https://github.com/glemaitre> I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is 2 * med_std. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#412 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHG9P2nQSDN2RjpnpQHl5-vgGCtyv2kDks5umh6ogaJpZM4SchbN> .

-- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/

ddudnik mentioned this pull request Mar 5, 2018

Is SMOTE-Non Continuous implemented in scikit-learn library? #401

Closed

glemaitre reviewed Aug 27, 2018

View reviewed changes

ddudnik force-pushed the smote_nc branch from 4db42bf to d7ed91e Compare September 3, 2018 14:51

glemaitre requested changes Sep 3, 2018

View reviewed changes

glemaitre requested changes Sep 4, 2018

View reviewed changes

glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26

ddudnik and others added 4 commits October 10, 2018 22:15

SMOTE-NC implementation.

7f6042b

Partially addressed code review.

8ee0aa1

Partially addressed code review, moved SMOTENC to _smote module.

6692f4a

EHN: finished up the code

b3f1619

glemaitre force-pushed the smote_nc branch from e6310aa to b3f1619 Compare October 11, 2018 08:49

EHN: accept string for categorical features

d1f540e

glemaitre added 5 commits October 11, 2018 21:42

EHN: force OHE to always encode with sparse matrix

dcfdd1f

DOC: add documentation SMOTE-NC

6c2b265

TST: skip common test since not default constructible

285b38c

TST: add consistency tests for SMOTE-NC

abada8f

TST: add missing tests for error

ba539b2

glemaitre changed the title ~~[WIP] SMOTE-NC implementation #401~~ EHN: implementation of SMOTE-NC for continuous and categorical mixed types Oct 11, 2018

glemaitre approved these changes Oct 11, 2018

View reviewed changes

glemaitre added 4 commits October 12, 2018 09:55

hotfix

bfa1074

EHN: random tie breaking when selecting categories

332905c

DOC: fix documentation for randomtie breaking

8aa482d

style

a6ec6a9

glemaitre merged commit 2d4902e into scikit-learn-contrib:master Oct 12, 2018

ddudnik deleted the smote_nc branch October 22, 2018 07:58

		smote.fit_resample(X, Y)


		def test_fit_feature_indices_out_of_bounds():

EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412

EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412

Uh oh!

Conversation

ddudnik commented Mar 5, 2018

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

pep8speaks commented Mar 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on October 12, 2018 at 10:07 Hours UTC

Uh oh!

codecov bot commented Mar 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

glemaitre commented Mar 8, 2018

Uh oh!

jijo7 commented May 4, 2018

Uh oh!

glemaitre commented May 5, 2018

Uh oh!

jijo7 commented Jul 5, 2018

Uh oh!

glemaitre commented Jul 20, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 30, 2018

Uh oh!

glemaitre commented Sep 3, 2018

Uh oh!

jnothman commented Sep 3, 2018

Uh oh!

ddudnik commented Sep 3, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ddudnik commented Sep 4, 2018

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pep8speaks commented Mar 5, 2018 •

edited

Loading

codecov bot commented Mar 6, 2018 •

edited

Loading