Skip to content

EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Oct 12, 2018

Conversation

ddudnik
Copy link
Contributor

@ddudnik ddudnik commented Mar 5, 2018

Reference Issue

#401

What does this implement/fix? Explain your changes.

Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer

Any other comments?

Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.

Points to pay attention to:

  • working with sparse matrices
  • 2 FIXME points in code
  • 'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)

@pep8speaks
Copy link

pep8speaks commented Mar 5, 2018

Hello @ddudnik! Thanks for updating the PR.

file_to_check.py:159:-69: W605 invalid escape sequence '\d'


Comment last updated on October 12, 2018 at 10:07 Hours UTC

@codecov
Copy link

codecov bot commented Mar 6, 2018

Codecov Report

Merging #412 into master will increase coverage by 0.03%.
The diff coverage is 99.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #412      +/-   ##
==========================================
+ Coverage   98.66%   98.69%   +0.03%     
==========================================
  Files          79       80       +1     
  Lines        4803     4999     +196     
==========================================
+ Hits         4739     4934     +195     
- Misses         64       65       +1
Impacted Files Coverage Δ
imblearn/utils/testing.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/__init__.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100% <100%> (ø)
imblearn/over_sampling/_smote.py 95.51% <98.82%> (+1.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae97633...a6ec6a9. Read the comment docs.

@glemaitre
Copy link
Member

I'll check it when I get some time. Thanks for the contribution.

@jijo7
Copy link

jijo7 commented May 4, 2018

Hi
Dose SMOTE support non-continuous features?
Regards,

@glemaitre
Copy link
Member

Not yet. We need to make the implementation to accept both continuous and categorical.

@jijo7
Copy link

jijo7 commented Jul 5, 2018

Hi
Sorry, does it take a long time to be successfully implemented?
Best regards,

@glemaitre
Copy link
Member

@jijo7 Feel free to take over the PR and finish the implementation.
Full time on it, it should not take long to implement it.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am starting to look at this PR.
Is the method stricly following the implementation of SMOTE-NC or is there a bit of trick there? I don't recall that they use OneHotEncoding for instance.

imblearn/base.py Outdated
@@ -63,7 +63,7 @@ def sample(self, X, y):

return self._sample(X, y)

def fit_sample(self, X, y):
def fit_sample(self, X, y, **fit_params):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I would avoid to pass a keyword here.
We need to pass the keyword when creating the instance.

In addition, I would propose to have an argument categorical_features='auto'orcategorical_features=[1, ...]to inform for the name of the index. With'auto', think that we can check the column which are binary {0, 1} previously encoded by a OneHotEncoder`.
It will be slower and we can directly pass the index of the column then (or even the column name if we want to support pandas dataframe).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there, thanks for looking into this PR!

  1. fit_params keyword (or, more specifically, feature_indices keyword parameter in SMOTENC class): I also considered the option of making it a init parameter to be set when SMOTENC instance is being created. However, in the end I opted for a keyword argument. Here's my motivation: feature_indices indicates which columns of input represent the nominal features after they were one-hot-encoded. Thus, it is directly connected to the provided input X. So if you want to call fit method on the different X, you might have to provide different feature_indices. Speaking more generally, I assumed that init should accept parameters which are not input-dependent and fit can take parameters which are dependent on given input X.

  2. categorical_features=auto: I like this idea! (I guess you mean aforementioned feature_indices parameter?). Just one question: would it maybe make sense to implement it as an improvement/follow-up ticket after this PR is merged? Asking since I'm not sure if I will have enough time in near future to work on it.

As for you question regarding OneHotEncoding you're right, it is not mentioned in the paper. Also, strictly speaking, it is not part of SMOTENC implementation either. It's just expects the data to be one-hot encoded because: a) SMOTENC re-uses a lot of code from SMOTE (nearest-neighbors, initial sample generation etc.) which will not work if input contains non-numerical data ; b) OneHotEncoding is a go-to tool when dealing with nominal features in sklearn; c) when applied, it conveniently returns feature_indices array to be used by SMOTENC

Regards,
Denis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. My concern is that we don't follow the current scikit-learn API. The ColumnTransformer will for instance get the same kind of parameters at initialization. Even if I agree that it would make sense to have it at fit time, I am not aware of mechanism to be able to forward the information of the OneHotEncoder to SMOTE within a Pipeline. It is something that at the long term should be done with sample_props which is actually something pass in fit similarly to what you are proposing. However, in the meantime I would be more conservative.

  2. I prefer the term categorical_features instead of features_indices since it is explicit what are we targeting. My proposal regarding the 'auto' is actually to partially solve the problem spotted in 1. Regarding the extra work, I could implement it and push it in your PR, if you are OK with it.

  3. I agree with your last point. However, since that we are actually not implemented the original SMOTE-NC, I would propose to rename the class to something different to not have any issue with other implementation which might differ. I don't have a good naming yet CatSMOTE maybe. Any proposal would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed (I'll address it during re-base)
  2. Agreed
  3. I still think that we could use SMOTE-NC as a name. Even though OneHotEncoding is not mentioned in the paper, I think using it does not contradict the paper. All the steps 1-3 (median computation, nearest neighbors, sample generation) behave as described in the paper. I would call the usage of OneHotEncoder just an "implementation detail". But maybe I'm missing something so please let me know if you have a different idea.
    (for reference https://arxiv.org/pdf/1106.1813.pdf, section 6.1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@here I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Also, I did test it with some sample datasets having categorical values and nominal, but it results in a way that the one-hot-encoded values (binary) are also getting incremented with the 'step' parameter provided.It results in the values such as (0.1,0.12,0.6..etc).

How to convert them back to the original categorical dataset with these type of values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Since I integrated the OneHotEncode within SMOTENC, the feature will be reverted.

How to convert them back to the original categorical dataset with these type of values?

Now you will get the original categories so it should not be an issue. I still have to implement more tests. to be sure that it behave as exptected in all situation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre There is some issue with OHE (when using normally).
Recent version provides functionality of converting it without LabelEncoding it. I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

I hope it might be of some use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recent version provides functionality of converting it without LabelEncoding it.

Yes you should not use the LabelEncoding for such usage. OneHotEncoding has been fixed in 0.20

I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

It depends how many categories you have by column. You can have a really huge matrix and depending of your RAM, you will get this error. One possibility is to convert the array to a sparse matrix.

We could think to always use the OHE internally with sparse matrix to avoid as much as possible such issue.

@glemaitre
Copy link
Member

We recently refactored the SMOTE class which will require also a rebase.
If you could do that and if we can come to an agreement regarding the parameters and naming. I can takeover regarding the automatic detection of categorical features.

WDYT?

@glemaitre
Copy link
Member

@jnothman similarly to sample_props, is there a similar proposal for feature.

Considering the following case: a ColumnTransformer with a OneHotEncoder for categorical data and a StandardScaler for the numerical, SMOTE-NC would benefit if the OneHotEncoder is outputting an array with the column names which could be passed at fit of the SMOTE-NC.

Do we have something in mind regarding this case and is it a "generic" enough case such that other estimators in scikit-learn can benefit from it?

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018

That conversation is not happening in any straightforward way, though I have noted that they are sides of the same coin in the Draft Roadmap (see "Passing around information that is not (X, y)"). I do think we have substantial related problems in the limitations of get_feature_names the identification of columns for transformation in ColumnTransformer, sample props and the issue of passing the full set of classes to sub-estimators and scorers (rather than inferring from y).

We have not yet considered combined/holistic solutions for sample, feature and target properties, but perhaps we should...

@ddudnik
Copy link
Contributor Author

ddudnik commented Sep 3, 2018

PR is rebased

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only a partial review
I will continue tomorrow.

"""
sample = super(SMOTENC, self)._generate_sample(X, nn_data, nn_num,
row, col, step)
if not hasattr(self, "categorical_feature_indices_"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this part. We will have raised an error much earlier.

@ddudnik
Copy link
Contributor Author

ddudnik commented Sep 4, 2018

Regarding categorical_features="auto".

I gave it another thought and looked through the code one more time (it was a long time since I initially implemented it) and it seems that 'auto' mode is not possible here.

Thing is that we need to distinguish between two different categorical features from the original dataset. I'll try to explain via an example.

Let's say, we have a dataset with 4 columns with features A, B, C and D. Features C and D are two separate categorical ones. Let's say, feature C has 2 possible values and feature D has 3 possible values. After one-hot encoding we'll end up with a new dataset with 7 columns: features A and B are left intact, but instead of feature C we have two columns (C1, C2) and instead of feature D we have 3 columns (D1, D2, D3). These new columns are of course binary.

For this code to work properly, it has to know where feature C columns end and where D start. I.e. we need to know that columns 2-3 represent feature C and columns 4-6 represent feature D.

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7].

I hope this explains why 'auto' is not really an option here. Unless I miss something.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add an additional test to check the "auto" mode.

We should check that we get the same results by using "auto" or specifying the column which will be categorical.

See also
--------
SMOTE : Over-sample using SMOTE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add the BorderlineSMOTE

from . import SMOTE


class SMOTENC(SMOTE):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to move this class in smote.py


{random_state}

{k_neighbors}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the full documentation in the docstring for those parameters.
We use it for the sampling_strategy and random_state only.
The other parameters can ease development if they are written in the documentation.

Long story short, people try to do that in scikit-learn and it failed, so we might want to avoid this :)

sampling_strategy='auto',
random_state=None,
k_neighbors=5,
m_neighbors='deprecated',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove all parameters which are deprecated. We don't need them here.
We will not support the old style version using kind.

self.categorical_feature_indices = categorical_feature_indices

def _fit_resample(self, X, y):
if self.categorical_feature_indices is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer:

if self.categorical_feature_indices == 'auto':
    self.categorical_feature_indices_ = [i for i in range(n_features)
                                         if len(_encode(X[:, i]) < 2]
else:
    categorical_feature_indices = check_array(...)
    if categorical_feature_indices not in np.arange(n_features):
        raise ValueError(...)
    self.categorical_feature_indices_ = categorical_feature_indices

else:
sample[start:end] = new_nominal_value

return sample.tocsr() if is_sparse else sample
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only return dense. The magic should occur in fit_resample in the sparse branch which should concatenate matrices and eliminate zeros

between two feature vectors, the difference of two different categorical
features will always equal to median standard deviation.
"""
if not hasattr(self, "categorical_feature_indices_"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this part. We will not accept it for the moment.

'to normal SMOTE', RuntimeWarning)
return X

X_copy = X.copy().tolil() if sparse.issparse(X) else X.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should be better to create a temporary CSC matrix.
Go column of the feature and change the value inplace and return X.tocsr()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X = X.tocsc()
for feature_idx in self.categorical_feature_indices:
    nnz_idx = X[:, feature_idx].nonzero()[0]
    X[nnz_idx, feature_idx] = self.std_median_ / 2
return X.to_csr()

X_indices = [2, 5, 7]


def test_sample_regular():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove those steps with numbers. There is a common test that we will try the over-sampling. We will try to test the corner case and thing which we can properly control.

smote.fit_resample(X, Y)


def test_fit_feature_indices_out_of_bounds():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename feature_indices to categorical_...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would concatenate all the error in a single test using pytest parametrize

@pytest.mark.parametrize(
    "categorical_feature_indices, msg_error",
    [([1], 'minimum of 2 is required'),
     ([3, 5, 8], 'Indices ...'),
     (...)]
)
def test_smote_nc_errors(categorical_feature_indices, msg_err):
    smote = SMOTENC(random_state=RND_SEED,
                    categorical_feature_indices=categorical_feature_indices)
    with pytest.raises(ValueError, match=msg_err):
        smote.fit_resample(X, y)

@glemaitre
Copy link
Member

For the common test you need to add SMOTENC to the DONT_SUPPORT_RATIO list.

@glemaitre
Copy link
Member

Thinking of which that we need to provide X which should be only categorical or only numerical to check that the sampler raises an error for those cases.

@glemaitre
Copy link
Member

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7]. I hope this explains why 'auto' is not really an option here. Unless I miss something.

So categorical_feature_indices is actually the starting index of a given feature.
This is not good.

I would prefer then to include the OneHotEncoder in SMOTE-NC.
The user will provide the categorical column and we can apply a OneHotEncoder on those and the StandardScaler on the other (using the ColumnTransformer) and the, we will know the start-end ourselve and it will be easier as a user perspective.

We also ensure that the columns are encoded for sure (a user could forget the OneHotEncoder) if they use it without care.

'For this method to work X should have '
'at least 1 continuous feature.')

if sparse.issparse(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually used the following function mean_variance_axis to compute the std on the sparse matrix.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs.py#L64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically you can factorize:

X_continous = X[:, self.continuous_feature_indices_]

stds = (np.sqrt(mean_variance_axis(X_continous, axis=0)[1])
        if sparse.issparse(X) else np.std(X_continous, axis=0))
self.std_median_ = np.median(stds)

@glemaitre glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26
@glemaitre
Copy link
Member

I will try to finish up this PR. To simplify the implementation without changing as much code:
We only need to change the _sample function:

  • One-hot encode the input X knowing the categorical feature.
  • We can use the trick mentioned by replacing the 1 by the median.
  • We just have to create the sample calling _sample from the SMOTE implementation.
  • We need to revert the one-hot encoding.

Perfectly, we could have use the ColumnTransformer but it does not have the inverse_transform yet which would allow to reverse easily the one-hot encoding. Until scikit-learn 0.21, we will have to do this part by hand.

@glemaitre glemaitre changed the title [WIP] SMOTE-NC implementation #401 EHN: implementation of SMOTE-NC for continuous and categorical mixed types Oct 11, 2018
@glemaitre
Copy link
Member

@ddudnik I think that we are almost good to go. I want to check if I need to add some additional tests.
Could you have a look regarding this PR and give any thoughts.

@chkoar Do you have time for a quick view regarding this PR.

@glemaitre glemaitre merged commit 2d4902e into scikit-learn-contrib:master Oct 12, 2018
@glemaitre
Copy link
Member

Merging since it seems ready. @ddudnik Thanks for the implementation.

@ddudnik
Copy link
Contributor Author

ddudnik commented Oct 19, 2018

Glad to see it merged! Thank you for taking over and finishing this thing. I wish I would have had more time in the past month to be of more help.

@ddudnik
Copy link
Contributor Author

ddudnik commented Oct 19, 2018

@glemaitre I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is 2 * med_std.

@glemaitre
Copy link
Member

glemaitre commented Oct 19, 2018 via email

@ddudnik ddudnik deleted the smote_nc branch October 22, 2018 07:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants