New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: implementation of SMOTE-NC for continuous and categorical mixed types #412

Merged
merged 14 commits into from Oct 12, 2018

Conversation

Projects
None yet
6 participants
@ddudnik
Contributor

ddudnik commented Mar 5, 2018

Reference Issue

#401

What does this implement/fix? Explain your changes.

Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer

Any other comments?

Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.

Points to pay attention to:

  • working with sparse matrices
  • 2 FIXME points in code
  • 'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)
@pep8speaks

This comment has been minimized.

pep8speaks commented Mar 5, 2018

Hello @ddudnik! Thanks for updating the PR.

file_to_check.py:159:-69: W605 invalid escape sequence '\d'


Comment last updated on October 12, 2018 at 10:07 Hours UTC
@codecov

This comment has been minimized.

codecov bot commented Mar 6, 2018

Codecov Report

Merging #412 into master will increase coverage by 0.03%.
The diff coverage is 99.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #412      +/-   ##
==========================================
+ Coverage   98.66%   98.69%   +0.03%     
==========================================
  Files          79       80       +1     
  Lines        4803     4999     +196     
==========================================
+ Hits         4739     4934     +195     
- Misses         64       65       +1
Impacted Files Coverage Δ
imblearn/utils/testing.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/__init__.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/tests/test_smote_nc.py 100% <100%> (ø)
imblearn/over_sampling/_smote.py 95.51% <98.82%> (+1.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae97633...a6ec6a9. Read the comment docs.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Mar 8, 2018

I'll check it when I get some time. Thanks for the contribution.

@jijo7

This comment has been minimized.

jijo7 commented May 4, 2018

Hi
Dose SMOTE support non-continuous features?
Regards,

@glemaitre

This comment has been minimized.

Member

glemaitre commented May 5, 2018

Not yet. We need to make the implementation to accept both continuous and categorical.

@jijo7

This comment has been minimized.

jijo7 commented Jul 5, 2018

Hi
Sorry, does it take a long time to be successfully implemented?
Best regards,

@glemaitre

This comment has been minimized.

Member

glemaitre commented Jul 20, 2018

@jijo7 Feel free to take over the PR and finish the implementation.
Full time on it, it should not take long to implement it.

@glemaitre

I am starting to look at this PR.
Is the method stricly following the implementation of SMOTE-NC or is there a bit of trick there? I don't recall that they use OneHotEncoding for instance.

@@ -63,7 +63,7 @@ def sample(self, X, y):

return self._sample(X, y)

def fit_sample(self, X, y):
def fit_sample(self, X, y, **fit_params):

This comment has been minimized.

@glemaitre

glemaitre Aug 27, 2018

Member

I think that I would avoid to pass a keyword here.
We need to pass the keyword when creating the instance.

In addition, I would propose to have an argument categorical_features='auto'orcategorical_features=[1, ...]to inform for the name of the index. With'auto', think that we can check the column which are binary {0, 1} previously encoded by a OneHotEncoder`.
It will be slower and we can directly pass the index of the column then (or even the column name if we want to support pandas dataframe).

This comment has been minimized.

@ddudnik

ddudnik Aug 30, 2018

Contributor

Hi there, thanks for looking into this PR!

  1. fit_params keyword (or, more specifically, feature_indices keyword parameter in SMOTENC class): I also considered the option of making it a init parameter to be set when SMOTENC instance is being created. However, in the end I opted for a keyword argument. Here's my motivation: feature_indices indicates which columns of input represent the nominal features after they were one-hot-encoded. Thus, it is directly connected to the provided input X. So if you want to call fit method on the different X, you might have to provide different feature_indices. Speaking more generally, I assumed that init should accept parameters which are not input-dependent and fit can take parameters which are dependent on given input X.

  2. categorical_features=auto: I like this idea! (I guess you mean aforementioned feature_indices parameter?). Just one question: would it maybe make sense to implement it as an improvement/follow-up ticket after this PR is merged? Asking since I'm not sure if I will have enough time in near future to work on it.

As for you question regarding OneHotEncoding you're right, it is not mentioned in the paper. Also, strictly speaking, it is not part of SMOTENC implementation either. It's just expects the data to be one-hot encoded because: a) SMOTENC re-uses a lot of code from SMOTE (nearest-neighbors, initial sample generation etc.) which will not work if input contains non-numerical data ; b) OneHotEncoding is a go-to tool when dealing with nominal features in sklearn; c) when applied, it conveniently returns feature_indices array to be used by SMOTENC

Regards,
Denis

This comment has been minimized.

@glemaitre

glemaitre Aug 30, 2018

Member
  1. My concern is that we don't follow the current scikit-learn API. The ColumnTransformer will for instance get the same kind of parameters at initialization. Even if I agree that it would make sense to have it at fit time, I am not aware of mechanism to be able to forward the information of the OneHotEncoder to SMOTE within a Pipeline. It is something that at the long term should be done with sample_props which is actually something pass in fit similarly to what you are proposing. However, in the meantime I would be more conservative.

  2. I prefer the term categorical_features instead of features_indices since it is explicit what are we targeting. My proposal regarding the 'auto' is actually to partially solve the problem spotted in 1. Regarding the extra work, I could implement it and push it in your PR, if you are OK with it.

  3. I agree with your last point. However, since that we are actually not implemented the original SMOTE-NC, I would propose to rename the class to something different to not have any issue with other implementation which might differ. I don't have a good naming yet CatSMOTE maybe. Any proposal would be nice.

This comment has been minimized.

@ddudnik

ddudnik Sep 3, 2018

Contributor
  1. Agreed (I'll address it during re-base)
  2. Agreed
  3. I still think that we could use SMOTE-NC as a name. Even though OneHotEncoding is not mentioned in the paper, I think using it does not contradict the paper. All the steps 1-3 (median computation, nearest neighbors, sample generation) behave as described in the paper. I would call the usage of OneHotEncoder just an "implementation detail". But maybe I'm missing something so please let me know if you have a different idea.
    (for reference https://arxiv.org/pdf/1106.1813.pdf, section 6.1)

This comment has been minimized.

@abhishekzgithub

abhishekzgithub Oct 11, 2018

@here I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Also, I did test it with some sample datasets having categorical values and nominal, but it results in a way that the one-hot-encoded values (binary) are also getting incremented with the 'step' parameter provided.It results in the values such as (0.1,0.12,0.6..etc).

How to convert them back to the original categorical dataset with these type of values?

This comment has been minimized.

@glemaitre

glemaitre Oct 11, 2018

Member

I just wanted to ask if a method has been implemented to inverse_transform the OneHotEncoded values after applying the SMOTE-NC.

Since I integrated the OneHotEncode within SMOTENC, the feature will be reverted.

How to convert them back to the original categorical dataset with these type of values?

Now you will get the original categories so it should not be an issue. I still have to implement more tests. to be sure that it behave as exptected in all situation.

This comment has been minimized.

@abhishekzgithub

abhishekzgithub Oct 11, 2018

@glemaitre There is some issue with OHE (when using normally).
Recent version provides functionality of converting it without LabelEncoding it. I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

I hope it might be of some use.

This comment has been minimized.

@glemaitre

glemaitre Oct 11, 2018

Member

Recent version provides functionality of converting it without LabelEncoding it.

Yes you should not use the LabelEncoding for such usage. OneHotEncoding has been fixed in 0.20

I tried the process with a dataset of shape ((284807, 30)) and it was giving me memory error.

It depends how many categories you have by column. You can have a really huge matrix and depending of your RAM, you will get this error. One possibility is to convert the array to a sparse matrix.

We could think to always use the OHE internally with sparse matrix to avoid as much as possible such issue.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Aug 30, 2018

We recently refactored the SMOTE class which will require also a rebase.
If you could do that and if we can come to an agreement regarding the parameters and naming. I can takeover regarding the automatic detection of categorical features.

WDYT?

@glemaitre

This comment has been minimized.

Member

glemaitre commented Sep 3, 2018

@jnothman similarly to sample_props, is there a similar proposal for feature.

Considering the following case: a ColumnTransformer with a OneHotEncoder for categorical data and a StandardScaler for the numerical, SMOTE-NC would benefit if the OneHotEncoder is outputting an array with the column names which could be passed at fit of the SMOTE-NC.

Do we have something in mind regarding this case and is it a "generic" enough case such that other estimators in scikit-learn can benefit from it?

@jnothman

This comment has been minimized.

jnothman commented Sep 3, 2018

That conversation is not happening in any straightforward way, though I have noted that they are sides of the same coin in the Draft Roadmap (see "Passing around information that is not (X, y)"). I do think we have substantial related problems in the limitations of get_feature_names the identification of columns for transformation in ColumnTransformer, sample props and the issue of passing the full set of classes to sub-estimators and scorers (rather than inferring from y).

We have not yet considered combined/holistic solutions for sample, feature and target properties, but perhaps we should...

@ddudnik ddudnik force-pushed the ddudnik:smote_nc branch from 4db42bf to d7ed91e Sep 3, 2018

@ddudnik

This comment has been minimized.

Contributor

ddudnik commented Sep 3, 2018

PR is rebased

@glemaitre

This is only a partial review
I will continue tomorrow.

Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote_nc.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote_nc.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote_nc.py Outdated
Show resolved Hide resolved imblearn/over_sampling/_smote_nc.py Outdated
"""
sample = super(SMOTENC, self)._generate_sample(X, nn_data, nn_num,
row, col, step)
if not hasattr(self, "categorical_feature_indices_"):

This comment has been minimized.

@glemaitre

glemaitre Sep 3, 2018

Member

I would remove this part. We will have raised an error much earlier.

Show resolved Hide resolved imblearn/over_sampling/_smote_nc.py Outdated
@ddudnik

This comment has been minimized.

Contributor

ddudnik commented Sep 4, 2018

Regarding categorical_features="auto".

I gave it another thought and looked through the code one more time (it was a long time since I initially implemented it) and it seems that 'auto' mode is not possible here.

Thing is that we need to distinguish between two different categorical features from the original dataset. I'll try to explain via an example.

Let's say, we have a dataset with 4 columns with features A, B, C and D. Features C and D are two separate categorical ones. Let's say, feature C has 2 possible values and feature D has 3 possible values. After one-hot encoding we'll end up with a new dataset with 7 columns: features A and B are left intact, but instead of feature C we have two columns (C1, C2) and instead of feature D we have 3 columns (D1, D2, D3). These new columns are of course binary.

For this code to work properly, it has to know where feature C columns end and where D start. I.e. we need to know that columns 2-3 represent feature C and columns 4-6 represent feature D.

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7].

I hope this explains why 'auto' is not really an option here. Unless I miss something.

@glemaitre

We need to add an additional test to check the "auto" mode.

We should check that we get the same results by using "auto" or specifying the column which will be categorical.

See also
--------
SMOTE : Over-sample using SMOTE.

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

We should add the BorderlineSMOTE

from . import SMOTE


class SMOTENC(SMOTE):

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

We still need to move this class in smote.py

{random_state}
{k_neighbors}

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

We should keep the full documentation in the docstring for those parameters.
We use it for the sampling_strategy and random_state only.
The other parameters can ease development if they are written in the documentation.

Long story short, people try to do that in scikit-learn and it failed, so we might want to avoid this :)

sampling_strategy='auto',
random_state=None,
k_neighbors=5,
m_neighbors='deprecated',

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

Let's remove all parameters which are deprecated. We don't need them here.
We will not support the old style version using kind.

self.categorical_feature_indices = categorical_feature_indices

def _fit_resample(self, X, y):
if self.categorical_feature_indices is None:

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

I would prefer:

if self.categorical_feature_indices == 'auto':
    self.categorical_feature_indices_ = [i for i in range(n_features)
                                         if len(_encode(X[:, i]) < 2]
else:
    categorical_feature_indices = check_array(...)
    if categorical_feature_indices not in np.arange(n_features):
        raise ValueError(...)
    self.categorical_feature_indices_ = categorical_feature_indices
else:
sample[start:end] = new_nominal_value

return sample.tocsr() if is_sparse else sample

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

only return dense. The magic should occur in fit_resample in the sparse branch which should concatenate matrices and eliminate zeros

between two feature vectors, the difference of two different categorical
features will always equal to median standard deviation.
"""
if not hasattr(self, "categorical_feature_indices_"):

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

Remove this part. We will not accept it for the moment.

'to normal SMOTE', RuntimeWarning)
return X

X_copy = X.copy().tolil() if sparse.issparse(X) else X.copy()

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

Actually it should be better to create a temporary CSC matrix.
Go column of the feature and change the value inplace and return X.tocsr()

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member
X = X.tocsc()
for feature_idx in self.categorical_feature_indices:
    nnz_idx = X[:, feature_idx].nonzero()[0]
    X[nnz_idx, feature_idx] = self.std_median_ / 2
return X.to_csr()
X_indices = [2, 5, 7]


def test_sample_regular():

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

We will remove those steps with numbers. There is a common test that we will try the over-sampling. We will try to test the corner case and thing which we can properly control.

smote.fit_resample(X, Y)


def test_fit_feature_indices_out_of_bounds():

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

rename feature_indices to categorical_...

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

I would concatenate all the error in a single test using pytest parametrize

@pytest.mark.parametrize(
    "categorical_feature_indices, msg_error",
    [([1], 'minimum of 2 is required'),
     ([3, 5, 8], 'Indices ...'),
     (...)]
)
def test_smote_nc_errors(categorical_feature_indices, msg_err):
    smote = SMOTENC(random_state=RND_SEED,
                    categorical_feature_indices=categorical_feature_indices)
    with pytest.raises(ValueError, match=msg_err):
        smote.fit_resample(X, y)
@glemaitre

This comment has been minimized.

Member

glemaitre commented Sep 4, 2018

For the common test you need to add SMOTENC to the DONT_SUPPORT_RATIO list.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Sep 4, 2018

Thinking of which that we need to provide X which should be only categorical or only numerical to check that the sampler raises an error for those cases.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Sep 4, 2018

This is exactly what we say with categorical_feature_indices. In this example it will be: [2, 4, 7]. I hope this explains why 'auto' is not really an option here. Unless I miss something.

So categorical_feature_indices is actually the starting index of a given feature.
This is not good.

I would prefer then to include the OneHotEncoder in SMOTE-NC.
The user will provide the categorical column and we can apply a OneHotEncoder on those and the StandardScaler on the other (using the ColumnTransformer) and the, we will know the start-end ourselve and it will be easier as a user perspective.

We also ensure that the columns are encoded for sure (a user could forget the OneHotEncoder) if they use it without care.

'For this method to work X should have '
'at least 1 continuous feature.')

if sparse.issparse(X):

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

Actually used the following function mean_variance_axis to compute the std on the sparse matrix.

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/sparsefuncs.py#L64

This comment has been minimized.

@glemaitre

glemaitre Sep 4, 2018

Member

Basically you can factorize:

X_continous = X[:, self.continuous_feature_indices_]

stds = (np.sqrt(mean_variance_axis(X_continous, axis=0)[1])
        if sparse.issparse(X) else np.std(X_continous, axis=0))
self.std_median_ = np.median(stds)

@glemaitre glemaitre force-pushed the scikit-learn-contrib:master branch 2 times, most recently from bbf2b12 to 513203c Sep 7, 2018

@glemaitre

This comment has been minimized.

Member

glemaitre commented Oct 8, 2018

I will try to finish up this PR. To simplify the implementation without changing as much code:
We only need to change the _sample function:

  • One-hot encode the input X knowing the categorical feature.
  • We can use the trick mentioned by replacing the 1 by the median.
  • We just have to create the sample calling _sample from the SMOTE implementation.
  • We need to revert the one-hot encoding.

Perfectly, we could have use the ColumnTransformer but it does not have the inverse_transform yet which would allow to reverse easily the one-hot encoding. Until scikit-learn 0.21, we will have to do this part by hand.

@glemaitre glemaitre force-pushed the ddudnik:smote_nc branch from e6310aa to b3f1619 Oct 11, 2018

@glemaitre glemaitre changed the title from [WIP] SMOTE-NC implementation #401 to EHN: implementation of SMOTE-NC for continuous and categorical mixed types Oct 11, 2018

@glemaitre

This comment has been minimized.

Member

glemaitre commented Oct 11, 2018

@ddudnik I think that we are almost good to go. I want to check if I need to add some additional tests.
Could you have a look regarding this PR and give any thoughts.

@chkoar Do you have time for a quick view regarding this PR.

@glemaitre glemaitre merged commit 2d4902e into scikit-learn-contrib:master Oct 12, 2018

1 of 3 checks passed

LGTM analysis: Python Running analyses for revisions
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@glemaitre

This comment has been minimized.

Member

glemaitre commented Oct 12, 2018

Merging since it seems ready. @ddudnik Thanks for the implementation.

@ddudnik

This comment has been minimized.

Contributor

ddudnik commented Oct 19, 2018

Glad to see it merged! Thank you for taking over and finishing this thing. I wish I would have had more time in the past month to be of more help.

@ddudnik

This comment has been minimized.

Contributor

ddudnik commented Oct 19, 2018

@glemaitre I've looked through the code briefly and I have one remark regarding replacing the '1' entries of the categorical features with the median of the standard deviation. Shouldn't we replace '1' entries with median of standard deviation divided by two? Otherwise the distance between two different categorical values is 2 * med_std.

@glemaitre

This comment has been minimized.

Member

glemaitre commented Oct 19, 2018

@ddudnik ddudnik deleted the ddudnik:smote_nc branch Oct 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment