Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA implements SMOTEN to handle nominal categorical features #802

Merged
merged 14 commits into from Feb 15, 2021

Conversation

glemaitre
Copy link
Member

closes #565

Implements SMOTE for nominal categorical features only. It uses the VDM distance and a majority vote regarding the per feature to create new categories from the k-NN.

@pep8speaks
Copy link

pep8speaks commented Feb 15, 2021

Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 798:89: E501 line too long (91 > 88 characters)

Line 43:9: W503 line break before binary operator
Line 44:9: W503 line break before binary operator
Line 45:9: W503 line break before binary operator
Line 46:9: W503 line break before binary operator

Comment last updated at 2021-02-15 22:55:02 UTC

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2021

This pull request introduces 1 alert when merging 049dde9 into 6155658 - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2021

This pull request introduces 1 alert when merging fee46f7 into 6155658 - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2021

This pull request introduces 1 alert when merging 8626040 into d9ba4af - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@codecov
Copy link

codecov bot commented Feb 15, 2021

Codecov Report

Merging #802 (a66bfaa) into master (b6621f9) will decrease coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #802      +/-   ##
==========================================
- Coverage   98.63%   98.58%   -0.05%     
==========================================
  Files          89       90       +1     
  Lines        5934     6007      +73     
  Branches      499      503       +4     
==========================================
+ Hits         5853     5922      +69     
- Misses         80       84       +4     
  Partials        1        1              
Impacted Files Coverage Δ
imblearn/over_sampling/_random_over_sampler.py 100.00% <ø> (ø)
imblearn/over_sampling/__init__.py 100.00% <100.00%> (ø)
imblearn/over_sampling/_adasyn.py 92.30% <100.00%> (-6.11%) ⬇️
imblearn/over_sampling/_smote.py 97.60% <100.00%> (+0.29%) ⬆️
imblearn/over_sampling/tests/test_smoten.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6621f9...a66bfaa. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2021

This pull request introduces 1 alert when merging 516b33c into 326119b - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@lgtm-com
Copy link

lgtm-com bot commented Feb 15, 2021

This pull request introduces 1 alert when merging a66bfaa into b6621f9 - view on LGTM.com

new alerts:

  • 1 for Signature mismatch in overriding method

@glemaitre glemaitre merged commit e3df215 into scikit-learn-contrib:master Feb 15, 2021
1 check passed
6. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]_
7. KMeans-SMOTE [17]_
8. ROSE - Random OverSampling Examples [19]_
4. SMOTEN - SMMOTE for Nominal only [8]_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMMOTE->SMOTE

@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last
columns are belonging to the same categories originally presented without any
other extra interpolation.

However, :class:`SMOTENC` is working with data composed of categorical data
only. WHen data are made of only nominal categorical data, one can use the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When

@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last
columns are belonging to the same categories originally presented without any
other extra interpolation.

However, :class:`SMOTENC` is working with data composed of categorical data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, :class:SMOTENC is working with datasets composed of continuous and categorical features.

two ways:

* the nearest neighbors search does not rely on the Euclidean distance. Indeed,
the value difference metric (VDM) also implemented in the class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value Difference Metric

* the nearest neighbors search does not rely on the Euclidean distance. Indeed,
the value difference metric (VDM) also implemented in the class
:class:`~imblearn.metrics.ValueDifferenceMetric` is used.
* the new sample generation is based on majority vote per feature to generate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the majority?

@@ -766,6 +774,9 @@ class SMOTENC(SMOTE):
--------
SMOTE : Over-sample using SMOTE.

SMOTEN : Over-sample using the SMOTE variable specifically for categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMOTEN : Over-sample using the SMOTE variant specifically for nominal features only.

@@ -1055,6 +1066,11 @@ class KMeansSMOTE(BaseSMOTE):
--------
SMOTE : Over-sample using SMOTE.

SMOTENC : Over-sample using SMOTE for continuous and categorical features.

SMOTEN : Over-sample using the SMOTE variable specifically for categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMOTEN : Over-sample using the SMOTE variant specifically for nominal features only.

SMOTEN : Over-sample using the SMOTE variable specifically for categorical
features only.

SVMSMOTE : Over-sample using SVM-SMOTE variant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SVMSMOTE : Over-sample using the SVM-SMOTE variant.


SVMSMOTE : Over-sample using SVM-SMOTE variant.

BorderlineSMOTE : Over-sample using Borderline-SMOTE variant.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BorderlineSMOTE : Over-sample using the Borderline-SMOTE variant.


ADASYN : Over-sample using ADASYN.

KMeansSMOTE : Over-sample applying a clustering before to oversample using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeansSMOTE : Over-sample by applying a clustering before to oversample using
SMOTE.

@@ -168,11 +168,12 @@ Below is a list of the methods currently implemented in this module.
1. Random minority over-sampling with replacement
2. SMOTE - Synthetic Minority Over-sampling Technique [8]_
3. SMOTENC - SMOTE for Nominal Continuous [8]_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMOTENC - SMOTE for Nominal and Continuous
or
SMOTENC - SMOTE Nominal Continuous

@glemaitre
Copy link
Member Author

Thanks. I will make the changes in a new PR.

Regarding "nominal" I am not convinced what is the best way. I think that we don't make a good job in the doc to explain what are nominal values and in scikit-learn usually, we talk about categorical features. When it comes to nominal features, we are usually using the term "nominal categorical features" to oppose it to "ordinal categorical features".

@chkoar do you have any thoughts on that?

@chkoar
Copy link
Member

chkoar commented Feb 16, 2021

Keeping aside the numerical features and the representation of the categorical data.

In general, nominal and ordinal variables represent distinct categories where math does not make sense. However, in ordinal variables order does matter. For instance color is nominal while priority is ordinal: L, M,H. So, you should have domain knowledge in order to encode accordingly(if you dont have already the encoding). You should not encode blindingly using the type of a variable. Depending on the context you make the encoding. Although, I suppose that priority could be coded as it was nominal in the context of feature encoding. On the other hand if you have the priority in the class probably you should encode using ordinal encoding while you are interested in ordinal classification.

To sum up, IMHO, I think that Chawla et al with word nominal are referring to categorical features. Since nominal and ordinal type are both categorical I would say that we could use the word categorical every where. Instead of the name SMOTEN and SMOTENC.

Does this make sense?

@glemaitre
Copy link
Member Author

Yep I think that would be best to be consistent. I will make a pass on the doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants