New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA implements SMOTEN to handle nominal categorical features #802
FEA implements SMOTEN to handle nominal categorical features #802
Conversation
Hello @glemaitre! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2021-02-15 22:55:02 UTC |
This pull request introduces 1 alert when merging 049dde9 into 6155658 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging fee46f7 into 6155658 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 8626040 into d9ba4af - view on LGTM.com new alerts:
|
Codecov Report
@@ Coverage Diff @@
## master #802 +/- ##
==========================================
- Coverage 98.63% 98.58% -0.05%
==========================================
Files 89 90 +1
Lines 5934 6007 +73
Branches 499 503 +4
==========================================
+ Hits 5853 5922 +69
- Misses 80 84 +4
Partials 1 1
Continue to review full report at Codecov.
|
This pull request introduces 1 alert when merging 516b33c into 326119b - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging a66bfaa into b6621f9 - view on LGTM.com new alerts:
|
6. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]_ | ||
7. KMeans-SMOTE [17]_ | ||
8. ROSE - Random OverSampling Examples [19]_ | ||
4. SMOTEN - SMMOTE for Nominal only [8]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMMOTE->SMOTE
@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last | |||
columns are belonging to the same categories originally presented without any | |||
other extra interpolation. | |||
|
|||
However, :class:`SMOTENC` is working with data composed of categorical data | |||
only. WHen data are made of only nominal categorical data, one can use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When
@@ -211,6 +211,44 @@ Therefore, it can be seen that the samples generated in the first and last | |||
columns are belonging to the same categories originally presented without any | |||
other extra interpolation. | |||
|
|||
However, :class:`SMOTENC` is working with data composed of categorical data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, :class:SMOTENC
is working with datasets composed of continuous and categorical features.
two ways: | ||
|
||
* the nearest neighbors search does not rely on the Euclidean distance. Indeed, | ||
the value difference metric (VDM) also implemented in the class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value Difference Metric
* the nearest neighbors search does not rely on the Euclidean distance. Indeed, | ||
the value difference metric (VDM) also implemented in the class | ||
:class:`~imblearn.metrics.ValueDifferenceMetric` is used. | ||
* the new sample generation is based on majority vote per feature to generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the majority?
@@ -766,6 +774,9 @@ class SMOTENC(SMOTE): | |||
-------- | |||
SMOTE : Over-sample using SMOTE. | |||
|
|||
SMOTEN : Over-sample using the SMOTE variable specifically for categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMOTEN : Over-sample using the SMOTE variant specifically for nominal features only.
@@ -1055,6 +1066,11 @@ class KMeansSMOTE(BaseSMOTE): | |||
-------- | |||
SMOTE : Over-sample using SMOTE. | |||
|
|||
SMOTENC : Over-sample using SMOTE for continuous and categorical features. | |||
|
|||
SMOTEN : Over-sample using the SMOTE variable specifically for categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMOTEN : Over-sample using the SMOTE variant specifically for nominal features only.
SMOTEN : Over-sample using the SMOTE variable specifically for categorical | ||
features only. | ||
|
||
SVMSMOTE : Over-sample using SVM-SMOTE variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVMSMOTE : Over-sample using the SVM-SMOTE variant.
|
||
SVMSMOTE : Over-sample using SVM-SMOTE variant. | ||
|
||
BorderlineSMOTE : Over-sample using Borderline-SMOTE variant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BorderlineSMOTE : Over-sample using the Borderline-SMOTE variant.
|
||
ADASYN : Over-sample using ADASYN. | ||
|
||
KMeansSMOTE : Over-sample applying a clustering before to oversample using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KMeansSMOTE : Over-sample by applying a clustering before to oversample using
SMOTE.
@@ -168,11 +168,12 @@ Below is a list of the methods currently implemented in this module. | |||
1. Random minority over-sampling with replacement | |||
2. SMOTE - Synthetic Minority Over-sampling Technique [8]_ | |||
3. SMOTENC - SMOTE for Nominal Continuous [8]_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SMOTENC - SMOTE for Nominal and Continuous
or
SMOTENC - SMOTE Nominal Continuous
Thanks. I will make the changes in a new PR. Regarding "nominal" I am not convinced what is the best way. I think that we don't make a good job in the doc to explain what are nominal values and in scikit-learn usually, we talk about categorical features. When it comes to nominal features, we are usually using the term "nominal categorical features" to oppose it to "ordinal categorical features". @chkoar do you have any thoughts on that? |
Keeping aside the numerical features and the representation of the categorical data. In general, nominal and ordinal variables represent distinct categories where math does not make sense. However, in ordinal variables order does matter. For instance color is nominal while priority is ordinal: L, M,H. So, you should have domain knowledge in order to encode accordingly(if you dont have already the encoding). You should not encode blindingly using the type of a variable. Depending on the context you make the encoding. Although, I suppose that priority could be coded as it was nominal in the context of feature encoding. On the other hand if you have the priority in the class probably you should encode using ordinal encoding while you are interested in ordinal classification. To sum up, IMHO, I think that Chawla et al with word nominal are referring to categorical features. Since nominal and ordinal type are both categorical I would say that we could use the word categorical every where. Instead of the name SMOTEN and SMOTENC. Does this make sense? |
Yep I think that would be best to be consistent. I will make a pass on the doc. |
closes #565
Implements SMOTE for nominal categorical features only. It uses the VDM distance and a majority vote regarding the per feature to create new categories from the k-NN.