In [48]:
%matplotlib inline


# Fitting model on imbalanced datasets and how to fight bias

This example illustrates the problem induced by learning on datasets having
imbalanced classes. Subsequently, we compare different approaches alleviating
these negative effects.


In [49]:
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
# Modified by yxsong@ecut.edu.cn

In [50]:
print(__doc__)

Automatically created module for IPython interactive environment


## Problem definition

- The imbalanced learning of Landslide Suscepitbility 



The "wanzhou" dataset as a class ratio of about 19:1



In [51]:
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv('./data/wanzhou_island.csv')
target = 'value'
IDCol = 'ID'
GeoID = data[IDCol]
print(data[target].value_counts())
# x_columns = [x for x in data.columns if x not in [target,IDCol,'GRID_CODE']]


0    553172
1     29313
Name: value, dtype: int64


Data Prepare



In [52]:
x_columns = ['Elevation', 'Slope', 'Aspect', 'TRI', 'Curvature', 'Lithology', 'River', 'NDVI', 'NDWI', 'Rainfall', 'Earthquake', 'Land_use']

X = data[x_columns]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,stratify=y, random_state=0)

We will perform a cross-validation evaluation to get an estimate of the test
score.

As a baseline, we could use a classifier which will always predict the
majority class independently of the features provided.



In [53]:
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
scoring = ["accuracy", "balanced_accuracy"]
cv_result = cross_validate(dummy_clf, X, y, scoring=scoring)
print(f"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}")

Accuracy score of a dummy classifier: 0.950


Instead of using the accuracy, we can use the balanced accuracy which will
take into account the balancing issue.



In [54]:
print(
    f"Balanced accuracy score of a dummy classifier: "
    f"{cv_result['test_balanced_accuracy'].mean():.3f}"
)

Balanced accuracy score of a dummy classifier: 0.500


## Strategies to learn from an imbalanced dataset
We will use a dictionary and a list to continuously store the results of
our experiments and show them as a pandas dataframe.



In [55]:
index = []
scores = {"Accuracy": [], "Balanced accuracy": []}

### Dummy baseline

Before to train a real machine learning model, we can store the results
obtained with our :class:`~sklearn.dummy.DummyClassifier`.



In [56]:
import pandas as pd

index += ["Dummy classifier"]
cv_result = cross_validate(dummy_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5


### Linear classifier baseline

We will create a machine learning pipeline using a
:class:`~sklearn.linear_model.LogisticRegression` classifier. In this regard,
we will need to one-hot encode the categorical columns and standardized the
numerical columns before to inject the data into the
:class:`~sklearn.linear_model.LogisticRegression` classifier.

First, we define our numerical and categorical pipelines.



In [58]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline

num_pipe = make_pipeline(
    StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

Then, we can create a preprocessor which will dispatch the categorical
columns to the categorical pipeline and the numerical columns to the
numerical pipeline



In [59]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector

preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=2,
)

Finally, we connect our preprocessor with our
:class:`~sklearn.linear_model.LogisticRegression`. We can then evaluate our
model.



In [60]:
from sklearn.linear_model import LogisticRegression

lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))

In [61]:
index += ["Logistic regression"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331


We can see that our linear model is learning slightly better than our dummy
baseline. However, it is impacted by the class imbalance.

We can verify that something similar is happening with a tree-based model
such as :class:`~sklearn.ensemble.RandomForestClassifier`. With this type of
classifier, we will not need to scale the numerical data, and we will only
need to ordinal encode the categorical data.



In [62]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier

num_pipe = SimpleImputer(strategy="mean", add_indicator=True)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    # OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
)

preprocessor_tree = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=2,
)

rf_clf = make_pipeline(
    preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)
)

In [63]:
index += ["Random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
# df_scores = pd.DataFrame(scores)

df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125


The :class:`~sklearn.ensemble.RandomForestClassifier` is as well affected by
the class imbalanced, slightly less than the linear model. Now, we will
present different approach to improve the performance of these 2 models.

### Use `class_weight`

Most of the models in `scikit-learn` have a parameter `class_weight`. This
parameter will affect the computation of the loss in linear model or the
criterion in the tree-based model to penalize differently a false
classification from the minority and majority class. We can set
`class_weight="balanced"` such that the weight applied is inversely
proportional to the class frequency. We test this parametrization in both
linear model and tree-based model.



In [64]:
lr_clf.set_params(logisticregression__class_weight="balanced")

index += ["Logistic regression with balanced class weights"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144


In [65]:
rf_clf.set_params(randomforestclassifier__class_weight="balanced")

index += ["Random forest with balanced class weights"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144
Random forest with balanced class weights,0.680929,0.378342


We can see that using `class_weight` was really effective for the linear
model, alleviating the issue of learning from imbalanced classes. However,
the :class:`~sklearn.ensemble.RandomForestClassifier` is still biased toward
the majority class, mainly due to the criterion which is not suited enough to
fight the class imbalance.

### Resample the training set during learning

Another way is to resample the training set by under-sampling or
over-sampling some of the samples. `imbalanced-learn` provides some samplers
to do such processing.



In [66]:
from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    RandomUnderSampler(random_state=42),
    LogisticRegression(max_iter=1000),
)

In [67]:
index += ["Under-sampling + Logistic regression"]
cv_result = cross_validate(lr_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144
Random forest with balanced class weights,0.680929,0.378342
Under-sampling + Logistic regression,0.572073,0.609646


In [68]:
rf_clf = make_pipeline_with_sampler(
    preprocessor_tree,
    RandomUnderSampler(random_state=42),
    RandomForestClassifier(random_state=42, n_jobs=2),
)

In [69]:
index += ["Under-sampling + Random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144
Random forest with balanced class weights,0.680929,0.378342
Under-sampling + Logistic regression,0.572073,0.609646
Under-sampling + Random forest,0.512825,0.384932


Applying a random under-sampler before the training of the linear model or
random forest, allows to not focus on the majority class at the cost of
making more mistake for samples in the majority class (i.e. decreased
accuracy).

We could apply any type of samplers and find which sampler is working best
on the current dataset.

Instead, we will present another way by using classifiers which will apply
sampling internally.

### Use of specific balanced algorithms from imbalanced-learn

We already showed that random under-sampling can be effective on decision
tree. However, instead of under-sampling once the dataset, one could
under-sample the original dataset before to take a bootstrap sample. This is
the base of the :class:`imblearn.ensemble.BalancedRandomForestClassifier` and
:class:`~imblearn.ensemble.BalancedBaggingClassifier`.



In [70]:
from imblearn.ensemble import BalancedRandomForestClassifier

rf_clf = make_pipeline(
    preprocessor_tree,
    BalancedRandomForestClassifier(random_state=42, n_jobs=2),
)

In [71]:
index += ["Balanced random forest"]
cv_result = cross_validate(rf_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144
Random forest with balanced class weights,0.680929,0.378342
Under-sampling + Logistic regression,0.572073,0.609646
Under-sampling + Random forest,0.512825,0.384932
Balanced random forest,0.535537,0.399958


The performance with the
:class:`~imblearn.ensemble.BalancedRandomForestClassifier` is better than
applying a single random under-sampling. We will use a gradient-boosting
classifier within a :class:`~imblearn.ensemble.BalancedBaggingClassifier`.



In [72]:
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from imblearn.ensemble import BalancedBaggingClassifier

bag_clf = make_pipeline(
    preprocessor_tree,
    BalancedBaggingClassifier(
        base_estimator=HistGradientBoostingClassifier(random_state=42),
        n_estimators=10,
        random_state=42,
        n_jobs=2,
    ),
)

index += ["Balanced bag of histogram gradient boosting"]
cv_result = cross_validate(bag_clf, X, y, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy
Dummy classifier,0.949676,0.5
Logistic regression,0.938623,0.497331
Random forest,0.530513,0.305125
Logistic regression with balanced class weights,0.574121,0.611144
Random forest with balanced class weights,0.680929,0.378342
Under-sampling + Logistic regression,0.572073,0.609646
Under-sampling + Random forest,0.512825,0.384932
Balanced random forest,0.535537,0.399958
Balanced bag of histogram gradient boosting,0.404184,0.38847


In [74]:
cv_result

{'fit_time': array([6.12350273, 6.15721726, 7.08900547, 6.98826957, 6.76534963]),
 'score_time': array([1.2804234 , 1.42841387, 1.73170495, 1.51417255, 1.52870631]),
 'test_accuracy': array([0.05814742, 0.40272282, 0.51078569, 0.51926659, 0.52999648]),
 'test_balanced_accuracy': array([0.0968503 , 0.49862293, 0.59988654, 0.45583192, 0.29115588])}

This last approach is the most effective. The different under-sampling allows
to bring some diversity for the different GBDT to learn and not focus on a
portion of the majority class.

