# PREPROCESSING (Part2)

# Imbalanced data

There is a powerful package written in Python and developed by part of the developers of Scikit-Learn, called Imbalanced-Learn.

It is developed through GitHub (see https://github.com/scikit-learn-contrib/imbalanced-learn), and there is also an official website (see http://imbalanced-learn.org/en/stable/) where you can find all the info you might need.

I strongly recommend to read the user guide (see http://imbalanced-learn.org/en/stable/user_guide.html) as well as the general examples as a complement to it (see http://imbalanced-learn.org/en/stable/auto_examples/index.html).

The package is not available through Anaconda Navigator, but you can install install is from the prompt by entering

conda install -c conda-forge imbalanced-learn

## Undersampling

We will try NearMiss undersampling technique on Iris dataset. Since Iris is perfectly balanced, firstly we will imbalance it artificially.

In [1]:
from collections import Counter

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from imblearn.datasets import make_imbalance
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced

RANDOM_STATE = 0

# Load dataset and create an artificial imbalance
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target,
                      sampling_strategy={0: 25, 1: 50, 2: 50},
                      random_state=RANDOM_STATE)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Creation of a pipeline, i.e. concatenation of steps in a composed process (see documentation for further details)
pipeline = make_pipeline(NearMiss(version=2),
                         LinearSVC(random_state=RANDOM_STATE, max_iter=10000))
pipeline.fit(X_train, y_train)

# Classification and results presentation
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

Training statistics: Counter({1: 38, 2: 38, 0: 17})
Testing statistics: Counter({1: 12, 2: 12, 0: 8})
                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      1.00      1.00      1.00      1.00      1.00         8
          1       1.00      0.83      1.00      0.91      0.91      0.82        12
          2       0.86      1.00      0.90      0.92      0.95      0.91        12

avg / total       0.95      0.94      0.96      0.94      0.95      0.90        32



## Oversampling

We try now SMOTE oversampling technique on a dataset about thyroid sickness. It has 3772 samples and 52 independent variables. It is imbalanced by a rate of 15 to 1.

In [2]:
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# Load dataset
tiroides = pd.read_csv('data/Thyroids.csv')
tiroides.values.astype(float)

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Pipeline creation
pipeline = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=10, random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

# Classification and results presentation
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

Training statistics: Counter({-1: 2662, 1: 167})
Testing statistics: Counter({-1: 879, 1: 64})
                   pre       rec       spe        f1       geo       iba       sup

         -1       0.99      0.99      0.86      0.99      0.92      0.86       879
          1       0.90      0.86      0.99      0.88      0.92      0.84        64

avg / total       0.98      0.98      0.87      0.98      0.92      0.86       943



#### Exercise 4:

(i) Try a different NearMiss version from the one in the example for the thyroids dataset with random forests classifier. Does it get better if we increase the number of trees to 100 in the forest (n_estimators)? And from 100 to 1000?

(ii) Plan a mixed strategy for thyroids dataset and chech its performance with random forests. Play with n_estimators parameter to increase f1 average score. Is the order of the mixed sampling strategies relevant?

(iii) Combine PCA with the mixed strategy. Quantify the percentage of data compression when capturing 95% of the total cummulative variance. Compare the performance with the one in (ii). In case of big differencies, which could be one reason?

(iv) Compare the results of all strategies with the case of not correcting the imbalance.

(v) Use ADASYN oversampling technique combined with an undersampling technique different from NearMiss. Explain the reason for your choice. See imbalanced-learn documentation for seeing which functions to use and checking how to use them. 

#### Solution:

In [3]:
# Your solution here

# In general, and for your future revisions of the material, it is better that you provided a complete code here.
# So it is better to define imports and functions here, so that this one single cell could be executed on its own.

#### (i) Try a different NearMiss version from the one in the example for the thyroids dataset with random forests classifier. Does it get better if we increase the number of trees to 100 in the forest (n_estimators)? And from 100 to 1000?

In [4]:
import pandas as pd
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

seed = 0

# Load dataset
df = pd.read_csv('data/Thyroids.csv')
df.values.astype(float)

# Separate inputs and target
X = df.values[:,:-1]
y = df.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

n_trees = [10, 100, 1000]

for n in n_trees:
    print(f'Pipeline with {n} trees and NearMiss version 1')
    pipeline = make_pipeline(NearMiss(version=1), RandomForestClassifier(n_estimators=n, random_state=seed))
    pipeline.fit(X_train, y_train)
    print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

Pipeline with 10 trees and NearMiss version 1
                   pre       rec       spe        f1       geo       iba       sup

         -1       0.99      0.79      0.92      0.88      0.85      0.72       879
          1       0.24      0.92      0.79      0.39      0.85      0.74        64

avg / total       0.94      0.80      0.91      0.85      0.85      0.72       943

Pipeline with 100 trees and NearMiss version 1
                   pre       rec       spe        f1       geo       iba       sup

         -1       1.00      0.78      0.97      0.88      0.87      0.75       879
          1       0.25      0.97      0.78      0.39      0.87      0.77        64

avg / total       0.95      0.80      0.96      0.84      0.87      0.75       943

Pipeline with 1000 trees and NearMiss version 1
                   pre       rec       spe        f1       geo       iba       sup

         -1       1.00      0.79      0.97      0.88      0.88      0.75       879
          1       0.25

If we increase the n_trees, it does not get better. We have the same f1 score with 10 and 1000 trees.
- n_estimators = 10        f1_score = 0.85
- n_estimators = 100       f1_score = 0.84
- n_estimators = 1000      f1_score = 0.85


#### (ii) Plan a mixed strategy for thyroids dataset and check its performance with random forests. Play with n_estimators parameter to increase f1 average score. Is the order of the mixed sampling strategies relevant?

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline

# SMOTE + Tomek’s links
# Class to perform over-sampling using SMOTE and cleaning using Tomek links.
# https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.combine.SMOTETomek.html
from imblearn.combine import SMOTETomek 

# Load dataset
df = pd.read_csv('data/Thyroids.csv')
df.values.astype(float)

X = df.values[:,:-1]
y = df.values[:,-1].astype(int)

seed = 0

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

n_trees = [10, 100, 1000]

for n in n_trees:
    print ('n_estimator = ' + str(n))
    pipeline = make_pipeline(SMOTETomek(random_state=seed), RandomForestClassifier(n_estimators=n, random_state=seed))
    pipeline.fit(X_train, y_train)
    print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

n_estimator = 10
                   pre       rec       spe        f1       geo       iba       sup

         -1       0.99      0.99      0.92      0.99      0.96      0.92       879
          1       0.89      0.92      0.99      0.91      0.96      0.91        64

avg / total       0.99      0.99      0.93      0.99      0.96      0.92       943

n_estimator = 100
                   pre       rec       spe        f1       geo       iba       sup

         -1       0.99      0.99      0.91      0.99      0.95      0.91       879
          1       0.91      0.91      0.99      0.91      0.95      0.89        64

avg / total       0.99      0.99      0.91      0.99      0.95      0.91       943

n_estimator = 1000
                   pre       rec       spe        f1       geo       iba       sup

         -1       0.99      0.99      0.91      0.99      0.95      0.91       879
          1       0.89      0.91      0.99      0.90      0.95      0.89        64

avg / total       0.99   

- We get a similar score with all n_estimators. The score is very good.

#### (iii) Combine PCA with the mixed strategy. Quantify the percentage of data compression when capturing 95% of the total cummulative variance. Compare the performance with the one in (ii). In case of big differencies, which could be one reason?

In [6]:
# Separate inputs and target
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTETomek 
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)

# Calculates dataframe PCA
def get_df_pca(df):    
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    pca.fit(X)
    X_reduced = pca.transform(X)
    # print("There have been selected " + str(X_reduced.shape[1]) + " principal components.")    
    columns = []
    for n in range(X_reduced.shape[1]):
        columns.append("PCA" + str(n))    
    df = pd.DataFrame(X_reduced, columns=columns)
    df['species'] = y
    return df       

# Load dataset
df = pd.read_csv('data/Thyroids.csv')
df.values.astype(float)

df_pca = get_df_pca(df)
X = df_pca.values[:,:-1]
y = df_pca.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

pipeline = make_pipeline(SMOTETomek(random_state=0), RandomForestClassifier(n_estimators=100, random_state=0))
pipeline.fit(X_train, y_train)
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

                   pre       rec       spe        f1       geo       iba       sup

         -1       0.96      0.89      0.55      0.92      0.70      0.50       879
          1       0.26      0.55      0.89      0.35      0.70      0.47        64

avg / total       0.92      0.86      0.57      0.88      0.70      0.50       943



- We get a f1-score of 0.88, not a good score compared without using PCA.
- With PCA, we reduce de dimension of the data and with SMOTETomek (SMOTE + Tomek) we modify the data previously compressed with PCA. That could be the reason of getting a lower f1-score.

#### (iv) Compare the results of all strategies with the case of not correcting the imbalance.
We are going to apply PCA and calculate the score, without balancing the data.

In [7]:
# We get the score of the model without correcting the imbalance data. (To compare with the used strategies)

# Separate inputs and target
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.pipeline import make_pipeline
from imblearn.combine import SMOTETomek 
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)

# Calculates dataframe PCA
def get_df_pca(df):    
    X = df.iloc[:,:-1]
    y = df.iloc[:,-1]
    pca.fit(X)
    X_reduced = pca.transform(X)
    # print("There have been selected " + str(X_reduced.shape[1]) + " principal components.")    
    columns = []
    for n in range(X_reduced.shape[1]):
        columns.append("PCA" + str(n))    
    df = pd.DataFrame(X_reduced, columns=columns)
    df['species'] = y
    return df       

# Load dataset
df = pd.read_csv('data/Thyroids.csv')
df.values.astype(float)

df_pca = get_df_pca(df)
X = df_pca.values[:,:-1]
y = df_pca.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=seed)

pipeline = make_pipeline(RandomForestClassifier(n_estimators=100, random_state=0))
pipeline.fit(X_train, y_train)
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

                   pre       rec       spe        f1       geo       iba       sup

         -1       0.94      0.99      0.11      0.97      0.33      0.12       879
          1       0.54      0.11      0.99      0.18      0.33      0.10        64

avg / total       0.91      0.93      0.17      0.91      0.33      0.12       943



- We can see that the f1-score is very bad. 
- The f1 score for the majority class (-1) is good, but for the minority class (1) it is bad.
- We get a better score if we only balance the data, without applying PCA.

#### (v) Use ADASYN oversampling technique combined with an undersampling technique different from NearMiss. Explain the reason for your choice. See imbalanced-learn documentation for seeing which functions to use and checking how to use them. 

In [8]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import TomekLinks

def automatic_scoring(df):
    algorithm = RandomForestClassifier(n_estimators=10, random_state=0)
    score = cross_val_score(estimator=algorithm, X=df.values[:, :-1], y=df.values[:, -1].astype('int'), cv=5, scoring='f1_macro')
    summary_score = score.mean()
    return summary_score

df = pd.read_csv('data/Thyroids.csv')
df.values.astype(float)

X = df.values[:,:-1]
y = df.values[:,-1].astype(int)

ada = ADASYN(random_state=0)
X_res1, y_res1 = ada.fit_resample(X, y)

cnn = TomekLinks()
X_res2, y_res2 = cnn.fit_resample(X_res1, y_res1)

df = pd.DataFrame(X_res2, columns=df.columns[:-1])
df['target'] = y_res2

print(automatic_scoring(df))

0.9875472634432899


Using TomekLinks undersampling method, we get a high score with cross validation. In this algorithm, we end up removing the majority element from the Tomek link which provides a better decision boundary for a classifier.