# PREPROCESSING (Part2)

# Imbalanced data

There is a powerful package written in Python and developed by part of the developers of Scikit-Learn, called Imbalanced-Learn.

It is developed through GitHub (see https://github.com/scikit-learn-contrib/imbalanced-learn), and there is also an official website (see http://imbalanced-learn.org/en/stable/) where you can find all the info you might need.

I strongly recommend to read the user guide (see http://imbalanced-learn.org/en/stable/user_guide.html) as well as the general examples as a complement to it (see http://imbalanced-learn.org/en/stable/auto_examples/index.html).

The package is not available through Anaconda Navigator, but you can install install is from the prompt by entering

conda install -c conda-forge imbalanced-learn

## Undersampling

We will try NearMiss undersampling technique on Iris dataset. Since Iris is perfectly balanced, firstly we will imbalance it artificially.

In [None]:
from collections import Counter

from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

from imblearn.datasets import make_imbalance
from imblearn.under_sampling import NearMiss
from imblearn.pipeline import make_pipeline
from imblearn.metrics import classification_report_imbalanced

RANDOM_STATE = 0

# Load dataset and create an artificial imbalance
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target,
                      sampling_strategy={0: 25, 1: 50, 2: 50},
                      random_state=RANDOM_STATE)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Creation of a pipeline, i.e. concatenation of steps in a composed process (see documentation for further details)
pipeline = make_pipeline(NearMiss(version=2),
                         LinearSVC(random_state=RANDOM_STATE, max_iter=10000))
pipeline.fit(X_train, y_train)

# Classification and results presentation
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

## Oversampling

We try now SMOTE oversampling technique on a dataset about thyroid sickness. It has 3772 samples and 52 independent variables. It is imbalanced by a rate of 15 to 1.

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

# Load dataset
tiroides = pd.read_csv('Thyroids.csv')
tiroides.values.astype(float)

# Separate inputs and target
X = tiroides.values[:,:-1]
y = tiroides.values[:,-1].astype(int)

# Split train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RANDOM_STATE)

print('Training statistics: {}'.format(Counter(y_train)))
print('Testing statistics: {}'.format(Counter(y_test)))

# Pipeline creation
pipeline = make_pipeline(SMOTE(random_state=RANDOM_STATE),
                         RandomForestClassifier(n_estimators=10, random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

# Classification and results presentation
print(classification_report_imbalanced(y_test, pipeline.predict(X_test)))

#### Exercise 4:

(i) Try a different NearMiss version from the one in the example for the thyroids dataset with random forests classifier. Does it get better if we increase the number of trees to 100 in the forest (n_estimators)? And from 100 to 1000?

(ii) Plan a mixed strategy for thyroids dataset and chech its performance with random forests. Play with n_estimators parameter to increase f1 average score. Is the order of the mixed sampling strategies relevant?

(iii) Combine PCA with the mixed strategy. Quantify the percentage of data compression when capturing 95% of the total cummulative variance. Compare the performance with the one in (ii). In case of big differencies, which could be one reason?

(iv) Compare the results of all strategies with the case of not correcting the imbalance.

(v) Use ADASYN oversampling technique combined with an undersampling technique different from NearMiss. Explain the reason for your choice. See imbalanced-learn documentation for seeing which functions to use and checking how to use them. 

#### Solution:

In [None]:
# Your solution here

# In general, and for your future revisions of the material, it is better that you provided a complete code here.
# So it is better to define imports and functions here, so that this one single cell could be executed on its own.