### Balancing data between e, mu, and noise

we will use this [url](https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/)

In [1]:
#pip install imbalanced-learn

Collecting imbalanced-learn
[?25l  Downloading https://files.pythonhosted.org/packages/c8/73/36a13185c2acff44d601dc6107b5347e075561a49e15ddd4e69988414c3e/imbalanced_learn-0.6.2-py3-none-any.whl (163kB)
[K     |████████████████████████████████| 163kB 790kB/s eta 0:00:01
Collecting scikit-learn>=0.22 (from imbalanced-learn)
[?25l  Downloading https://files.pythonhosted.org/packages/41/b6/126263db075fbcc79107749f906ec1c7639f69d2d017807c6574792e517e/scikit_learn-0.22.2.post1-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 13.9MB/s eta 0:00:01
Installing collected packages: scikit-learn, imbalanced-learn
  Found existing installation: scikit-learn 0.21.3
    Uninstalling scikit-learn-0.21.3:
      Successfully uninstalled scikit-learn-0.21.3
Successfully installed imbalanced-learn-0.6.2 scikit-learn-0.22.2.post1
Note: you may need to restart the kernel to use updated packages.


In [1]:
# check version number
import imblearn
print(imblearn.__version__)

Using TensorFlow backend.


0.6.2


**Random oversampling** can be implemented using the `RandomOverSampler` class.

The class can be defined and takes a sampling_strategy argument that can be set to “minority” to automatically balance the minority class with majority class or classes.

In [2]:
# example of random oversampling to balance the class distribution
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler

# define dataset
X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)
# summarize class distribution
print(Counter(y))

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
#This means that if the majority class had 1,000 examples and the minority class had 
#100, this strategy would oversampling the minority class so that it has 1,000 examples.

# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over))

Counter({0: 9900, 1: 100})
Counter({0: 9900, 1: 9900})


Before I had 9900 sample for class 0 and just 100 for class 1, and then 9800 samples for the 1 class have been added. Let's try with our dataset:

In [3]:
import numpy as np

In [4]:
loaded = np.load('/gpfs/projects/damic/eVSmuVSn_1.npz')
X = loaded['data']
y = loaded['labels']

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

We have to apply just the oversample to the train dataset:

In [7]:
print(Counter(y_train))

Counter({0: 2560, 1: 1981, 2: 602})


In [8]:
aux_X = X_train.reshape((X_train.shape[0], X_train.shape[1]* X_train.shape[2]))

In [17]:
oversample = RandomOverSampler(sampling_strategy='minority')

# fit and apply the transform
X_over, y_over = oversample.fit_resample(aux_X, y_train)
# summarize class distribution
print(Counter(y_over))

Counter({0: 2560, 2: 2560, 1: 1981})


We repeat one more time:

In [18]:
# fit and apply the transform
X_over_f, y_over_f = oversample.fit_resample(X_over, y_over)
# summarize class distribution
print(Counter(y_over_f))

Counter({1: 2560, 0: 2560, 2: 2560})


In [20]:
X_over_f.shape

(7680, 84656)

In [19]:
X_train.shape

(5143, 296, 286)

Redefining again:

In [21]:
X_train = X_over_f.reshape((X_over_f.shape[0], X_train.shape[1], X_train.shape[2]))

In [22]:
print(X_train.shape)

(7680, 296, 286)


As we observe, we have increased the number of samples from aprox 5000 to more than 7500 for the trainning. Observing that the muon data has been increased from 600 to 2560, this issue can lead to overfitting. So we will take a new approach:
* Downsample the noise data
* Oversample the muon data

In [9]:
print(aux_X.shape)

(5143, 84656)


In [10]:
print(Counter(y_train))

Counter({0: 2560, 1: 1981, 2: 602})


In [11]:
from imblearn.under_sampling import RandomUnderSampler

In [21]:
#undersample the noise to aprox the sample of electrons

under = RandomUnderSampler(sampling_strategy={0: 1981} )

X_under, y_under = under.fit_resample(aux_X, y_train)

print(Counter(y_under))

Counter({0: 1981, 1: 1981, 2: 602})


In [22]:
oversample = RandomOverSampler(sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X_under, y_under)

print(Counter(y_over))

Counter({0: 1981, 1: 1981, 2: 1981})


In [24]:
X_over.shape, y_over.shape

((5943, 84656), (5943,))

In [25]:
X_train.shape, y_train.shape

((5143, 296, 286), (5143,))

In [27]:
X_train = X_over.reshape((X_over.shape[0], X_train.shape[1], X_train.shape[2]))

In [28]:
X_train.shape

(5943, 296, 286)

In [29]:
y_train = y_over

We have the same amount of sample of each class! Now we save these arrays (for training and test)

In [30]:
np.savez_compressed('/gpfs/projects/damic/eVSmuVSn_tr1', data=X_train, labels=y_train)

In [31]:
np.savez_compressed('/gpfs/projects/damic/eVSmuVSn_te1', data_test=X_test, labels_test=y_test)