### Balancig input data for a model to differentiate between muons and electrons

In this notebook, we prepare the input data that will be feeded to a convolutional neural network to classify a signal as electron or muon. So first of all, we wil get all the electron files from the `cropped_images` directory and we will separate this array into two: one for the energy and the other one for the noise. And then, we will wet the muon files from the `cropped_muons` directory and we will proceed similary.

In [1]:
import glob
import os, shutil
import numpy as np

In [2]:
e_dir = '/gpfs/projects/damic/electrons_padded'
mu_dir = '/gpfs/projects/damic/cropped_muons'

Selecting energy for muons:

In [3]:
muons = glob.glob1(mu_dir,"*13.npz") #all the muons

len_mu = len(muons)

all_muon_energy = [np.load(os.path.join(mu_dir, muons[_]))['energy'] for _ in range(len_mu)]

all_muon_energy_ = np.dstack(all_muon_energy)
all_muon_energy_ = np.rollaxis(all_muon_energy_,-1)

Selecting energy for electrons:

In [4]:
electrons = glob.glob1(e_dir,"*11.npz") #all the electrons

len_e = len(electrons)

all_e_energy = [np.load(os.path.join(e_dir, electrons[_]))['energy'] for _ in range(len_e)]

all_e_energy_ = np.dstack(all_e_energy)
all_e_energy_ = np.rollaxis(all_e_energy_,-1)

In [5]:
all_e_energy_.shape

(2628, 296, 286)

In [6]:
all_muon_energy_.shape

(801, 296, 286)

We observe we have different number of samples for each class, so a under/oversampling must be performed. But first we create the labels. 
* 0 for electron
* 1 for muons

In [7]:
labels_electron = np.repeat(0, all_e_energy_.shape[0])
labels_muon = np.repeat(1, all_muon_energy_.shape[0])

In [8]:
X = np.concatenate((all_e_energy_, all_muon_energy_), axis=0) #energy followed by noise
y = np.concatenate((labels_electron, labels_muon), axis=0) #energy labels followed by noise labels

In [9]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_1', data=X, labels=y)

Now we split the data in train and test so we can perform the under/oversampling on the training and validating dataset

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [12]:
import imblearn
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

Using TensorFlow backend.


In [13]:
print(Counter(y_train))

Counter({0: 1966, 1: 605})


The number of samples is really unbalanced!!!

In [14]:
aux_X = X_train.reshape((X_train.shape[0], X_train.shape[1]* X_train.shape[2]))

In [15]:
#undersample the electrons

under = RandomUnderSampler(sampling_strategy={0: 800} )

X_under, y_under = under.fit_resample(aux_X, y_train)

print(Counter(y_under))

Counter({0: 800, 1: 605})


In [16]:
oversample = RandomOverSampler(sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X_under, y_under)

print(Counter(y_over))

Counter({0: 800, 1: 800})


In [17]:
X_train = X_over.reshape((X_over.shape[0], X_train.shape[1], X_train.shape[2]))
print(X_train.shape)

(1600, 296, 286)


In [18]:
y_train = y_over

We save the data separately:

In [19]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_tr1', data=X_train, labels=y_train)

In [20]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_te1', data_test=X_test, labels_test=y_test)

Now we split the data in train and test so we can perform the under/oversampling on the training dataset:
* 0.2 for the validation dataset
* 0.2 aprox for the testing dataset
* 0.6 for the training dataset

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0, test_size = 0.2)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, random_state=0, test_size = 0.2)

In [13]:
import imblearn
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler

Using TensorFlow backend.


In [14]:
print(Counter(y_train))

Counter({0: 1680, 1: 514})


In [15]:
aux_X = X_train.reshape((X_train.shape[0], X_train.shape[1]* X_train.shape[2]))

In [16]:
#undersample the electrons

under = RandomUnderSampler(sampling_strategy={0: 800} )

X_under, y_under = under.fit_resample(aux_X, y_train)

print(Counter(y_under))

Counter({0: 800, 1: 514})


In [17]:
oversample = RandomOverSampler(sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X_under, y_under)

print(Counter(y_over))

Counter({0: 800, 1: 800})


In [18]:
X_train = X_over.reshape((X_over.shape[0], X_train.shape[1], X_train.shape[2]))
print(X_train.shape)

(1600, 296, 286)


In [19]:
y_train = y_over

We save train, validation and test data separately:

In [24]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_tr2', data=X_train, labels=y_train)

In [25]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_te2', data_test=X_test, labels_test=y_test)

In [26]:
np.savez_compressed('/gpfs/projects/damic/eVSmu_va2', data_v=X_val, labels_v=y_val)