### Preparing input data for a model to differentiate between noise and electrons

In this notebook, we prepare the input data that will be feeded to a convolutional neural network to classify a signal as electron or noise. So first of all, we wil get all the electron files from the `cropped_images` directory and we will separate this array into two: one for the energy and the other one for the noise.

In [1]:
import glob
import os, shutil
import numpy as np

In [2]:
original_dataset_dir = '/gpfs/projects/damic/cropped_images/'

elements = len(glob.glob1(original_dataset_dir,"*11.npz"))

electrons_energy = glob.glob1(original_dataset_dir,"*11.npz") #all the electrons

all_electron_energy = [np.load(os.path.join(original_dataset_dir, electrons_energy[_]))['energy'] for _ in range(elements)]

In [3]:
type(all_electron_energy)

list

As the numpy loaded array is presented as a list, we transforme it into a an array:

In [4]:
all_electron_energy_ = np.dstack(all_electron_energy)

all_electron_energy_ = np.rollaxis(all_electron_energy_,-1)

In [6]:
all_electron_energy_.shape, type(all_electron_energy_)

((2628, 201, 147), numpy.ndarray)

Now we've got a numpy array with 2628 instances of 201x147 pixels. We shall proceed similary for the noise:

In [7]:
all_electron_noise = [np.load(os.path.join(original_dataset_dir, electrons_energy[_]))['noise'] for _ in range(elements)]

In [8]:
#list into numpy array
all_electron_noise_ = np.dstack(all_electron_noise)
all_electron_noise_ = np.rollaxis(all_electron_noise_, -1)

Once we have all the images as these numpy arrays, we give them the labels of 1 for electron signal and 0 for noise

In [9]:
labels_electron= np.repeat(1, all_electron_energy_.shape[0])

labels_noise = np.repeat(0, all_electron_noise_.shape[0])

In [10]:
X = np.concatenate((all_electron_energy_, all_electron_noise_), axis=0) #energy followed by noise
y = np.concatenate((labels_electron, labels_noise), axis=0) #energy labels followed by noise labels

In [11]:
print("%f gb of data+labels" % ((X.size * X.itemsize + y.size * y.itemsize) *10**(-9)))

1.242434 gb of data+labels


As we cant run the model in altamira, we will save `X` and `y` as npz files:

In [21]:
np.savez_compressed('/gpfs/projects/damic/eVSn', data=X, labels=y)

Let's see if it has been saved correctly:

In [22]:
loaded = np.load('/gpfs/projects/damic/eVSn.npz')

In [24]:
loaded['data'][0] == X[0]

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [25]:
loaded['labels'][0]=y[0]

Once we saved the `npz` file, we will download it to local and then run this code:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print(X_train.shape)

X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], X_train.shape[2], 1))

print(X_train.shape)