# LoLa

Written by Vince Ling

Paper link:https://arxiv.org/pdf/1707.08966.pdf. Read the paper before going forward. It is important that you understand the math and the structure of lola and cola layers before the training process.

Notice that we are doing two-classifier job in this tutorial.

## I. Data Preprocessing

### 1.Understanding data

Dataset link: https://zenodo.org/record/2603256#.X7VrlGhKiUm There are 3 files, contains in total 1.2M training events, 400k validation events and 400k test events. We will work with training dataset, which is the 1GB.

In [None]:
import pandas as pd
input_filename = "data/train.h5"
store = pd.read_hdf(input_filename, 'table')

Now check the data structure.

In [None]:
store.info

We can see there are 1211000 rows x 806 columns. 1211000 means there are 1.2 million jets, 806 columns means we have `E0, PX0, PY0, PZ0` to `E199, PX199, PY199, PZ199`(800) with `truthE, truthPX, truthPY, truthPZ, ttv, is_signal_new`.

This means for each jet, we are at most 200 momenta, 1 truth momenta and we can check what kind of dataset by `ttv`, and wheather it is a signal(top) or a background(qcd) by `is_signal_new`(1 for signal 0 for background)

### 2. Data Process

Since the input for LOLA is four momenta array, we need to process them to a four momenta.

In [None]:
import keras
import numpy as np

In [None]:
#first we will split the train dataset from signal and background.
signal = store[store['is_signal_new']==1]
background = store[store['is_signal_new']==0]
print(signal.shape)
print(background.shape)
#we can see there are 600k of each

In [None]:
def loadmomenta(dataset, nConstituents=40):
    #this function takes a input of top tagging dataset and return a four momenta array
    momenta = dataset.values[:, :nConstituents*4]
    momenta = np.reshape(momenta, (len(momenta), nConstituents, 4))
    momenta = np.transpose(momenta, [0, 2, 1])
    labels = dataset.values[:, -1]
    indices = np.random.permutation(len(labels))
    return momenta[indices], labels[indices]

In [None]:
signal_momenta, signal_labels = loadmomenta(signal)
background_momenta, background_labels = loadmomenta(background)

In [None]:
momenta = np.append(background_momenta, signal_momenta, axis=0)
labels = keras.utils.to_categorical(np.append(background_labels, signal_labels), 2)

### Excercise

load the test and validation dataset for later use.

In [None]:
# TODO


## II. Model Construction

### 1. Model

Cola class and Lola class

In [None]:
import sys
sys.path.insert(0,'lib')
import classes

Now contruct the model.

In [None]:
model = classes.LoLaClassifier(nConstituents=40, nAdded=10).model

In [None]:
model.compile(
            optimizer=keras.optimizers.Adam(lr=0.0001), 
            loss='binary_crossentropy', 
            metrics=['acc'])
print(model.summary())

## III. Training

In [None]:
history = model.fit(momenta, labels,
        batch_size=1024,
        validation_split=0.25,
        epochs=10, 
        shuffle=True, 
        callbacks=None,
        use_multiprocessing=True, 
        workers=4)

## IV. Evaluation

In [None]:
import matplotlib.pyplot as plt
def learningCurveLoss(history):
    plt.figure()
    plt.plot(history.history['loss'], linewidth=1)
    plt.plot(history.history['val_loss'], linewidth=1)
    plt.title('Model Loss over Epochs')
    plt.legend(['training sample loss','validation sample loss'])
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.show()

In [None]:
learningCurveLoss(history)

Now, it is your time to finish the ROC curve and other evaluations for Lola network with two classifiers.

In [None]:
# TODO


In this experiment, we uesed the train.py for validation, training and testing. It is up to you that if you want to load all the dataset from the website to run this code.

## Excercise

This is a two-taggers job. One of the most important goals in our research team is to convert these kind of problems into a 5-classifiers job using our data, and compare their performances. Since you have finished couple of 5-tagger problems, now it is time for you to modify these codes.

Hint: Convert the training dataset with this shape: (98769, 4, 40),lables shape: (98769, 5).


In [None]:
# TODO
