# I. Data Preprocessing
## 1. Loading data
There are several common datatypes: .h5 , .pkl , .json , .txt and compacted datatypes .z , .gz , .zip ,etc. <br>
Generally, we tend to use .h5, because it takes up relatively small space.

Note we are using the .z five labels file

In [None]:
import pandas as pd
import numpy as np
import h5py
import json

In [None]:
# TODO
path = " "
f = h5py.File(path, 'r')
f.keys()                    # check keys in .h5 file, we need to read it by the key.

In [None]:
# TODO
# extract the data array
darray = f['whatever_key_you_find'][()] 

# we would like to use pandas to manipulate the data.
features = ['f1','f2','f3']         
labels = ['l1','l2']
data_feature = pd.DataFrame(darray, columns=features)
data_label = pd.DataFrame(darray, columns=labels)

f.close()

**OR**

In [None]:
# Once you are familiar with your data, there is a short cut.
path = " "
with h5py.File(path, 'r') as f:
    darray = f['whatever_key_you_find'][()]
    data = pd.DataFrame(darray, columns= columns_you_want)              

In terms of DGCNN, we are gonna use 7 features and 5 labels( labels depend on what task you are doing) as our input.

| Features(7) | Labels(5) |
|:---|:--- |
|"j1_etarel" -- delta eta, |'J_t',|
|"j1_phirel" -- delta phi, |'J_q'|
|"log(j1_pt)" -- log pt, |'J_g'|
|"log(j1_e)" -- log E, |'J_w'|
|"log(j1_ptrel)" -- log(pt / ptjet), |'J_z'|
|"log(j1_erel)" -- log(E / Ejet), ||
|"j1_deltaR" -- delta R||

j1_etarel: ration of the eta of each constituent to the eta of the jet<br>
j1_phirel: ratio of the phi of each constituent to the phi of the jet<br>
j1_pt: constituent pt (transverse momentum)<br>
j1_e: constituent energy<br>
j1_ptrel: ratio of the pT of each constituent to the pT of the jet<br>
j1_erel: ration of the energy of each constituent of the energy of the jet<br>
j1_deltaR: sqrt((Δeta)2 + (Δ phi)2 ) <br><br>
j_g: gluon jet<br>
j_q: quark jet <br>
j_w: W boson jet <br>
j_z: Z boson jet<br>
j_t: Top jet<br>


### Excercise: 
Read out all the columns and try to understand what they are.


## 2. Feature construction
We cannot get the log values directly from the original file, therefore a little feature construction is needed.

In [None]:
data_feature["log(j1_pt)"] = np.log(data_feature['j1_pt'])
data_feature["log(j1_e)"] = np.log(data_feature["j1_e"])
data_feature["log(j1_ptrel)"] = np.log(data_feature['j1_ptrel'])
data_feature["log(j1_erel)"] = np.log(data_feature['j1_erel'])

data_feature.drop(['j1_pt','j1_e','j1_ptrel','j1_erel'],axis=1,inplace=True)

Now let's combine the features and labels so that we can send it to the model.

In [None]:
data_all = pd.concat([data_feature,data_label],axis=1)

## 3. Downsizing jets
In the data we got, the number of constituents contained in each jet is different, ranging from 20 to 200. While we need a fixed size as input in the machine learning process, that is to say, we need to manually specify the number of constituents for each jet. If we set nConstituents = 40, all Jets whose number of constituents is less than 40 will be zero-padded.

### 1) How do we identify jets
In the data I have contacted, there are two forms: particle-based and jet-based. <br>

For the particle-based data, there should be a feature help identify the data. For example "j_index", it tells you the unique index of a jet. Get it <a href="https://drive.google.com/file/d/1DCpxWbWtqU4sQwmGbZTg-4cdGAWonDKy/view?usp=sharing">here</a>.<br>

For the jet-based data, each row represents a jet, you can get specific number of constinuents by conditional slicing. Get it <a href="https://zenodo.org/record/2603256#.X62WkFqSmbh">here</a>.<br>




### 2) N-Constituents


In [None]:
labels = labels+['j_index']
data_label = pd.DataFrame(darray, columns=labels)
data_all = pd.concat([data_feature, data_label],axis=1)

In [None]:
from tqdm import tqdm
def data_transform (nConstituents, data_all):
    kColumns = data_all.columns.shape[0]

    # we expect the output shape (mJets, nConstituents, kColumns)
    jet_list = list(set(data_all['j_index']))
    data_expected = []

    for jet in tqdm(jet_list):
        # Zero padding for insufficient jets. 
        # So we create a empty array and add signals in.
        jet_frame = np.zeros((nConstituents, kColumns))
        jet_temp = data_all[data_all['j_index']==jet].values
        if (jet_temp.shape[0]<nConstituents):
            for i, constituent in enumerate(jet_temp):
                jet_frame[i] = constituent
        else:
            jet_frame += jet_temp[:nConstituents]
        data_expected.append(jet_frame)

    # "j_index" is useless for machine learning part. Drop it!
    return np.array(data_expected[:,:,:-1])

In [None]:
data = data_transform(40, data_all)

This is not the only solution or the fastest function to accomplish the goal. You can try to develop a better one. If you find a better method, please share to your collegues. Because we are gonna use this method for almost all the models.

### 3) Excercise
Try to think how you can get the same data shape with a jet-based data.

## 4. Train Test Split
We rely on the sklearn package to accomplish it. There is a build-in function.<br>
Choose a random seed and use it for all your researches. Wanna know why? To keep Consistent input very time you run. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Excercise
Apply the code above to your data. For further explanations for parameters, Google it!

In [None]:
# You can check the array shape in this cell


# II. Creating a Model

In [None]:
import sys
sys.path.insert(0,'lib')
import classes
import tensorflow.keras as keras

Load the model from classes.py file, and change the parameters according to our reshaped data structure.

In [None]:
model = classes.EdgeConvClassifier((40, 7)).model

Now compile the model, and set the learning rate. You can all change other settings like optimizer and loss function. Print the model structre to check each layers have the right parameters. 

In [None]:
model.compile(
            optimizer=keras.optimizers.Adam(lr = 0.0001), 
            loss='categorical_crossentropy', 
            metrics=['acc'])

In [None]:
print(model.summary())

## Training the model

Train the model using model.fit() function, and set the validation_split value, number of epochs,and bastch_size. Batch_size represents the size of data bins used to train the network, since with large volumes of data it cannot fit all onto your RAM at one time. An epoch is one iteration through the entire shuffled data set; with additional epochs, the data is reshuffled and used to train the network again. The validation split represents the fraction of the remaining training data to use as a validation set during the training.

In [None]:
history = model.fit(X_train, y_train,
        batch_size=1024,
        validation_split=0.25,
        epochs=10, 
        shuffle = True, 
        callbacks = None,
        use_multiprocessing=True, 
        workers=4)

After training the model, you can save the result to you local directory using model.save() funciton.

In [None]:
model.save('DGCNN.h5')

## Evaluation

Now to validate the result, plot the learning curve: loss on the training set versus the loss on the validation set.

In [None]:
def learningCurveLoss(history):
    plt.figure()
    plt.plot(history.history['loss'], linewidth=1)
    plt.plot(history.history['val_loss'], linewidth=1)
    plt.title('Model Loss over Epochs')
    plt.legend(['training sample loss','validation sample loss'])
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.show()

In [None]:
learningCurve(history)

Then plot the ROC curve.

In [None]:
if 'j_index' in labels:
    labels = labels[:-1]
from sklearn.metrics import roc_curve, auc
def makeRoc(features_val, labels_val, labels, model, outputSuffix=''):
    labels_pred = model.predict(features_val)
    df = pd.DataFrame()
    fpr = {}
    tpr = {}
    auc1 = {}
    plt.figure()       
    for i, label in enumerate(labels):
        df[label] = labels_val[:,i]
        df[label + '_pred'] = labels_pred[:,i]
        fpr[label], tpr[label], threshold = roc_curve(df[label],df[label+'_pred'])
        auc1[label] = auc(fpr[label], tpr[label])
        plt.plot(fpr[label],tpr[label],label='%s tagger, AUC = %.1f%%'%(label.replace('j_',''),auc1[label]*100.))
    plt.xlabel("Background Efficiency")
    plt.ylabel("Signal Efficiency")
    plt.xlim([-0.05, 1.05])
    plt.ylim(0.001,1.05)
    plt.grid(True)
    plt.legend(loc='lower right')
    plt.title('%s ROC Curve'%(outputSuffix))
    #plt.savefig('%s_ROC_Curve.png'%(outputSuffix))
    return labels_pred

In [None]:
y_pred = makeRoc(X_test, y_test, labels, model, outputSuffix='DGCNN')

## Exercise

After getting a reasonable learning curve and ROC curve, you can start to change and modify the parameters like learning rate and number of epochs, or even the hyperparameters inside the model. Find the best result from this DGCNN model with our data. And try to identify any anomalies and explain why it happens.