# Data Preprocessing
## 1. Loading data
There are several common datatypes: .h5 , .pkl , .json , .txt and compacted datatypes .z , .gz , .zip ,etc. <br>
Generally, we tend to use .h5, because it takes up relatively small space.

In [None]:
import pandas as pd
import numpy as np
import h5py
import json

In [None]:
path = " "
f = h5py.File(path, 'r')
f.keys()                    # check keys in .h5 file, we need to read it by the key.

In [None]:
features = ['f1','f2','f3']         
labels = ['l1','l2']
data = pd.read_hdf(path,key="The key you saw in the last step",columns=features+labels)

In terms of DGCNN, we are gonna use 7 features and 5 labels( labels depend on what task you are doing) as our input.

| Features(7) | Labels(5) |
|:---|:--- |
|"j1_etarel" -- delta eta, |'J_t',|
|"j1_phirel" -- delta phi, |'J_q'|
|"log(j1_pt)" -- log pt, |'J_g'|
|"log(j1_e)" -- log E, |'J_w'|
|"log(j1_ptrel)" -- log(pt / ptjet), |'J_z'|
|"log(j1_erel)" -- log(E / Ejet), ||
|"j1_deltaR" -- delta R||

j1_etarel: ration of the eta of each constituent to the eta of the jet<br>
j1_phirel: ratio of the phi of each constituent to the phi of the jet<br>
j1_pt: constituent pt (transverse momentum)<br>
j1_e: constituent energy<br>
j1_ptrel: ratio of the pT of each constituent to the pT of the jet<br>
j1_erel: ration of the energy of each constituent of the energy of the jet<br>
j1_deltaR: sqrt((Δeta)2 + (Δ phi)2 ) <br><br>
j_g: gluon jet<br>
j_q: quark jet <br>
j_w: W boson jet <br>
j_z: Z boson jet<br>
j_t: Top jet<br>


### Excercise: 
Read out all the columns and try to understand what they are.


## 2. Feature construction
We cannot get the log values directly from the original file, therefore a little feature construction is needed.

In [None]:
data_feature["log(j1_pt)"] = np.log(data_feature['j1_pt'])
data_feature["log(j1_e)"] = np.log(data_feature["j1_e"])
data_feature["log(j1_ptrel)"] = np.log(data_feature['j1_ptrel'])
data_feature["log(j1_erel)"] = np.log(data_feature['j1_erel'])

data_feature.drop(['j1_pt','j1_e','j1_ptrel','j1_erel'],axis=1,inplace=True)

Now let's combine the features and labels so that we can send it to the model.

In [None]:
data_all = pd.concat([data_feature,data_label],axis=1)

## 3. Downsizing jets
In the data we got, the number of constituents contained in each jet is different, ranging from 20 to 200. While we need a fixed size as input in the machine learning process, that is to say, we need to manually specify the number of constituents for each jet. If we set nConstituents = 40, all Jets whose number of constituents is less than 40 will be zero-padded.

### 1) How do we identify jets
In the data I have contacted, there are two forms: particle-based and jet-based. <br>

For the particle-based data, there should be a feature help identify the data. For example "j_index", it tells you the unique index of a jet. Get it <a href="https://drive.google.com/file/d/1DCpxWbWtqU4sQwmGbZTg-4cdGAWonDKy/view?usp=sharing">here</a>.<br>

For the jet-based data, each row represents a jet, you can get specific number of constinuents by conditional slicing. Get it <a href="https://zenodo.org/record/2603256#.X62WkFqSmbh">here</a>.<br>




### 2) N-Constituents


In [None]:
labels = labels+['j_index']
data_label = pd.DataFrame(darray, columns=labels)
data_all = pd.concat([data_feature, data_label],axis=1)

In [None]:
from tqdm import tqdm
def data_transform (nConstituents, data_all):
    kColumns = data_all.columns.shape[0]

    # we expect the output shape (mJets, nConstituents, kColumns)
    jet_list = list(set(data_all['j_index']))
    data_expected = []

    for jet in tqdm(jet_list):
        # Zero padding for insufficient jets. 
        # So we create a empty array and add signals in.
        jet_frame = np.zeros((nConstituents, kColumns))
        jet_temp = data_all[data_all['j_index']==jet].values
        if (jet_temp.shape[0]<nConstituents):
            for i, constituent in enumerate(jet_temp):
                jet_frame[i] = constituent
        else:
            jet_frame += jet_temp[:nConstituents]
        data_expected.append(jet_frame)

    # "j_index" is useless for machine learning part. Drop it!
    return np.array(data_expected[:,:,:-1])


In [None]:
data = data_transform(40, data_all)

This is not the only solution or the fastest function to accomplish the goal. You can try to develop a better one. If you find a better method, please share to your collegues. Because we are gonna use this method for almost all the models.

### 3) Excercise
Try to think how you can get the same data shape with a jet-based data.

## 4. Encoding
Some times we need to encode some of the columns, for example, the label column come with value {0: Light Jet, 4: Charm Jet, 5: Bottom Jet}. we need to map it to a 3-columns data, each column represents a kind of jet. It is called One-Hot Encoding. <br>
You should not run the following code because the data we are using do not require encoding.

In [None]:
label_df = pd.get_dummies(df['The_Label_Column'],prefix='label')
# You may want to rename the columns.
label_df.rename(columns={'flavor_0':"Light_Jet",'flavor_4':'Charm_Jet','flavor_5':'Bottom_Jet'},inplace=True)

Just in case you want to transform it back. You can multiply the weight for each column and sum it to get back.

In [None]:
label_prev = np.sum(label_encoded*[0,4,5],axis=1)

### Excercise
Create an 200-length array with random integer ranging \[0,5\], Encode it and then get it back.

## 5. Train Test Split
We rely on the sklearn package to accomplish it. There is a build-in function.<br>
Choose a random seed and use it for all your researches. Wanna know why? To keep Consistent input very time you run. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Excercise
Apply the code above to your data. For further explanations for parameters, Google it!