## CopperHead V2 tutorial

This framework builds upon columnar analysis platform coffea 202x python package, using awkward arrays and dask distributed for parallelization.

First we setup our config by specifying the era/year we will be doing our analysis work on.

# Pre-stage
Before we "run" our analysis, we prepare the list of samples that we will be performing our analysis on. This can be done by executing ```run_prestage.py``` script, specifying the chunksize by using ```--chunksize``` flag and listing the samples we would like to perform our analysis on with ```--input_string``` flag.

The chunksize value is simple: it is an integer value of "chunks" of rows of data that each worker works on during parallelized workflow. 

Moreover, one can specify the list of data runs, MC background samples and MC signal samples for the analysis to run on by using --data, --background and --signal flag respectively. If left empty/ imcompatible (ie data 'A' in year 2017), it will just skip and move on.


In [None]:
data_l = ['A', 'B', 'C', 'D']
bkg_l = ['DY', 'TT',]
sig_l = ['ggH', 'VBF']
! python run_prestage.py --chunksize 100000 --year 2018 --cluster True --data {' '.join(data_l)} --background {' '.join(bkg_l)} --signal {' '.join(sig_l)}

If we wish to run our analysis only onto a subset of our samples in order to save time, for example, we can do so my specifying the fraction of the samples we would like to perform our analysis on with the ```--change_fraction``` flag with the accompanying floating value representing the fraction of the samples we want to work on.

For example running this cell below would trim our  ```./config/fraction_processor_samples.json``` by approximately ten percent.

In [None]:
! python run_prestage.py --change_fraction 0.1

The code above will only less than a second. This will save a new config file ```./config/fraction_processor_samples.json```. Please note that we don't overwrite the original full config file ```./config/fraction_processor_samples.json```. This is so that if you would like to change your fraction value, you can do so quickly, instead of waiting a full minute to redo the whole prestage step.

# Running Stage 1

Now we're ready to execute stage 1 of the analysis, which refers to the baseline selections we apply just before categorization of Higgs decay categories. we do this by simply running ```run_stage1.py```, though we recommend to also add ```-W ignore``` option to suppress warning flags. This operation takes the most time, ranging from 30 mins for fraction of around 0.25, all the way to hours for a full sample run. The outputs of the ```run_stage1.py``` will be saved as collection of ```.parquet``` files in the directory that's defined in the ```--save_path``` flag along with the sample name and fraction. 

For instance, data_A samples with fraction 0.25 with sample_path of ```/depot/cms/users/yun79/results/stage1/test/``` would be saved at ```/depot/cms/users/yun79/results/stage1/test/f0_25/data_A```

In [None]:
year = 2018
save_path = "/depot/cms/users/yun79/results/stage1/test/"
! python -W ignore run_stage1.py -y {year} --save_path {save_path}

# Stage 1 Validation
Now we validate our stage 1 outputs by plotting validation histograms. Like ```run_prestage.py``` script, we can specify the options of the plots via ```--input_string``` flag, but with different formating, but this time with mostly just boolean values: 


Ratio_{Y or N}/LogY_{Y or N}/ShowLumi_{Y or N}/Status_{work or prelim}

Where we specify if we want Data/MC ratio plot in the bottom panel on with "Y" to mean yes and "N" to mean no after ```Ratio_```, plot in log scale in the y axis after ```LogY_```, show integrated luminosity value of the run after ```ShowLumi_``` and status of the plot after ```Status_```, where the option is "work" for "Work in Progress", "prelim" for "Preliminary" and empty character ("") for no mention of the status at all.

Ie: Ratio_Y/LogY_Y/ShowLumi_N/Status_work indicates to have Data/MC ratio plot on the bottom, plot in logarithmic scale, don't show the integrated luminosity value, and have "Work in progress" label

next is the ```--load_path``` flag, which should be identical to the path specified in ```--save_path``` flag when running the ```run_stage1.py``` script.

One can also specify the path to where the validation plots will be saved by adding ```--save_path``` flag onto ```run_stage1_validation.py``` script, or just use the default path ```./validation/figs```

In [None]:
! python run_stage1_validation.py --fraction 0.001 --input_string "Ratio_Y/LogY_Y/ShowLumi_N/Status_work" --load_path "/depot/cms/users/yun79/results/stage1/test/"

In [1]:
data_l = ['A', 'B', 'C', 'D']
bkg_l = ['DY','TT','ST','VV','EWK']
sig_l = ['ggH', 'VBF']
vars2plot = ['jet', 'mu', 'dimuon', 'dijet'] 
lumi = 137.9
status = "Private_Work"
year = 2018

In [None]:
fraction = 1.0
fraction_str = str(fraction).replace('.', '_')
load_path = f"/depot/cms/users/yun79/results/stage1/test_full3/{year}/f{fraction_str}"
! python validation_plotter_unified.py -y {year} --load_path {load_path}  -var {' '.join(vars2plot)} --data {' '.join(data_l)} --background {' '.join(bkg_l)} --signal {' '.join(sig_l)} --lumi 137.9 --status {status} --ROOT_style    

# Stage 2
Now we take the stage1 output for stage2: Categorization of skimmed and selected data into production mode categories. Currently, only ggH production mode is supported.

Each category processes the stage1 output through their own MVAs.


In [1]:
load_path = "/depot/cms/users/yun79/results/stage1/test_VBF-filter_JECon_07June2024" # path where stage1 output is saved 
save_path = "/work/users/yun79/stage2_output/ggH/test" # path where stage2 output is saved 
category = "ggH"
processes = ["data", "signal"] # signal here is MC signal sample (ie ggh_powheg)
! python run_stage2.py -load {load_path} -save {save_path}

load_path: /depot/cms/users/yun79/results/stage1/test_VBF-filter_JECon_07June2024/2018/f1_0
len(training_features): 20
sum df.h_peak: 5809371.0
scalers: (2, 20)
df_i: [{dimuon_cos_theta_cs: -0.327, dimuon_eta: 2.38, dimuon_phi_cs: ..., ...}, ...]
df_i_feat[:,0]: [-0.327, -0.608, 0.288, 0.0652, 0.706, ..., 0.22, -0.039, 0.739, 0.795, -0.0614]
df_i.dimuon_cos_theta_cs: [-0.327, -0.608, 0.288, 0.0652, 0.706, ..., 0.22, -0.039, 0.739, 0.795, -0.0614]
model: phifixedBDT_2018
prediction: [0.3871868  0.28236988 0.43517634 ... 0.1707716  0.31343392 0.47809392]
scalers: (2, 20)
df_i: [{dimuon_cos_theta_cs: 0.511, dimuon_eta: -1.55, dimuon_phi_cs: 0.48, ...}, ...]
df_i_feat[:,0]: [0.511, -0.241, -0.118, 0.201, 0.847, ..., -0.0385, 0.815, 0.844, -0.0747]
df_i.dimuon_cos_theta_cs: [0.511, -0.241, -0.118, 0.201, 0.847, ..., -0.0385, 0.815, 0.844, -0.0747]
model: phifixedBDT_2018
prediction: [0.4106679  0.4083793  0.55877185 ... 0.3553411  0.37208927 0.2874001 ]
scalers: (2, 20)
df_i: [{dimuon_cos_t