# Introduction

In this exercise we will learn the basic usage of two widely used machine learning frameworks.
 * TMVA
 * sklearn

We use a dataset from the Belle experiment. The Belle experiment was located in Tsukuba, Japan at the KEKB asymmetric electron-positron collider, which operated at a center of mass energy of 10.58 GeV.
The decay D0 -> K- pi+ (pi0 -> gamma gamma) was reconstructed and simple cuts on the particle identification information and the kinematics were applied to reduce the combinatorical background.
A mass-constrained vertex fit of the pi0 was performed, and an unconstrained vertex fit of the D0.

Two datasets are provided:
 * csc_mc.root contains Monte Carlo simulated events
 * csc_data.root contains Detector Data

In [1]:
! rm csc_mc.root csc_data.root
! wget http://ekpwww.ekp.kit.edu/~tkeck/csc_mc.root http://ekpwww.ekp.kit.edu/~tkeck/csc_data.root

/eos/user/c/csc01
--2017-07-27 18:13:10--  http://ekpwww.ekp.kit.edu/~tkeck/csc_mc.root
Resolving ekpwww.ekp.kit.edu... 129.13.101.178
Connecting to ekpwww.ekp.kit.edu|129.13.101.178|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32880821 (31M)
Saving to: “csc_mc.root”


2017-07-27 18:13:12 (23.9 MB/s) - “csc_mc.root” saved [32880821/32880821]

--2017-07-27 18:13:12--  http://ekpwww.ekp.kit.edu/~tkeck/csc_data.root
Reusing existing connection to ekpwww.ekp.kit.edu:80.
HTTP request sent, awaiting response... 200 OK
Length: 28276441 (27M)
Saving to: “csc_data.root”


2017-07-27 18:13:13 (27.5 MB/s) - “csc_data.root” saved [28276441/28276441]

FINISHED --2017-07-27 18:13:13--
Downloaded: 2 files, 58M in 2.3s (25.4 MB/s)
/eos/user/c/csc01
Downloaded: 2 files, 58M in 2.3s (25.4 MB/s)


# TMVA

TMVA is the multivariate analysis framework of the ROOT library.
It includes many algorithms used in HEP.

In this exercise we only cover the basics.
For more advanced examples see:
https://swan.web.cern.ch/content/machine-learning

For TMVA there are basically two concepts:
 * TMVA::Factory is responsible for the algorithms
 * TMVA::DataLoader handles the data

In [1]:
TMVA::Tools::Instance();

auto inputFile = TFile::Open("csc_mc.root");
auto outputFile = TFile::Open("output.root", "RECREATE");

TMVA::Factory factory("TMVAClassification", outputFile,
                      "!V:ROC:!Correlations:!Silent:Color:!DrawProgressBar:AnalysisType=Classification" );

In [2]:
TMVA::DataLoader loader("dataset");
loader.AddVariable("p");
loader.AddVariable("pt");
loader.AddVariable("pz");
loader.AddVariable("phi");
loader.AddVariable("chiProb");
loader.AddVariable("dr");
loader.AddVariable("dz");
loader.AddVariable("dphi");
loader.AddVariable("Kid0");
loader.AddVariable("Kid1");
loader.AddVariable("chiProb0");
loader.AddVariable("chiProb1");
loader.AddVariable("dr0");
loader.AddVariable("dr1");
loader.AddVariable("dz0");
loader.AddVariable("dz1");
loader.AddVariable("E0");
loader.AddVariable("E1");
loader.AddVariable("width0");
loader.AddVariable("width1");
loader.AddVariable("highestE0");
loader.AddVariable("highestE1");
loader.AddVariable("hits0");
loader.AddVariable("hits1");
loader.AddVariable("ratio0");
loader.AddVariable("ratio1");
loader.AddVariable("distance0");
loader.AddVariable("distance1");
loader.AddVariable("chiProb2");

In [3]:
TTree *tree;
inputFile->GetObject("tree", tree);
std::cout << "Signal " << tree->GetEntries("isSignal == 1") << std::endl;
std::cout << "Background " << tree->GetEntries("isSignal != 1") << std::endl;
std::cout << "Total " << tree->GetEntries() << std::endl;

Signal 6267
Background 210436
Total 216703


In [4]:
TCut cuts = "isSignal == 1";
TCut cutb = "isSignal != 1";
loader.SetInputTrees(tree, cuts, cutb);

TCut cut;
loader.PrepareTrainingAndTestTree(cut, "nTrain_Signal=3130:nTrain_Background=105218:SplitMode=Random:NormMode=NumEvents:!V" );

DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree tree of type Signal with 216703 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree tree of type Background with 216703 events
                         : Dataset[dataset] : Class index : 0  name : Signal
                         : Dataset[dataset] : Class index : 1  name : Background


In [5]:
factory.BookMethod(&loader,TMVA::Types::kBDT, "BDT",
                   "!V:NTrees=100:MinNodeSize=2.5%:MaxDepth=3:BoostType=AdaBoost:UseBaggedBoost:BaggedSampleFraction=0.5:SeparationType=GiniIndex:nCuts=128" );

Factory                  : Booking method: [1mBDT[0m
                         : 
DataSetFactory           : [dataset] : Number of events in input trees
                         : 
                         : 
                         : Number of training and testing events
                         : ---------------------------------------------------------------------------
                         : Signal     -- training events            : 3130
                         : Signal     -- testing events             : 3137
                         : Signal     -- training and testing events: 6267
                         : Background -- training events            : 105218
                         : Background -- testing events             : 105218
                         : Background -- training and testing events: 210436
                         : 
DataSetInfo              : Correlation matrix (Signal):
                         : -------------------------------------------------------

In [6]:
factory.TrainAllMethods();

Factory                  : [1mTrain all methods[0m
Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'p' <---> Output : variable 'p'
                         : Input : variable 'pt' <---> Output : variable 'pt'
                         : Input : variable 'pz' <---> Output : variable 'pz'
                         : Input : variable 'phi' <---> Output : variable 'phi'
                         : Input : variable 'chiProb' <---> Output : variable 'chiProb'
                         : Input : variable 'dr' <---> Output : variable 'dr'
                         : Input : variable 'dz' <---> Output : variable 'dz'
                         : Input : variable 'dphi' <---> Output : variable 'dphi'
                         : Input : variable 'Kid0' <---> Output : variable 'Kid0'
                         : Input : vari

In [7]:
factory.TestAllMethods();
factory.EvaluateAllMethods();

Factory                  : [1mTest all methods[0m
Factory                  : Test method: BDT for Classification performance
                         : 
BDT                      : [dataset] : Evaluation of BDT on testing sample (108355 events)
                         : Elapsed time for evaluation of 108355 events: [1;31m0.933 sec[0m       
Factory                  : [1mEvaluate all methods[0m
Factory                  : Evaluate classifier: BDT
                         : 
BDT                      : [dataset] : Loop over test events and fill histograms with classifier response...
                         : 
TFHandler_BDT            :  Variable         Mean         RMS   [        Min         Max ]
                         : ----------------------------------------------------------------
                         :         p:     1.5787    0.84332   [   0.027463     6.8006 ]
                         :        pt:     1.0092    0.66296   [  0.0035651     4.6610 ]
                     

In [8]:
%jsroot on
auto c1 = factory.GetROCCurve(&loader);
c1->Draw();

## Exercise 1

### 1.1

 * What is the area under the receiver operating characteristics of the BDT?
 * Is the BDT overtrained?
 
### 1.2

Optimize the hyper-parameters of the BDT