Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vhbb#561 #564

Merged
merged 8 commits into from
Jan 10, 2017
Merged

vhbb#561 #564

merged 8 commits into from
Jan 10, 2017

Conversation

veelken
Copy link

@veelken veelken commented Nov 7, 2016

Dear all,

we would like to propose 3 modifications to vhbb.py and vhbbobj.py for the ttH, H->tautau analysis. We believe the modifications will not introduce any problem for anyone, except for an increase in filesize for some of the samples. Please let us know what you think.

Cheers,

Karl and Christian
for ttH, H->tautau

vhbb.py

In the VHbbAnalyzer config we would like to set the flag passall=True to disable the numJets >= 2 cut
The motivation for this modification is that in the ttH multilepton and ttH, H->tautau analyses we measure the electron charge misidentification rate in Z->ee events. Because the electron charge misidentification rate is very small, we need the full event statistics.
We would like to add
JetAna.lepSelCut = lambda lep : (abs(lep.pdgId()) == 11 and lep.relIso03 < 0.4) or (abs(lep.pdgId()) == 13 and lep.relIso04 < 0.4)
so that jets don't get cleaned with respect to leptons that pass only the miniIso, but not the standard isolation cuts. In case the jet collection is cleaned with respect to leptons passing the miniIso, about 1% of b-jets get cleaned, which we would like to recover.
vhbbobj.py

We would like to replace the line
NTupleVariable("eleooEmooP", lambda x : abs(1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy()) if abs(x.pdgId())==11 and x.ecalEnergy()>0.0 else 9e9 , help="Electron 1/E - 1/P"),
by
NTupleVariable("eleooEmooP", lambda x : (1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy()) if abs(x.pdgId())==11 and x.ecalEnergy()>0.0 else 9e9 , help="Electron 1/E - 1/P"),
i.e. remove the abs function. In the ttH multilepton and ttH, H->tautau analyses events with negative 1/E - 1/P values get cut and the presence of the abs in the computation of eleooEmooP means that we cannot do that, causing a problem for us in terms of synchronization with other groups. In our opinion, it is safe to remove the abs function, as it can always be applied later on analysis level.

Christian Veelken added 5 commits September 6, 2016 16:54
  JetAna.lepSelCut = lambda lep : (abs(lep.pdgId()) == 11 and lep.relIso03 < 0.4) or (abs(lep.pdgId()) == 13 and lep.relIso04 < 0.4)
for jet cleanining, i.e. don't clean jet collection with respect to leptons that pass only mini-isolation and not "standard" isolation
  "eleooEmooP", lambda x : abs(1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy())
by
  "eleooEmooP", lambda x : 1.0/x.ecalEnergy() - x.eSuperClusterOverP()/x.ecalEnergy()
i.e. keep sign of (E-P)/E for electron ID variable
…ts are kept in Ntuples

(needed for measurement of electron charge misidentification rate and of jet->lepton fake-rate in ttH, H->tautau analysis)
Christian Veelken added 2 commits November 28, 2016 13:46
- resolved merge conflicts
- added triggers for ttH, H->tautau analysis for full 2016 dataset

Conflicts:
	VHbbAnalysis/Heppy/python/TriggerTable.py
	VHbbAnalysis/Heppy/python/TriggerTableData.py
@veelken
Copy link
Author

veelken commented Dec 21, 2016

Hi Andrea,
I merged the trigger changes from Michele with mine. Please merge this PR now.
Thank you,
Christian

@arizzi
Copy link

arizzi commented Dec 21, 2016

this PR makes passall=true that is not ok. If there is a specific class of events you want to save we have to let them pass, we cannot passall=true for space reason especially when running on fully hadronic stuff

@veelken
Copy link
Author

veelken commented Dec 21, 2016

Hi Andrea,

the effect of the passall=true flag is that events with less than 2 jets no longer get cut. The nJets >= 2 cut is very loose for fully hadronic events. I expect that the nJets >= 2 cut mainly removes Z->ll and W->lnu events. Unfortunately, we do need an inclusive sample of Z->ee events for the purpose of estimating backgrounds, arising from electron charge misidentification, in the ttH, H->tautau analysis. I would prefer that we keep the event processing simple and not run different configs (with and without the nJet >= 2 cut on different samples). If disk space is a problem, we can store the VHbb Ntuples in Tallinn if you like (we have enough disk space).

@arizzi
Copy link

arizzi commented Dec 21, 2016

you only need Zee? then why having passall=true? can't we just whitelist Vtype=1 ?

@veelken
Copy link
Author

veelken commented Dec 21, 2016 via email

@degrutto
Copy link

HI Andrea, Christian,

sorry to chime in,
to have passall=true is actually a recurrent question/desire that I have heard by many people using the vhbb ntuples (and many never said on github)

May I ask you @veelken if you know for example for QCD bkgs how much would be the addition of events on tape?

I think this is the only sample that we are afraid of exploding, @arizzi true? Maybe also the data (MET dataset? BtagCVS?

 Michele

@arizzi
Copy link

arizzi commented Dec 21, 2016

michele, asking with no motivation is not going to go anywhere. What's the reason for passall=true from others?

@jpata
Copy link

jpata commented Dec 21, 2016

ciao, I tend to agree with @arizzi that we have to be conservative about space for the following reasons:

  1. at T2_CH_CSCS where I monitor the space, vhbb ntuples are among the largest individual user datasets, in the scale of 10s of TB (subset of the full vhbb datasets)
  2. Inflating the file sizes with events used only rarely directly affects analysis downstream, where the mostly-useless events have to be filtered every time, with costs in IO, CPU, analysis job reliability.

Possibly for the special cases one can consider making a separate crab run, but we have to keep throwing events at every possible stage we can.

@degrutto
Copy link

the most common comment is to make life easier for signal and bkg cut flow/efficiency studies
(so I guess is more relevant for signal samples)

btw it would be interesting to quantify for QCD

@veelken
Copy link
Author

veelken commented Dec 21, 2016 via email

@degrutto
Copy link

Hi,
actually it is very easy to do just looking at the sample we have for V24
so

QCDHT200-300: root -l root://stormgf1.pi.infn.it:1094///store/user/arizzi/VHBBHeppyV24/QCD_HT200to300_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/VHBB_HEPPY_V24_QCD_HT200to300_TuneCUETP8M1_13TeV-madgraphMLM-Py8__spr16MAv2-puspr16_80r2as_2016_MAv2_v0-v1/160909_064004/0000/tree_4.root
tree->GetEntries()/Count->Integral()
(const double)8.96610860519146069e-01

but for the lower bin we consider
root -l root://stormgf1.pi.infn.it:1094///store/user/arizzi/VHBBHeppyV24/QCD_HT100to200_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/VHBB_HEPPY_V24_QCD_HT100to200_TuneCUETP8M1_13TeV-madgraphMLM-Py8__spr16MAv2-puspr16_80r2as_2016_MAv2_v0-v1/160909_063817/0000/tree_10.root
root [1] tree->GetEntries()/Count->Integral()
(const double)8.72890038750593900e-02

so this one will be problematic

@veelken
Copy link
Author

veelken commented Dec 21, 2016

Hi Michele,

thank you for the numbers. I didn't know it is that easy to get them!
How shall we proceed now ?
Andrea, would it be ok with you to merge PR #561 and then we submit the crab jobs for the QCD MC samples with passall set to false ?

Cheers,

Christian

@arizzi
Copy link

arizzi commented Dec 21, 2016

no, we are not going to do it.
The DY sample (i.e. a big one) reduction is 0.13.
As joosep explained we do not waste resources because people doesn't like to get the ratios out of the histos. We cannot pay a factor 5-10x on the ntuple sizes. We currently distribute 2-3 copies of the ntuples so having space in a given T2 for one version of the ntuple is not the point (we need several site, plus we need one storing the whole history of VHbb ntuples for reproducibility, reanalysis for combination etc etc..)
Let me add that our event size is already too big and that considering many variables that we compute are jet based I see no point in storing events without any jet.

PS: QCD events are not passing for other reasons (not the 2 jets requirements) and the passall will let those go through too.

@veelken
Copy link
Author

veelken commented Dec 22, 2016

Hi,

I have changed passall=False so that PR #561 can be merged.
Would it be an option that we build two versions of vhbbHeppy for the ReReco data and MC , one with passall=True and one with passall=False, so that a few VHbb Ntuples could centrally be produced with passall set to true ?
The alternative is of course that people working on ttH, H->tautau organize the production of samples with passall=True by themselves.
What do you think ?

Cheers,

Christian

@arizzi
Copy link

arizzi commented Dec 22, 2016 via email

@veelken
Copy link
Author

veelken commented Jan 5, 2017

Hi Andrea,

we don't apply a cut on vtype in the ttH, H->tautau analysis. We do apply a cut 60 < mll < 120 GeV in some of our control regions/auxiliary measurements, but not in all. The best option in my opinion would still be to set passall=true based on the sample name.

I noticed that PR #561 is not merged yet. Can you please merge it now ?
(passall is set to false by default now)
What is the status/plan/timescale for the VHbb Ntuple production for the ReReco data and MC ?

Cheers,

Christian

@arizzi
Copy link

arizzi commented Jan 6, 2017

what does "mll" means for VTypes where there are not two leptons selected?!?

@veelken
Copy link
Author

veelken commented Jan 6, 2017

Hi Andrea,

we compute mll by looping over the selLeptons branch, apply some lepton selection criteria and then add the lepton four-vectors for pairs of leptons that pass the lepton selection criteria.

As I mentioned before, I think it is best not to use vtype and mll, but set passall=true based on the sample name.

Cheers,

Christian

@arizzi
Copy link

arizzi commented Jan 6, 2017

well, "it is better" is a relative concept, for sure it is not better for who has to babysit tens of thousands of jobs and add the complication of different settings.
I'm not sure why you do not use vtype to classify Zee events of a control region. I mean VHbb ntuples are based on the vtype to setup cuts and fill variables, so just asking "remove any selection you do because we do not like the vtype" is not helping here. At some point we can decide to have different production campaign if different analysis have different needs. The ttH bb guys are already running their own campaign with additional MEM stuff, so it could be better to prepare a ttHtt.py and ttHtt-data.py config that you run for ttHtt with passall=true and we clean up vhbb.py from what we do not need in H->bb related analysis. We should understand what is the cost of different choices (cpu,diskspace,people time)

@veelken
Copy link
Author

veelken commented Jan 6, 2017

Hi Andrea,

sorry for not being more clear about it: Zee is only one of the control regions we need for ttH, H->tautau. We need other control regions to measure tight/loose lepton ratios and these control regions use events with single leptons (these control regions are dominated by QCD; we don't need QCD MC for these measurements though, only data).
The selection of the control regions to measure tight/loose lepton ratios is work in progress and I would very much prefer to avoid hardcoding the cuts at Ntuple production time - at least for this round of the VHbb Ntuple production.

Cheers,

Christian

@arizzi
Copy link

arizzi commented Jan 6, 2017

this doesn't help. On data the passall=false is even more important because that's where we get most of the reduction.
For VH channels the ntuple production always assumes that lepton selection has been already defined/optimized in dedicated sample production (e.g. with passall=true) or in previous studies. This is needed because we have to avoid events sharing between analyses in order to keep them stat-independent.

@@ -263,6 +263,7 @@
from PhysicsTools.Heppy.analyzers.objects.JetAnalyzer import JetAnalyzer
JetAna = JetAnalyzer.defaultConfig
JetAna.calculateSeparateCorrections = True # CV: needed for ttH prompt lepton MVA
JetAna.lepSelCut = lambda lep : (abs(lep.pdgId()) == 11 and lep.relIso03 < 0.4) or (abs(lep.pdgId()) == 13 and lep.relIso04 < 0.4)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this meant for? we already have a selection for the selectedLeptons used here.
The only difference seem to be you do not OR with miniIsoltion, is that the purpose? remove the miniIsolation or?

@veelken
Copy link
Author

veelken commented Jan 10, 2017

Hi Andrea,

yes, the motivation for this change is to not clean the jets wrt leptons that pass the miniIsolation, but fail the standard isolation. As we studied with Lorenzo, the effect of cleaning the jets wrt leptons passing miniIsolation or standard isolation is small, on the level of 1%. The main motivation for restoring the "old" jet cleaning behavior is to avoid differences in synchronization with other groups.

@arizzi arizzi merged commit c731df8 into vhbb:vhbbHeppy80X Jan 10, 2017
@arizzi arizzi added this to the V25 milestone Jan 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants