Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when calculating TTC and pilot running time in ra.session #25

Open
mturilli opened this issue Dec 8, 2016 · 3 comments
Open

Error when calculating TTC and pilot running time in ra.session #25

mturilli opened this issue Dec 8, 2016 · 3 comments

Comments

@mturilli
Copy link
Contributor

mturilli commented Dec 8, 2016

$ radical-stack 
python            : 2.7.10
virtualenv        : /Users/mturilli/Virtualenvs/RADICAL-ANALYTICS
radical.utils     : v0.44.RC1-11-ge108e90@experiment-aimes
saga-python       : v0.41-72-gb0cf900@experiment-aimes
radical.pilot     : v0.40.1-237-gdefffeb@experiment-aimes
radical.analytics : v0.1-168-ge8da063@experiment-aimes

Data for replication at: https://github.com/radical-experiments/AIMES-Experience/tree/master/OSG/analysis

Errors:

sdir = 'data/exp8/rp.session.radical.mturilli.017128.0000'
sid = glob.glob('%s/*.json' % sdir)[0].split('/')[-1:][0][:-5]
sra_session = ra.Session(sid, 'radical.pilot', src=sdir)
print sra_session.ttc
1479910311.1441

and

sdir = 'data/exp11/rp.session.radical.mturilli.017139.0003'
sid = glob.glob('%s/*.json' % sdir)[0].split('/')[-1:][0][:-5]
sra_session = ra.Session(sid, 'radical.pilot', src=sdir)
sra_session.ttc
1480874035.2191

It does feel like 100 years but I doubt that corresponds to actual time ;)

@mturilli
Copy link
Contributor Author

mturilli commented Dec 9, 2016

As a follow up, same stack:

sdir = 'data/exp9/rp.session.radical.mturilli.017131.0005'
sid = glob.glob('%s/*.json' % sdir)[0].split('/')[-1:][0][:-5]
sra_session = ra.Session(sid, 'radical.pilot', src=sdir)
sra_pilots = sra_session.filter(etype='pilot', inplace=False)
sra_pilots.duration(['PMGR_ACTIVE_PENDING','PMGR_ACTIVE'])
63879.87249994278

This is not outrageous as 100 hours but it is still very much an outlier at 17 hours. Digging a bit deeper, I see that TTC is consistent:

sra_session.ttc
94050.86249995232

And so it is also the total time spent executing units:

sra_units = sra_session.filter(etype='unit', inplace=False)
sra_units.duration(['AGENT_EXECUTING','AGENT_STAGING_OUTPUT_PENDING'])
86431.01699995995

I check the Tq for each pilot:

for pid in sorted(sra_pilots.list('uid')):
    pentity = sra_pilots.get(uid=pid)[0]
    tq = pentity.duration(['PMGR_ACTIVE_PENDING','PMGR_ACTIVE'])
    print '%s; Tq = %s' % (pid, tq)

pilot.0000; Tq = 173.723799944
pilot.0001; Tq = 205.427600145
pilot.0002; Tq = 202.304100037
pilot.0003; Tq = 196.887200117
pilot.0004; Tq = 179.128400087
pilot.0005; Tq = 63641.8777001
pilot.0006; Tq = 183.492399931
pilot.0007; Tq = 267.250999928
pilot.0008; Tq = 197.989400148
pilot.0009; Tq = 185.617599964
pilot.0010; Tq = 213.860100031
pilot.0011; Tq = 169.612400055
pilot.0012; Tq = 171.625499964
pilot.0013; Tq = None
pilot.0014; Tq = 299.252599955
pilot.0015; Tq = 299.252300024
pilot.0016; Tq = 299.252799988
pilot.0017; Tq = 300.16320014
pilot.0018; Tq = 299.25180006
pilot.0019; Tq = 299.25150013
pilot.0020; Tq = 63753.4737
pilot.0021; Tq = 158.164200068
pilot.0022; Tq = 63751.2569001
pilot.0023; Tq = None
pilot.0024; Tq = 689.254899979
pilot.0025; Tq = 247.015500069
pilot.0026; Tq = 246.21449995
pilot.0027; Tq = 63612.6008
pilot.0028; Tq = 63585.7997999
pilot.0029; Tq = 63623.4006
pilot.0030; Tq = 63879.8724999
pilot.0031; Tq = 63664.4848001
pilot.0032; Tq = None
pilot.0033; Tq = None
pilot.0034; Tq = 63639.7081001
pilot.0035; Tq = 63595.4674001
pilot.0036; Tq = 192.760900021
pilot.0037; Tq = 63610.5464001
pilot.0038; Tq = 196.886899948
pilot.0039; Tq = 184.292400122
pilot.0040; Tq = 193.773200035
pilot.0041; Tq = 183.49180007
pilot.0042; Tq = 231.111599922
pilot.0043; Tq = 355.441900015
pilot.0044; Tq = 402.271000147
pilot.0045; Tq = 260.084800005
pilot.0046; Tq = 63613.7738001
pilot.0047; Tq = 63598.7416999
pilot.0048; Tq = 238.745899916
pilot.0049; Tq = 240.898499966
pilot.0050; Tq = 63669.8938
pilot.0051; Tq = 165.616600037
pilot.0052; Tq = 251.485199928
pilot.0053; Tq = 230.129400015
pilot.0054; Tq = 193.772599936
pilot.0055; Tq = 192.76060009
pilot.0056; Tq = 185.617200136
pilot.0057; Tq = 214.893199921
pilot.0058; Tq = 183.493200064
pilot.0059; Tq = 173.723400116
pilot.0060; Tq = 184.292900085
pilot.0061; Tq = 235.41869998
pilot.0062; Tq = 199.007800102
pilot.0063; Tq = 212.860100031

I see several offenders, among which pilot.0030 seems to be the responsible for the 17 hours spike. I check whether any unit has been executing on that pilot:

pu_rels = sra_session.describe('relations', ['pilot', 'unit'])
pu_rels['pilot.0030']
[u'unit.000053', u'unit.000139', u'unit.000231', u'unit.000353', u'unit.000515', u'unit.000589', u'unit.000671', u'unit.000692', u'unit.000740', u'unit.000766', u'unit.000773', u'unit.000776', u'unit.000882', u'unit.001027', u'unit.001138', u'unit.001172', u'unit.001173', u'unit.001328', u'unit.001413', u'unit.001436', u'unit.001509', u'unit.001532', u'unit.001543', u'unit.001581', u'unit.001640', u'unit.001733', u'unit.001843', u'unit.001906', u'unit.002053', u'unit.002079', u'unit.002097']

And I check whether the execution time (Tx) of those units is smaller than Tq, just in case something went wrong with calculating the time overlap among pilots' Tq and units' Tx:

for uid in pu_rels['pilot.0030']:
    uentity = sra_units.get(uid=uid)[0]
    uentity.duration(['AGENT_EXECUTING','AGENT_STAGING_OUTPUT_PENDING'])

218.28030014038086
485.08429980278015
486.24169993400574
825.2764999866486
479.0343999862671
311.17659997940063
550.7825999259949
482.5248999595642
833.3187000751495
481.9230999946594
479.4444999694824
221.03939986228943
865.2293000221252
481.3836998939514
242.4404001235962
263.08929991722107
255.80170011520386
223.24640011787415
484.35509991645813
301.80769991874695
466.7702000141144
481.6726999282837
383.92440009117126
892.2154998779297
233.33660006523132
858.2283000946045
272.4568998813629
249.12859988212585
212.96090006828308
965.8349001407623
237.24930000305176

Finally I check whether these Tx are analogous to those of a pilot with a much lower Tq:

for uid in pu_rels['pilot.0002']:
    uentity = sra_units.get(uid=uid)[0]
    uentity.duration(['AGENT_EXECUTING','AGENT_STAGING_OUTPUT_PENDING'])

377.2000000476837
327.29150009155273
1030.0767998695374
320.85550022125244
825.8889999389648
475.9823999404907
2620.778800010681
315.408499956131
481.24129986763
315.8494999408722
284.41589999198914
293.9662001132965
230.65429997444153
308.4658000469208
283.88759994506836
472.96600008010864
421.6988000869751
222.18110013008118
318.94749999046326
446.9809000492096
216.9238998889923
2142.395099878311
267.1692998409271
372.41160011291504
846.716500043869
315.00679993629456
301.22839999198914
288.9149000644684
309.5456998348236
285.80310010910034
219.23109984397888
248.8872001171112
297.48529982566833
693.7283000946045
479.7750999927521
295.4909999370575
203.59450006484985
484.9703001976013
901.3083000183105
236.97300004959106
493.2297999858856
308.8712000846863
231.74309992790222
477.51900005340576
407.03110003471375
395.3278000354767

So, in summary, it really seems we have hit Tq on OSG after all :) Seeing that it is a outlier by a factor 100, I would not mind for you to have a look too in the context of this ticket. Maybe this kind of digging can also give us some ideas about how to expand our consistency checking.

@mturilli mturilli changed the title Error when calculating TTC in ra.session Error when calculating TTC and pilot running time in ra.session Dec 9, 2016
@mturilli
Copy link
Contributor Author

mturilli commented Dec 9, 2016

Same stack as above, same location of the experiment data.

Confirmed that for the same two sessions described in the first message above, the execution time (duration(['PMGR_ACTIVE', ['DONE', 'CANCELED', 'FAILED']])) of the pilots are off. They are all set at the same value of ~75 years.

@ibethune ibethune added this to the 0.46 milestone Mar 2, 2017
@andre-merzky
Copy link
Member

I think this is related to radical-cybertools/radical.pilot#1117

@andre-merzky andre-merzky self-assigned this Aug 24, 2017
@andre-merzky andre-merzky modified the milestones: v0.47, 0.46 Aug 24, 2017
@andre-merzky andre-merzky removed their assignment Apr 3, 2019
@mturilli mturilli removed the bug label Jun 5, 2019
@mturilli mturilli removed this from the v0.47 milestone Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants