# Inspection


RADICAL-Analytics enables deriving information about RCT sessions, pilots and tasks. For example, session ID, number of tasks, number of pilots, final state of the tasks and pilots, CPU/GPU processes for each task, etc. That information allows to derive task requirements and resource capabilities, alongside the RCT configuration parameters used for a session.

## Prologue

Load the Python modules needed to profile and plot a RADICAL-Cybertool (RCT) session.

In [1]:
import os
import tarfile

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker

import radical.utils as ru
import radical.pilot as rp
import radical.entk as re
import radical.analytics as ra

1664961028.067 : radical.analytics    : 29075 : 140003482892096 : INFO     : radical.analytics    version: 1.16.0-v1.16.0-24-g1ea17a5@fix-rtd_build


Load the RADICAL Matplotlib style to obtain viasually consistent and publishable-qality plots.

In [2]:
plt.style.use(ra.get_mplstyle('radical_mpl'))

Usually, it is useful to record the stack used for the analysis. 

<div class="alert alert-info">

__Note:__ The analysis stack might be different from the stack used to create the session to analyze. Usually, the two stacks must have the same minor release number (Major.Minor.Patch) in order to be compatible.

</div>

In [3]:
! radical-stack

[32m1664961028.564 : radical.analytics    : 29108 : 139897237518144 : INFO     : radical.analytics    version: 1.16.0-v1.16.0-24-g1ea17a5@fix-rtd_build[0m

  python               : /mnt/home/merzky/radical/radical.analytics.devel/ve3/bin/python3
  pythonpath           : 
  version              : 3.8.0
  virtualenv           : /mnt/home/merzky/radical/radical.analytics.devel/ve3

  radical.analytics    : 1.16.0-v1.16.0-24-g1ea17a5@fix-rtd_build
  radical.entk         : 1.16.0-v1.16.0@master
  radical.gtod         : 1.13.0
  radical.pilot        : 1.17.0-v1.17.0-148-g5a63a7f8e@devel
  radical.saga         : 1.17.0-v1.17.0-2-g5b9803bb@devel
  radical.utils        : 1.17.0-v1.17.0-4-ge3c8acb@fix-docs



## Single Session

Name and location of the session we profile.

In [4]:
sid = 'rp.session.rivendell.merzky.019270.0003'
sdir = 'sessions/'

Unbzip and untar the session.

In [5]:
sp = sdir + sid + '.tar.bz2'
tar = tarfile.open(sp, mode='r:bz2')
tar.extractall(path=sdir)
tar.close()

Create a ``ra.Session`` object for the session. We do not need EnTK-specific traces so load only the RP traces contained in the EnTK session. Thus, we pass the ``'radical.pilot'`` session type to ``ra.Session``.

<div class="alert alert-warning">
    
__Warning:__ We already know we need information about pilots and tasks. Thus, we save in memory two session objects filtered for pilots and tasks. This might be too expensive with large sessions, depending on the amount of memory available.

</div>
    
<div class="alert alert-info">
    
__Note:__ We save the ouput of ``ra.Session`` in ``capt`` to avoid polluting the notebook with warning messages. 

</div>

In [6]:
%%capture capt

sp = sdir + sid

session = ra.Session(sp, 'radical.pilot')
pilots  = session.filter(etype='pilot', inplace=False)
tasks   = session.filter(etype='task' , inplace=False)

Information about __session__ that is commonly used when analyzing and plotting one or more RCT sessions.

In [7]:
# Session info
sinfo = {
    'sid'       : session.uid,
    'hostid'    : session.get(etype='pilot')[0].cfg['hostid'],
    'cores_node': session.get(etype='pilot')[0].cfg['resource_details']['rm_info']['cores_per_node'],
    'gpus_node' : session.get(etype='pilot')[0].cfg['resource_details']['rm_info']['gpus_per_node'],
    'smt'       : session.get(etype='pilot')[0].cfg['resource_details']['rm_info']['threads_per_core']
}

# Pilot info (assumes 1 pilot)
sinfo.update({
    'pid'       : pilots.list('uid'),
    'npilot'    : len(pilots.get()),
    'npact'     : len(pilots.timestamps(state='PMGR_ACTIVE')),
})

# Task info
sinfo.update({
    'ntask'     : len(tasks.get()),
    'ntdone'    : len(tasks.timestamps(state='DONE')),
    'ntcanceled': len(tasks.timestamps(state='CANCELED')),
    'ntfailed'  : len(tasks.timestamps(state='FAILED')),
})

# Derive info (assume a single pilot)
sinfo.update({
    'pres'      : pilots.get(uid=sinfo['pid'])[0].description['resource'],
    'ncores'    : pilots.get(uid=sinfo['pid'])[0].description['cores'],
    'ngpus'     : pilots.get(uid=sinfo['pid'])[0].description['gpus']
})
sinfo.update({
    'nnodes'    : int(sinfo['ncores']/sinfo['cores_node'])
})

sinfo

{'sid': 'rp.session.rivendell.merzky.019270.0003',
 'hostid': 'rivendell',
 'cores_node': 16,
 'gpus_node': 1,
 'smt': 1,
 'pid': ['pilot.0000'],
 'npilot': 1,
 'npact': 1,
 'ntask': 128,
 'ntdone': 128,
 'ntcanceled': 0,
 'ntfailed': 0,
 'pres': 'local.localhost',
 'ncores': 16,
 'ngpus': 0,
 'nnodes': 1}

Information about __tasks__ that is commonly used when analyzing and plotting one or more RCT sessions.

<div class="alert alert-info">
    
__Note:__ we use `ra.entity.description` to get each task description as a dictionary. We then select the keys of that dictionary that contain the task requirements. More keys are available, especially those about staged input/output files.

</div>

In [8]:
tinfo = []
for task in tasks.get():

    treq = {
        'executable'       : task.description['executable'],
        'cpu_process_type' : task.description['cpu_process_type'],
        'cpu_processes'    : task.description['cpu_processes'],
        'cpu_thread_type'  : task.description['cpu_thread_type'],
        'cpu_threads'      : task.description['cpu_threads'],
        'gpu_process_type' : task.description['gpu_process_type'],
        'gpu_processes'    : task.description['gpu_processes'],
        'gpu_thread_type'  : task.description['gpu_thread_type'],
        'gpu_threads'      : task.description['gpu_threads']
    }
    
    if not tinfo:
        treq['n_of_tasks'] = 1
        tinfo.append(treq)
        continue
    
    for i, ti in enumerate(tinfo):
        counter = ti['n_of_tasks']
        ti.pop('n_of_tasks')
        
        if ti == treq:
            counter += 1
            tinfo[i]['n_of_tasks'] = counter
        else:
            treq['n_of_tasks'] = 1
            tinfo.append(treq)
tinfo

[{'executable': '/bin/sleep',
  'cpu_process_type': '',
  'cpu_processes': 0,
  'cpu_thread_type': '',
  'cpu_threads': 0,
  'gpu_process_type': '',
  'gpu_processes': 0,
  'gpu_thread_type': '',
  'gpu_threads': 0,
  'n_of_tasks': 128}]

## Multiple Sessions

Name and location of the sessions we profile.

In [9]:
sids = ['rp.session.rivendell.merzky.019270.0000',
        'rp.session.rivendell.merzky.019270.0000',
        'rp.session.rivendell.merzky.019270.0000',
        'rp.session.rivendell.merzky.019270.0000']
sdir = 'sessions/'
sessions = [sdir + s for s in sids]

Unbzip and untar those sessions.

In [10]:
for sid in sids:
    sp = sdir + sid + '.tar.bz2'
    tar = tarfile.open(sp, mode='r:bz2')
    tar.extractall(path=sdir)
    tar.close()

Create the session, tasks and pilots objects for each session.

In [11]:
%%capture capt

ss = {}
for sid in sids:
    sp = sdir + sid
    ss[sid] = {'s': ra.Session(sp, 'radical.pilot')}
    ss[sid].update({'p': ss[sid]['s'].filter(etype='pilot', inplace=False),
                    't': ss[sid]['s'].filter(etype='task' , inplace=False)})

In [12]:
for sid in sids:
    ss[sid].update({'sid'       : ss[sid]['s'].uid,
                    'hostid'    : ss[sid]['s'].get(etype='pilot')[0].cfg['hostid'],
                    'cores_node': ss[sid]['s'].get(etype='pilot')[0].cfg['resource_details']['rm_info']['cores_per_node'],
                    'gpus_node' : ss[sid]['s'].get(etype='pilot')[0].cfg['resource_details']['rm_info']['gpus_per_node'],
                    'smt'       : ss[sid]['s'].get(etype='pilot')[0].cfg['resource_details']['rm_info']['threads_per_core']
    })

    ss[sid].update({
                    'pid'       : ss[sid]['p'].list('uid'),
                    'npilot'    : len(ss[sid]['p'].get()),
                    'npact'     : len(ss[sid]['p'].timestamps(state='PMGR_ACTIVE'))
    })

    ss[sid].update({
                    'ntask'     : len(ss[sid]['t'].get()),
                    'ntdone'    : len(ss[sid]['t'].timestamps(state='DONE')),
                    'ntfailed'  : len(ss[sid]['t'].timestamps(state='FAILED')),
                    'ntcanceled': len(ss[sid]['t'].timestamps(state='CANCLED'))
    })


    ss[sid].update({'pres'      : ss[sid]['p'].get(uid=ss[sid]['pid'])[0].description['resource'],
                    'ncores'    : ss[sid]['p'].get(uid=ss[sid]['pid'])[0].description['cores'],
                    'ngpus'     : ss[sid]['p'].get(uid=ss[sid]['pid'])[0].description['gpus']
    })

    ss[sid].update({'nnodes'    : int(ss[sid]['ncores']/ss[sid]['cores_node'])})

For presentation purposes, we can convert the session information into a DataFrame and rename some of the columns to improve readability.

In [13]:
ssinfo = []
for sid in sids:
    ssinfo.append({'session'   : sid,
                   'resource'  : ss[sid]['pres'],
                   'cores_node': ss[sid]['cores_node'],
                   'gpus_node' : ss[sid]['gpus_node'],
                   'pilots'    : ss[sid]['npilot'],
                   'ps_active' : ss[sid]['npact'],
                   'cores'     : int(ss[sid]['ncores']/ss[sid]['smt']), 
                   'gpus'      : ss[sid]['ngpus'], 
                   'nodes'     : ss[sid]['nnodes'], 
                   'tasks'     : ss[sid]['ntask'], 
                   't_done'    : ss[sid]['ntdone'],  
                   't_failed'  : ss[sid]['ntfailed']})

df_info = pd.DataFrame(ssinfo) 
df_info

Unnamed: 0,session,resource,cores_node,gpus_node,pilots,ps_active,cores,gpus,nodes,tasks,t_done,t_failed
0,rp.session.rivendell.merzky.019270.0000,local.localhost,16,1,1,1,16,0,1,16,16,0
1,rp.session.rivendell.merzky.019270.0000,local.localhost,16,1,1,1,16,0,1,16,16,0
2,rp.session.rivendell.merzky.019270.0000,local.localhost,16,1,1,1,16,0,1,16,16,0
3,rp.session.rivendell.merzky.019270.0000,local.localhost,16,1,1,1,16,0,1,16,16,0


We can then derive task information for each session.

In [14]:
tsinfo = {}
for sid in sids:

    tsinfo[sid] = []
    for task in tasks.get():

        treq = {
            'executable'       : task.description['executable'],
            'cpu_process_type' : task.description['cpu_process_type'],
            'cpu_processes'    : task.description['cpu_processes'],
            'cpu_thread_type'  : task.description['cpu_thread_type'],
            'cpu_threads'      : task.description['cpu_threads'],
            'gpu_process_type' : task.description['gpu_process_type'],
            'gpu_processes'    : task.description['gpu_processes'],
            'gpu_thread_type'  : task.description['gpu_thread_type'],
            'gpu_threads'      : task.description['gpu_threads']
        }

        if not tsinfo[sid]:
            treq['n_of_tasks'] = 1
            tsinfo[sid].append(treq)
            continue

        for i, ti in enumerate(tsinfo[sid]):
            counter = ti['n_of_tasks']
            ti.pop('n_of_tasks')

            if ti == treq:
                counter += 1
                tsinfo[sid][i]['n_of_tasks'] = counter
            else:
                treq['n_of_tasks'] = 1
                tsinfo[sid].append(treq)
tsinfo

{'rp.session.rivendell.merzky.019270.0000': [{'executable': '/bin/sleep',
   'cpu_process_type': '',
   'cpu_processes': 0,
   'cpu_thread_type': '',
   'cpu_threads': 0,
   'gpu_process_type': '',
   'gpu_processes': 0,
   'gpu_thread_type': '',
   'gpu_threads': 0,
   'n_of_tasks': 128}]}