Skip to content

Supermuc

Andre Merzky edited this page Mar 10, 2017 · 6 revisions

How to deploy and use RP on LRZ's Supermuc

NOTE: This page is somewhat convoluted and needs cleanup: some of the steps can probably be combined into fewer commands. Specifically, we spend some time setting up the socks-5 tunnel, but don't really use it yet. That is expected to change, and all individual tunnel setups described below are expected to get automated over the socks-tunnel.

Remaining TODO items:

  • hotfix release for saga-python to fix loadleveler handling of CANDIDATE_HOSTS
  • deploy ORTE
  • use SSH for non-MPI units (ssh key setup?)
  • automate deployment, including tunnel creation
  • use Supermuc as remote resource

Supermuc Challenges

Supermuc is neigh impossible to be used from a remote site at this point, and is also difficult to use from its login nodes, due to stringent firewall restrictions and complex system setups:

  • gsissh connections are only allowed to gridmuc.lrz.de, and the connection gets placed on a random (?) login node;
  • outgoing connections are filtered by target port numbers and are generally not allowed;
  • outgoging ssh connections are only allowed to pre-registered IP addresses, and only from the login08 login node;
  • LDL_PRELOAD is not allowed, which makes it difficult to use pip over a socks-5 tunnel;
  • job submission is only allowed from from certain login nodes, as listed below:
Login Node Compute Nodes
login0[34567] phase 1 thin nodes
login0[1,2] phase 1 fat nodes
login2[123] phase 2 Haswell nodes

This setup makes it (a) difficult to deploy RP, as one currently needs to manually setup several ssh tunnels RP to deploy and function properly, and (b) difficult to run RP, as the submission host is different from the host used to tunnel to the external MongoDB. We discuss below how to address those two issues.

Installing RP

We recommend to use the release versions of the radical stack. The following instructions assume, for simplicity, the following entries in ~/.ssh/config on supermuc -- please adapt to your own endpoint.

host 144.76.72.175 radical
  user      = merzky
  hostname  = 144.76.72.175

The host radical is, in this case, used for both installation tunneling and MongoDB hosting. We expect public-key authentication to be configured for the connection from supermuc to that trusted host.

  • grid-proxy-info: make sure you have a valid X509
  • get onto supermuc: gsissh gridmuc.lrz.de. You end up on some login node
  • if not landed on login08, hop to that login node which allows outgoing ssh tunnels: ssh login08
  • use ssh to create a SOCKS-5 tunnel to a trusted (ie. registered, no other vetting needed) remote host: ssh -NfD 1080 radical
  • from that trusted source, fetch a copy of virtualenv (curl could be used - but the source is behind fastly, and that does not play well with socks-5 proxies): scp radical:virtualenv-1.9.tar.gz .
  • unpack: tar zxvf virtualenv-1.9.tar.gz
  • load the python module: module load python/2.7_intel
  • make sure the python module actually works: export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/
  • create a virtualenv: python virtualenv-1.9/virtualenv.py ve_rp
  • activate the VE: source ve_rp/bin/activate

At this point we have:

  • a socks-5 tunnel
  • a working Python
  • a basic virtualenv (activated)

The next step is to install the radical stack. This is complicated due to the fact that pip does not really function well over ssh tunnels nor SOCKS-5 proxies. The usual ways to get pip working in this context (use LD_PRELOAD to apply a tcp socks wrapper on connect calls) is forbidden on Supermuc. We thus use easy_install to install the stack, but will have to fix the installation manually thereafter: easy_install handels tunnels, but is otherwise somewhat buggy:

  • create an explicit http tunnel: ssh -NfL 1081:localhost:8888 radical
  • easy_install radical.pilot
  • confirm the install is complete:
$ radical-stack
python            : 2.7.12
virtualenv        : /home/hpc/pr92ge/di29suh2/ve_rp
radical.utils     : 0.45
saga-python       : 0.45
radical.pilot     : 0.45.1

Fix SAGA-Python

The current release of radical.saga misses a feature to support radical.pilot on loadleveler. Apply this fix to get radical.saga to work correctly on supermuc: in line 134 of ~/ve_rp/lib/python2.7/site-packages/saga_python-0.45-py2.7.egg/saga/adaptors/loadl/loadljob.py apply:

                          saga.job.PROCESSES_PER_HOST,
+                         saga.job.CANDIDATE_HOSTS,
                          saga.job.TOTAL_CPU_COUNT],

Fix Installation

We have an installation - but easy_install will not have installed the examples correctly. Fix this with:

  • mkdir -p ve_rp/share/radical.pilot
  • cp -R ve_rp/lib/python2.7/site-packages/radical.pilot-0.45.1-py2.7.egg/share/radical.pilot/examples/ ve_rp/share/radical.pilot/
  • chmod 0755 ve_rp/share/radical.pilot/examples/*.py
  • `cd ve_rp/share/radical.pilot/examples
  • also (trust me on this), do: cd ~/ve_rp; ln -s . rp_install; cd -

Setup database tunnel and environment

We now need to set up a separate tunnel for MongoDB, but need to make sure that this is also valid for the compute nodes, so we listen to all interfaces, and then use login08 instead of localhost for the DB URL:

  • ssh -NfL \*:1082:localhost:27017 radical
  • export RADICAL_PILOT_DBURL=mongodb://login08:1082/rp

With this setup, we are ready to run the first example code, at this point only targeting the login node.

Test RP on the login node

  • mkdir -p $HOME/.radical/pilot/configs
  • create resource_lrz.json in that directory, with the following content (please replace your username where appropriate):
{
    "test_local": {
        "description"                 : "local test",
        "notes"                       : "",
        "schemas"                     : ["fork"],
        "fork"                        : {
            "job_manager_endpoint"    : "fork://localhost/",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "lrms"                        : "FORK",
        "agent_type"                  : "multicore",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "agent_launch_method"         : "FORK",
        "task_launch_method"          : "FORK",
        "mpi_launch_method"           : "MPIEXEC",
        "forward_tunnel_endpoint"     : "login08",
        "pre_bootstrap_1"             : ["source /etc/profile",
                                         "source /etc/profile.d/modules.sh",
                                         "module load python/2.7_intel",
                                         "export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/",
                                         "module unload mpi.ibm", "module load mpi.intel"
                                        ],
        "valid_roots"                 : ["/home", "/gpfs/work", "/gpfs/scratch"],
        "rp_version"                  : "installed",
        "virtenv"                     : "/home/hpc/pr92ge/di29suh2/ve_rp/",
        "virtenv_mode"                : "use",
        "python_dist"                 : "default"
    }
}

For the examples to work out of the box, add this section to ./config.json:

        "lrz.test_local" : {
            "project"  : null,
            "queue"    : null,
            "schema"   : null,
            "cores"    : 2
        },

We are now able to run the first example - not yet toward the compute nodes, but locally on the login node, but it will confirm that (i) the installation is viable, and (ii) the DB tunnel setup is correct and usable:

(ve_rp)di29suh2@login08:~/ve_rp/share/radical.pilot/examples> ./00_getting_started.py  lrz.test_local

================================================================================
 Getting Started (RP version 0.45.1)                                            
================================================================================

new session: [rp.session.login08.di29suh2.017235.0010]                         \
database   : [mongodb://login08:1082/rp]                                      ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots                                                                   

create pilot manager                                                          ok
create pilot description [lrz.test_local:2]                                   ok
submit 1 pilot(s) .                                                           ok

--------------------------------------------------------------------------------
submit units                                                                    

create unit manager                                                           ok
add 1 pilot(s)                                                                ok
create 128 unit description(s)
        ........................................................................
        ........................................................              ok
submit 128 unit(s)
        ........................................................................
        ........................................................              ok

--------------------------------------------------------------------------------
gather results                                                                  

wait for 128 unit(s)
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++             ok

--------------------------------------------------------------------------------
finalize                                                                        

closing session rp.session.login08.di29suh2.017235.0010                        \
close pilot manager                                                            \
wait for 1 pilot(s) *                                                         ok
                                                                              ok
close unit manager                                                            ok
session lifetime: 40.6s                                                       #ok

--------------------------------------------------------------------------------

Using Compute Nodes

We can now go the next step and run toward the SM compute nodes. For that we add another set of config sections to ./config.json (for the examples to work), and to ~/.radical/pilot/configs/resource_lrz.json (for the supermuc configuration):

For ./config.json:

        "lrz.local" : {
            "project"  : null,
            "queue"    : null,
            "schema"   : null,
            "cores"    : 32
        },

For ~/.radical/pilot/configs/resource_lrz.json:

    "local": {
        "description"                 : "use SM compute nodes",
        "notes"                       : "", 
        "schemas"                     : ["ssh"],
        "ssh"                         :
        {
            "job_manager_endpoint"    : "loadl+ssh://login01/?energy_policy_tag=radical_pilot&island_count=1&node_usage=not_shared&network_mpi=sn_all,not_shared,us",
            "filesystem_endpoint"     : "file://localhost/"
        },
        "default_queue"               : "test",
        "lrms"                        : "LOADLEVELER",
        "agent_type"                  : "multicore",
        "agent_scheduler"             : "CONTINUOUS",
        "agent_spawner"               : "POPEN",
        "agent_launch_method"         : "MPIEXEC",
        "task_launch_method"          : "MPIEXEC",
        "mpi_launch_method"           : "MPIEXEC",
        "forward_tunnel_endpoint"     : "login08",
        "pre_bootstrap_1"             : ["source /etc/profile",
                                         "source /etc/profile.d/modules.sh",
                                         "module load python/2.7_intel",
                                         "export LD_LIBRARY_PATH=/lrz/sys/tools/python/intelpython27/lib/",
                                         "module unload mpi.ibm", "module load mpi.intel"
                                        ],
        "valid_roots"                 : ["/home", "/gpfs/work", "/gpfs/scratch"],
        "rp_version"                  : "installed",
        "virtenv"                     : "/home/hpc/pr92ge/di29suh2/ve_rp/",
        "virtenv_mode"                : "use",
        "python_dist"                 : "default"
    }

With that setup, we also have working submission to compute nodes:

(ve_rp)di29suh2@login08:~/ve_rp/share/radical.pilot/examples> ./00_getting_started.py  lrz.local

================================================================================
 Getting Started (RP version 0.45.1)                                            
================================================================================

new session: [rp.session.login08.di29suh2.017235.0014]                         \
database   : [mongodb://login08:1082/rp]                                      ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots                                                                   

create pilot manager                                                          ok
create pilot description [lrz.local:32]                                       ok
submit 1 pilot(s) .                                                           ok

--------------------------------------------------------------------------------
submit units                                                                    

create unit manager                                                           ok
add 1 pilot(s)                                                                ok
create 128 unit description(s)
        ........................................................................
        ........................................................              ok
submit 128 unit(s)
        ........................................................................
        ........................................................              ok

--------------------------------------------------------------------------------
gather results                                                                  

wait for 128 unit(s)
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++|
        +++++++++++++++++++++++++++++++++++++++++++++++++++++++++             ok

--------------------------------------------------------------------------------
finalize                                                                        

closing session rp.session.login08.di29suh2.017235.0014                        \
close pilot manager                                                            \
wait for 1 pilot(s) *                                                         ok
                                                                              ok
close unit manager                                                            ok
session lifetime: 76.3s                                                       ok

--------------------------------------------------------------------------------
Clone this wiki locally