-
Notifications
You must be signed in to change notification settings - Fork 0
howtos install OFED 1.5.x
A basic HOWTO on installing OpenFabrics on a Rocks 5.3 cluster. Developed by Tim Carlson mailto:tim@pnl.gov . If you find errors, or want to improve the page, just let me know!
Your list or rolls should look something like this
# rocks list roll NAME VERSION ARCH ENABLED Red_Hat_Enterprise_Linux_Client_5.4: 5.2 x86_64 yes base: 5.3 x86_64 yes ganglia: 5.3 x86_64 yes hpc: 5.3 x86_64 yes kernel: 5.3 x86_64 yes web-server: 5.3 x86_64 yes
This is required because the uninstall script that comes with OFED can be unhappy at times.
# rpm -e --allmatches libibverbs librdmacm compat-dapl openmpi-libs openmpi \ rocks-openmpi-1.3.3-1 iscsi-initiator-utils openmpi-devel mpi-tests
This assumes you are just using the GNU compilers. If you root environment has access to the Intel and Portland Group compiler, you should fix up your path so that just the GNU bits get compiled.
# mkdir /root/ofed # cd /root/ofed # wget http://www.openfabrics.org/downloads/OFED/ofed-1.5.1/OFED-1.5.1.tgz # tar zxf OFED-1.5.1.tgz # cd OFED-1.5.1 # ./install.pl --all --print-available # grep -v debuginfo ofed-all.conf > ofed.conf # ./install.pl -c ofed.conf --build32I am doing this installation on a box that has a Mellanox Infinihost DDR card. At the end of the install process I get this information. If you are building the packages on a head node that does not have an IB card, you won't see any information after the iscsi rpm has been installed.
...
Running rpm -iv /root/ofed/OFED-1.5.1/RPMS/redhat-release-5Client-5.4.0.3/x86_64/iscsi-initiator-utils-2.0-869.2.x86_64.rpm
Device (15b3:6278):
0c:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0)
Link Width: 8x
Link Speed: 2.5Gb/s
Installation finished successfully.
On the head node, it's a good idea to edit the yum.conf file to add a list of packages that should be excluded from updates. If this is not done, a 'yum update' of the head node can result in a non working Infiniband interface.
# vi /etc/yum.conf
[main] cachedir=/var/cache/yum debuglevel=2 logfile=/var/log/yum.log pkgpolicy=newest distroverpkg=redhat-release tolerant=1 exactarch=1 obsoletes=1 gpgcheck=1 plugins=1 exclude=kernel* wordpress tentakel compat-dapl compat-dapl-devel compat-dapl-devel-static compat-dapl-utils dapl dapl-devel dapl-devel-static dapl-utils ib-bonding ibsim ibutils infiniband-diags infinipath-psm infinipath-psm-devel kernel-ib kernel-ib-devel libcxgb3 libcxgb3-devel libibcm libibcm-devel libibmad libibmad-devel libibmad-static libibumad libibumad-devel libibumad-static libibverbs libibverbs-devel libibverbs-devel-static libibverbs-utils libipathverbs libipathverbs-devel libmlx4 libmlx4-devel libmthca libmthca-devel-static libnes libnes-devel-static librdmacm librdmacm-devel librdmacm-utils libsdp libsdp-devel mpi-selector mpitests_mvapich2_gcc mpitests_mvapich_gcc mpitests_openmpi_gcc mstflint mvapich2_gcc mvapich_gcc ofed-docs ofed-scripts openmpi_gcc opensm opensm-devel opensm-libs opensm-static perftest qperf rds-tools scsi-target-utils sdpnetstat srptools tgt compat-dapl-static dapl-static [Rocks-5.3] name=Rocks 5.3 baseurl=http://localhost/install/rocks-dist/x86_64 priority=1
All of the RPMs are now in the directory /root/ofed/OFED-1.5.1/RPMS/redhat-release-5Client-5.4.0.3/x86_64 If you are running Centos instead of RHEL, you'll have to change the above to better suit your environment. The RPMs need to be copied to /export/rocks/install/contrib/5.3/x86_64/RPMS/
# cd /root/ofed/OFED-1.5.1/RPMS/redhat-release-5Client-5.4.0.3/x86_64 # cp * /export/rocks/install/contrib/5.3/x86_64/RPMS/Create and extend-compute.xml file that lists all of these RPMs
# cd /export/rocks/install/site-profiles/5.3/nodes/ # cp skeleton.xml extend-compute.xml # vi extend-compute.xmlBefore the <post></post> section, you will want to add the following packages.
<package>kernel-ib</package> <package>kernel-ib-devel</package> <package>ib-bonding</package> <package>ofed-scripts</package> <package>libibverbs</package> <package>libibverbs-devel</package> <package>libibverbs-devel-static</package> <package>libibverbs-utils</package> <package>libmthca</package> <package>libmthca-devel-static</package> <package>libmlx4</package> <package>libmlx4-devel</package> <package>libcxgb3</package> <package>libcxgb3-devel</package> <package>libnes</package> <package>libnes-devel-static</package> <package>libipathverbs</package> <package>libibcm</package> <package>libibcm-devel</package> <package>libibumad</package> <package>libibumad-devel</package> <package>libibumad-static</package> <package>libibmad</package> <package>libibmad-devel</package> <package>libibmad-static</package> <package>ibsim</package> <package>librdmacm</package> <package>librdmacm-utils</package> <package>librdmacm-devel</package> <package>libsdp</package> <package>libsdp-devel</package> <package>opensm-libs</package> <package>opensm</package> <package>opensm-devel</package> <package>opensm-static</package> <package>compat-dapl</package> <package>compat-dapl-devel</package> <package>dapl</package> <package>dapl-devel</package> <package>dapl-devel-static</package> <package>dapl-utils</package> <package>perftest</package> <package>mstflint</package> <package>sdpnetstat</package> <package>srptools</package> <package>rds-tools</package> <package>ibutils</package> <package>infiniband-diags</package> <package>qperf</package> <package>ofed-docs</package> <package>tgt</package> <package>mpi-selector</package> <package>mvapich_gcc</package> <package>mvapich2_gcc</package> <package>openmpi_gcc</package> <package>mpitests_mvapich_gcc</package> <package>mpitests_mvapich2_gcc</package> <package>mpitests_openmpi_gcc</package> <package>open-iscsi</package>
Once you have saved extend-compute.xml, you should check to make sure there are no XML errors by running this command
# xmllint -noout extend-compute.xml
# cd /export/rocks/install # rocks create distro
This assumes you have already installed your nodes. If you have not yet installed your compute nodes you can skip this step and just start adding your compute nodes with insert-ethers
# rocks set host boot compute action=install # rocks run host compute reboot
Go to one of your compute nodes and see if basic IB connectivity is working. Here is the output from my 8 node cluster
# ssh compute-0-0 # ibhosts Ca : 0x0005ad0000047000 ports 2 "MT25208 InfiniHostEx Mellanox Technologies" Ca : 0x0005ad00000552a0 ports 2 "compute-0-1 HCA-1" Ca : 0x0005ad0000055268 ports 2 "compute-0-5 HCA-1" Ca : 0x0005ad0000055288 ports 2 "compute-0-7 HCA-1" Ca : 0x0005ad00000552ac ports 2 "compute-0-6 HCA-1" Ca : 0x0005ad00000552a4 ports 2 "compute-0-3 HCA-1" Ca : 0x0005ad0000055298 ports 2 "compute-0-2 HCA-1" Ca : 0x0005ad00000552bc ports 2 "compute-0-4 HCA-1" Ca : 0x0005ad000005529c ports 2 "compute-0-0 HCA-1"If this case, my IB switch has a subnet manager (opensm) running. If your switch does not have subnet manager software, you will want to pick one of the nodes to run the subnet manager. Without a subnet manager of some type your IB fabric will not work. Ideally, you would use the head node for the subnet manger.
# chkconfig opensmd on # service opensmd start Starting IB Subnet Manager. [ OK ]
As a regular user, let's see if we can really run an MPI program. I am going to run this test outside of any queue system. First we'll create a hostfile for mpirun to use. I'll put all of my compute nodes in this file so it looks like this
$ cat hostfile compute-0-0 compute-0-1 compute-0-2 compute-0-3 compute-0-4 compute-0-5 compute-0-6 compute-0-7The mvapich that is supplied with OFED-1.5 includes some test you can run. Let's run those:
$ /usr/mpi/gcc/mvapich-1.2.0/bin/mpirun -np 8 -hostfile hostfile /usr/mpi/gcc/mvapich-1.2.0/tests/osu_benchmarks-3.1.1/osu_alltoall
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
Oops.. that doesn't look good! What you have run into here is the fact that the default settings for users is to have a very limited amount of memory that can be locked. To fix this we need to do 2 things. Edit/add the file /etc/sysconfig/sshd on the head node and add these lines somewhere near the top of the file.
# Fix for RLIMIT_MEMLOCK problem ulimit -l unlimitedYou then have to restart the sshd process and log back into your head node. You also need to push this to all the compute nodes. In the <post></post> section of extend-compute.xml you could do it like this.
&lt;file name&#61;&quot;/etc/sysconfig/sshd&quot; mode&#61;&quot;append&quot;&gt; &#35; Fix for RLIMIT_MEMLOCK problem ulimit &#45;l unlimited &lt;/file&gt;Add the above lines and recreate the distro and reinstall the nodes
&#35; cd /export/rocks/install &#35; rocks create distro &#35; rocks set host boot compute action&#61;install &#35; rocks run host compute rebootAfter the nodes come up, let's try that run again!
$ /usr/mpi/gcc/mvapich&#45;1.2.0/bin/mpirun &#45;np 8 &#45;hostfile hostfile /usr/mpi/gcc/mvapich&#45;1.2.0/tests/osu_benchmarks&#45;3.1.1/osu_alltoall &#35; OSU MPI All&#45;to&#45;All Personalized Exchange Latency Test v3.1.1 &#35; Size Latency (us) 1 22.63 2 22.52 4 22.89 8 23.16 16 23.54 32 26.34 64 27.65 128 31.07 256 35.51 512 41.28 1024 56.06 2048 89.68 4096 220.10 8192 341.34 16384 430.56 32768 656.67 65536 1106.93 131072 2023.19 262144 3863.54 524288 7483.98 1048576 14571.77Those latency numbers are ok. You are doing an All-to-All test. This isn't a single node to a single node. Let's try the those single node tests. Same hostfile, but this time we'll just run on two of the nodes
$ /usr/mpi/gcc/mvapich&#45;1.2.0/bin/mpirun &#45;np 2 &#45;hostfile hostfile /usr/mpi/gcc/mvapich&#45;1.2.0/tests/osu_benchmarks&#45;3.1.1/osu_bw &#35; OSU MPI Bandwidth Test v3.1.1 &#35; Size Bandwidth (MB/s) 1 2.33 2 4.66 4 9.63 8 18.44 16 35.53 32 66.93 64 126.42 128 233.19 256 368.43 512 507.09 1024 632.22 2048 749.37 4096 820.82 8192 874.60 16384 833.79 32768 895.17 65536 928.47 131072 946.37 262144 954.75 524288 959.31 1048576 961.69 2097152 962.69 4194304 963.26 &#91;tim@underlord ~&#93;$ /usr/mpi/gcc/mvapich&#45;1.2.0/bin/mpirun &#45;np 2 &#45;hostfile hostfile /usr/mpi/gcc/mvapich&#45;1.2.0/tests/osu_benchmarks&#45;3.1.1/osu_latency &#35; OSU MPI Latency Test v3.1.1 &#35; Size Latency (us) 0 3.59 1 3.62 2 3.66 4 3.62 8 3.64 16 3.65 32 3.80 64 3.90 128 4.87 256 5.31 512 6.15 1024 7.51 2048 8.95 4096 12.17 8192 19.10 16384 36.82 32768 54.30 65536 87.86 131072 156.34 262144 292.11 524288 564.71 1048576 1108.91 2097152 2198.01 4194304 4372.93Those are the numbers you can expect for DDR Infinihost III cards.
The short answer is "no". None of the standard MPI versions that you would use on and Infiniband network (mvapich, mvapich2, openmpi) require the Infiniband interface to be running IP.
You first need to create the Infiniband network in Rocks. In this example I've called the network "ibnet"
rocks add network ibnet subnet&#61;192.168.2.0 netmask 255.255.255.0
These next steps adds the new interface and connects it to the ibnet network.
rocks add host interface compute&#45;RACK&#45;RANK ib0 rocks set host interface ip compute&#45;RACK&#45;RANK ib0 192.168.2.X rocks set host interface name compute&#45;RACK&#45;RANK ib0 icompute&#45;rack&#45;rank rocks set host interface module compute&#45;RACK&#45;RANK ib0 ib_ipoib rocks set host interface subnet icompute&#45;RACK&#45;RANK ib0 ibnet
There is also a more crude way of doing it, with some sed magic in extend-compute.xml. The line below assumes that I have configured the cluster to have a private IP network of 192.168.1.0/255.255.255.0 and that I will be using 192.168.2.0/255.255.255.0 for the IPoIB network. Stick this in the <post></post> section.
cat /etc/sysconfig/network&amp;&#35;45&#59;scripts/ifcfg&amp;&#35;45&#59;eth0 &amp;&#35;124&#59; grep &amp;&#35;45&#59;v HWADDR &amp;&#35;124&#59; grep &amp;&#35;45&#59;v MTU &amp;&#35;124&#59; sed &amp;&#35;45&#59;e s/168.1/168.2/ &amp;&#35;124&#59; sed &amp;&#35;45&#59;e s/eth0/ib0/ &amp;gt&#59; /etc/sysconfig/network&amp;&#35;45&#59;scripts/ifcfg&amp;&#35;45&#59;ib0
You could serve data from the head node via NFS. Let' say you have a large partition called /data and the underlying file system can give you better than 100MB/s performance. A gigabit network is going to become a bottleneck, so let's share this partition out with IPoIB. Assuming you have configured your network interface on the head node using the above sed command, you have three more steps.
- Add the filesystem to /etc/exports with the line
/data 192.168.2.0/255.255.255.0(rw,async)
- Add the IPoIB interface to /etc/sysconfig/iptables. After the line
&amp;&#35;45&#59;A INPUT &amp;&#35;45&#59;i eth0 &amp;&#35;45&#59;j ACCEPTadd the line
&amp;&#35;45&#59;A INPUT &amp;&#35;45&#59;i ib0 &amp;&#35;45&#59;j ACCEPTand restart iptables
&amp;&#35;35&#59; /sbin/service iptables restart
- I'm lazy and add an entry like this to /etc/auto.home
data 192.168.2.1&amp;&#35;58&#59;/dataand push it out to the nodes with
&amp;&#35;35&#59; rocks sync users
If you are using Qlogic cards this will be a problem because there is a broken header issue with the RHEL 2.6.18_194 kernel and the OFED 1.5.1 distribution. If you don't have Qlogic cards then you can upgrade the kernel using the following procedure.
- Download the packages into the contrib directory the head node, update the head node, and reboot
&amp;&#35;35&#59; yum update &amp;&#35;45&#59;&amp;&#35;45&#59;downloadonly &amp;&#35;45&#59;&amp;&#35;45&#59;downloaddir&amp;&#35;61&#59;/export/rocks/install/contrib/5.3/x86_64/RPMS kernel kernel&amp;&#35;45&#59;devel kernel&amp;&#35;45&#59;headers &amp;&#35;35&#59; yum update kernel kernel&amp;&#35;45&#59;devel kernel&amp;&#35;45&#59;headers &amp;&#35;35&#59; reboot
- Remove the previous kernel-ib kernel-ib-devel and ib-bonding RPMs
&amp;&#35;35&#59; rpm &amp;&#35;45&#59;e kernel&amp;&#35;45&#59;ib&amp;&#35;45&#59;devel&amp;&#35;45&#59;1.5.1&amp;&#35;45&#59;2.6.18_164.el5 kernel&amp;&#35;45&#59;ib&amp;&#35;45&#59;1.5.1&amp;&#35;45&#59;2.6.18_164.el5 \ ib&amp;&#35;45&#59;bonding&amp;&#35;45&#59;0.9.0&amp;&#35;45&#59;2.6.18_164.el5
- Rebuild the source RPMs against the new kernel
&amp;&#35;35&#59; rpmbuild &amp;&#35;45&#59;&amp;&#35;45&#59;rebuild &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;_topdir /var/tmp//OFED_topdir&amp;&#35;39&#59; \ &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;KVERSION 2.6.18&amp;&#35;45&#59;194.el5&amp;&#35;39&#59; &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;_release 2.6.18_194.el5&amp;&#35;39&#59; \ &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;force_all_os 0&amp;&#35;39&#59; &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;_prefix /usr&amp;&#35;39&#59; &amp;&#35;45&#59;&amp;&#35;45&#59;define &amp;&#35;39&#59;&amp;&#35;95&#59;&amp;&#35;95&#59;arch_install_post %&amp;&#35;123&#59;nil&amp;&#35;125&#59;&amp;&#35;39&#59; \ /root/ofed/OFED&amp;&#35;45&#59;1.5.1/SRPMS/ib&amp;&#35;45&#59;bonding&amp;&#35;45&#59;0.9.0&amp;&#35;45&#59;42.src.rpm
- Install the RPMs you just built and reboot
&amp;&#35;35&#59; cd /var/tmp/OFED_topdir/RPMS/x86_64 &amp;&#35;35&#59; rpm &amp;&#35;45&#59;ivh ib&amp;&#35;45&#59;bonding&amp;&#35;45&#59;0.9.0&amp;&#35;45&#59;2.6.18_194.el5.x86_64.rpm \ kernel&amp;&#35;45&#59;ib&amp;&#35;45&#59;1.5.1&amp;&#35;45&#59;2.6.18_194.el5.x86_64.rpm \ kernel&amp;&#35;45&#59;ib&amp;&#35;45&#59;devel&amp;&#35;45&#59;1.5.1&amp;&#35;45&#59;2.6.18_194.el5.x86_64.rpm &amp;&#35;35&#59; reboot
- Copy the new RPMs to the contrib directory, rebuild the distro, and reinstall the nodes
&amp;&#35;35&#59; cd /var/tmp/OFED_topdir/RPMS/x86_64 &amp;&#35;35&#59; cp &amp;&#35;42&#59; /export/rocks/install/contrib/5.3/x86_64/RPMS &amp;&#35;35&#59; cd /export/rocks/install &amp;&#35;35&#59; rocks create distro &amp;&#35;35&#59; rocks set host boot compute action&amp;&#35;61&#59;install &amp;&#35;35&#59; rocks run host compute command&amp;&#35;61&#59;reboot
© 2014 www.rocksclusters.org. All Rights Reserved.