New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues on reproducing your artifact #1

Open
ReviewerPC opened this Issue Dec 6, 2016 · 6 comments

Comments

Projects
None yet
3 participants
@ReviewerPC

ReviewerPC commented Dec 6, 2016

Hello,

I will list here a brief review of this artifact, so you could fix it in the future.

A cluster of 8 nodes x (2 CPUs Intel Xeon L5420, 4 cores/CPU, 16GB RAM, 298GB HDD) is deployed

  • IntelMPI (4.1.1.036) and MKL is used as described in the tutorial file.

  • HPL is successfully built as well as SKT.

  • HPL is well run using these 8 nodes, you can see the attached file (running hpl.gpg)
    The job size is modified as explained in the artifact (regarding the aggregated memory,
    here we have 128 GB of mem.).

  • The modified version of HPL benchmark (SKT) does not run on our cluster due to several reseans:

1 - It is not sufficiently explained how to run this modified benchmark (command does not exist), (this also concerns the parent banchmark but I found how to run HPL on the internet)

2- There is a dependency mismatch inside the added scripts and nothing is said about that in the tutorial and the paper.

So, the error is about:
./hpl-daemon.sh: line 41: srun: command not found

Hope yopu fix your artifact and its description ASAP, so it can be reproducible.

running hpl

@tomxice

This comment has been minimized.

Show comment
Hide comment
@tomxice

tomxice Dec 6, 2016

Member

Thank you for your efforts and sorry for the unclear description.
I am going to fix these issues in the new tutorial. The github version will be updated soon.
The quick solution is as below.

In Line-41 of hpl-daemon.sh there is the submission command

srun -p work -n 128 --nodelist=./worklist --ntasks-per-node=16 ../skt/xhpl

If you use mpirun of Intel MPI, change it to this command. (Leave the heading environment variables unchanged).

mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl

To run SKT-HPL, run ./hpl-daemon.sh in skt-hpl/bin/script/

Member

tomxice commented Dec 6, 2016

Thank you for your efforts and sorry for the unclear description.
I am going to fix these issues in the new tutorial. The github version will be updated soon.
The quick solution is as below.

In Line-41 of hpl-daemon.sh there is the submission command

srun -p work -n 128 --nodelist=./worklist --ntasks-per-node=16 ../skt/xhpl

If you use mpirun of Intel MPI, change it to this command. (Leave the heading environment variables unchanged).

mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl

To run SKT-HPL, run ./hpl-daemon.sh in skt-hpl/bin/script/

@ReviewerPC

This comment has been minimized.

Show comment
Hide comment
@ReviewerPC

ReviewerPC Dec 6, 2016

For HPL, I actually run a similar command of this one:

mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl

and it runs as you can see in the attached figure. But the problem is appeared when trying to run SKT-HPL, I submit hpl-daemon.sh :

mpirun --hostfile=./worklist -n 128 -ppn 16 ./hpl-daemon.sh

but an error was raised. The output is (./hpl-daemon.sh: line 41: srun: command not found)

So, could you please explain in your tutorial what is srun and how we can get it if we are not familiar with. Its very helpful to explain this basic things, specifically when you know nothing about the expected users and what is already installed on his/her cluster.

Thank you very much in advance.

ReviewerPC commented Dec 6, 2016

For HPL, I actually run a similar command of this one:

mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl

and it runs as you can see in the attached figure. But the problem is appeared when trying to run SKT-HPL, I submit hpl-daemon.sh :

mpirun --hostfile=./worklist -n 128 -ppn 16 ./hpl-daemon.sh

but an error was raised. The output is (./hpl-daemon.sh: line 41: srun: command not found)

So, could you please explain in your tutorial what is srun and how we can get it if we are not familiar with. Its very helpful to explain this basic things, specifically when you know nothing about the expected users and what is already installed on his/her cluster.

Thank you very much in advance.

@tomxice

This comment has been minimized.

Show comment
Hide comment
@tomxice

tomxice Dec 7, 2016

Member

Your method to run HPL is right. However, we have provided a wrapper script for SKT-HPL so the command is different.

DO NOT use mpirun to submit hpl-daemon.sh directly. Instead, write mpirun inside hpl-daemon.sh then run this script in bash.
In other words, it is NOT
mpirun --hostfile=./worklist -n 128 -ppn 16 ./hpl-daemon.sh
It is
./hpl-daemon.sh

The error is raised because Line-41 of hpl-daemon.sh invokes srun but there is no SLURM installed (srun is the submit command when using SLURM as a job management system). To fix this error, change the submit command in Line-41 of hpl-daemon.sh to use mpirun.

Change the line from
RPN=16 TSIZE=8 SNAPSHOT=30 REUSEMAT=$REUSE_MAT_N ONLY=0 RECOVER=1 RUN_TIME_LIMIT=100000 srun -p work -n 128 --nodelist=./worklist --ntasks-per-node=16 ../skt/xhpl
to
RPN=16 TSIZE=8 SNAPSHOT=30 REUSEMAT=$REUSE_MAT_N ONLY=0 RECOVER=1 RUN_TIME_LIMIT=100000 mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl
And then run ./hpl-daemon.sh in bash directly.

Also, Line-50 of hpl-daemon.sh uses srun again and should be changed to use mpirun.
mpirun -n $ncnt --hostfile=./worklist -ppn 1 ./clr.sh

Member

tomxice commented Dec 7, 2016

Your method to run HPL is right. However, we have provided a wrapper script for SKT-HPL so the command is different.

DO NOT use mpirun to submit hpl-daemon.sh directly. Instead, write mpirun inside hpl-daemon.sh then run this script in bash.
In other words, it is NOT
mpirun --hostfile=./worklist -n 128 -ppn 16 ./hpl-daemon.sh
It is
./hpl-daemon.sh

The error is raised because Line-41 of hpl-daemon.sh invokes srun but there is no SLURM installed (srun is the submit command when using SLURM as a job management system). To fix this error, change the submit command in Line-41 of hpl-daemon.sh to use mpirun.

Change the line from
RPN=16 TSIZE=8 SNAPSHOT=30 REUSEMAT=$REUSE_MAT_N ONLY=0 RECOVER=1 RUN_TIME_LIMIT=100000 srun -p work -n 128 --nodelist=./worklist --ntasks-per-node=16 ../skt/xhpl
to
RPN=16 TSIZE=8 SNAPSHOT=30 REUSEMAT=$REUSE_MAT_N ONLY=0 RECOVER=1 RUN_TIME_LIMIT=100000 mpirun --hostfile=./worklist -n 128 -ppn 16 ../skt/xhpl
And then run ./hpl-daemon.sh in bash directly.

Also, Line-50 of hpl-daemon.sh uses srun again and should be changed to use mpirun.
mpirun -n $ncnt --hostfile=./worklist -ppn 1 ./clr.sh

@tomxice

This comment has been minimized.

Show comment
Hide comment
@tomxice

tomxice Dec 7, 2016

Member

GOOD NEWS: We have prepared a 'ready to run' experimental platform for reviewers to evaluate SKT-HPL. So if you find it too difficult to run SKT-HPL on your our machine, try our experiment platform. Please see 'Run_SKT-HPL_on_gorgon_cluster.pdf' on our GitHub repo.

Member

tomxice commented Dec 7, 2016

GOOD NEWS: We have prepared a 'ready to run' experimental platform for reviewers to evaluate SKT-HPL. So if you find it too difficult to run SKT-HPL on your our machine, try our experiment platform. Please see 'Run_SKT-HPL_on_gorgon_cluster.pdf' on our GitHub repo.

@ReviewerPC

This comment has been minimized.

Show comment
Hide comment
@ReviewerPC

ReviewerPC Dec 8, 2016

Thank you very much for updating the tutorial and offering for us to access your own cluster. Actually, I have entered your cluster not to evaluate the artifact as I want to run it on a different hardware and environment. So, I accessed your cluster to copy the file HPL.dat to the cluster that I'm working on.

I used also mpirun instead of srun, and everything is goning to be OK. The evaluation is running now, I can successfully inject a failure and SKT can resume the work. For other reviewers, the following screenshot shows well that several snapshots are taken, also that SKT is trying to restart after injecting a one node failure.

image

I'm just still waiting for the final result (as the evaluation is running now) to know if the test will pass or not.

ReviewerPC commented Dec 8, 2016

Thank you very much for updating the tutorial and offering for us to access your own cluster. Actually, I have entered your cluster not to evaluate the artifact as I want to run it on a different hardware and environment. So, I accessed your cluster to copy the file HPL.dat to the cluster that I'm working on.

I used also mpirun instead of srun, and everything is goning to be OK. The evaluation is running now, I can successfully inject a failure and SKT can resume the work. For other reviewers, the following screenshot shows well that several snapshots are taken, also that SKT is trying to restart after injecting a one node failure.

image

I'm just still waiting for the final result (as the evaluation is running now) to know if the test will pass or not.

@gfursin

This comment has been minimized.

Show comment
Hide comment
@gfursin

gfursin Dec 8, 2016

Hi guys,
I am from AE CGO-PPoPP steering committee and I just got a confirmation from the above evaluator that he managed to validate your artifact. See the final screen attached. Congrats!
Grigori
image1

gfursin commented Dec 8, 2016

Hi guys,
I am from AE CGO-PPoPP steering committee and I just got a confirmation from the above evaluator that he managed to validate your artifact. See the final screen attached. Congrats!
Grigori
image1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment