-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error only when using SLURM (-S) with ASHS 2.0.0: Validity check at end of stage 2 detected missing files #9
Comments
Hi Natasha,
When you run on Slurm, there should be a bunch of output files generated in
the dump folder of the work directory. Please check those files for errors,
perhaps there is some library missing on one of the slurm nodes, or there
is an error invoking slurm in the first place.
Paul
…On Tue, Aug 8, 2023 at 2:08 PM Natasha Pavlovikj ***@***.***> wrote:
Hi,
I recently downloaded ASHS 2.0.0 (the release from March 2 2022).
I ran *ashs_main.sh* with the test data you provide
<https://github.com/pyushkevich/ashs/tree/master/testing/atlas_system_test/images>
and the UPENN PMC Atlas (*ashs_atlas_upennpmc_20170810*) on our HPC
Center that supports Slurm.
When I run *ashs_main.sh* both with the serial option and the parallel
option (-P):
ashs_main.sh -I sub07 -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f
sub07_tse.nii.gz -w sub07_output_slurm
or
ashs_main.sh -I sub07 -P -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz
-f sub07_tse.nii.gz -w sub07_output_slurm
the test run is successful (please see the generated log here,
https://gist.github.com/npavlovikj/9b089f11283ed98dbe1cfddfa6d6a6b2).
However, when I run *ashs_main.sh* with the Slurm option (-S) on the same
dataset with:
ashs_main.sh -I sub07 -S -q "--mem=60gb --time=168:00:00" -a
$ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w
sub07_output_slurm_parallel
the run fails (please see the generated log here,
https://gist.github.com/npavlovikj/4d22d26e713d7406961ee42540927515) with:
**************** !! ERROR !! *******************
Validity check at end of stage 2 detected
missing files.
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************
When I submit the job, I can see many *ashs_mul** jobs being submitted
and running, but ultimately the run fails with the error from above.
Do you know why this error happens only when I use the Slurm option?
Also, do you have any suggestions on how to fix it?
Please let me know if you need any additional information.
Thank you,
Natasha
—
Reply to this email directly, view it on GitHub
<#9>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJPEWY2YWILVXCRUZYHJQTXUJ6CJANCNFSM6AAAAAA3I3ABEQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi Paul, Thank you so much for your prompt reply! There are indeed many ashs_stg2*.out files in the dump directory. However, I haven't been able to find any errors in them, and they all look similar (one example can be seen here https://gist.github.com/npavlovikj/3ab46c75146e539c248b642aeb58797b). Do you have any other suggestions? Thank you, |
Hi Natasha,
The nodes seem to be successfully generating the .mat files that the parent
script complains about not finding. Can you confirm that they are present
in the filesystem?
I've seen something like this on one of our clusters, where files written
to NFS from the nodes did not immediately show up on the submission host.
Can you try running ASHS by stages (using -s 2 option, etc.) with a sleep
command after each stage, to let NFS refresh? Hopefully this will do
the trick.
Paul
Message ID: ***@***.***>
… |
Hi Paul, Thank you for the suggestion! Yes, I can verify that the .mat files do exist on the cluster in the listed directories and have content in it.
and I continue getting the Validity check error for Stage 2, and the consecutive stages afterwards:
The files for Stage 2 and Stage 3 that give the Validity check error do exist in the output directory:
Do you have any other suggestions for me to try? Thank you, |
It's really strange... Can you try to wait for longer after stage 2, maybe
10 minutes? Otherwise, could you add an echo command to the function in
ashs bin/ashs_lib.sh directory that does the validity check, to have it
print out the full path of each of the missing files? Maybe there is come
other kind of mismatch with the filesystem...
…On Tue, Aug 8, 2023 at 5:08 PM Natasha Pavlovikj ***@***.***> wrote:
Hi Paul,
Thank you for the suggestion!
Yes, I can verify that the .mat files do exist on the cluster in the
listed directories and have content in it.
We have shared file system.
I also applied your suggestion and ran ASHS in stages with a sleep command
in between, e.g.,:
echo "Stage 1"
ashs_main.sh -I sub07 -s 1 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 2"
ashs_main.sh -I sub07 -s 2 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 3"
ashs_main.sh -I sub07 -s 3 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 4"
ashs_main.sh -I sub07 -s 4 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 5"
ashs_main.sh -I sub07 -s 5 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 6"
ashs_main.sh -I sub07 -s 6 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130
echo "Stage 7"
ashs_main.sh -I sub07 -s 7 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
and I continue getting the Validity check error for Stage 2, and the
consecutive stages afterwards:
Stage 1
ashs_main execution log
timestamp: Tue Aug 8 15:33:31 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 1 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
****************************************
Starting stage 1: Normalization to T1 population template
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel
------------------- INFO ---------------------
Started stage 1: Normalization to T1 population
template
------------------------------------------------
Submitted batch job 3388168
------------------- INFO ---------------------
Validity check at end of stage 1 successful
------------------------------------------------
Stage 2
ashs_main execution log
timestamp: Tue Aug 8 15:38:36 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 2 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
------------------- INFO ---------------------
Validity check at end of stage 1 successful
------------------------------------------------
****************************************
Starting stage 2: Initial ROI registration to all T2 atlases
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel
------------------- INFO ---------------------
Started stage 2: Initial ROI registration to
all T2 atlases
------------------------------------------------
**************** !! ERROR !! *******************
Validity check at end of stage 2 detected
missing files.
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************
Stage 3
ashs_main execution log
timestamp: Tue Aug 8 15:40:54 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 3 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
**************** !! ERROR !! *******************
Validity check at end of stage 2 detected
missing files.
(multiatlas/tseg_left_train022/greedy_atlas_to_s
ubj_warp.nii.gz and 129 other files).
************************************************
Stage 4
ashs_main execution log
timestamp: Tue Aug 8 15:43:08 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 4 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
**************** !! ERROR !! *******************
Validity check at end of stage 3 detected
missing files.
(multiatlas/tseg_right_train027/greedy_atlas_to_
subj_warp.nii.gz and 11 other files).
************************************************
Stage 5
ashs_main execution log
timestamp: Tue Aug 8 15:45:21 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 5 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
**************** !! ERROR !! *******************
Validity check at end of stage 4 detected
missing files.
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
and 295 other files).
************************************************
Stage 6
ashs_main execution log
timestamp: Tue Aug 8 15:47:34 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 6 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
**************** !! ERROR !! *******************
Validity check at end of stage 5 detected
missing files.
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
and 301 other files).
************************************************
Stage 7
ashs_main execution log
timestamp: Tue Aug 8 15:49:48 CDT 2023
invocation: /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 7 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
directory: /work/project/npavlovikj/ashs/2.0.0
environment:
ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
ASHS_SUBJID=sub07
ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
ASHS_USE_SLURM=1
ASHS_USE_SOME_BATCHENV=1
ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
**************** !! ERROR !! *******************
Validity check at end of stage 6 detected
missing files.
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
and 307 other files).
************************************************
The files for Stage 2 and Stage 3 that give the Validity check error do
exist in the output directory:
***@***.*** sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz
-rw-r--r-- 1 npavlovikj project 4211389 Aug 8 15:41 multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz
***@***.*** sub07_output_slurm_parallel]$ ls -ls multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
4465 -rw-r--r-- 1 npavlovikj project 4511258 Aug 8 15:43 multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
***@***.*** sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
-rw-r--r-- 1 npavlovikj project 127 Aug 8 15:38 multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
Do you have any other suggestions for me to try?
Thank you,
Natasha
—
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJPEW6IGUHWSQMAZACEBMTXUKTGPANCNFSM6AAAAAA3I3ABEQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hi Paul, I added sleep 900 after Stage 2, but I got the same error pretty soon after the first job for Stage 2 started:
As for echoing the commands in ashs_lib.sh, do you know what variable that is? I tried echoing MISSFILE, CHK_FILE, MADIR and TDIR, and these are the results I got:
All the printed paths and files are valid.
Thank you, |
Hi Paul, I have been playing a bit with the sleep command, and I figured out that when the missing files are being written is what is causing issues, since there is some delay in writing those files (maybe they come from different SLURM jobs that are not finished yet). I tried adding sleep 900 before Line 2088 in 8d0098a
While the number of reported missing files was reduced, some were still missing for Stage 3 and the job failed. Then, I increased the number to 1800, and I was able to have a successful run when using SLURM with ASHS 2.0.0. We do have Lustre on our /work shared file system. Waiting for 30 minutes for all files to be written looks like a long time. Is it possible to have some changes in the code to address this, maybe wait for all SLURM jobs to be finished before checking for the missing files? Thank you, |
Hi,
I recently downloaded ASHS 2.0.0 (the release from March 2 2022).
I ran ashs_main.sh with the test data you provide and the UPENN PMC Atlas (ashs_atlas_upennpmc_20170810) on our HPC Center that supports Slurm.
When I run ashs_main.sh both with the serial option and the parallel option (-P):
ashs_main.sh -I sub07 -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm
or
ashs_main.sh -I sub07 -P -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm
the test run is successful (please see the generated log here, https://gist.github.com/npavlovikj/9b089f11283ed98dbe1cfddfa6d6a6b2).
However, when I run ashs_main.sh with the Slurm option (-S) on the same dataset with:
ashs_main.sh -I sub07 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
the run fails (please see the generated log here, https://gist.github.com/npavlovikj/4d22d26e713d7406961ee42540927515) with:
When I submit the job, I can see many ashs_mul* jobs being submitted and running, but ultimately the run fails with the error from above.
Do you know why this error happens only when I use the Slurm option?
Also, do you have any suggestions on how to fix it?
Please let me know if you need any additional information.
Thank you,
Natasha
The text was updated successfully, but these errors were encountered: