Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error only when using SLURM (-S) with ASHS 2.0.0: Validity check at end of stage 2 detected missing files #9

Open
npavlovikj opened this issue Aug 8, 2023 · 7 comments

Comments

@npavlovikj
Copy link

Hi,

I recently downloaded ASHS 2.0.0 (the release from March 2 2022).
I ran ashs_main.sh with the test data you provide and the UPENN PMC Atlas (ashs_atlas_upennpmc_20170810) on our HPC Center that supports Slurm.

When I run ashs_main.sh both with the serial option and the parallel option (-P):
ashs_main.sh -I sub07 -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm
or
ashs_main.sh -I sub07 -P -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm
the test run is successful (please see the generated log here, https://gist.github.com/npavlovikj/9b089f11283ed98dbe1cfddfa6d6a6b2).

However, when I run ashs_main.sh with the Slurm option (-S) on the same dataset with:
ashs_main.sh -I sub07 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
the run fails (please see the generated log here, https://gist.github.com/npavlovikj/4d22d26e713d7406961ee42540927515) with:

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

When I submit the job, I can see many ashs_mul* jobs being submitted and running, but ultimately the run fails with the error from above.

Do you know why this error happens only when I use the Slurm option?
Also, do you have any suggestions on how to fix it?

Please let me know if you need any additional information.

Thank you,
Natasha

@pyushkevich
Copy link
Owner

pyushkevich commented Aug 8, 2023 via email

@npavlovikj
Copy link
Author

Hi Paul,

Thank you so much for your prompt reply!

There are indeed many ashs_stg2*.out files in the dump directory. However, I haven't been able to find any errors in them, and they all look similar (one example can be seen here https://gist.github.com/npavlovikj/3ab46c75146e539c248b642aeb58797b).
The Slurm status and exit code for all those jobs is Completed and 0.
I checked the used computational resources for all those jobs as well, and they are much lower than the ones I have requested.

Do you have any other suggestions?

Thank you,
Natasha

@pyushkevich
Copy link
Owner

pyushkevich commented Aug 8, 2023 via email

@npavlovikj
Copy link
Author

Hi Paul,

Thank you for the suggestion!

Yes, I can verify that the .mat files do exist on the cluster in the listed directories and have content in it.
We have shared file system.
I also applied your suggestion and ran ASHS in stages with a sleep command in between, e.g.,:

echo "Stage 1"
ashs_main.sh -I sub07 -s 1 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 2"
ashs_main.sh -I sub07 -s 2 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 3"
ashs_main.sh -I sub07 -s 3 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 4"
ashs_main.sh -I sub07 -s 4 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 5"
ashs_main.sh -I sub07 -s 5 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 6"
ashs_main.sh -I sub07 -s 6 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel
sleep 130

echo "Stage 7"
ashs_main.sh -I sub07 -s 7 -S -q "--mem=60gb --time=168:00:00" -a $ASHS_ATLAS_UPENN_PMC -g sub07_mprage.nii.gz -f sub07_tse.nii.gz -w sub07_output_slurm_parallel

and I continue getting the Validity check error for Stage 2, and the consecutive stages afterwards:

Stage 1
ashs_main execution log
  timestamp:   Tue Aug  8 15:33:31 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 1 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"
****************************************
Starting stage 1: Normalization to T1 population template
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

-------------------  INFO  ---------------------
Started stage 1: Normalization to T1 population 
template
------------------------------------------------

Submitted batch job 3388168

-------------------  INFO  ---------------------
Validity check at end of stage 1 successful
------------------------------------------------

Stage 2
ashs_main execution log
  timestamp:   Tue Aug  8 15:38:36 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 2 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

-------------------  INFO  ---------------------
Validity check at end of stage 1 successful
------------------------------------------------

****************************************
Starting stage 2: Initial ROI registration to all T2 atlases
****************************************
SLURM options for this stage: --mem=60gb --time=168:00:00 --partition=devel

-------------------  INFO  ---------------------
Started stage 2: Initial ROI registration to 
all T2 atlases
------------------------------------------------


**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

Stage 3
ashs_main execution log
  timestamp:   Tue Aug  8 15:40:54 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 3 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train022/greedy_atlas_to_s
ubj_warp.nii.gz and 129 other files).
************************************************

Stage 4
ashs_main execution log
  timestamp:   Tue Aug  8 15:43:08 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 4 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 3 detected 
missing files. 
(multiatlas/tseg_right_train027/greedy_atlas_to_
subj_warp.nii.gz and 11 other files).
************************************************

Stage 5
ashs_main execution log
  timestamp:   Tue Aug  8 15:45:21 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 5 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 4 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 295 other files).
************************************************

Stage 6
ashs_main execution log
  timestamp:   Tue Aug  8 15:47:34 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 6 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 5 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 301 other files).
************************************************

Stage 7
ashs_main execution log
  timestamp:   Tue Aug  8 15:49:48 CDT 2023
  invocation:  /util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/bin/ashs_main.sh -I sub07 -s 7 -S -q --mem=60gb --time=168:00:00 --partition=devel -a /work/HCC/DATA/mridata-1.0/ashs
/ashs_upennpmc -g /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz -f /work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz -w sub0
7_output_slurm_parallel
  directory:   /work/project/npavlovikj/ashs/2.0.0
  environment:
    ASHS_ATLAS_UPENN_PMC=/work/HCC/DATA/mridata-1.0/ashs/ashs_upennpmc
    ASHS_ATLAS_UPENN_PMC_20170810=/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
    ASHS_BIN=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0/ext/Linux/bin
    ASHS_MPRAGE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
    ASHS_QSUB_OPTS='--mem=60gb --time=168:00:00 --partition=devel'
    ASHS_ROOT=/util/opt/anaconda/deployed-conda-envs/packages/ashs/envs/ashs-2.0.0
    ASHS_SUBJID=sub07
    ASHS_TSE=/lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
    ASHS_USE_SLURM=1
    ASHS_USE_SOME_BATCHENV=1
    ASHS_WORK=/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Atlas    : /lustre/work/HCC/DATA/mridata-1.0/ashs/ashs_atlas_upennpmc_20170810
T1 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_mprage.nii.gz
T2 Image : /lustre/work/project/npavlovikj/ashs/ashs/testing/atlas_system_test/images/sub07_tse.nii.gz
WorkDir  : /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel
Using SLURM with options "--mem=60gb --time=168:00:00 --partition=devel"

**************** !! ERROR !! *******************
Validity check at end of stage 6 detected 
missing files. 
(multiatlas/fusion/lfseg_corr_nogray_left.nii.gz
 and 307 other files).
************************************************

The files for Stage 2 and Stage 3 that give the Validity check error do exist in the output directory:

[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz 
-rw-r--r-- 1 npavlovikj project 4211389 Aug  8 15:41 multiatlas/tseg_left_train022/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -ls multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
4465 -rw-r--r-- 1 npavlovikj project 4511258 Aug  8 15:43 multiatlas/tseg_right_train027/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat 
-rw-r--r-- 1 npavlovikj project 127 Aug  8 15:38 multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat

Do you have any other suggestions for me to try?

Thank you,
Natasha

@pyushkevich
Copy link
Owner

pyushkevich commented Aug 8, 2023 via email

@npavlovikj
Copy link
Author

Hi Paul,

I added sleep 900 after Stage 2, but I got the same error pretty soon after the first job for Stage 2 started:

**************** !! ERROR !! *******************
Validity check at end of stage 2 detected 
missing files. 
(multiatlas/tseg_left_train000/greedy_atlas_to_s
ubj_affine.mat and 231 other files).
************************************************

As for echoing the commands in ashs_lib.sh, do you know what variable that is?

I tried echoing MISSFILE, CHK_FILE, MADIR and TDIR, and these are the results I got:

$ echo $MISSFILE
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/.missing
$ echo $CHK_FILE

$ echo $MADIR
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas

$ echo $TDIR
tseg_right_train028

All the printed paths and files are valid.
The .missing file has 232 files, and I checked a few of them, and they all exist on the file system:

[npavlovikj@login2 sub07_output_slurm_parallel]$ head -n3 .missing 
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
/lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
-rw-r--r-- 1 npavlovikj project 126 Aug  8 17:02 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_affine.mat
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
-rw-r--r-- 1 npavlovikj project 4339419 Aug  8 17:03 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train000/greedy_atlas_to_subj_warp.nii.gz
[npavlovikj@login2 sub07_output_slurm_parallel]$ ls -la /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat
-rw-r--r-- 1 npavlovikj project 133 Aug  8 17:02 /lustre/work/project/npavlovikj/ashs/2.0.0/sub07_output_slurm_parallel/multiatlas/tseg_left_train001/greedy_atlas_to_subj_affine.mat

Thank you,
Natasha

@npavlovikj
Copy link
Author

Hi Paul,

I have been playing a bit with the sleep command, and I figured out that when the missing files are being written is what is causing issues, since there is some delay in writing those files (maybe they come from different SLURM jobs that are not finished yet).

I tried adding sleep 900 before

cat $LISTFILE | sort -n -t ',' | while read LINE; do
.
While the number of reported missing files was reduced, some were still missing for Stage 3 and the job failed.

Then, I increased the number to 1800, and I was able to have a successful run when using SLURM with ASHS 2.0.0.

We do have Lustre on our /work shared file system. Waiting for 30 minutes for all files to be written looks like a long time. Is it possible to have some changes in the code to address this, maybe wait for all SLURM jobs to be finished before checking for the missing files?

Thank you,
Natasha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants