Ray on HiSeq-2500-NA12878-demo-2x150 using titan.ccs.ornl.gov #197

Closed
sebhtml opened this Issue Sep 23, 2013 · 36 comments

Comments

Projects
None yet
2 participants
@sebhtml
Owner

sebhtml commented Sep 23, 2013


# Batch script to carry out computation on retrieved data
# PBS directives
#PBS -N HiSeq-2500-NA12878-demo-2x150-3
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

# Launch executable

cd $PBS_O_WORKDIR

#module load PrgEnv-pgi/4.1.40
#pgi/12.10.0
#module load cray-mpich2/5.6.3

#module load lsc005/Ray/2.2.0-1

#/tmp/proj/lsc005/software/lsc005/Ray/2.2.0-1/bin/Ray \

aprun -n 5008 \
./software/lsc005/Ray/2.3.0-devel-3dd4ef5304c-1/bin/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-3 \

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

Sequences

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-3/FilePartition.txt 
#File   Name    FirstSequence   LastSequence    NumberOfSequences
0   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz    0   143818692   143818693
1   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz    143818693   287637385   143818693
2   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz    287637386   437610805   149973420
3   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz    437610806   587584225   149973420
4   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz    587584226   731879531   144295306
5   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz    731879532   876174837   144295306
6   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz    876174838   1023766068  147591231
7   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz    1023766069  1171357299  147591231

Owner

sebhtml commented Sep 23, 2013

Sequences

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-3/FilePartition.txt 
#File   Name    FirstSequence   LastSequence    NumberOfSequences
0   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz    0   143818692   143818693
1   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz    143818693   287637385   143818693
2   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz    287637386   437610805   149973420
3   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz    437610806   587584225   149973420
4   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz    587584226   731879531   144295306
5   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz    731879532   876174837   144295306
6   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz    876174838   1023766068  147591231
7   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz    1023766069  1171357299  147591231

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

Each titan node has:
( https://www.olcf.ornl.gov/support/system-user-guides/titan-user-guide/ )

16 cores
32 GiB ram

NVIDIA KEPLER

313 * 16 = 5008 MPI ranks

32 GiB / 16 cores = 2 GiB.

Owner

sebhtml commented Sep 23, 2013

Each titan node has:
( https://www.olcf.ornl.gov/support/system-user-guides/titan-user-guide/ )

16 cores
32 GiB ram

NVIDIA KEPLER

313 * 16 = 5008 MPI ranks

32 GiB / 16 cores = 2 GiB.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

Latency is very high:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> head HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt 
# average and mode round trip latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# MessagePassingInterfaceRank   Name    ModeLatencyInMicroseconds   AverageLatencyInMicroseconds    NumberOfExchanges
# AverageForAllRanks: 299.679
# StandardDeviation: 31.685
0   nid12147    30  116 1000
1   nid12147    34  313 1000
2   nid12147    148 279 1000
3   nid12147    184 298 1000
4   nid12147    28  287 1000
5   nid12147    30  301 1000
Owner

sebhtml commented Sep 23, 2013

Latency is very high:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> head HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt 
# average and mode round trip latency in microseconds (10^-6 seconds) when requesting a reply for a message of 4000 bytes
# MessagePassingInterfaceRank   Name    ModeLatencyInMicroseconds   AverageLatencyInMicroseconds    NumberOfExchanges
# AverageForAllRanks: 299.679
# StandardDeviation: 31.685
0   nid12147    30  116 1000
1   nid12147    34  313 1000
2   nid12147    148 279 1000
3   nid12147    184 298 1000
4   nid12147    28  287 1000
5   nid12147    30  301 1000
@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

memory usage is at 3 GiB+ when Ray starts (?)

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep memory  HiSeq-2500-NA12878-demo-2x150-3.o1732882|head
Rank 77: assembler memory usage: 3251836 KiB
Rank 78: assembler memory usage: 3251836 KiB
Rank 77: assembler memory usage: 3317568 KiB
Rank 78: assembler memory usage: 3317568 KiB
Rank 63: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3251836 KiB
Rank 3861: assembler memory usage: 3251836 KiB
Rank 1645: assembler memory usage: 3251836 KiB
Rank 1639: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3317568 KiB
Owner

sebhtml commented Sep 23, 2013

memory usage is at 3 GiB+ when Ray starts (?)

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep memory  HiSeq-2500-NA12878-demo-2x150-3.o1732882|head
Rank 77: assembler memory usage: 3251836 KiB
Rank 78: assembler memory usage: 3251836 KiB
Rank 77: assembler memory usage: 3317568 KiB
Rank 78: assembler memory usage: 3317568 KiB
Rank 63: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3251836 KiB
Rank 3861: assembler memory usage: 3251836 KiB
Rank 1645: assembler memory usage: 3251836 KiB
Rank 1639: assembler memory usage: 3251836 KiB
Rank 51: assembler memory usage: 3317568 KiB
@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

Every machine has 16 MPI ranks:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep -v ^# HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt | awk '{print $2}'|sort|uniq -c|wc -l
313
Owner

sebhtml commented Sep 23, 2013

Every machine has 16 MPI ranks:

sebhtml@titan-ext1:~/lsc005/projects/human-1-hour> grep -v ^# HiSeq-2500-NA12878-demo-2x150-3/NetworkTest.txt | awk '{print $2}'|sort|uniq -c|wc -l
313
@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 23, 2013

Owner

error messages:


MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27853.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 158 nr_overcommit 16154 resv 0 surplus 158
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27856.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 165 nr_overcommit 16154 resv 0 surplus 165
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.24160.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 173 nr_overcommit 16154 resv 0 surplus 173
Owner

sebhtml commented Sep 23, 2013

error messages:


MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27853.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1227] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 158 nr_overcommit 16154 resv 0 surplus 158
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.27856.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 1230] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c19-4c0s2n1] [nid12091] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 165 nr_overcommit 16154 resv 0 surplus 165
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.24160.kvs_3577704, err Cannot allocate memory
MPICH2 ERROR [Rank 4378] [job id 3577704] [Mon Sep 16 20:34:24 2013] [c0-2c1s6n0] [nid00114] - MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 173 nr_overcommit 16154 resv 0 surplus 173
@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 26, 2013

Owner

report info at tick = 0 and add VmRSS. to -debug

Owner

sebhtml commented Sep 26, 2013

report info at tick = 0 and add VmRSS. to -debug

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 26, 2013

Owner

Add all of these on Linux:

VmPeak: 108964 kB
VmSize: 108960 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 872 kB
VmRSS: 872 kB
VmData: 196 kB
VmStk: 140 kB
VmExe: 132 kB
VmLib: 1992 kB
VmPTE: 60 kB
VmSwap: 0 kB

Owner

sebhtml commented Sep 26, 2013

Add all of these on Linux:

VmPeak: 108964 kB
VmSize: 108960 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 872 kB
VmRSS: 872 kB
VmData: 196 kB
VmStk: 140 kB
VmExe: 132 kB
VmLib: 1992 kB
VmPTE: 60 kB
VmSwap: 0 kB

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 26, 2013

Owner

To build it:

module purge
module load PrgEnv-intel/4.1.40
module load cray-mpich2/5.6.3
make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y clean
make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

Owner

sebhtml commented Sep 26, 2013

To build it:

module purge
module load PrgEnv-intel/4.1.40
module load cray-mpich2/5.6.3
make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y clean
make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 26, 2013

Owner

iteration 4:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \


sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742436

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> showq | grep 1742436
1742436             sebhtml       Idle  5008     3:00:00  Thu Sep 26 15:53:42

Owner

sebhtml commented Sep 26, 2013

iteration 4:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \


sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742436

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> showq | grep 1742436
1742436             sebhtml       Idle  5008     3:00:00  Thu Sep 26 15:53:42

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Sep 26, 2013

Owner

Needs -debug:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742711

Owner

sebhtml commented Sep 26, 2013

Needs -debug:

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> cat HiSeq-2500-NA12878-demo-2x150-4.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-4
#PBS -l walltime=3:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -S 8 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-4 \

sebhtml@titan-ext3:/tmp/proj/lsc005/projects/human-1-hour> qsub HiSeq-2500-NA12878-demo-2x150-4.sh
1742711

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 1, 2013

Owner

iteration 5:

Carlos P. Sosa told me to use aprun -n 5008 -N 16

titan> cat HiSeq-2500-NA12878-demo-2x150-5.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-5
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -N 16 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-5 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-5.sh
1747928

Owner

sebhtml commented Oct 1, 2013

iteration 5:

Carlos P. Sosa told me to use aprun -n 5008 -N 16

titan> cat HiSeq-2500-NA12878-demo-2x150-5.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-5
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008

aprun -n 5008 -N 16 \
./software/lsc005/Ray/c610ae8670e1627bc41a64bbde18ac8f658b131f-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-5 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-5.sh
1747928

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 21, 2013

Owner

With -debug:

titan> cat HiSeq-2500-NA12878-demo-2x150-7.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-7
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
#-debug \

aprun -n 5008 -N 16 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-7 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-7.sh
1763643

Owner

sebhtml commented Oct 21, 2013

With -debug:

titan> cat HiSeq-2500-NA12878-demo-2x150-7.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-7
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
#-debug \

aprun -n 5008 -N 16 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-debug \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-7 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-7.sh
1763643

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 28, 2013

Owner

In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph.

titan> cat HiSeq-2500-NA12878-demo-2x150-8/ElapsedTime.txt 
#Step   Date    Elapsed time    Since Beginning
Network testing 2013-10-26T22:03:57 12 seconds  12 seconds
Counting sequences to assemble  2013-10-26T22:20:12 16 minutes, 15 seconds  16 minutes, 27 seconds
Sequence loading    2013-10-26T23:28:44 1 hours, 8 minutes, 32 seconds  1 hours, 24 minutes, 59 seconds
K-mer counting  2013-10-26T23:34:33 5 minutes, 49 seconds   1 hours, 30 minutes, 48 seconds
Coverage distribution analysis  2013-10-26T23:34:40 7 seconds   1 hours, 30 minutes, 55 seconds
Graph construction  2013-10-26T23:44:11 9 minutes, 31 seconds   1 hours, 40 minutes, 26 seconds
Null edge purging   2013-10-26T23:46:01 1 minutes, 50 seconds   1 hours, 42 minutes, 16 seconds
Selection of optimal read markers   2013-10-27T00:03:12 17 minutes, 11 seconds  1 hours, 59 minutes, 27 seconds
Detection of assembly seeds 2013-10-27T00:09:52 6 minutes, 40 seconds   2 hours, 6 minutes, 7 seconds
Estimation of outer distances for paired reads  2013-10-27T00:11:58 2 minutes, 6 seconds    2 hours, 8 minutes, 13 seconds
Bidirectional extension of seeds    2013-10-27T02:07:27 1 hours, 55 minutes, 29 seconds 4 hours, 3 minutes, 42 seconds

As expected, the merging must be improved.

titan> tail HiSeq-2500-NA12878-demo-2x150-8/NumberOfSequences.txt 
    FilePath: HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz
    NumberOfSequences: 147591231
    FirstSequence: 1023766069
    LastSequence: 1171357299


Summary
    NumberOfSequences: 1171357300
    FirstSequence: 0
    LastSequence: 1171357299

Let's do a bigger job now !!!

Owner

sebhtml commented Oct 28, 2013

In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph.

titan> cat HiSeq-2500-NA12878-demo-2x150-8/ElapsedTime.txt 
#Step   Date    Elapsed time    Since Beginning
Network testing 2013-10-26T22:03:57 12 seconds  12 seconds
Counting sequences to assemble  2013-10-26T22:20:12 16 minutes, 15 seconds  16 minutes, 27 seconds
Sequence loading    2013-10-26T23:28:44 1 hours, 8 minutes, 32 seconds  1 hours, 24 minutes, 59 seconds
K-mer counting  2013-10-26T23:34:33 5 minutes, 49 seconds   1 hours, 30 minutes, 48 seconds
Coverage distribution analysis  2013-10-26T23:34:40 7 seconds   1 hours, 30 minutes, 55 seconds
Graph construction  2013-10-26T23:44:11 9 minutes, 31 seconds   1 hours, 40 minutes, 26 seconds
Null edge purging   2013-10-26T23:46:01 1 minutes, 50 seconds   1 hours, 42 minutes, 16 seconds
Selection of optimal read markers   2013-10-27T00:03:12 17 minutes, 11 seconds  1 hours, 59 minutes, 27 seconds
Detection of assembly seeds 2013-10-27T00:09:52 6 minutes, 40 seconds   2 hours, 6 minutes, 7 seconds
Estimation of outer distances for paired reads  2013-10-27T00:11:58 2 minutes, 6 seconds    2 hours, 8 minutes, 13 seconds
Bidirectional extension of seeds    2013-10-27T02:07:27 1 hours, 55 minutes, 29 seconds 4 hours, 3 minutes, 42 seconds

As expected, the merging must be improved.

titan> tail HiSeq-2500-NA12878-demo-2x150-8/NumberOfSequences.txt 
    FilePath: HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz
    NumberOfSequences: 147591231
    FirstSequence: 1023766069
    LastSequence: 1171357299


Summary
    NumberOfSequences: 1171357300
    FirstSequence: 0
    LastSequence: 1171357299

Let's do a bigger job now !!!

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 28, 2013

Owner

script for the 4 hours incomplete run:

titan> cat HiSeq-2500-NA12878-demo-2x150-8.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-8
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
# 313 * 8 * 1 = 2504
#-debug \

aprun -n 2504 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-8 \


Owner

sebhtml commented Oct 28, 2013

script for the 4 hours incomplete run:

titan> cat HiSeq-2500-NA12878-demo-2x150-8.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-8
#PBS -l walltime=12:00:00 
#PBS -l nodes=313
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 313 * 8 * 2 = 5008
# 313 * 8 * 1 = 2504
#-debug \

aprun -n 2504 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-8 \


@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 28, 2013

Owner

with 3750 nodes, I can run 24 hours !

https://www.olcf.ornl.gov/kb_articles/titan-scheduling-policy/

accounting: 3750_30_24 = 2700000 (we don't have enough for this)

we have 250000 for this fall

Let's try with 3750 nodes, 8 ranks per node, with 30000 ranks.

Owner

sebhtml commented Oct 28, 2013

with 3750 nodes, I can run 24 hours !

https://www.olcf.ornl.gov/kb_articles/titan-scheduling-policy/

accounting: 3750_30_24 = 2700000 (we don't have enough for this)

we have 250000 for this fall

Let's try with 3750 nodes, 8 ranks per node, with 30000 ranks.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Oct 28, 2013

Owner
titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-9
#PBS -l walltime=00:12:00:00 
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 626 * 8 = 5008

aprun -n 5008 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-9 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-9.sh
1769459
Owner

sebhtml commented Oct 28, 2013

titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-9
#PBS -l walltime=00:12:00:00 
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

# 626 * 8 = 5008

aprun -n 5008 \
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
-o HiSeq-2500-NA12878-demo-2x150-9 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-9.sh
1769459
@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Nov 7, 2013

Owner

job -9 vanished, that's strange:

titan> showq | grep boisv
titan> ls|grep HiSeq-2500-NA12878-demo-2x150-9
HiSeq-2500-NA12878-demo-2x150-9.sh

Let's resubmit as -10:

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-10
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-10 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-10.sh
1778289

Owner

sebhtml commented Nov 7, 2013

job -9 vanished, that's strange:

titan> showq | grep boisv
titan> ls|grep HiSeq-2500-NA12878-demo-2x150-9
HiSeq-2500-NA12878-demo-2x150-9.sh

Let's resubmit as -10:

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-10
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-10 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-10.sh
1778289

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Nov 15, 2013

Owner

Waiting time:

Salut Jacques,

Mes jobs sont en attente, respectivement depuis 18 et 8 jours.

titan> showq | grep sebht
1769459 sebhtml Idle 10016 12:00:00 Mon Oct 28 14:19:19
1778289 sebhtml Idle 10016 12:00:00 Thu Nov 7 11:17:21

titan> checkjob 1769459|head
job 1769459

AName: HiSeq-2500-NA12878-demo-2x150-9
State: Idle
Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0
WallTime: 00:00:00 of 12:00:00
BecameEligible: Fri Nov 15 12:27:27
SubmitTime: Mon Oct 28 14:19:19
(Time Queued Total: 18:00:01:21 Eligible: 17:21:30:02)

titan> checkjob 1778289|head

job 1778289

AName: HiSeq-2500-NA12878-demo-2x150-10
State: Idle
Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0
WallTime: 00:00:00 of 12:00:00
BecameEligible: Fri Nov 15 12:27:27
SubmitTime: Thu Nov 7 11:17:21
(Time Queued Total: 8:02:03:34 Eligible: 7:23:39:24)

Owner

sebhtml commented Nov 15, 2013

Waiting time:

Salut Jacques,

Mes jobs sont en attente, respectivement depuis 18 et 8 jours.

titan> showq | grep sebht
1769459 sebhtml Idle 10016 12:00:00 Mon Oct 28 14:19:19
1778289 sebhtml Idle 10016 12:00:00 Thu Nov 7 11:17:21

titan> checkjob 1769459|head
job 1769459

AName: HiSeq-2500-NA12878-demo-2x150-9
State: Idle
Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0
WallTime: 00:00:00 of 12:00:00
BecameEligible: Fri Nov 15 12:27:27
SubmitTime: Mon Oct 28 14:19:19
(Time Queued Total: 18:00:01:21 Eligible: 17:21:30:02)

titan> checkjob 1778289|head

job 1778289

AName: HiSeq-2500-NA12878-demo-2x150-10
State: Idle
Creds: user:sebhtml group:sebhtml account:LSC005 class:batch qos:bin0
WallTime: 00:00:00 of 12:00:00
BecameEligible: Fri Nov 15 12:27:27
SubmitTime: Thu Nov 7 11:17:21
(Time Queued Total: 8:02:03:34 Eligible: 7:23:39:24)

@macmanes

This comment has been minimized.

Show comment
Hide comment
@macmanes

macmanes Dec 3, 2013

on Trillian (UNH Cray XE6, http://trillian-use.sr.unh.edu/index.php/Main_Page) does not like this make command

make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

It complains that -xHOST is an invalid command line flag.

Did you ever solve the latency issue- I see high latency here, too.

macmanes commented Dec 3, 2013

on Trillian (UNH Cray XE6, http://trillian-use.sr.unh.edu/index.php/Main_Page) does not like this make command

make MPICXX=CC CXXFLAGS="-xHOST -O3 -static" -j 4 HAVE_LIBZ=y

It complains that -xHOST is an invalid command line flag.

Did you ever solve the latency issue- I see high latency here, too.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Dec 6, 2013

Owner

Which compiler are you using. -xHOST is with the Intel compiler I think.

Owner

sebhtml commented Dec 6, 2013

Which compiler are you using. -xHOST is with the Intel compiler I think.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Dec 9, 2013

Owner

Update for jobs -9 and -10:

Hi Jacques,

Regarding titan:

My best shot so far:

"In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph."
(2013-10-28)

Last 2 jobs

However, my last 2 jobs both failed (I increased the number of cores and this highlighted the same problem
in the caching subsystem of nodes).

Job: HiSeq-2500-NA12878-demo-2x150-9 # 1769459

titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-9
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-9 \

Like last time, this is a problem with cached content in the VFS layer of Lustre.

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pa
gesize-2097152/hugepagefile.MPICH.2.16799.kvs_3928360, err Cannot allocate memory

Job: HiSeq-2500-NA12878-demo-2x150-10 # 1778289

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-10
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-10 \

Same here:

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/
pagesize-2097152/hugepagefile.MPICH.2.23067.kvs_3999472, err Cannot allocate memory

What support says about this

The ticket with ORNL people is "Re: [CCS #177295] MPICH on titan uses a lot of memory (?)".

The last response I got was from 2013-10-21:

Thanks Sebastien,

My gut tells me you're running out of memory per core. Hugepage is busting and
the max size is 2GB.
MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 211 nr_overcommit
16154 resv 0 surplus 211

The network is just the one to complain about it, but not necessarily the
cause.

Have you tried lowering the number of MPI processes to 8/node?

FF

(I am already at 8, I think
the problem is buggy caching, not memory usage by Ray).

The issue is that cached pages in the VFS wastes memory.

See below the /proc/meminfo:

[Rank 3758] Cat of /proc/meminfo
[Rank 3755]: MemTotal: 33084652 kB
[Rank 3755]: MemFree: 3984520 kB
[Rank 3755]: Buffers: 0 kB
[Rank 3755]: Cached: 22332700 kB ***************************
[Rank 3755]: SwapCached: 0 kB
[Rank 3755]: Active: 12556068 kB
[Rank 3755]: Inactive: 12527116 kB *************************
[Rank 3758]: MemTotal: 33084652 kB
[Rank 3755]: Active(anon): 2637848 kB
[Rank 3758]: MemFree: 3984892 kB
[Rank 3755]: Inactive(anon): 168920 kB
[Rank 3758]: Buffers: 0 kB
[Rank 3755]: Active(file): 9918220 kB
[Rank 3758]: Cached: 22332700 kB
[Rank 3755]: Inactive(file): 12358196 kB

That's somewhere between 22 gigabytes and 36 gigabytes wasted on cache
by the operating system.

Ticket: #197

Séb

Owner

sebhtml commented Dec 9, 2013

Update for jobs -9 and -10:

Hi Jacques,

Regarding titan:

My best shot so far:

"In 4 hours, Ray loads data, builds the graph, compute libraries and traverse the graph."
(2013-10-28)

Last 2 jobs

However, my last 2 jobs both failed (I increased the number of cores and this highlighted the same problem
in the caching subsystem of nodes).

Job: HiSeq-2500-NA12878-demo-2x150-9 # 1769459

titan> cat HiSeq-2500-NA12878-demo-2x150-9.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-9
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-9 \

Like last time, this is a problem with cached content in the VFS layer of Lustre.

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pa
gesize-2097152/hugepagefile.MPICH.2.16799.kvs_3928360, err Cannot allocate memory

Job: HiSeq-2500-NA12878-demo-2x150-10 # 1778289

titan> cat HiSeq-2500-NA12878-demo-2x150-10.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-10
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005
#PBS -l gres=widow1

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-10 \

Same here:

MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/
pagesize-2097152/hugepagefile.MPICH.2.23067.kvs_3999472, err Cannot allocate memory

What support says about this

The ticket with ORNL people is "Re: [CCS #177295] MPICH on titan uses a lot of memory (?)".

The last response I got was from 2013-10-21:

Thanks Sebastien,

My gut tells me you're running out of memory per core. Hugepage is busting and
the max size is 2GB.
MPIU_nem_gni_get_hugepages(): large page stats: free 0 nr 211 nr_overcommit
16154 resv 0 surplus 211

The network is just the one to complain about it, but not necessarily the
cause.

Have you tried lowering the number of MPI processes to 8/node?

FF

(I am already at 8, I think
the problem is buggy caching, not memory usage by Ray).

The issue is that cached pages in the VFS wastes memory.

See below the /proc/meminfo:

[Rank 3758] Cat of /proc/meminfo
[Rank 3755]: MemTotal: 33084652 kB
[Rank 3755]: MemFree: 3984520 kB
[Rank 3755]: Buffers: 0 kB
[Rank 3755]: Cached: 22332700 kB ***************************
[Rank 3755]: SwapCached: 0 kB
[Rank 3755]: Active: 12556068 kB
[Rank 3755]: Inactive: 12527116 kB *************************
[Rank 3758]: MemTotal: 33084652 kB
[Rank 3755]: Active(anon): 2637848 kB
[Rank 3758]: MemFree: 3984892 kB
[Rank 3755]: Inactive(anon): 168920 kB
[Rank 3758]: Buffers: 0 kB
[Rank 3755]: Active(file): 9918220 kB
[Rank 3758]: Cached: 22332700 kB
[Rank 3755]: Inactive(file): 12358196 kB

That's somewhere between 22 gigabytes and 36 gigabytes wasted on cache
by the operating system.

Ticket: #197

Séb

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Dec 16, 2013

Owner

Support said to try out the new storage:

https://www.olcf.ornl.gov/kb_articles/atlas-transition/

Owner

sebhtml commented Dec 16, 2013

Support said to try out the new storage:

https://www.olcf.ornl.gov/kb_articles/atlas-transition/

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Dec 23, 2013

Owner

moving files to atlas.

titan> mv /tmp/proj/lsc005/* /lustre/atlas/proj-shared/lsc005/

Owner

sebhtml commented Dec 23, 2013

moving files to atlas.

titan> mv /tmp/proj/lsc005/* /lustre/atlas/proj-shared/lsc005/

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 6, 2014

Owner

job with atlas on titan:

titan> pwd
/lustre/atlas/proj-shared/lsc005/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-11.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-11
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-11 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-11.sh
1833464

titan> showq | grep 1833464
1833464 sebhtml Idle 10016 12:00:00 Mon Jan 6 11:43:18

I think this will start in like 1 month.

Owner

sebhtml commented Jan 6, 2014

job with atlas on titan:

titan> pwd
/lustre/atlas/proj-shared/lsc005/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-11.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-11
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

aprun -n 5008
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-11 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-11.sh
1833464

titan> showq | grep 1833464
1833464 sebhtml Idle 10016 12:00:00 Mon Jan 6 11:43:18

I think this will start in like 1 month.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 5, 2014

Owner

on titan: #228

Owner

sebhtml commented Feb 5, 2014

on titan: #228

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 6, 2014

Owner

-11 failed because of a faulty symlink...

Owner

sebhtml commented Feb 6, 2014

-11 failed because of a faulty symlink...

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 6, 2014

Owner

Hi Jacques,

For my Titan job, it seems that it started after the decommissioning of Spider, which was on 27 Jan 2014 I think.

There was a faulty symbolic link. although my data was on Atlas.

titan> pwd
/ccs/home/sebhtml/lsc005-atlas/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-11.e1833464
aprun: file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray not found
aprun: Exiting due to errors. Application aborted

titan> readlink software
lsc005/software/
titan> readlink lsc005
/tmp/proj/lsc005
titan> file /tmp/proj/lsc005
/tmp/proj/lsc005: cannot open `/tmp/proj/lsc005' (No such file or directory)

titan> file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), for GNU/Linux 2.6.4, statically linked, not stripped

Owner

sebhtml commented Feb 6, 2014

Hi Jacques,

For my Titan job, it seems that it started after the decommissioning of Spider, which was on 27 Jan 2014 I think.

There was a faulty symbolic link. although my data was on Atlas.

titan> pwd
/ccs/home/sebhtml/lsc005-atlas/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-11.e1833464
aprun: file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray not found
aprun: Exiting due to errors. Application aborted

titan> readlink software
lsc005/software/
titan> readlink lsc005
/tmp/proj/lsc005
titan> file /tmp/proj/lsc005
/tmp/proj/lsc005: cannot open `/tmp/proj/lsc005' (No such file or directory)

titan> file ./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray
./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), for GNU/Linux 2.6.4, statically linked, not stripped

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 6, 2014

Owner

-12

titan> vim HiSeq-2500-NA12878-demo-2x150-12.sh
titan> qsub HiSeq-2500-NA12878-demo-2x150-12.sh
1863329
titan> pwd
/ccs/home/sebhtml/lsc005/projects/human-1-hour

Owner

sebhtml commented Feb 6, 2014

-12

titan> vim HiSeq-2500-NA12878-demo-2x150-12.sh
titan> qsub HiSeq-2500-NA12878-demo-2x150-12.sh
1863329
titan> pwd
/ccs/home/sebhtml/lsc005/projects/human-1-hour

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 7, 2014

Owner

Corrupted files on Titan (Atlas FS):

7 out of 8 fastq files vanished (strange). This is what is left of it:

Fichiers sur Titan (il y a eu un problème sur le FS):

titan> ls -lh HiSeq-2500-NA12878-demo-2x150/*gz
-rw------- 1 sebhtml lsc005 684M 2013-12-23 12:30 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz

Ce que c'est sensé être:

[boisver1@ip03-mp2 data]$ ls -lh HiSeq-2500-NA12878-demo-2x150/*gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 22 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz

oh well...

Owner

sebhtml commented Feb 7, 2014

Corrupted files on Titan (Atlas FS):

7 out of 8 fastq files vanished (strange). This is what is left of it:

Fichiers sur Titan (il y a eu un problème sur le FS):

titan> ls -lh HiSeq-2500-NA12878-demo-2x150/*gz
-rw------- 1 sebhtml lsc005 684M 2013-12-23 12:30 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz

Ce que c'est sensé être:

[boisver1@ip03-mp2 data]$ ls -lh HiSeq-2500-NA12878-demo-2x150/*gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 22 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 18G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz
-rw-rwxr-- 1 boisver1 corbeil 19G Nov 21 2012 HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz

oh well...

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 7, 2014

Owner

Pulling from Sherbrooke to get the data again.

rsync -avzPL

/mnt/scratch_mp2/corbeil/corbeil_group/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150

Owner

sebhtml commented Feb 7, 2014

Pulling from Sherbrooke to get the data again.

rsync -avzPL

/mnt/scratch_mp2/corbeil/corbeil_group/nne-790-ab/data/HiSeq-2500-NA12878-demo-2x150

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 12, 2014

Owner

Data on Titan:

titan> ls /lustre/atlas/proj-shared/lsc005/projects/human-1-hour/HiSeq-2500-NA12878-demo-2x150/ -lh
total 145G
-rw-rwxr-- 1 sebhtml sebhtml 946 2012-11-21 15:16 11
-rw-rwxr-- 1 sebhtml sebhtml 1009 2012-11-21 15:16 12
-rw-rwxr-- 1 sebhtml sebhtml 328 2012-11-22 18:44 Counts
-rw-rwxr-- 1 sebhtml sebhtml 291 2012-11-22 13:52 Get.sh
-rw-rwxr-- 1 sebhtml sebhtml 889 2012-11-20 00:38 RawFiles.txt
-rw-r--r-- 1 sebhtml sebhtml 14 2012-11-21 13:10 README
-rw-rwxr-- 1 sebhtml sebhtml 584 2012-11-22 14:06 sha1sum.txt
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 15:17 sorted_S1_L001_R1_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 16:44 sorted_S1_L001_R1_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:17 sorted_S1_L001_R2_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L001_R2_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-22 11:22 sorted_S1_L001_R2_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R2_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:38 sorted_S1_L002_R1_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L002_R1_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:07 sorted_S1_L002_R1_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R1_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 19:16 sorted_S1_L002_R2_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:29 sorted_S1_L002_R2_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_002.fastq.gz.log

Owner

sebhtml commented Feb 12, 2014

Data on Titan:

titan> ls /lustre/atlas/proj-shared/lsc005/projects/human-1-hour/HiSeq-2500-NA12878-demo-2x150/ -lh
total 145G
-rw-rwxr-- 1 sebhtml sebhtml 946 2012-11-21 15:16 11
-rw-rwxr-- 1 sebhtml sebhtml 1009 2012-11-21 15:16 12
-rw-rwxr-- 1 sebhtml sebhtml 328 2012-11-22 18:44 Counts
-rw-rwxr-- 1 sebhtml sebhtml 291 2012-11-22 13:52 Get.sh
-rw-rwxr-- 1 sebhtml sebhtml 889 2012-11-20 00:38 RawFiles.txt
-rw-r--r-- 1 sebhtml sebhtml 14 2012-11-21 13:10 README
-rw-rwxr-- 1 sebhtml sebhtml 584 2012-11-22 14:06 sha1sum.txt
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 15:17 sorted_S1_L001_R1_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 16:44 sorted_S1_L001_R1_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R1_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:17 sorted_S1_L001_R2_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L001_R2_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-22 11:22 sorted_S1_L001_R2_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L001_R2_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:38 sorted_S1_L002_R1_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 602 2012-11-22 13:52 sorted_S1_L002_R1_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 18G 2012-11-21 16:07 sorted_S1_L002_R1_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R1_002.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 19:16 sorted_S1_L002_R2_001.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_001.fastq.gz.log
-rw-rwxr-- 1 sebhtml sebhtml 19G 2012-11-21 15:29 sorted_S1_L002_R2_002.fastq.gz
-rw-rwxr-- 1 sebhtml sebhtml 523 2012-11-22 13:52 sorted_S1_L002_R2_002.fastq.gz.log

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 12, 2014

Owner

new executable
/lustre/atlas/proj-shared/lsc005/software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray

Owner

sebhtml commented Feb 12, 2014

new executable
/lustre/atlas/proj-shared/lsc005/software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 12, 2014

Owner

-13

titan> pwd
/ccs/home/sebhtml/lsc005/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-13.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-13
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

#./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \

aprun -n 5008
./software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-13 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-13.sh
1867708

titan> showq | grep sebhtml
1867708 sebhtml Idle 10016 12:00:00 Wed Feb 12 16:39:47

Owner

sebhtml commented Feb 12, 2014

-13

titan> pwd
/ccs/home/sebhtml/lsc005/projects/human-1-hour
titan> cat HiSeq-2500-NA12878-demo-2x150-13.sh
#PBS -N HiSeq-2500-NA12878-demo-2x150-13
#PBS -l walltime=00:12:00:00
#PBS -l nodes=626
#PBS -A LSC005

cd $PBS_O_WORKDIR

626 * 8 = 5008

#./software/lsc005/Ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \

aprun -n 5008
./software/lsc005/Ray/53a80be6905565c7f791d069f9a1bf2e82ea8132-1/Ray
-k 31
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150
-o HiSeq-2500-NA12878-demo-2x150-13 \

titan> qsub HiSeq-2500-NA12878-demo-2x150-13.sh
1867708

titan> showq | grep sebhtml
1867708 sebhtml Idle 10016 12:00:00 Wed Feb 12 16:39:47

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Feb 24, 2014

Owner

For job HiSeq-2500-NA12878-demo-2x150-13 (Atlas)

MPICH2 ERROR [Rank 4] [job id 4468454] [Wed Feb 12 19:16:49 2014] [c6-4c2s3n3] [nid02823] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepag
efile.MPICH.2.2794.kvs_4468454, err Cannot allocate memory

titan> grep Cached HiSeq-2500-NA12878-demo-2x150-13.e1867708|head -n1
[Rank 4]: Cached: 14777104 kB

Owner

sebhtml commented Feb 24, 2014

For job HiSeq-2500-NA12878-demo-2x150-13 (Atlas)

MPICH2 ERROR [Rank 4] [job id 4468454] [Wed Feb 12 19:16:49 2014] [c6-4c2s3n3] [nid02823] - MPIU_nem_gni_get_hugepages(): Unable to mmap 12582912 bytes for file /var/lib/hugetlbfs/global/pagesize-2097152/hugepag
efile.MPICH.2.2794.kvs_4468454, err Cannot allocate memory

titan> grep Cached HiSeq-2500-NA12878-demo-2x150-13.e1867708|head -n1
[Rank 4]: Cached: 14777104 kB

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Aug 11, 2014

Owner

This project is finished.

Owner

sebhtml commented Aug 11, 2014

This project is finished.

@sebhtml sebhtml closed this Aug 11, 2014

@sebhtml sebhtml self-assigned this Aug 11, 2014

@sebhtml sebhtml added this to the 2.3.2 milestone Aug 11, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment