Aggregate step segmentation faults #19

jakewendt · 2021-09-30T15:47:12Z

I'm running a test on some TCGA data. 4 small groups, 5 members each. I've rerun the aggregate step several times, always with the same result. A segmentation fault.

Starting the analysis with the following arguments: 
	- Input file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix
	- Output file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated
	- Count matrix file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json
	- Configuration file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json
	- Shift= 1
	- General threshold= 70
	- Source threshold= 80
	- Coverage limit= 50
	- Consistency= 2

Step 0 : Reading /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix...
	Total lines: 538537325
done.             
	Read 450319809 kmers.
	Space occupied: 211.375Gb
	Max values: 
	 - nMutant_x_nWT:100
	 - nMutant_x_tMutant:100
	 - nMutant_x_tWT:100
	 - nWT_x_tMutant:100
	 - nWT_x_tWT:100
	 - tMutant_x_tWT:100

Step 1 : Computing edges... done.
Step 2 : Building the groups.../var/spool/slurm/d/job263422/slurm_script: line 4: 251271 Segmentation fault      singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.6.img iMOKA_core aggregate --input /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix --count-matrix /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json --mapper-config /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json --output /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated

Unfortunately, it leaves no indication of what the problem is.

Any questions or suggestions?

The text was updated successfully, but these errors were encountered:

CloXD · 2021-09-30T18:23:39Z

Hello Jake,
I think it's a problem of memory due to the big number of results ( I have never tried iMOKA with WGS, but I imagined there would have been lots of results ).
Try increasing the general threshold (-T) to 90 and the source threshold (-t) to 95 ( or even 95 and 99 ) to keep only the best results.
With larger cohorts, the accuracy values should be more reliable: if in the reduction step you kept the default values, you used 1/4
of the samples as test, that means 1 for each group. Take a look at the reduced matrix and if there are only 100, it would be better to increase the number of samples in each group to 10 or increase the fraction of the test set ( -t ) to 0.4 ( so with 5 samples, it will use 2 as test and 3 as training ).
I hope this will help.
Cheers,
Claudio

jakewendt · 2021-09-30T18:49:13Z

Thanks again Claudio.

Initially, this was just a test of principle, so the accuracy of the results weren't really that important. Once functioning, I am planning to run all available samples.

Not sure where to check for 100 as you suggested.

The reduced matrix did keep half a billion kmers which is quite a bit.

head 15/reduced.matrix
#{"adjustments":[0.25,0.05],"cross_validation":100,"file_in":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json","file_out":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix","kept":538537323,"min_acc":65.0,"minimum_count":5,"perc_test":0.25,"processed":864984338,"standard_error":0.5}
kmer	nMutant_x_nWT	nMutant_x_tMutant	nMutant_x_tWT	nWT_x_tMutant	nWT_x_tWT	tMutant_x_tWT	nMutant	nWT	tMutant	tWT
AAAAAAAAAAAAAAA	79.500	81.500	93.500	66.000	62.000	22.000	462462.289	504177.301	527753.348	534412.340
AAAAAAAAAAAAAAC	77.000	49.000	83.500	68.000	35.000	58.500	18289.031	23344.986	20507.831	22575.272

I also just noticed a new quirk with WGS. At least paired data anyway. The kmer counts aren't canonical as the reduced.matrix includes reverse complements. I'm guessing that they probably should given that half the reads are forward and have are reverse complement. That would mean going back to the preprocessing step, I think, and changing the library type. I'm assuming that the default to library type is effectively ff. I'm gonna try fr. Suggestions there?

I'll make the mods to the aggregate that you suggested and rerun.

Thanks again,
Jake

CloXD · 2021-10-01T07:53:46Z

No problem.
The reduced matrix has accuracies different than only 100, so that's fine ( from the second column to the seventh ).
The k-mers are not canonical on purpose to handle stranded RNA-seq.
An optimization of iMOKA for WGS would include the use of canonical k-mer, the adaptation of the aggregation step for canonical ( all the steps that consider the k-mer sequence, such as the generation of the graphs, the mapping etc.. ) and eventually a discretization of the k-mer counts.
Those changes require lots of work (and a dataset of test), but unfortunately, my contract just ended and I don't know yet if I'll continue to develop iMOKA in the future or if someone else will.
Cheers,
Claudio

jakewendt · 2021-10-01T17:05:12Z

Will passing --library-type fr to preprocess correctly orient the extracted kmers when used in paired sequences when passed in the source files as ...?

sample	group	FILE_R1.fastq.gz;FILE_R2.fastq.gz

CloXD · 2021-10-03T07:09:15Z

yes, It will convert the file matching the RE /[]?[R]2[.]/ and convert it to its reverse complementary ( the file 1 is associated with []?[R_]1[._] ).

jakewendt · 2021-10-20T15:36:28Z

Just to close this off, I reran from preprocessing with --library-type fr, reduce with --test-percentage 0.5 and aggregate with --global-threshold 95 --origin-threshold 99 and the problem went away. The change in aggregate parameters is likely what stopped the seg fault.

Thanks again Claudio

jakewendt closed this as completed Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate step segmentation faults #19

Aggregate step segmentation faults #19

jakewendt commented Sep 30, 2021

CloXD commented Sep 30, 2021

jakewendt commented Sep 30, 2021 •

edited

Loading

CloXD commented Oct 1, 2021

jakewendt commented Oct 1, 2021

CloXD commented Oct 3, 2021

jakewendt commented Oct 20, 2021

Aggregate step segmentation faults #19

Aggregate step segmentation faults #19

Comments

jakewendt commented Sep 30, 2021

CloXD commented Sep 30, 2021

jakewendt commented Sep 30, 2021 • edited Loading

CloXD commented Oct 1, 2021

jakewendt commented Oct 1, 2021

CloXD commented Oct 3, 2021

jakewendt commented Oct 20, 2021

jakewendt commented Sep 30, 2021 •

edited

Loading