Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate step segmentation faults #19

Closed
jakewendt opened this issue Sep 30, 2021 · 6 comments
Closed

Aggregate step segmentation faults #19

jakewendt opened this issue Sep 30, 2021 · 6 comments

Comments

@jakewendt
Copy link

I'm running a test on some TCGA data. 4 small groups, 5 members each. I've rerun the aggregate step several times, always with the same result. A segmentation fault.

Starting the analysis with the following arguments: 
	- Input file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix
	- Output file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated
	- Count matrix file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json
	- Configuration file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json
	- Shift= 1
	- General threshold= 70
	- Source threshold= 80
	- Coverage limit= 50
	- Consistency= 2

Step 0 : Reading /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix...
	Total lines: 538537325
done.             
	Read 450319809 kmers.
	Space occupied: 211.375Gb
	Max values: 
	 - nMutant_x_nWT:100
	 - nMutant_x_tMutant:100
	 - nMutant_x_tWT:100
	 - nWT_x_tMutant:100
	 - nWT_x_tWT:100
	 - tMutant_x_tWT:100

Step 1 : Computing edges... done.
Step 2 : Building the groups.../var/spool/slurm/d/job263422/slurm_script: line 4: 251271 Segmentation fault      singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.6.img iMOKA_core aggregate --input /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix --count-matrix /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json --mapper-config /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json --output /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated

Unfortunately, it leaves no indication of what the problem is.

Any questions or suggestions?

@CloXD
Copy link
Collaborator

CloXD commented Sep 30, 2021

Hello Jake,
I think it's a problem of memory due to the big number of results ( I have never tried iMOKA with WGS, but I imagined there would have been lots of results ).
Try increasing the general threshold (-T) to 90 and the source threshold (-t) to 95 ( or even 95 and 99 ) to keep only the best results.
With larger cohorts, the accuracy values should be more reliable: if in the reduction step you kept the default values, you used 1/4
of the samples as test, that means 1 for each group. Take a look at the reduced matrix and if there are only 100, it would be better to increase the number of samples in each group to 10 or increase the fraction of the test set ( -t ) to 0.4 ( so with 5 samples, it will use 2 as test and 3 as training ).
I hope this will help.
Cheers,
Claudio

@jakewendt
Copy link
Author

jakewendt commented Sep 30, 2021

Thanks again Claudio.

Initially, this was just a test of principle, so the accuracy of the results weren't really that important. Once functioning, I am planning to run all available samples.

Not sure where to check for 100 as you suggested.

The reduced matrix did keep half a billion kmers which is quite a bit.

head 15/reduced.matrix
#{"adjustments":[0.25,0.05],"cross_validation":100,"file_in":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json","file_out":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix","kept":538537323,"min_acc":65.0,"minimum_count":5,"perc_test":0.25,"processed":864984338,"standard_error":0.5}
kmer	nMutant_x_nWT	nMutant_x_tMutant	nMutant_x_tWT	nWT_x_tMutant	nWT_x_tWT	tMutant_x_tWT	nMutant	nWT	tMutant	tWT
AAAAAAAAAAAAAAA	79.500	81.500	93.500	66.000	62.000	22.000	462462.289	504177.301	527753.348	534412.340
AAAAAAAAAAAAAAC	77.000	49.000	83.500	68.000	35.000	58.500	18289.031	23344.986	20507.831	22575.272

I also just noticed a new quirk with WGS. At least paired data anyway. The kmer counts aren't canonical as the reduced.matrix includes reverse complements. I'm guessing that they probably should given that half the reads are forward and have are reverse complement. That would mean going back to the preprocessing step, I think, and changing the library type. I'm assuming that the default to library type is effectively ff. I'm gonna try fr. Suggestions there?

I'll make the mods to the aggregate that you suggested and rerun.

Thanks again,
Jake

@CloXD
Copy link
Collaborator

CloXD commented Oct 1, 2021

No problem.
The reduced matrix has accuracies different than only 100, so that's fine ( from the second column to the seventh ).
The k-mers are not canonical on purpose to handle stranded RNA-seq.
An optimization of iMOKA for WGS would include the use of canonical k-mer, the adaptation of the aggregation step for canonical ( all the steps that consider the k-mer sequence, such as the generation of the graphs, the mapping etc.. ) and eventually a discretization of the k-mer counts.
Those changes require lots of work (and a dataset of test), but unfortunately, my contract just ended and I don't know yet if I'll continue to develop iMOKA in the future or if someone else will.
Cheers,
Claudio

@jakewendt
Copy link
Author

Will passing --library-type fr to preprocess correctly orient the extracted kmers when used in paired sequences when passed in the source files as ...?

sample	group	FILE_R1.fastq.gz;FILE_R2.fastq.gz

@CloXD
Copy link
Collaborator

CloXD commented Oct 3, 2021

yes, It will convert the file matching the RE /[]?[R]2[.]/ and convert it to its reverse complementary ( the file 1 is associated with []?[R_]1[._] ).

@jakewendt
Copy link
Author

Just to close this off, I reran from preprocessing with --library-type fr, reduce with --test-percentage 0.5 and aggregate with --global-threshold 95 --origin-threshold 99 and the problem went away. The change in aggregate parameters is likely what stopped the seg fault.

Thanks again Claudio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants