Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation of parameters #2

Closed
zkstewart opened this issue Oct 4, 2017 · 13 comments
Closed

Optimisation of parameters #2

zkstewart opened this issue Oct 4, 2017 · 13 comments

Comments

@zkstewart
Copy link

Hi,

I am wondering if you could give me a few tips for optimising the assembly. I have previously used your SMARTdenovo assembler, and received a quite good assembly. Parameters were default, except I reduced minimum length cut-off to 2500. Stats are:

Genome size: 274,508,993
Estimated genome size [by SMARTdenovo]: 311,266,064
Number of contigs: 1,096
Shortest contig: 8,918
Longest contig: 3,655,283

N50: 621,057
Median: 89,349.5
Mean: 250,464.40967153283

I have tried three different parameter combinations with wtdbg, but haven't been able to get an assembly as contiguous. The default parameters received the following stats:

Genome size: 308,848,834
Number of contigs: 7,249
Shortest contig: 3,148
Longest contig: 758,777

N50: 118,094
Median: 15,921
Mean: 42,605.715822872124

I next tried two variants of a "maximum sensitivity" combination (at least, as I understand it). The first variant included the arguments "-k 0 -p 17 -S 2 --edge-min 2 --rescue-low-cov-edges". The second was the same, but --tidy-reads was set to 2500. The statistics for these two in their respective order is below:

Genome size: 271,989,307
Number of contigs: 4,539
Shortest contig: 2,417
Longest contig: 2,258,713

N50: 458,530
Median: 11,060
Mean: 59,922.737827715355
Genome size: 269,423,173
Number of contigs: 5,036
Shortest contig: 2,038
Longest contig: 1,872,034

N50: 346,207
Median: 11,191.0
Mean: 53,499.43864177919

I was wondering if you had any ideas for how I might be able to improve the program's performance, or if SMARTdenovo might just be better suited for my particular genome?

Thanks,
Zac.

@ruanjue
Copy link
Owner

ruanjue commented Oct 5, 2017

Hi Zac,
A good way to choose the best parameters is look at the message on stderr, there was a value of median node coverage. If too low, say less than 20, you might need to try --edge-min 2. kmer-size or pmer-size is more about sequencing error rate. I prefer to use 5000 in --tidy-reads, though no evidence.
As I know little about your sequencing type and coverage, the belowed is suggested but ...
1, -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000
will take a long time and big RAM
2, -k 0 -p 21 -S 1 --edge-min 2 --tidy-reads 5000
3, -k 0 -p 21 -S 1 --edge-min 2 --tidy-reads 5000 --ttr-cutoff-depth 20
ttr-cutoff might be useful in complicated genomes to filter Tandom Tiny Repeats.

In fact, wtdbg is degined to be able to assemble a huge genome within one day, SMARTdenovo might get better assemblies in small genomes.

Best,
Jue

@zkstewart
Copy link
Author

Thanks for the response Jue,

I'll try out your suggestions as soon as I get the chance and I'll let you know how it goes.

Zac.

@zkstewart
Copy link
Author

Hi again,

I've managed to finish running a few variations based on your suggestions. The first thing I noticed was that I made a mistake regarding what the final output file was. I was incorrectly using the .ctg.lay.fa file as the final output rather than the .map.fa file. I might suggest changing the readme file to make this clearer, since the .map.fa file isn't mentioned on there.

With respect to the outputs, I'll list the commands and statistics here for anyone else interested in using this program.

Default
--tidy-reads 5000 -k 0 -p 21 -S 4 --rescue-low-cov-edges

Genome size: 350,014,325
Number of contigs: 7,249
Shortest contig: 3,932
Longest contig: 886,772

N50: 135,662
Median: 17,860
Mean: 48,284

Suggestion 1
--tidy-reads 5000 -k 15 -p 0 -S 1 --rescue-low-cov-edges

Genome size: 283,918,398
Number of contigs: 2,696
Shortest contig: 2,988
Longest contig: 3,321,251

N50: 672,118
Median: 14,897
Mean: 105,310

Suggestion 1.1
--tidy-reads 5000 -k 15 -p 0 -S 1 --edge-min 2 --rescue-low-cov-edges

Genome size: 303,282,604
Number of contigs: 4,554
Shortest contig: 2,988
Longest contig: 3,198,861

N50: 494,793
Median: 12,167
Mean: 66,596

Suggestion 2
--tidy-reads 5000 -k 0 -p 21 -S 1 --edge-min 2 --ttr-cutoff-depth 20

Genome size: 319,221,742
Number of contigs: 5,483
Shortest contig: 2,986
Longest contig: 1,730,020

N50: 297,377
Median: 12,970
Mean: 58,220

Suggestion 2.1
--tidy-reads 5000 -k 0 -p 21 -S 1 --edge-min 2

Genome size: 305,845,722
Number of contigs: 4,996
Shortest contig: 2,988
Longest contig: 1,852,489

N50: 376,278
Median: 12,407
Mean: 61,218

At least for my relatively small genome, the settings of suggestion 1/1.1 seem to be giving the most contiguous assembly which is comparable to SMARTdenovo but not quite as good. The run time is still much faster than SMARTdenovo although the memory usage is much higher when not running at default settings. I'll consider playing around with the program further, but for now I think gene annotation is the next thing I need to try to see how the two programs differ from each other and compare to other assemblers.

Zac.

@YiweiNiu
Copy link

YiweiNiu commented Mar 10, 2018

Hi,

Thank you for your discussions about the parameter optimisation. It's very helpful.

I've tried to use smartdenovo, wtdbg1.1.006 and wtdbg1.2.8 to assemble an insect genome (~850M, ~70X PacBio data).

I want to share my commands and the basic stats I got.

smartdenovo

with -c 1

Size_includeN	756816708
Size_withoutN	756816708
Seq_Num	6135
Mean_Size	123360
Median_Size	55901
Longest_Seq	5704487
Shortest_Seq	10769
GC_Content	31.72
N50	240010
N90	44546
Gap	0.0

wtdbg 1.1.006

with -H -k 21 -S 1.02 -e 3

total base: 607971510
%GC: 32.21
num: 4655
min: 2700
max: 6594103
avg: 130606
N50: 573208
N90: 46228

wtdbg 1.2.8

run1, with defalult -k 0 -p 21 -S 4

total base: 757804309
%GC: 32.37
num: 20960
min: 2247
max: 3846453
avg: 36154
N50: 103681
N90: 12128

run2, with --edge-min 2 --rescue-low-cov-edges --tidy-reads 5000 (Because median node depth = 6, less than 20)

total base: 845834770
%GC: 32.51
num: 19555
min: 2030
max: 2025061
avg: 43254
N50: 158013
N90: 14248

run3, with -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000

Size_includeN	795503989
Size_withoutN	795503989
Seq_Num	12557
Mean_Size	63351
Median_Size	15690
Longest_Seq	7257493
Shortest_Seq	2277
GC_Content	32.44
N50	308340
N90	21383

run4, with -k 0 -p 19 -S 2 --rescue-low-cov-edges --tidy-reads 5000

Size_includeN	780618272
Size_withoutN	780618272
Seq_Num	11722
Mean_Size	66594
Median_Size	16335
Longest_Seq	8184393
Shortest_Seq	2547
GC_Content	32.4
N50	294217
N90	23008

run5, with --tidy-reads 5000 -k 21 -p 0 -S 2 --rescue-low-cov-edges --tidy-reads 5000

Size_includeN	843085698
Size_withoutN	843085698
Seq_Num	26341
Mean_Size	32006
Median_Size	18982
Longest_Seq	491063
Shortest_Seq	2992
GC_Content	32.51
N50	54544
N90	13737

In my case wtdbg 1.1.006 generated the best results (at least the N50 size). But I don't know how to tweak the parameters to make it better, do you have any ideas?

The DNA of my data was from dozens of bugs, becuse they are so tiny, and I don't how this affects the assembly (together with the errors in the data).

Any suggestions or thoughts would be appreciated.

Bests,
Yiwei Niu

@ruanjue
Copy link
Owner

ruanjue commented Mar 10, 2018

Thanks, @YiweiNiu
Also try --aln-noskip

@ruanjue ruanjue reopened this Mar 10, 2018
@YiweiNiu
Copy link

Hi Jue,

Sorry for the late reply. The test was just finished.

with wtdbg-1.2.8 -t 20 -i third_all.fasta -o run7 --aln-noskip

Size_includeN	726925732
Size_withoutN	726925732
Seq_Num	15983
Mean_Size	45481
Median_Size	12714
Longest_Seq	2523944
Shortest_Seq	2290
GC_Content	32.21
N50	164635
N90	14464
Gap	0.0

with wtdbg-1.2.8 -t 20 -i third_all.fasta -o run8 -k 15 -p 0 -S 1 --rescue-low-cov-edges --tidy-reads 5000 --aln-noskip

Size_includeN	762713695
Size_withoutN	762713695
Seq_Num	9803
Mean_Size	77804
Median_Size	15366
Longest_Seq	11163143
Shortest_Seq	2449
GC_Content	32.22
N50	488952
N90	25913
Gap	0.0

Thank you for your help!

Bests,
Yiwei Niu

@ruanjue
Copy link
Owner

ruanjue commented Mar 12, 2018

In personal feeling, I like wtdbg-1.2.8 more than SMARTdenovo and wtdbg-1.1.006.

Anyway, thanks for sharing experiments!

@ruanjue ruanjue closed this as completed Mar 12, 2018
@YiweiNiu
Copy link

Do you have any ideas about improving the assembly (though I know this largely depends on the data/species)? If necessary, I will upload the log files.

Now I'm not sure what to do next... I've tried several assembly pipelines (miniasm, MECAT, Canu etc.). wtdbg-1.1.006 gave me the best N50 size so far. I'm wondering whether there is room for improvement. Or I should move on.

Thank you!

@ruanjue
Copy link
Owner

ruanjue commented Mar 12, 2018

N50 contig of ~500kb is enough for following genomic analysis, also enough for further scaffolding. Now, I suggest you to check mis-assemblies, and make a final dicision.
I care more about how to get an good enough assembly efficiently, not a best assembly outstand, which may take too much time but little of improvement.

@YiweiNiu
Copy link

Thank you very much for your help and for your great tools! I'll try and paste something useful here.

@evanrrees
Copy link

Can you please clarify how the -k and -p parameters work together? I see from the example commands in this thread that either -k 0 or -p 0 is always specified, but the usage information gives the impression that they can be used together. It would be helpful if you could provide an example sequence and how it would be interpreted with different arguments.

Thanks!
Evan

@ruanjue
Copy link
Owner

ruanjue commented Aug 29, 2018

a seed = +
k-mer is normal DNA fragment of length k
p-mer is homopolymer-compressed DNA fragment, has a length of p
the specificity of k-mer can be 4^k
while the specificity of p-mer is about 3^p
the specificity of seed can be estimated by 4^k * 3 ^ p

@ruanjue ruanjue reopened this Aug 29, 2018
@ruanjue
Copy link
Owner

ruanjue commented Aug 29, 2018

a seed = <k-mer>+<p-mer>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants