No mean distance calculated between samples #37

AndreaAguadoM · 2023-12-05T07:47:57Z

Hello!
My name is Andrea and I am bioinformatician from Spain. I have been using tn93 for a while, including it in some of the pipelines I am developing in order to analyze HIV sequences more effectively. I am trying to generate a distance matrix with lots of HIV-samples, and analyzing my results, I found some of the sequences pairs do not seem to have assigned a distance (as a result of the mean distance, I obtain -nan). How can this be possible if the t parameter value adjusted in my pipeline is 1?

Thanks in advance!

spond · 2023-12-05T13:14:16Z

Dear @AndreaAguadoM,

Can you please provide an example? nan will only arise if no comparisons were performed, i.e. something like this occurs (Actual comparisons performed = 0).

{
	"Actual comparisons performed" :0,
	"Comparisons accounting for copy numbers " :0,
	"Total comparisons possible" : 10,
	"Links found" : 0,
	"Maximum distance" : 0,
	"Sequences" : 5,
	"Mean distance" : nan
...

Make sure you specify the -L argument to compare sequences that overlap by fewer than the default 100 nucleotides as well (which is the case for the example above).

Best,
Sergei

AndreaAguadoM · 2023-12-11T11:54:00Z

Dear Sergei I apologize for the delayed response; unfortunately, your previous message got lost in the shuffle of incoming emails, and I mistakenly thought you hadn't replied. In response to your demand, here I provide an example below as requested: { "Actual comparisons performed" :0, "Comparisons accounting for copy numbers " :0, "Total comparisons possible" : 1, "Links found" : 0, "Maximum distance" : 0, "Sequences" : 2, "Mean distance" : -nan, The primary issue I've encountered relates to the -l parameter threshold setting. I've set the threshold to the minimum value of 1. Upon comparing two sequences. It appears there is no overlap in any position. Consequently, I still obtain this "nan" result. On the other hand, in my experimentation, I used tn93 to compare two sequences that, for the the most part differ (except for one common nucleotide):

sample1

NNNNNNNTGGCGAVATGTCTAGTAGCCAGCTGTGATAAATGTCAGCAAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGTCCAGGAATATGGCAACTAGATTGTACACACTTAGAAGACAAAATTATCCTGGTAGCAGTTCATGTAGCCAGTGGATATATAGAAGCAGAAGTTATTCCAGCAGAAACAGGGCAGGAAACAGCATACTTCATCCTAAAGTTAGCAGGAAGATGGCCAGTAAAAACAATACATACAGACAATGGTAGAAATTTTACCAGTAGTGCTGTGAAGGCAGCCTGTTGGTGGGCAGGGATCCAGCAGGAATTTGGAATTCCCTACAATCCCCAAAGTCAAGGAGTAGTAGAATCTATGAATAAAGAATTAAAGAAAATCATAGGACAAGTAAGAGATCAAGCTGAACATCTTAAGACAGCAGTACAAATGGCGGTGTTCATTCACAATTTTAAAAGAAAAGGGGGGATTGGGGAGTACAGTGCAGGGGAAAGAATAATAGACATAATAGCAACAGACATACAAACTAAAGAATTACAAAAACAAATTATAAAAATTCAAAATTTCCGGGTTTATTACAGGGACAGCAGAGACCCAATTTGGAAAGGACCAGCAAAGCTGCTCTGGAAAGGTGAAGGGGCAGTAGTCATACAAGATAATAGTGAAATAAAAGTAGTGCCAAGAAGAAAAGCAAAGATCATTAGGGATTATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAGGCAAGTAGACNNNNNN

sample2

AAAAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAA Nevertheless, I obtain the following distance as a result: $tn93 -t 1 -l 1 -o tn93_${sample1}_${sample2}.txt ${sample1}_${sample2}_alignment.fasta "Actual comparisons performed" :1, "Comparisons accounting for copy numbers " :1, "Total comparisons possible" : 1, "Links found" : 1, "Maximum distance" : 0.0025491, "Sequences" : 2, "Mean distance" : 0.0025491, This 0.0025 value as a result is remarkably low. I am puzzled by this outcome given the substantial dissimilarity in almost every nucleotide of the sequences. Your insights on this matter would be really appreciated. Thank you so much in advance. Best regards! El mar, 5 dic 2023 a las 14:14, Sergei Pond ***@***.***>) escribió:

…

Dear @AndreaAguadoM <https://github.com/AndreaAguadoM>, Can you please provide an example? nan will only arise if *no* comparisons were performed, i.e. something like this occurs (Actual comparisons performed = 0). { "Actual comparisons performed" :0, "Comparisons accounting for copy numbers " :0, "Total comparisons possible" : 10, "Links found" : 0, "Maximum distance" : 0, "Sequences" : 5, "Mean distance" : nan ... Make sure you specify the -L argument to compare sequences that overlap by fewer than the default 100 nucleotides as well (which is the case for the example above). Best, Sergei — Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AW4354SFZKMXOKLLXYXODN3YH4M3FAVCNFSM6AAAAABAHHEGFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBQG43TAOBZGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

spond · 2023-12-11T18:41:33Z

Dear @AndreaAguadoM,

In default run mode, N means "match everything". Sequences that comprise N will match any character at that position (distance 0).

If you want to treat N differently, you should adjust the -a command line argument. For example -a average.

Best,
Sergei

AndreaAguadoM · 2023-12-14T13:23:39Z

Thank you so much! I've been noticing that when using this -a parameter adjustment (-a average), I obtain 1000 as resulting mean distance in some distance calculations. As far as I know, the Tamura-Nei distance has a range of values between 0 and 2. Why am I obtaining these results? Thanks in advance again!

spond · 2023-12-14T14:57:28Z

Dear @AndreaAguadoM,

1000 is the upper bound that tn93 reports for all distances. Most genetic distances, including the TN93 distance, can range from 0 to ∞

It requires some serious data pathology, but it could occur. In fact, tn93 will "downgrade" to a K2P distance is the input data do not contain one of the four characters. That's because TN93 may become undefined in this case.

tn93/src/tn93_shared.cc

Line 756 in 728bb98

if (useK2P) {

Best,
Sergei

AndreaAguadoM · 2023-12-15T09:25:49Z

Okay. thank you very much! Your response has been very helpful
Best,
Andrea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No mean distance calculated between samples #37

No mean distance calculated between samples #37

AndreaAguadoM commented Dec 5, 2023

spond commented Dec 5, 2023

AndreaAguadoM commented Dec 11, 2023 via email

spond commented Dec 11, 2023

AndreaAguadoM commented Dec 14, 2023

spond commented Dec 14, 2023 •

edited by stephenshank

Loading

AndreaAguadoM commented Dec 15, 2023

No mean distance calculated between samples #37

No mean distance calculated between samples #37

Comments

AndreaAguadoM commented Dec 5, 2023

spond commented Dec 5, 2023

AndreaAguadoM commented Dec 11, 2023 via email

spond commented Dec 11, 2023

AndreaAguadoM commented Dec 14, 2023

spond commented Dec 14, 2023 • edited by stephenshank Loading

AndreaAguadoM commented Dec 15, 2023

spond commented Dec 14, 2023 •

edited by stephenshank

Loading