New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core_accessory_graph weights are greater than 1 #405
Comments
I don't know the answer, but perhaps paralog genes might be the problem or bug? Not sure! |
How would the split-paralogs flag affect that? I would expect paralogs to be in separate clusters if the split paralogs is flagged. I also have cases where the weight is less than 1 but when I invert that to see what the number of isolates was, I get a number that isn't right. For example, one weight was 0.032258. I then did 1 divided by that number to get the number of isolates and I get 31. I checked how often those two gene pairs are next to each other in our population set and it's much higher than 31. I went into the OrderGenes.pm code and printed the denominator of the weight calculation and they are floats, not integers. Is there any other criteria your team used to calculate the denominator of the weight equation other than frequency? |
Hi Sarah, Higher numbers = longer path = 2 genes less commonly found beside each other on contigs. To reduce the impact of over represented genotypes in a sample set, a clustering is performed with cd-hit, and each cluster/genotype gets a weighting (in AccessoryClustering.pm). For example if you have 100 samples, but 90 come from a single outbreak this would skew your results. So any samples from the single outbreak are down weighted and only contribute 1/90. These weightings carry over to the graph, so that the contribution of theses over represented samples doesnt bias the graph. If you think the method could be improved, please feel free to submit a pull request. |
Hi Andrew, Thanks and I agree, this graph is a great resource! |
Yes thats correct. |
Great, thank you for your help Andrew! |
Andrew, I do have another question. If we are inverting the frequency, I'm confused how we can get weights larger than 1. I printed the denominator in the OrderGenes.pm and noticed that they are floats. I'm confused how a frequency can be a float in this case when calculating the weight? |
Hello,
I'm attempting to use the DOT file created by Roary for some downstream analyses (great addition and thank you for writing it to a file!). I was looking through the weights that were calculated and I'm getting very confused so please tell me if I'm misunderstanding something!
I read through the Wiki and the Supplemental information and was under the impression that the weight of each edge is calculated by 1 divided by the number of isolates that have those two nodes (genes) next to each other in their respective genomes. I see weights greater than 1 very often and I'm confused on how any weight can be greater than 1 if your dividing 1 by an integer.
I would appreciate any guidance on how to interpret the weight calculation!
The text was updated successfully, but these errors were encountered: