Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core_accessory_graph weights are greater than 1 #405

Open
sarahbuzz opened this issue Jun 29, 2018 · 7 comments
Open

core_accessory_graph weights are greater than 1 #405

sarahbuzz opened this issue Jun 29, 2018 · 7 comments

Comments

@sarahbuzz
Copy link

Hello,

I'm attempting to use the DOT file created by Roary for some downstream analyses (great addition and thank you for writing it to a file!). I was looking through the weights that were calculated and I'm getting very confused so please tell me if I'm misunderstanding something!

I read through the Wiki and the Supplemental information and was under the impression that the weight of each edge is calculated by 1 divided by the number of isolates that have those two nodes (genes) next to each other in their respective genomes. I see weights greater than 1 very often and I'm confused on how any weight can be greater than 1 if your dividing 1 by an integer.

I would appreciate any guidance on how to interpret the weight calculation!

@tseemann
Copy link
Contributor

I don't know the answer, but perhaps paralog genes might be the problem or bug? Not sure!

@sarahbuzz
Copy link
Author

How would the split-paralogs flag affect that? I would expect paralogs to be in separate clusters if the split paralogs is flagged.

I also have cases where the weight is less than 1 but when I invert that to see what the number of isolates was, I get a number that isn't right. For example, one weight was 0.032258. I then did 1 divided by that number to get the number of isolates and I get 31. I checked how often those two gene pairs are next to each other in our population set and it's much higher than 31.

I went into the OrderGenes.pm code and printed the denominator of the weight calculation and they are floats, not integers. Is there any other criteria your team used to calculate the denominator of the weight equation other than frequency?

@andrewjpage
Copy link
Member

Hi Sarah,
Sorry that the code isnt the clearest, and looking back now after 3 years I realise I should have documented it a bit clearer. Its great to hear your looking at the graphs, because I think this is the most interesting (untapped) output of roary!

Higher numbers = longer path = 2 genes less commonly found beside each other on contigs.
Lower numbers = shorter path = 2 gene more commonly found beside each other on contigs.

To reduce the impact of over represented genotypes in a sample set, a clustering is performed with cd-hit, and each cluster/genotype gets a weighting (in AccessoryClustering.pm). For example if you have 100 samples, but 90 come from a single outbreak this would skew your results. So any samples from the single outbreak are down weighted and only contribute 1/90. These weightings carry over to the graph, so that the contribution of theses over represented samples doesnt bias the graph.

If you think the method could be improved, please feel free to submit a pull request.
Regards,
Andrew

@sarahbuzz
Copy link
Author

Hi Andrew,
Thank you for the explanation! Ok I think I understand now. You defined outbreaks as those that map to the same cluster in CD-HIT? Then you are using the representative sequences to create the graph nodes and edges then using the frequency across the representatives to calculate the weight?

Thanks and I agree, this graph is a great resource!
Sarah

@andrewjpage
Copy link
Member

Yes thats correct.

@sarahbuzz
Copy link
Author

Great, thank you for your help Andrew!

@sarahbuzz
Copy link
Author

Andrew, I do have another question. If we are inverting the frequency, I'm confused how we can get weights larger than 1. I printed the denominator in the OrderGenes.pm and noticed that they are floats.

I'm confused how a frequency can be a float in this case when calculating the weight?

@sarahbuzz sarahbuzz reopened this Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants