-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strains not added to the database #6
Comments
The problem, as you guessed is a 'bad' cluster, or more specifically a 'bad' strain. It is this one Outlier Z-Score lm11_1 18.7142857143 -1.75952223932This means that strain lm11_1 is -1.759*(standard deviation of the pairwise distance) from the cluster median pairwise distance distance. e.g. if the median pairwise distance of strains within this cluster is 10 and the standard deviation is 2, then this strain has a median pairwise distance of 6.5 SNPs from the other strains. This is a bad thing because if a strain is too close to all the other strains in your cluster it might mean that you have sequenced a mixed culture (one of the strains in the mix is in your cluster, the other is not). This will lead to 'N's instead of SNPs at variant positions, which will make this strain appear closer than expected to all the other strains in the cluster. This has the potential to 'collapse' clusters and create big headaches. If you want to troubleshoot, then 'get_the_snps' for the strain that is causing trouble, lm11_1, and 5 or 6 other strains at varying distances from lm11_1. Then visually inspect the alignment, if there are loads of singleton Ns scattered through the lm11_1 sequence, then the strain could be a mix and maybe should be ignored. If there are just a few Ns, or they are clustered (which may mean that part of the ref genome is not well covered/present in lm11_1), then you can allow the strain into your clusters. You do this by updating the 'zscore_check' field in strain_stats table to be 'Y', using e.g. pgAdmin. This means you have checked a strain which tripped the zscore, and it's ok. The cutoff is -1.75, so your strain has just tripped it, and is probably fine. This is based on my slightly rusty memory, @timdallman should probably confirm. |
Alright, thanks a lot for this clear explanation! This makes a lot of sense. |
Thanks for the perfect explanation Phil - I appreciate this is not explained in the docs yet and will rectify as soon as possible. Tim |
Oh, and the final thing to say is that if you decide to ignore it, then you should update the 'ignore' column in strain_stats for this strain. Then, the clustering will run to completion in future if you add more strains that dont have problems (I think). |
Hey,
I'm experimenting a bit with SnapperDB and I have a problem adding additional strains to the database.
I've successfully created a database with 14 strains.
Now when I try to add 6 more, they are not added to the 'strain_clusters' table in the database (they are added to the 'strain_stats' table however).
I used the example commands from the tutorial:
fastq_to_db (for all FASTQ files)
update_distance_matrix
update_clusters
I get the following output:
I think the clusters are somehow flagged as 'bad' clusters, but the previous clusters were really similar and they were successfully added to the db. Do you know what might be the cause of this?
The text was updated successfully, but these errors were encountered: