Skip to content

Commit

Permalink
fixed bomb out when user tries to update distance matrix before addin…
Browse files Browse the repository at this point in the history
…g any strains
  • Loading branch information
timdallman committed Jan 4, 2018
1 parent 863e61d commit 8ee28a7
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 45 deletions.
59 changes: 57 additions & 2 deletions README.md
Expand Up @@ -174,12 +174,67 @@ Using hierarchical single linkage clustering of the pairwise SNP distances we ar
snapperdb.py update_clusters -c $myconfigname
```

Isolates that have artificially low genetic distance to others may led to cluster merges with single linkage clustering. Artificially low genetic distances might be caused by low quality samples or by samples that are mixed with another isolate. To combat this problem, for each cluster a Z-Score is calculated for each isolate to ascertain how far that isolates average SNP distance deviates from the mean average SNP distance in that cluster. If an isolates Z-Score is less that -1.75 it is deemed an outlier and will have to be manually accepted or ignored.

Here is an example of the output of update_clusters where an outlier has been identified. Strain 13816_H14212028301-1 has tripped the Z-Score threshold in the 50 SNP cluster '1', its average distance to other isolates in this cluster was 42.6 which resulted in a Z-Score just less than -1.75.


```
### Cluster level 250 :23:41:32.847220
### Cluster level 100 :23:53:40.710755
### Cluster level 50 :00:00:35.531624
### Cluster level 25 :00:03:01.664428
### Cluster level 10 :00:03:51.182371
### Cluster level 5 :00:04:37.412988
### Cluster level 0 :00:05:26.974864
### Getting previously checked outliers:00:06:26.419278
### Checking Clusters:00:06:26.540735
33-19-8101311 [1, 1, 1, 124, 3121, 4294, 5529]
### Investigating 250 1
### Average Distance 417.962973573
### Standard Deviation 216.131996093
### Investigating 100 1
### Average Distance 69.2875007924
### Standard Deviation 22.0576314
### Investigating 50 1
### Average Distance 67.2382283514
### Standard Deviation 14.066770828
### Outlier Z-Score 13816_H14212028301-1 42.6178690009 -1.75024955276
### Investigating 25 124
### Average Distance 60.1811010179
### Standard Deviation 12.4097892225
### Investigating 10 3121
### Average Distance 10.041015625
### Standard Deviation 3.85288497424
### Investigating 5 4294
### Average Distance 0.0
### Investigating 0 5529
### Average Distance 0.0
```


At this stage it is best to manually inspect the SNP alignment for the outlier (see Generating Alignments for Phylogenetic Analysis below), does it have more 'N' bases then expected?

If you are happy to accept the outlier you can run the following

```sh
snapperdb.py accept_outlier -c $myconfigname -n $nameofstrain
```

If you think the outlier is best ignored from this and future analysis then run

```sh
snapperdb.py ignore_isolate -c $myconfigname -n $nameofstrain
```

Now you can run update_clusters again and generate the SNP addresses

### Generating Alignments for Phylogenetic Analysis

The command **get_the_snps** is used to produce alignments to use for phylogenetic analysis. In it's simplest form the **get_the_snps** function takes the name of the config file and list of strains in your database that you want to generate the alignment from.

```sh
SnapperDB_main.py get_the_snps -c $myconfigname -l $mylistofstrains
snapperdb.py get_the_snps -c $myconfigname -l $mylistofstrains
```

There are lots of other options in **get_the_snps**. To see them supply **-h**. We think we have provided fairly sensible defaults. One key option is **-m** which sets the maximum number of SNPs away from the reference that's allowed (Default 5000). Strains with more SNPs than this will be ignored. Another important option is **-a** which is alignment type. The three options available are; **C** (Core) - only allows positions that are present in all strains, **A** (Soft Core) - produces alignments where at least specified percentage of the samples are A/C/T/G at each position (Default A:80), **W** produce whole genome alignments. If you wish to mask regions of the genome out of the alignment, for example parts of the reference known to have recombined the you can either supply a file of coordinates to ignore with the **-n** flag or supply a GFF file produced by gubbins for example with the **-ng** option. The **-b** flag allows the list of strains supplied with the -l flag to be complemented with other background strains from the database. To include a representative from each 100 SNP cluster you would provide the option **t100**. Finally if you would like to output a list of annotated variants or a distance matrix of pairwise SNP distances provide a 'Y' option to the **-v** and **-x** flags respectively.
Expand All @@ -189,5 +244,5 @@ There are lots of other options in **get_the_snps**. To see them supply **-h**.
This will create a set of JSON files containing all the information to share to another database

```sh
SnapperDB_main.py export_json -c $myconfigname -l $mylistofstrains
snapperdb.py export_json -c $myconfigname -l $mylistofstrains
```
7 changes: 5 additions & 2 deletions snapperdb/__init__.py
Expand Up @@ -5,8 +5,11 @@
import sys


__config_dir__ = os.path.join(os.path.dirname(os.path.dirname(os.path.realpath(__file__))), 'user_configs')

try:
__config_dir__ = os.environ['GASTROSNAPPER_CONFPATH']
except:
__config_dir__ = os.path.join(os.path.dirname(os.path.dirname(os.path.realpath(__file__))), 'user_configs')

# in the brave new world we no longer keep stuff in relative paths to the code
# the location of the reference genomes is set in the module for phe/gastro_fastq_to_vcf_config
# keep the except branch for compatibility - ulf 1Nov2016
Expand Down
32 changes: 18 additions & 14 deletions snapperdb/snpdb/__init__.py
Expand Up @@ -401,23 +401,27 @@ def update_distance_matrix(config_dict, args):
snpdb.parse_config_dict(config_dict)
snpdb._connect_to_snpdb()
logger.info('Getting strains')
strain_list, update_strain = snpdb.get_strains()
strain_list, update_strain, all_strains = snpdb.get_strains()


# # get_all_good_ids from snpdb2 takes a snp cutoff as well, here, we don't have a SNP cutoff so we set it arbitrarily high.
snp_co = '1000000'
if update_strain:
print "### Populating distance matrix: " + str(datetime.datetime.now())
snpdb.parse_args_for_update_matrix(snp_co, strain_list)
if args.hpc == 'N':
print '### Launching serial update_distance_matrix ' + str(datetime.datetime.now())
snpdb.check_matrix(strain_list, update_strain)
snpdb.update_clusters()
if all_strains or len(update_strain) > 1:
if update_strain:
print "### Populating distance matrix: " + str(datetime.datetime.now())
snpdb.parse_args_for_update_matrix(snp_co, strain_list)
if args.hpc == 'N':
print '### Launching serial update_distance_matrix ' + str(datetime.datetime.now())
snpdb.check_matrix(strain_list, update_strain)
snpdb.update_clusters()
else:
print '### Launching parallel update_distance_matrix ' +str(datetime.datetime.now())
present_stains = list(set(strain_list) - set(update_strain))
for idx, one_strain in enumerate(chunks(list(update_strain), int(args.hpc))):
snpdb.write_qsubs_to_check_matrix(args, idx, one_strain, present_stains, config_dict['snpdb_name'])
snpdb.check_matrix(update_strain, update_strain)
else:
print '### Launching parallel update_distance_matrix ' +str(datetime.datetime.now())
present_stains = list(set(strain_list) - set(update_strain))
for idx, one_strain in enumerate(chunks(list(update_strain), int(args.hpc))):
snpdb.write_qsubs_to_check_matrix(args, idx, one_strain, present_stains, config_dict['snpdb_name'])
snpdb.check_matrix(update_strain, update_strain)

print '### Nothing to update ' + str(datetime.datetime.now())
else:
print '### Nothing to update ' + str(datetime.datetime.now())
# -------------------------------------------------------------------------------------------------
Expand Down
53 changes: 29 additions & 24 deletions snapperdb/snpdb/snpdb.py
Expand Up @@ -5,6 +5,7 @@
import glob
import os
import math
import subprocess
import json
import sys
import re
Expand Down Expand Up @@ -953,7 +954,7 @@ def get_strains(self):

update_strain = set(strain_list) - set(all_strains)

return strain_list, update_strain
return strain_list, update_strain, all_strains

def parse_args_for_update_matrix(self, snp_co, strain_list):

Expand Down Expand Up @@ -1408,32 +1409,36 @@ def check_clusters(self,clusters, dist_mat,levels, cluster_strain_list,outliers)
def update_clusters(self):
#get distance matrix
dist_mat = self.get_input()
#get clusters
co, cluster_strain_list, cluster_dict = self.get_clusters()
#print cluster_dict
clean_clusters = {}
for i, cuts in enumerate(sorted(co,reverse=True,key=int)):
print "### Cluster level "+str(cuts)+" :"+ str(datetime.time(datetime.now()))
#print "making links"
links = self.make_links(dist_mat, cuts, cluster_dict[i])
#print "defining_clusters"
clusters = self.define_clusters(links)
#print "removing duplicates"
clean_clusters[cuts] = self.remove_duplicate_clusters(clusters)

print "### Getting previously checked outliers:"+ str(datetime.time(datetime.now()))
outliers = self.get_outliers()

print "### Checking Clusters:"+ str(datetime.time(datetime.now()))
bad_list = self.check_clusters(clean_clusters, dist_mat, co,cluster_strain_list,outliers)

if not bad_list:
strain_list = self.add_clusters_to_existing_table(clean_clusters, dist_mat, co, cluster_strain_list)
self.merged_clusters(cluster_strain_list, strain_list)

if dist_mat:

#get clusters
co, cluster_strain_list, cluster_dict = self.get_clusters()
#print cluster_dict
clean_clusters = {}
for i, cuts in enumerate(sorted(co,reverse=True,key=int)):
print "### Cluster level "+str(cuts)+" :"+ str(datetime.time(datetime.now()))
#print "making links"
links = self.make_links(dist_mat, cuts, cluster_dict[i])
#print "defining_clusters"
clusters = self.define_clusters(links)
#print "removing duplicates"
clean_clusters[cuts] = self.remove_duplicate_clusters(clusters)

print "### Getting previously checked outliers:"+ str(datetime.time(datetime.now()))
outliers = self.get_outliers()

print "### Checking Clusters:"+ str(datetime.time(datetime.now()))
bad_list = self.check_clusters(clean_clusters, dist_mat, co,cluster_strain_list,outliers)

if not bad_list:
strain_list = self.add_clusters_to_existing_table(clean_clusters, dist_mat, co, cluster_strain_list)
self.merged_clusters(cluster_strain_list, strain_list)
else:
print "### Not Updating Clusters:"+ str(datetime.time(datetime.now()))
else:
print "### Not Updating Clusters:"+ str(datetime.time(datetime.now()))


# -------------------------------------------------------------------------------------------------

def sql_single_extrac(self,data, data_type, table):
Expand Down
6 changes: 3 additions & 3 deletions user_configs/ebg4_config.txt
@@ -1,7 +1,7 @@
snpdb_name ebg_4_snps
reference_genome AM933172
pg_uname timdallman
pg_pword
pg_uname postgres
pg_pword postgres
pg_host localhost
depth_cutoff 10
mq_cutoff 30
Expand All @@ -10,4 +10,4 @@ average_depth_cutoff 30
mapper bwa
mapper_threads 4
variant_caller gatk
variant_caller_threads 4
variant_caller_threads 4

0 comments on commit 8ee28a7

Please sign in to comment.