fixed bomb out when user tries to update distance matrix before addin…

…g any strains
ukhsa-collaboration · Jan 4, 2018 · 8ee28a7 · 8ee28a7
1 parent 863e61d
commit 8ee28a7
Show file tree

Hide file tree

Showing 5 changed files with 112 additions and 45 deletions.
diff --git a/README.md b/README.md
@@ -174,12 +174,67 @@ Using hierarchical single linkage clustering of the pairwise SNP distances we ar
 snapperdb.py update_clusters -c $myconfigname
 ```
 
+Isolates that have artificially low genetic distance to others may led to cluster merges with single linkage clustering.  Artificially low genetic distances might be caused by low quality samples or by samples that are mixed with another isolate.  To combat this problem, for each cluster a Z-Score is calculated for each isolate to ascertain how far that isolates average SNP distance deviates from the mean average SNP distance in that cluster.  If an isolates Z-Score is less that -1.75 it is deemed an outlier and will have to be manually accepted or ignored.
+
+Here is an example of the output of update_clusters where an outlier has been identified. Strain 13816_H14212028301-1 has tripped the Z-Score threshold in the 50 SNP cluster '1', its average distance to other isolates in this cluster was 42.6 which resulted in a Z-Score just less than -1.75.
+
+
+```
+### Cluster level 250 :23:41:32.847220
+### Cluster level 100 :23:53:40.710755
+### Cluster level 50 :00:00:35.531624
+### Cluster level 25 :00:03:01.664428
+### Cluster level 10 :00:03:51.182371
+### Cluster level 5 :00:04:37.412988
+### Cluster level 0 :00:05:26.974864
+### Getting previously checked outliers:00:06:26.419278
+### Checking Clusters:00:06:26.540735
+33-19-8101311 [1, 1, 1, 124, 3121, 4294, 5529]
+### Investigating  250 1
+### Average Distance  417.962973573
+### Standard Deviation  216.131996093
+### Investigating  100 1
+### Average Distance  69.2875007924
+### Standard Deviation  22.0576314
+### Investigating  50 1
+### Average Distance  67.2382283514
+### Standard Deviation  14.066770828
+### Outlier Z-Score  13816_H14212028301-1 42.6178690009 -1.75024955276
+### Investigating  25 124
+### Average Distance  60.1811010179
+### Standard Deviation  12.4097892225
+### Investigating  10 3121
+### Average Distance  10.041015625
+### Standard Deviation  3.85288497424
+### Investigating  5 4294
+### Average Distance  0.0
+### Investigating  0 5529
+### Average Distance  0.0
+```
+
+
+At this stage it is best to manually inspect the SNP alignment for the outlier (see Generating Alignments for Phylogenetic Analysis below), does it have more 'N' bases then expected?
+
+If you are happy to accept the outlier you can run the following
+
+```sh
+snapperdb.py accept_outlier -c $myconfigname -n $nameofstrain
+```
+
+If you think the outlier is best ignored from this and future analysis then run
+
+```sh
+snapperdb.py ignore_isolate -c $myconfigname -n $nameofstrain
+```
+
+Now you can run update_clusters again and generate the SNP addresses
+
 ### Generating Alignments for Phylogenetic Analysis
 
 The command **get_the_snps** is used to produce alignments to use for phylogenetic analysis.  In it's simplest form the **get_the_snps** function takes the name of the config file and list of strains in your database that you want to generate the alignment from.
 
 ```sh
-SnapperDB_main.py get_the_snps -c $myconfigname -l $mylistofstrains
+snapperdb.py get_the_snps -c $myconfigname -l $mylistofstrains
 ```
 
 There are lots of other options in **get_the_snps**.  To see them supply **-h**.  We think we have provided fairly sensible defaults.  One key option is **-m** which sets the maximum number of SNPs away from the reference that's allowed (Default 5000).  Strains with more SNPs than this will be ignored.  Another important option is **-a** which is alignment type.  The three options available are; **C** (Core) -  only allows positions that are present in all strains, **A** (Soft Core) - produces alignments where at least specified percentage of the samples are A/C/T/G at each position (Default A:80), **W** produce whole genome alignments.  If you wish to mask regions of the genome out of the alignment, for example parts of the reference known to have recombined the you can either supply a file of coordinates to ignore with the **-n** flag or supply a GFF file produced by gubbins for example with the **-ng** option.  The **-b** flag allows the list of strains supplied with the -l flag to be complemented with other background strains from the database.  To include a representative from each 100 SNP cluster you would provide the option **t100**.  Finally if you would like to output a list of annotated variants or a distance matrix of pairwise SNP distances provide a 'Y' option to the **-v** and **-x** flags respectively.
@@ -189,5 +244,5 @@ There are lots of other options in **get_the_snps**.  To see them supply **-h**.
 This will create a set of JSON files containing all the information to share to another database
 
 ```sh
-SnapperDB_main.py export_json -c $myconfigname -l $mylistofstrains
+snapperdb.py export_json -c $myconfigname -l $mylistofstrains
 ```
diff --git a/snapperdb/__init__.py b/snapperdb/__init__.py
@@ -5,8 +5,11 @@
 import sys
 
 
-__config_dir__ = os.path.join(os.path.dirname(os.path.dirname(os.path.realpath(__file__))), 'user_configs')
-
+try:
+    __config_dir__ = os.environ['GASTROSNAPPER_CONFPATH']
+except:
+    __config_dir__ = os.path.join(os.path.dirname(os.path.dirname(os.path.realpath(__file__))), 'user_configs')
+
 # in the brave new world we no longer keep stuff in relative paths to the code
 # the location of the reference genomes is set in the module for phe/gastro_fastq_to_vcf_config
 # keep the except branch for compatibility - ulf 1Nov2016

diff --git a/snapperdb/snpdb/__init__.py b/snapperdb/snpdb/__init__.py
@@ -401,23 +401,27 @@ def update_distance_matrix(config_dict, args):
     snpdb.parse_config_dict(config_dict)
     snpdb._connect_to_snpdb()
     logger.info('Getting strains')
-    strain_list, update_strain = snpdb.get_strains()
+    strain_list, update_strain, all_strains = snpdb.get_strains()
+
+
     # # get_all_good_ids from snpdb2 takes a snp cutoff as well, here, we don't have a SNP cutoff so we set it arbitrarily high.
     snp_co = '1000000'
-    if update_strain:
-        print "###  Populating distance matrix: " + str(datetime.datetime.now())
-        snpdb.parse_args_for_update_matrix(snp_co, strain_list)
-        if args.hpc == 'N':
-            print '### Launching serial update_distance_matrix ' + str(datetime.datetime.now())
-            snpdb.check_matrix(strain_list, update_strain)
-            snpdb.update_clusters()
+    if all_strains or len(update_strain) > 1:
+        if update_strain:
+            print "###  Populating distance matrix: " + str(datetime.datetime.now())
+            snpdb.parse_args_for_update_matrix(snp_co, strain_list)
+            if args.hpc == 'N':
+                print '### Launching serial update_distance_matrix ' + str(datetime.datetime.now())
+                snpdb.check_matrix(strain_list, update_strain)
+                snpdb.update_clusters()
+            else:
+                print '### Launching parallel update_distance_matrix ' +str(datetime.datetime.now())
+                present_stains = list(set(strain_list) - set(update_strain))
+                for idx, one_strain in enumerate(chunks(list(update_strain), int(args.hpc))):
+                    snpdb.write_qsubs_to_check_matrix(args, idx, one_strain, present_stains, config_dict['snpdb_name'])
+                snpdb.check_matrix(update_strain, update_strain)
         else:
-            print '### Launching parallel update_distance_matrix ' +str(datetime.datetime.now())
-            present_stains = list(set(strain_list) - set(update_strain))
-            for idx, one_strain in enumerate(chunks(list(update_strain), int(args.hpc))):
-                snpdb.write_qsubs_to_check_matrix(args, idx, one_strain, present_stains, config_dict['snpdb_name'])
-            snpdb.check_matrix(update_strain, update_strain)
-
+            print '### Nothing to update ' + str(datetime.datetime.now())
     else:
         print '### Nothing to update ' + str(datetime.datetime.now())
 # -------------------------------------------------------------------------------------------------

diff --git a/snapperdb/snpdb/snpdb.py b/snapperdb/snpdb/snpdb.py
@@ -5,6 +5,7 @@
 import glob
 import os
 import math
+import subprocess
 import json
 import sys
 import re
@@ -953,7 +954,7 @@ def get_strains(self):
 
         update_strain = set(strain_list) -  set(all_strains)
 
-        return strain_list, update_strain
+        return strain_list, update_strain, all_strains
 
     def parse_args_for_update_matrix(self, snp_co, strain_list):
 
@@ -1408,32 +1409,36 @@ def check_clusters(self,clusters, dist_mat,levels, cluster_strain_list,outliers)
     def update_clusters(self):
         #get distance matrix
         dist_mat = self.get_input()
-        #get clusters
-        co, cluster_strain_list, cluster_dict = self.get_clusters()
-        #print cluster_dict
-        clean_clusters = {}
-        for i, cuts in enumerate(sorted(co,reverse=True,key=int)):
-            print "### Cluster level "+str(cuts)+" :"+ str(datetime.time(datetime.now()))
-            #print "making links"
-            links = self.make_links(dist_mat, cuts, cluster_dict[i])
-            #print "defining_clusters"
-            clusters = self.define_clusters(links)
-            #print "removing duplicates"
-            clean_clusters[cuts] = self.remove_duplicate_clusters(clusters)
-
-        print "### Getting previously checked outliers:"+ str(datetime.time(datetime.now()))
-        outliers = self.get_outliers()
-
-        print "### Checking Clusters:"+ str(datetime.time(datetime.now()))
-        bad_list = self.check_clusters(clean_clusters, dist_mat, co,cluster_strain_list,outliers)
-
-        if not bad_list:
-            strain_list = self.add_clusters_to_existing_table(clean_clusters, dist_mat, co, cluster_strain_list)
-            self.merged_clusters(cluster_strain_list, strain_list)
+
+        if dist_mat:
+
+            #get clusters
+            co, cluster_strain_list, cluster_dict = self.get_clusters()
+            #print cluster_dict
+            clean_clusters = {}
+            for i, cuts in enumerate(sorted(co,reverse=True,key=int)):
+                print "### Cluster level "+str(cuts)+" :"+ str(datetime.time(datetime.now()))
+                #print "making links"
+                links = self.make_links(dist_mat, cuts, cluster_dict[i])
+                #print "defining_clusters"
+                clusters = self.define_clusters(links)
+                #print "removing duplicates"
+                clean_clusters[cuts] = self.remove_duplicate_clusters(clusters)
+
+            print "### Getting previously checked outliers:"+ str(datetime.time(datetime.now()))
+            outliers = self.get_outliers()
+
+            print "### Checking Clusters:"+ str(datetime.time(datetime.now()))
+            bad_list = self.check_clusters(clean_clusters, dist_mat, co,cluster_strain_list,outliers)
+
+            if not bad_list:
+                strain_list = self.add_clusters_to_existing_table(clean_clusters, dist_mat, co, cluster_strain_list)
+                self.merged_clusters(cluster_strain_list, strain_list)
+            else:
+                print "### Not Updating Clusters:"+ str(datetime.time(datetime.now()))
         else:
             print "### Not Updating Clusters:"+ str(datetime.time(datetime.now()))
 
-
 # -------------------------------------------------------------------------------------------------
 
     def sql_single_extrac(self,data, data_type, table):

diff --git a/user_configs/ebg4_config.txt b/user_configs/ebg4_config.txt
@@ -1,7 +1,7 @@
 snpdb_name ebg_4_snps
 reference_genome AM933172
-pg_uname timdallman
-pg_pword 
+pg_uname postgres
+pg_pword postgres
 pg_host localhost
 depth_cutoff 10
 mq_cutoff 30
@@ -10,4 +10,4 @@ average_depth_cutoff 30
 mapper bwa
 mapper_threads 4
 variant_caller gatk
-variant_caller_threads 4
+variant_caller_threads 4