# Documentation

## Installation

PathOGIST can be easily installed through conda. First install conda at https://www.anaconda.com/download/ and select the corresponding system. Once you have installed conda, follow the code below to install PathOGIST.

In [None]:
conda create --name pathogist python=3.5
source activate pathogist
conda install -c seanla pathogist

To use PATHOGIST, we need to activate the pathogist environment

In [102]:
source activate pathogist

(pathogist) 

: 1

In [103]:
PATHOGIST=/home/klw17/dev/PathOGiST/PATHOGIST
$PATHOGIST -h

                 {run,correlation,consensus,distance,visualize} ...

PathOGiST Version 1.0
Copyright (C) 2018 Leonid Chindelevitch, Cedric Chauve, William Hsiao

positional arguments:
  {run,correlation,consensus,distance,visualize}
    run                 run entire PathOGiST pipeline, from genotyping to clustering
    correlation         perform correlation clustering
    consensus           perform consensus clustering on multiple clusterings
    distance            construct distance matrix from genotyping data
    visualize           visualize distance matrix or clustering

optional arguments:
  -h, --help            show this help message and exit
                        Set the logging level
(pathogist) 

: 1

## Modules of PATHOGIST

## Run Subcommand

This subcommand runs the PathOGiST pipeline from start to finish (i.e. genotyping -> distance matrix creation -> correlation clustering -> consensus clustering). We recommend users to use the run subcommand as PathOGIST will automatically created the modified version of genotype input for its distance matrix calculations.

In [112]:
$PATHOGIST run -h

usage: PATHOGIST run [-h] [-n] CONFIG

positional arguments:
  CONFIG            path to input configuration file, or path to write a new
                    configuration file

optional arguments:
  -h, --help        show this help message and exit
  -n, --new_config  write a blank configuration file at path given by CONFIG
(pathogist) 

: 1

To use the run subcommand, we first have to create a config file and modify it.

In [113]:
$PATHOGIST run -n ./config.yaml

New configuration file written at ./config.yaml
(pathogist) 

: 1

Now let's have a look a look at the config file

In [114]:
cat config.yaml

---
# PathOGiST configuration file.
# This configuration file is in YAML file format, release 1.2 (Third Edition).
# Google yaml for the specification.
# Un-comment example files to use
# Directory to save temporary files.
temp: tests/integration_tests/temp_dir
# Number of threads on your computer
threads: 1
# Select what tools to run: 1 to run and 0 to not run.
run:
  snippy: 1
  kwip: 1
  prince: 1
  spotyping: 1
  mentalist: 1
# Command line options for tools to genotype your raw reads
genotyping:
  input_reads:
    forward_reads: #/home/usr/forwards.txt
    reverse_reads: #/home/usr/reverse.txt
  mentalist:
    # Choose 1 of the following option for mentalist to obtain a mlst database by selecting 1 and 0 for the others
    db_loc:
      local_file: 1
      build_db: 0
      download_pubmlst: 0
      download_cgmlst: 0
      download_enterobase: 0
    local_file:
      database: #/home/usr/mlst.db
    build_db:
      options:
        ## kmer size
        k:
        ## FASTA files w

: 1

There are lots of options to customize in the config file. Fortunately, the majority of the options have already been filled in. You only need to fill in the options using the template examples provided.

For more information about the various genotyping options, please visit:<br>
https://github.com/WGS-TB/MentaLiST, for mentalist<br>
https://kwip.readthedocs.io/en/latest/, for kwip<br>
https://github.com/WGS-TB/PythonPRINCE/, for prince<br>
https://github.com/xiaeryu/SpoTyping, for spotyping<br>
https://github.com/tseemann/snippy, for snippy<br>

Let's look at a filled out config file

In [115]:
cat /home/klw17/dev/PathOGiST/tb_test_config.yaml

---
# PathOGiST configuration file.
# This configuration file is in YAML file format, release 1.2 (Third Edition).
# Google yaml for the specification.

# Directory to save temporary files.
temp: /home/klw17/temp_files/
# Number of threads on your computer
threads: 64
# Select what tools to run: 1 to run and 0 to not run.
run:
  snippy: 1
  kwip: 1
  prince: 1
  spotyping: 1
  mentalist: 1
# Command line options for tools to genotype your raw reads
genotyping:
  input_reads:
    forward_reads: /home/klw17/tb_5_forward.txt
    reverse_reads: /home/klw17/tb_5_reverse.txt
  mentalist:
    # Choose 1 of the following option for mentalist to obtain a mlst database by selecting 1 and 0 for the others
    db_loc:
      local_file: 1
      build_db: 0
      download_pubmlst: 0
      download_cgmlst: 0
      download_enterobase: 0
    local_file:
      database: /projects/pathogist/cgMLST/MTB/Jan2019/mtb_31_Jan2019.db
    build_db:
      options:
        ## kmer size
        k:
        ## FASTA

: 1

To run with a completed config file, simply do the following:

In [None]:
$PATHOGIST run /home/klw17/dev/PathOGiST/tb_test_config.yaml

The final correlation file will be saved to what you have inputted for the output_prefix file

In [131]:
cat /home/klw17/pathogist_all/final_test.tsv

	Consensus	MLST	spoligotyping	SNP	CNV	kWIP
SRR6152639	1	1	1	1	1	1
SRR6152640	2	2	2	2	1	2
SRR6152641	1	3	1	1	1	1
SRR6152642	1	4	1	1	1	1
SRR6152643	1	4	1	1	1	3
(pathogist) 

: 1

The above config file generates the genotyping calls. If you already have the genotyping calls or the distance matrix created, you can use those instead. Check out the config and path files at https://github.com/WGS-TB/PathOGiST/blob/genotype_of_john/tests/integration_tests/test3_data/ and https://github.com/WGS-TB/PathOGiST/blob/genotype_of_john/tests/integration_tests/test4_data/ to see examples on how to format your files.

## Distance Subcommand

This subcommand allows the user to run only the distance matrix creation of the PathOGIST pipeline. If the user already has the genotyping step done, they can use this module to construct their distance matrix. PathOGIST uses the hamming distance metrics to create the distance matrix.

In [104]:
$PATHOGIST distance -h

usage: PATHOGIST distance [-h] [--bed BED]
                          calls_path {MLST,CNV,SNP,Spoligotype} output_path

positional arguments:
  calls_path            path to file containing paths to signal calls (e.g.
                        MLST calls, CNV calls, etc)
  {MLST,CNV,SNP,Spoligotype}
                        genotyping data
  output_path           path to output tsv file

optional arguments:
  -h, --help            show this help message and exit
  --bed BED             bed file of unwanted SNP positions in the genome
(pathogist) 

: 1

### Example 1 with SNP genotyping data:
An example calls_path file will look like this:

In [44]:
snp_calls_path=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/snippy_dec_02/dist/primitive_snippy_calls.txt
head -n 3 $calls_path

(pathogist2) /projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024893.class
/projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024894.class
/projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024895.class
(pathogist2) 

: 1

Running distance matrix generation with SNP data:

In [75]:
snp_dist_output=snp_distance_matrix.tsv
$PATHOGIST distance $snp_calls_path SNP $snp_dist_output

(pathogist2) 07:11:54 AM (5183 ms) -> INFO:Creating distance matrix ...
07:12:37 AM (48501 ms) -> INFO:Writing distance matrix ...
07:12:37 AM (48569 ms) -> INFO:Distance matrix creation complete!
(pathogist2) 

: 1

Below is a small portion of the resulting distance matrix

In [76]:
head $snp_dist_output | cut -f 1,2,3,4

	ERR1413989	ERR024905	ERR1414006
ERR1413989	0	16103	16454
ERR024905	16103	0	449
ERR1414006	16454	449	0
SRR1922794	18117	17754	17957
ERR024903	22463	22004	22427
ERR1414057	16454	449	0
ERR024922	19108	9906	9935
ERR1413986	1807	15678	16027
ERR024921	25691	25346	25721
(pathogist2) 

: 1

*Attention* PathOGIST requires modified SNP genotyping input. The SNP input should only contain SNP substitution. An example SNP genotyping input file will be formatted as such with the name of the sample as the first line

In [26]:
head -n5 /projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/snippy_dec_02/snps/ERR024892/snps.primitive.tab

ERR024892
CHROM	POS	TYPE	REF	ALT	EVIDENCE	FTYPE	STRAND	NT_POS	AA_POS	EFFECT	LOCUS_TAG	GENE	PRODUCT
NZ_LT596221.1	309	snp	T	C	C:10 T:0								
NZ_LT596221.1	663	snp	C	T	T:10 C:0								
NZ_LT596221.1	1432	snp	A	G	G:10 A:1								
(pathogist2) 

: 1

### Example 2 with MLST genotyping data:

An example calls_path file will look like this:

In [27]:
calls_path=/home/klw17/yersinia_mlst_calls.txt
head -n 3 $calls_path

(pathogist2) /projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024893.class
/projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024894.class
/projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/mentalist/unfiltered/original_calls/ERR024895.class
(pathogist2) 

: 1

Running distance matrix generation with MLST data:

In [96]:
mlst_dist_output=mlst_distance_matrix.tsv
$PATHOGIST distance $calls_path MLST $mlst_dist_output

(pathogist2) 04:51:31 PM (3496 ms) -> INFO:Creating distance matrix ...
04:51:34 PM (6061 ms) -> INFO:Writing distance matrix ...
04:51:34 PM (6122 ms) -> INFO:Distance matrix creation complete!
(pathogist2) 

: 1

Below is a small portion of the resulting distance matrix

In [33]:
head $dist_output | cut -f 1,2,3,4

	SRR1922796	ERR1414057	ERR1414049
SRR1922796	0	986	987
ERR1414057	986	0	1
ERR1414049	987	1	0
ERR024915	1527	1531	1531
ERR1413984	1063	1107	1106
ERR1414067	986	1	2
ERR1414030	986	2	3
ERR1414048	987	3	4
ERR1414080	987	4	3
(pathogist2) 

: 1

## Correlation Subcommand

This subcommand allows the user to cluster their samples based on the distance matrix of different genotypes using correlation clustering.

In [105]:
$PATHOGIST correlation -h

usage: PATHOGIST correlation [-h] [-a] [-m {C4,ILP}] [-p]
                             distance_matrix threshold output_path

positional arguments:
  distance_matrix       path to the distance matrix file
  threshold             threshold value for correlation
  output_path           path to write cluster output tsv file

optional arguments:
  -h, --help            show this help message and exit
  -a, --all_constraints
                        add all constraints to the optimization problem, not
                        just those with mixed signs.
  -m {C4,ILP}, --method {C4,ILP}
                        Method for correlation clustering
  -p, --presolve        presolve the ILP
(pathogist) 

: 1

The input is a distance matrix that is either generated through PathOGIST or through kWIP and a threshold value determined by the user. The threshold should be selected based on the user's general idea of his collection of samples. PathOGIST can use two methods to calculate correlation clustering: C4 and ILP. C4 is very but is non-deterministic where is ILP is deterministic but requires CPLEX.

### Example 1 using SNP distance matrix:

In [116]:
snp_corr_output=snp_corr.output
$PATHOGIST correlation -a -m C4 $snp_dist_output 1000 $snp_corr_output

(pathogist) 05:08:20 AM (5163 ms) -> INFO: Opening distance matrix...
(pathogist) 

: 1

Clustering Results:

In [117]:
head $snp_corr_output

Sample	Cluster
ERR1413989	1
ERR1413990	1
ERR1413996	1
ERR1413997	1
ERR1413998	1
ERR1413999	1
ERR024905	2
ERR1414006	2
ERR1414057	2
(pathogist) 

: 1

### Example 2 using MLST distance matrix:

In [128]:
mlst_corr_output=mlst_corr_output
$PATHOGIST correlation -a -m C4 $mlst_dist_output 100 $mlst_corr_output

(pathogist) 05:33:01 AM (3332 ms) -> INFO: Opening distance matrix...
(pathogist) 

: 1

In [130]:
head $mlst_corr_output

Sample	Cluster
ERR1414039	1
ERR1414072	1
ERR1414077	1
ERR1414082	1
ERR1414086	1
ERR1414101	1
ERR1414080	1
ERR1414078	1
ERR1414055	1
(pathogist) 

: 1

### Example 3 using kWIP distance matrix:

In [119]:
kwip_dist=/projects/pathogist/klw17/PathOGiSTPrivate/genotyping/yersinia_Williamson/kwip21/pseudoTB21.dist
kwip_corr_output=kwip_corr_output
$PATHOGIST correlation -a -m C4 $kwip_dist 1.2 $kwip_corr_output

(pathogist) (pathogist) 05:09:11 AM (3737 ms) -> INFO: Opening distance matrix...
(pathogist) 

: 1

In [129]:
head $kwip_corr_output

Sample	Cluster
ERR024892	1
ERR024893	1
ERR024894	1
ERR024896	1
ERR024901	1
ERR024902	1
ERR024903	1
ERR024905	1
ERR024907	1
(pathogist) 

: 1

## Consensus Subcommand

This subcommand allows the user to obtain the consensus from the different clustering results from different genotypes

In [111]:
$PATHOGIST consensus -h

usage: PATHOGIST consensus [-h] [-a] [-m {C4,ILP}]
                           distance_matrices clusterings fine_clusterings
                           output_path

positional arguments:
  distance_matrices     path to file containing paths to distance matrices for
                        different clusterings
  clusterings           path to file containing paths to clusterings,
                        represented as either matrices or lists of clustering
                        assignments
  fine_clusterings      path to file containing the names of the clusterings
                        which are the finest
  output_path           path to output tsv file

optional arguments:
  -h, --help            show this help message and exit
  -a, --all_constraints
                        add all constraints to the optimization problem, not
                        just those with mixed signs.
  -m {C4,ILP}, --method {C4,ILP}
                        Method for consensus clustering
(pathogist) 

: 1

This subcommand uses three files as input. This first file is a files that contains the path to the different distance matrices used for the different correlation clusterings. An example looks like this:

In [121]:
distance_matrices=/home/klw17/pathogist_doc/yersinia_163_distances.txt
cat $distance_matrices

(pathogist) MLST=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/matched_yersinia_163_mlst_dist.tsv
SNP=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/matched_yersinia_163_snv_dist.tsv
KWIP=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/matched_yersinia_163_kwip_dist.tsv(pathogist) 

: 1

The second file is a file that contains the path to the different clusterings. An example looks like this:

In [122]:
clusterings=/home/klw17/pathogist_doc/yersinia_163_clusters.txt
cat $clusterings

(pathogist) MLST=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/yersinia_163_mlst_clustering.tsv
SNP=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/yersinia_163_snv_clustering.tsv
KWIP=/projects/pathogist/klw17/genotyping/yersinia_pseudotuberculosis/pathogist/yersinia_163_kwip_clustering.tsv(pathogist) 

: 1

The third file contains which clustering is considered the finest. In this case SNP would be considered the finest. The file would look like this:

In [123]:
fine_clusterings=/home/klw17/pathogist_doc/example_fine_clusters.txt
cat $fine_clusterings

(pathogist) SNP
(pathogist) 

: 1

### Example

In [125]:
consensus=consensus.tsv
$PATHOGIST consensus -a -m C4 $distance_matrices $clusterings $fine_clusterings $consensus

(pathogist) 05:22:54 AM (5033 ms) -> INFO: Reading distance matrices ...
05:22:54 AM (5115 ms) -> INFO: Getting clusterings ...
05:22:54 AM (5132 ms) -> INFO: Getting other metadata ...
05:22:54 AM (5134 ms) -> INFO:Creating and solving consensus clustering problem ...
05:23:07 AM (18485 ms) -> INFO: Writing clusterings to file ...
(pathogist) 

: 1

In [126]:
head $consensus

	Consensus	MLST	KWIP	SNP
ERR024893	1	19	106	20
ERR024894	2	10	77	11
ERR024895	3	16	66	17
ERR024899	3	16	90	17
ERR024900	4	2	2	2
ERR024901	5	11	36	12
ERR024902	2	10	41	11
ERR024903	6	15	75	16
ERR024904	7	21	113	22
(pathogist) 

: 1

## Visualize Subcommand

This subcommand allows the user to visualize the clustering results

In [88]:
$PATHOGIST visualize -h

usage: PATHOGIST visualize [-h] input {clustering,distances}

positional arguments:
  input                 path to distance matrix or clustering, all in tsv
                        format
  {clustering,distances}
                        type of data for the input

optional arguments:
  -h, --help            show this help message and exit
(pathogist2) 

: 1