## Beijing Lineage Phylogeny Construction - Run 4 ##

### 3 July 2016 ###

The input data for this run are VCF files containing the SNPs from 110 unique strains from ~150 samples from the Beijing Lineage, where SNPs from repetitive regions and antibiotic-resistance were removed along with those whose frequency was less than 2. The entire list of strains used in this run can be found in the convertvcf folder in the file unique_strains.txt.

#### VCF Conversion ####

The collection of VCF files were converted to FASTA file format with the tool convertvcf.py under the convertvcf folder with the following commands:

```bash
run=run4

inputDir=/home/zhf615/TB_test/MALAWI/test/shuffled/results
outputDir=/global/scratch/seanla/Data/MTBC/$run
snpDir=$outputDir/snps
snpPrefix=$snpDir/$run

convertVcfDir=/home/seanla/Projects/beijing_ancestor_mtbc/convertvcf
strains=$convertVcfDir/unique_strains.txt
convertVcf=$convertVcfDir/convertvcf.py

mkdir -p $snpDir

mkdir -p $phymlDir

python $convertVcf -i $inputDir -o $snpPrefix -s $strains -r "(\d|\D)*_final.vcf$" -f -p
```

Where `(\d|\D)*_final.vcf$` is the regular expression for the appropriate VCF files. Processing revealed that there are 7806 sites (i.e. the length of each strain sequence is 7806 bases).

#### Best Fit Model of DNA Evolution - jModelTest ####

jModelTest was used to infer the best-fit model of nucleotide evolution using the Akaike information criterion with the following commands:

```bash
run=run4

prefix=/global/scratch/seanla/Data/MTBC/$run
phylip=$prefix/snps/$run.phy
jmodelDir=$prefix/jmodeltest
output=$jmodelDir/jmodeltest-results.out
jModelTest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar

mkdir -p $jmodelDir

java -jar $jModelTest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used is as follows:
* -AIC indicates we used the Akaike information criterion to infer the best fit model.
* -a indicates we estimated the model-averaged phylogeny for each active criterion.
* -f indicates we included models with unequal base frequencies.
* -g 4 indicates we included models with rate variation among sites and sets the number of categories to 4.
* -i indicates we included models with a proportion invariable sites.
* -H AIC indicates we used the AIC information criterion for clustering search. (jModelTest indicated that this option has no effect for our dataset.)

jModelTest ouputted GTR+G (General Time Reversible model with rate variation among sites) as the best fit model for our data.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+G           0.16    0.35    0.32    0.17    0.00    0.00      0.901   3.626   0.431   0.517   2.731   1.000 N/A        0.59
```

SNP alignments in the PHYLIP format were fed to jModelTest.

#### Neighbor-joining Tree with MEGA ####

The neighbor-joining tree was constructed with MEGA with the following parameters:

* Test of phylogeny was set to bootstrap with 1000 pseudoreplicates.
* Substitution model/method was set to no. of differences.
* Substitutions to include were transitions and transversions.
* Rates among sites was set to Gamma distributed with a Gamma parameter of 0.59 (as per the output of jModelTest)

SNP alignments were fed to MEGA in the FASTA format.

#### Maximum-likelihood Tree with MEGA ####

The maximum-likelihood tree was constructed using MEGA with the following parameters:

* Test of phylogeny was set to bootstrap with 1000 pseudoreplicates.
* The model of substitution was set to General Time Reversible model.
* Rates among Sites was set to Gamma Distributed.
* No. of discrete gamma categories was set to 4 (as per the default number of categories in jModelTest)
* The ML heuristic used was Nearest-Neighbor-Interchange (NNI)
* The initial tree for ML was created automatically with default being neighbor-joining/Bio neighbor-joining
* Branch swap filter was set to none.
* Number of threads used was 12.

SNP alignments were fed to MEGA in the FASTA format.

#### Bayesian Inference with BEAST ####

The optimal tree distribution was found using Bayesian inference with the program BEAST with the following parameters:

##### Site Model #####
* The Gamma category count was set to 4.
* The Gamma shape parameter was set to 0.59.
* The proportion invariant (proportion of invariant sites) was set to 0.0.
* The substitution model was set to GTR, with each substitution parameter (AC, AG, etc) set to 1.0. (Default value)
* All substitution parameters, with the exception of CT, were estimated. (Default value)
* Frequencies for each were estimated. (Default value)

##### Clock Model #####
* A strict clock was used with a clock rate of 1.0.

##### Priors #####
* The tree prior was set to the Yule Model.
* Birth rate was set to uniform, with an initial value of 1.0 and constrained within the real numbers ([-inf, inf]).
* The GTR substitution parameters was set to a Gamma distribution.
* The GTR substitution parameters for each substitution type was initially set to 1.0 and constrained within the positive real numbers ([0,inf]).

##### MCMC #####
* The chain length was set to 10 000 000.
* The number of pre-burnin was set to 0.
* The number of initialization attempts was set to 10.
* Sample from prior was not initialized.

SNP alignments were given to BEAST in FASTA file format.

### 5 July 2016 ###

The Neighbor-Joining tree was visualized by passing the NEWICK file `run4mega-nj.nwk` to Dendroscope (the extension was changed to `newick` so that Dendroscope could accept it), where the labels of the internal nodes were interpreted as bootstrap values.

![Neighbor-joining Tree](run4mega-nj.png)

The Bayesian consensus tree was visualized using the following steps:
* TreeAnnotator v2.4.2 was used first to find the consensus topology, where the burnin percentage was first set to 10%. The node heights were set to mean heights, and the target tree was set to maximum clade credibility tree.
* The output file `run4bayes-consensus.nexus` given by TreeAnnotator was then fed into FigTree and the option 'node labels' was set.

![Bayesian Consensus Tree](run4bayes-consensus.nexus.png)

### 7 July 2016 ###

RAxML was used to construct the maximum likelihood tree and perform 1000 rapid bootstraps using the following command:

```bash
run=run4
dir=/global/scratch/seanla/Data/MTBC/$run
input=$dir/snps/$run.phy
outputdir=$dir/raxml
output=${run}raxml

mkdir -p $outputdir

raxml -T ${PBS_NUM_PPN} -f a -x 61938 -p 889579 -# 1000 -k -m GTRGAMMA -s $input -w $outputdir -n $output
```

An explanation of the parameters follows:
* `-f a` indicates we performed a rapid bootstrap analysis and search for best-scoring ML tree in one program run.
* `-x 61938` indicates we gave the integer 61938 as a random seed to initialize rapid bootstrapping.
* `-p 889579` indicates we passed the integer 889579 as a random seed to initialize parsimony inferences. This is necessary when using any algorithm/option that requires randomization.
* `-# 1000` indicates we performed 1000 bootstrap pseudoreplicates (this is according to the [website](http://sco.h-its.org/exelixis/web/software/raxml/hands_on.html) of the developers).
* `-k` indicates that bootstrapped trees are printed with branch lengths.
* `-m GTRGAMMA` indicates the the model of nucleotide substitution is GTR with optimization of substitution rates and the usage of a gamma model for rate heterogeneity with an estimated alpha parameter.