## Beijing Lineage Phylogeny Construction - Run 5 ##

### 6 July 2016 ###

The input data for this run are VCF files containing the SNPs from 76 unique strains from ~150 samples from the Beijing Lineage, where SNPs from repetitive regions and antibiotic-resistance were removed along with those whose frequency was less than 2. The entire list of strains used in this run can be found in the convertvcf folder in the file `snps/76strains.txt`. This test only involves constructing the ML tree using RAxML.

#### VCF Conversion ####

The collection of VCF files were converted to FASTA file format with the tool convertvcf.py under the convertvcf folder with the following commands:

```bash
run=run5

inputDir=/home/zhf615/TB_test/MALAWI/test/shuffled/results
outputDir=/global/scratch/seanla/Data/MTBC/$run
snpDir=$outputDir/snps
snpPrefix=$snpDir/$run

convertVcfDir=/home/seanla/Projects/beijing_ancestor_mtbc/convertvcf
strains=$convertVcfDir/76strains.txt
convertVcf=$convertVcfDir/convertvcf.py

mkdir -p $snpDir


python $convertVcf -i $inputDir -o $snpPrefix -s $strains -r "(\d|\D)*_final.vcf$" -f -p
```

Where `(\d|\D)*_final.vcf$` is the regular expression for the appropriate VCF files. The file `76strains.txt` contains the ascension codes of all 76 unique strains. They were taken from the excel spreadsheet `Beijing_lineage_snp.xlsx`, where said ascension codes were colored in black. Ascension codes colored in red were excluded.

#### Best Fit Model of DNA Evolution - jModelTest ####

jModelTest was used to infer the best-fit model of nucleotide evolution using the Akaike information criterion with the following commands:

```bash
run=run5

prefix=/global/scratch/seanla/Data/MTBC/$run
phylip=$prefix/snps/$run.phy
jmodelDir=$prefix/jmodeltest
output=$jmodelDir/jmodeltest-results.out
jModelTest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar

mkdir -p $jmodelDir

java -jar $jModelTest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used is as follows:
* -AIC indicates we used the Akaike information criterion to infer the best fit model.
* -a indicates we estimated the model-averaged phylogeny for each active criterion.
* -f indicates we included models with unequal base frequencies.
* -g 4 indicates we included models with rate variation among sites and sets the number of categories to 4.
* -i indicates we included models with a proportion invariable sites.
* -H AIC indicates we used the AIC information criterion for clustering search. (jModelTest indicated that this option has no effect for our dataset.)

The test revealed that the General Time Reversible Model with variations among sites is the best fit model.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+G           0.17    0.33    0.32    0.18    0.00    0.00      0.902   3.420   0.396   0.567   2.788   1.000 N/A        0.57
```

Notably, the Gamma parameter is this case is estimated to be 0.57.

#### Maximum Likelihood ####
The maximum likelihood tree was constructed using PhyML with the following command:

```bash
input=/global/scratch/seanla/Data/MTBC/run5/ml/run5.phy
phyml=/home/seanla/Software/PhyML-3.1/PhyML-3.1_linux64

$phyml -i $input -q -b 1000 -m GTR -a 0.57 --no_memory_check
```

An explanation of the parameters is as such:
* -q indicates the PHYLIP input data is in sequential form.
* -b 1000 indicates we are performing 1000 bootstrap replicates.
* -m GTR indicates the model of DNA substitution is GTR
* -a 0.57 indicates the Gamma shape parameter alpha is 0.57, as per the output of jModelTest
* --no_memory_check indicates the program does not check for sufficient memory before running the phylogenetic construction

RAxML was also used to construct the maximum likelihood tree using the following command:

```bash
run=run5
dir=/global/scratch/seanla/Data/MTBC/$run
input=$dir/snps/$run.phy
outputdir=$dir/raxml
output=${run}raxml

mkdir -p $outputdir

raxml -T ${PBS_NUM_PPN} -f a -x 314159 -p 271828 -# 100 -k -m GTRGAMMA -s $input -w $outputdir -n $output
```

An explanation of the parameters is as such:
* `-f a` indicates we performed a rapid bootstrap analysis and search for best-scoring ML tree in one program run.
* `-x 314159` indicates we gave the integer 314159 as a random seed to initialize rapid bootstrapping.
* `-p 271828` indicates we passed the integer 271828 as a random seed to initialize parsimony inferences. This is necessary when using any algorithm/option that requires randomization.
* `-# 100` indicates we performed 100 alternative runs on distinct starting trees.
* `-k` indicates that bootstrapped trees are printed with branch lengths.
* `-m GTRGAMMA` indicates the the model of nucleotide substitution is GTR with optimization of substitution rates and the usage of a gamma model for rate heterogeneity with an estimated alpha parameter.

Next, the best tree with bootstrap node values was created using the command:

```bash
run=run5
dir=/global/scratch/seanla/Data/MTBC/$run/raxml
bootstrap=$dir/RAxML_bootstrap.run5raxml
tree=$dir/RAxML_bestTree.run5raxml
output=bs_tree

raxml -T ${PBS_NUM_PPN} -f b -m GTRGAMMA -z $bootstrap -t $tree -w $dir -n $output
```

The most important parameter here is `-f b`, which indicates we drew bipartition information on the inputted best tree based on the bootstrap trees.

![Maximum likelihood tree](raxml/RAxML_bipartitionsBranchLabels.100fastbs_tree.png)