# Run 7 - 90 Samples #

## 25 July 2016 ##

This experiment used the VCF files located at `/home/zhf615/TB_test/MALAWI/test/Beijing/results` in the Breezy cluster at Westgrid. The 90 samples used can be found in the file `snps/sequences_90.txt`.

### Conversion of VCF to FASTA/PHYLIP ###

The VCF files were converted to FASTA and PHYLIP format using the following commands:

```bash
$ run=run7
$ vcfDir=/home/zhf615/TB_test/MALAWI/test/Beijing/results
$ projectDir=/home/seanla/Projects/beijing_ancestor_mtbc
$ convertVcf=$projectDir/convertvcf/convertvcf.py
$ runDir=$projectDir/$run
$ snpDir=$runDir/snps
$ snpPrefix=$snpDir/$run
$ strains=$runDir/snps/sequences_90.txt
$
$ mkdir -p $snpDir
$ python $convertVcf -i $vcfDir -o $snpPrefix -s $strains -r  "(\d|\D)*_unresi_unrepeted_depthfilerted.vcf$" -f -p
```

The alignments in the FASTA and PHYLIP files featured 6403 sites and 1579 patterns.

### Finding the best fit model of nucleotide substitution ###

The best fit model of nucleotide substitution was found using jModelTest 2 with the following commands:

```bash
run=run7
prefix=/home/seanla/Projects/beijing_ancestor_mtbc/run7
phylip=$prefix/snps/$run.phy
jmodelDir=$prefix/jmodeltest
output=$jmodeldir/jmodeltest-results.out
jmodeltest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar

mkdir -p $jmodeldir
java -jar $jmodeltest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used is as follows:
* -AIC indicates we used the Akaike information criterion to infer the best fit model.
* -a indicates we estimated the model-averaged phylogeny for each active criterion.
* -f indicates we included models with unequal base frequencies.
* -g 4 indicates we included models with rate variation among sites and sets the number of categories to 4.
* -i indicates we included models with a proportion invariable sites.
* -H AIC indicates we used the AIC information criterion for clustering search. (jModelTest indicated that this option has no effect for our dataset.)

Results from jModelTest indicated that the best fit model of DNA evolution for our data was the General Time Reversible Model with rate variation among sites and invariant sites (GTR+I+G), with a gamma parameter of 4.00 and 0.00 proportion of invariant sites.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+I+G         0.19    0.31    0.30    0.19    0.00    0.00      1.199   3.675   0.472   0.719   3.352   1.000    0.00    4.00
```

## 26 July 2016 ##

### Construction of neighbor-joining tree ###

The neighbor-joining tree was constructed using MEGA. The MAO configuration file was created using `megaproto` with the following settings:
* Test of phylogeny was set to bootstrap method using 1000 pseudoreplicates.
* The substitution model was set to No. of differences, transitions and transversions being included.
* Rates among sites was gamma distributed with invariant sites with a gamma parameter of 4.00.
* All other settings were left at their defaults.

MEGA was run using the following commands:

```bash
run=run7
prefix=/home/seanla/Projects/beijing_ancestor_mtbc/run7
outputdir=$prefix/nj
output=$outputdir/${run}mega-nj
config=$outputdir/infer_NJ_nucleotide.mao
input=$prefix/snps/${run}.fasta
mega=/home/seanla/Software/mega/megacc

mkdir -p $outputdir
$mega -a $config -d $input -o $output
```

The tree outputted by MEGA can be found at `nj/run7mega-nj.newick`.

![Neighbor-joining tree](nj/run7mega-nj_consensus.png)

### Construction of Maximum Likelihood tree ###

The maximum likelihood tree was constructed using the program RAxML using the following commands:

```bash
run=run7
prefix=/home/seanla/Projects/beijing_ancestor_mtbc/run7
input=$prefix/snps/${run}.phy
outputdir=$prefix/raxml
output=${run}raxml
mkdir -p $outputdir
raxml -T ${PBS_NUM_PPN} -f a -x 61938 -p 889579 -# 1000 -k -m GTRGAMMA -s $input -w $outputdir -n $output
```

An explanation of the parameters used is as follows:
* `-f a` indicates we performed rapid bootstrap analysis and search for best-scoring ML tree in one program run.
* `-x 61938` indicates we turned on rapid bootstrapping with a random seed of 61938.
* `-p 889575` indicates we specified a random seed of 889575 for the parsimony inferences. This is required for algorithms using some sort of randomization.
* `-#` indicates we performed 1000 pseudoreplicates for bootstrapping.
* `-k` specifies the bootstrapped trees are printed with branch lengths.
* `-m GTRGAMMA` indicates we used GTR as our model of DNA substitution, with estimated shape parameter

The tree outputted by RAxML can be found at `raxml/RAxML_bipartitionsBranchLabels.run7raxml.newick`.

![Maximum likelihood tree](raxml/RAxML_bipartitionsBranchLabels.run7raxml.png)

### Finding the optimal tree distribution using Bayesian Inference ###

The BEAST XML configuration file was first constructed using `beauti` with the following settings:

*** Site Model ***
* The model of DNA substitution was set to GTR.
* The gamma category count was set to 4.
* The shape parameter was set to 4.00.
* The proportion of invariant sites was set to 0.00.
* The substitution rates parameters were all left at their defaults, with rates of 1.0 and estimated (with the exception of CT, which was the default option). 
* Frequencies were estimated.

*** Clock Model ***
* The clock model used was strict, with a clock rate of 1.0.

*** Priors ***
* The prior tree used was the Yule Model.

The following parameters were left at their defaults.
* Birth rate was set to uniform.
* Rates were set to gamma distributed.

*** MCMC ***
* Chain length used was 10 000 000.
* Pre burnin was set to 0.
* Initialization attempts was set to 10.
* Sample from prior was not set.

BEAST was run using the following commands:

```bash
$ run=run7
$ prefix=/home/seanla/Projects/beijing_ancestor_mtbc/$run/bayes
$ input=$prefix/${run}beast.xml
$ output=$prefix
$ 
$ mkdir -p $output
$ beast -threads ${PBS_NUM_PPN} -prefix $output -overwrite -working $input
```

The consensus tree was then constructed using TreeAnnotator, where the following parameters were set:
* 10% of generations were discarded as burn-in
* The posterior probability limit was set to 0.5.

![Bayesian consensus tree](bayes/run7-bayes_consensus.nexus.png)