** 22 June 2016 **

The 154 samples from the Beijing lineage were filtered to remove redundant samples from the same lineage, arriving to a filtered list of 110 unique strains. The list of removed samples These unique strains were fed into `convertvcf.py` to construct FASTA and PHYLIP files. This was done with the following commands:

```
$ run=run2
$ inputDir=/home/zhf615/TB_test/MALAWI/test/shuffled/results
$ outputPrefix=/global/scratch/seanla/Data/MTBC/$run/snps/$run
$ convertVcfDir=/home/seanla/Projects/beijing_ancestor_mtbc/convertvcf
$ strains=$convertVcfDir/unique_strains.txt
$ convertVcf=$convertVcfDir/convertvcf.py
$ phymlDir=$outputPrefix/../phyml
$ mkdir -p $outputPrefix/..
$ mkdir -p $phymlDir
$ python $convertVcf -i $inputDir -o $outputPrefix -s $strains -r "(\d|\D)*_noresi.vcf$" -f -p
$ cp $outputPrefix.phy $phymlDir
```

The list of the 110 strains used are contained in the directory `snps/unique_samples.txt`. The PHYLIP file at `snps/run2.phy` indicated that this sets of SNPs contains 8831 sites.

jModelTest was used to infer the best fit model of nucleotide substitution for our data set. This was done with the following commands:

```
$ run=run2
$ prefix=/global/scratch/seanla/Data/MTBC/$run
$ phylip=$prefix/snps/$run.phy
$ jmodelDir=$prefix/jmodeltest
$ output=$jmodelDir/jmodeltest-results.out
$ jModelTest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar
$ mkdir -p $jmodelDir
$ java -jar $jModelTest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used:
* `-AIC` indicates we used the Akaike information criterion to infer the best fit model.
* `-a` indicates we estimated the model-averaged phylogeny for each active criterion.
* `-f` indicates we included models with unequal base frequencies.
* `-g 4` indicates we included models with rate variation among sites and sets the number of categories to 4.
* `-i` indicates we included models with a proportion invariable sites.
* `-H AIC` indicates we used the AIC information criterion for clustering search.

jModelTest outputted the General Time Reversible model with rate variation among sites and invariable sites as the best fit model of the data. The results of this can be found at `jmodeltest/jmodeltest-results.out`.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+I+G         0.16    0.34    0.33    0.17    0.00    0.00      0.932   3.448   0.594   0.610   2.804   1.000    0.00    0.73

```

## Construction of the neighbor-joining tree ##
MEGA was used to construct a phylogenetic tree using neighbor-joining with the following commands:

```
$ run=run2
$ prefix=/global/scratch/seanla/Data/MTBC/$run
$ outputDir=$prefix/mega
$ input=$prefix/snps/$run.fasta
$ config=$prefix/../mao/infer_NJ_nucleotide.mao
$ output=$outputDir/${run}mega
$ mega=/home/seanla/Software/mega/megacc
$ mkdir -p $outputDir
$ mega -a $config -d $input -o $output
```

The following parameters were set:
* Nucleotide alignment (non-coding)
* Test of phylogeny - bootstrap with 1000 pseudoreplicates
* Substitution model - no. of differences, transitions and transversions
* Rates among sites - uniform rates (this was set incorrectly; this should have been set to rate variation among sites as per the output of jModelTest)

The consensus tree in Newick format can be found at `mega/run2mega_consensus.newick` or `mega/run2mega_consensus.nwk` (same file; different extensions).

![Neighbor-joining tree](mega/run2mega_consensus.png)

## Construction of the Bayesian consensus tree ##
BEAST 2 was used to find the optimal distribution of trees. The XML file was created using BEAUti using the following parameters

**Site model**
* The substitution model was set to GTR with rates of base conversion set to 1.0 and estimated (with the exception of CT, which was the default option by BEAST)
* The gamma category count was set to 4, as per the settings we used in jModelTest to find the best fit model of DNA evolution.
* The shape parameter was set to 1.0 (this was set incorrectly; the proper value should have been 0.73 as per the output of jModelTest)
* The proportion of invariant sites was set to 0.0

**Clock model**
* A strick clock model was used, with rate 1.0

**Priors**
* The tree prior was set to the Yule Model.
* Birth rate was set to uniform.
* Substitution parameters were set to Gamma distributed.

** MCMC **
* The chain length used was 1500000, which is the number of chains used in the paper "T cell epitopes are evolutionarily hyperconserved" when constructing their phylogenies.
* Pre-burnin was set to 150000
* Every 1000th tree was stored.

The distribution of trees outputted by BEAST2 (found at `beast/run2.trees`) was then fed into TreeAnnotator to find the consensus tree, which can be found at `beast/run2beast.nexus`. The tree was then visualized using FigTree.

![Bayesian consensus tree](beast/run2beast.nexus.png)