**Sunday, 19 June 2016**

VCF files containing the SNPs from the Beijing lineage from the folder `/home/zhf615/TB_test/MALAWI/test/shuffled/results` were converted to the FASTA, NEXUS, and PHYLIP formats using the tool `convertvcf.py`. VCF files in the format `*_noresi.vcf` in each folder were selected for the tests.

The following commands were used:

```
$ input=/home/zhf615/TB_test/MALAWI/test/shuffled/results
$ prefix=/global/scratch/seanla/Data/MTBC/run1/snps
$ output=$prefix/run1snps
$ cd /home/seanla/Projects/beijing_ancestor_mtbc/convertvcf/
$ python convertvcf.py -i $input -o $output -r "(\d|\D)*_noresi.vcf$" -f -n -p
$ cd $output
$ cp run1snp.phy ../phyml/
```

The PHYLIP file indicated that the length of each sequence is 9771.

## Finding the best fit model ##
jModelTest 2.1.10 was used on the PHYLIP file to infer the best fit model of DNA substitution with the following commands:

```
$ prefix=/global/scratch/seanla/Data/MTBC/run1
$ phylip=$prefix/snps/run1snps.phy
$ output=$prefix/jmodeltest/run1jmt.out
$ cd /home/seanla/Software/jmodeltest-2.1.10
$ java -jar jModelTest.jar -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

Here is an explanation of the parameters used:
* `-AIC` indicates we used the Akaike information criterion to infer the best fit model.
* `-a` indicates we estimated the model-averaged phylogeny for each active criterion.
* `-f` indicates we included models with unequal base frequencies.
* `-g 4` indicates we included models with rate variation among sites and sets the number of categories to 4.
* `-i` indicates we included models with a proportion invariable sites.
* `-H AIC` indicates we used the AIC information criterion for clustering search.

The output of jModelTest indicated the best fit model for the Beijing lineage SNP data was GTR+G, or the General Time Reversible model with rate variation among sites.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+G           0.16    0.34    0.33    0.17    0.00    0.00      0.920   3.380   0.634   0.618   2.859   1.000 N/A        0.60

```

Next, PhyML 3.1 was used to infer the best phylogenetic tree using a maximum likelihood approach with the following commands:

```
$ input=/global/scratch/seanla/Data/MTBC/run1/phyml/run1snps.phy
$ cd /home/seanla/Software/PhyML-3.1
$ ./PhyML-3.1_linux64 -i $input -q -b 1000 -m GTR --no_memory_check
```

Here is an explanation of the parameters used:
* `-q` indicates the input data is in sequential form.
* `-b 1000` indicates bootstrap was performed with 1000 pseudoreplicates.
* `-m GTR` indicates the model used was GTR.
* `--no_memory_check` indicates that the program was run without check of sufficient memory.

jModelTest indicated the best fit model includes rate variation among sites - PhyML implements this by automatically adjusting the gamma correction value by default. However, there is a `--free_rates` options that is an alternative to the discrete gamma model that may provide more accurate models. This is a point of discussion for the future.

**Monday, 20 June 2016**

## Bayesian Inference ##
BEAST v1.8.3 was used to infer the optimal phylogenetic tree distribution using the Bayesian Inference method. The FASTA file `run1snps.fasta` was used as input. The BEAST XML file was created using the program BEAUti. The following parameters were changed:
* Substitution model - changed to GTR, with site heterogeneity model Gamma with 4 categories.
* Ancestral State Reconstructions - "Reconstruct states at ancestor" setting was activated and set to the tree root.
* Length of chain - set to 1500000

The remainder of the settings were left at their default values.

MEGA was used to infer the optimal phylogenetic tree using neighbor-joining. The following parameters were set:
* Nucleotide alignment (non-coding)
* Test of phylogeny - bootstrap with 1000 pseudoreplicates
* Substitution model - no. of differences, transitions and transversions
* Rates among sites - uniform rates

![Bayesian inference consensus tree](beast/run1bayes.nexus.png)

MEGA was run using the following commands:

```
$ prefix=/global/scratch/seanla/Data/MTBC/run1
$ input=$prefix/snps/run1snps.fasta
$ config=$prefix/mega/infer_NJ_nucleotide.mao
$ output=$prefix/mega/run1mega
$ cd /home/seanla/Software/mega
$ ./megacc -a $config -d $input -o $output
```

![Neighbor joining tree with bootstrap values](mega/run1mega_consensus.png)