## Beijing Lineage MTBC - Phylogenetic Run 3 ##

### 30 June 2016 ###
#### Input Data ####
The input data for this run are VCF files containing the SNPs from 110 unique strains from ~150 samples from the Beijing Lineage, where SNPs from repetitive regions and antibiotic-resistance have been removed. The entire list of strains used in this run can be found in the `convertvcf` folder in the file `unique_strains.txt`.

#### VCF Conversion #####
The collection of VCF files were converted to FASTA file format with the tool `convertvcf.py` under the `convertvcf` folder with the following commands:

```bash
run=run3
inputDir=/home/zhf615/TB_test/MALAWI/test/shuffled/results
outputDir=/global/scratch/seanla/Data/MTBC/$run
snpDir=$outputDir/snps
snpPrefix=$snpDir/$run
convertVcfDir=/home/seanla/Projects/beijing_ancestor_mtbc/convertvcf
strains=$convertVcfDir/unique_strains.txt
convertVcf=$convertVcfDir/convertvcf.py

mkdir -p $snpDir

python $convertVcf -i $inputDir -o $snpPrefix -s $strains -r "(\d|\D)*_unresi_unrepeted.vcf$" -f -p
```

Where `(\d|\D)*_unresi_unrepeted.vcf$` is the regular expression for the appropriate VCF files. Processing revealed that there are 7806 sites (i.e. the length of each strain sequence is 7806 bases).

#### jModelTest ####
jModelTest was used to infer the best-fit model of nucleotide evolution using the Akaike information criterion with the following commands:

```bash
run=run3
prefix=/global/scratch/seanla/Data/MTBC/$run
phylip=$prefix/snps/$run.phy
jmodelDir=$prefix/jmodeltest
output=$jmodelDir/jmodeltest-results.out
jModelTest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar

mkdir -p $jmodelDir

java -jar $jModelTest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used:
* `-AIC` indicates we used the Akaike information criterion to infer the best fit model.
* `-a` indicates we estimated the model-averaged phylogeny for each active criterion.
* `-f` indicates we included models with unequal base frequencies.
* `-g 4` indicates we included models with rate variation among sites and sets the number of categories to 4.
* `-i` indicates we included models with a proportion invariable sites.
* `-H AIC` indicates we used the AIC information criterion for clustering search. (jModelTest indicated that this option has no effect for our dataset.)

jModelTest ouputted GTR+I+G (General Time Reversible model with invariable sites and rate variation among sites) as the best fit model for our data.

```
::Best Models::

        Model           f(a)    f(c)    f(g)    f(t)    kappa   titv    Ra      Rb      Rc      Rd      Re      Rf      pInv    gamma
----------------------------------------------------------------------------------------------------------------------------------------
AIC     GTR+I+G         0.18    0.32    0.31    0.18    0.00    0.00      0.915   3.079   0.432   0.566   2.729   1.000    0.00    2.16
```

#### Neighbor Joining ####
To construct the neighbor joining tree, MEGA was used with the following parameters:
* Nucleotide alignment (non-coding)
* Test of phylogeny - bootstrap with 1000 pseudoreplicates
* Substitution model - no. of differences, transitions and transversions
* Rates among sites - gamma distributed with invariant sites, gamma parameter of 0
* Gaps/Missing Data Treatment was set to deletion

#### Maximum Likelihood ####
To construct the Maximum Likelihood tree, MEGA was used with the following parameters:
* The data to analyze was set to nucleotide alignment (non-coding).
* Test of phylogeny was set to bootstrap method with 1000 replications.
* The substitution model was set to GTR.
* Rates among sites was set to Gamma Distributed with Invariant Sites (G+I)
* The number of discrete gamma categories was set to 4.
* Gaps/Missing Data were treated as complete deletion.
* The ML heuristic method used was Nearest-Neighbor-Interchange (NNI)
* The initial tree for ML was created using NJ/BioNJ (Neighbor-joining?)
* No branch swap filters were used.

#### Bayesian Inference ####
To construct the Bayesian Inference tree, BEAST 2 was used with the following parameters:

##### Site Model #####
* The site model was set to gamma with a gamma category count of 4.
* The shape of the gamma site model was set to 1.0
* The proportion invariant of the gamma site model was set to 0.0.
* The model of nucleotide substitution was set to GTR.
* Rate for each nucleotide substitution was set to 1.0.
* Each nucleotide substitution rate was estimated with the exception of CT.
* Frequences were estimated as well.

##### Clock Model #####
* The clock model was set to strict with a rate of 1.0.

##### Priors #####
* The tree model was set to Yule.
* The birth rate was set to uniform.
* Each nucleotide rate prior was set to gamma.

##### MCMC #####
* 10 000 000 chains were used with 0 pre-burnin (not the same as burn-in).
* We logged the tree distribution every 1000 chains.
* We did not sample from prior.