## Beijing Lineage Phylogeny Construction - Run 4 ##

### 6 July 2016 ###

The input data for this run are VCF files containing the SNPs from 76 unique strains from ~150 samples from the Beijing Lineage, where SNPs from repetitive regions and antibiotic-resistance were removed along with those whose frequency was less than 2. The entire list of strains used in this run can be found in the convertvcf folder in the file 76strains.txt.

#### VCF Conversion ####

The collection of VCF files were converted to FASTA file format with the tool convertvcf.py under the convertvcf folder with the following commands:

```bash
run=run5

inputDir=/home/zhf615/TB_test/MALAWI/test/shuffled/results
outputDir=/global/scratch/seanla/Data/MTBC/$run
snpDir=$outputDir/snps
snpPrefix=$snpDir/$run

convertVcfDir=/home/seanla/Projects/beijing_ancestor_mtbc/convertvcf
strains=$convertVcfDir/76strains.txt
convertVcf=$convertVcfDir/convertvcf.py

mkdir -p $snpDir


python $convertVcf -i $inputDir -o $snpPrefix -s $strains -r "(\d|\D)*_final.vcf$" -f -p
```

Where `(\d|\D)*_final.vcf$` is the regular expression for the appropriate VCF files. The file `76strains.txt` contains the ascension codes of all 76 unique strains. They were taken from the excel spreadsheet `Beijing_lineage_snp.xlsx`, where said ascension codes were colored in black. Ascension codes colored in red were excluded.

#### Best Fit Model of DNA Evolution - jModelTest ####

jModelTest was used to infer the best-fit model of nucleotide evolution using the Akaike information criterion with the following commands:

```bash
run=run5

prefix=/global/scratch/seanla/Data/MTBC/$run
phylip=$prefix/snps/$run.phy
jmodelDir=$prefix/jmodeltest
output=$jmodelDir/jmodeltest-results.out
jModelTest=/home/seanla/Software/jmodeltest-2.1.10/jModelTest.jar

mkdir -p $jmodelDir

java -jar $jModelTest -tr ${PBS_NUM_PPN} -d $phylip -o $output -AIC -a -f -g 4 -i -H AIC
```

An explanation of the parameters used is as follows:
* -AIC indicates we used the Akaike information criterion to infer the best fit model.
* -a indicates we estimated the model-averaged phylogeny for each active criterion.
* -f indicates we included models with unequal base frequencies.
* -g 4 indicates we included models with rate variation among sites and sets the number of categories to 4.
* -i indicates we included models with a proportion invariable sites.
* -H AIC indicates we used the AIC information criterion for clustering search. (jModelTest indicated that this option has no effect for our dataset.)

