Scripts to work with conserved loci data from squamates.
- matching of contigs to targets has some weird behavior
- check cases with high matches where matches aren't retained
- use read error correction prior to assembly
http://www.genome.umd.edu/quorum.html
- check with M. Grundler about this
- use cap3 to make PRG rather than random picking
- better trusted SNP set for BQSR
- maybe do Cortex intersection with raw GATK call
- check SNP calling filters?
https://www.broadinstitute.org/gatk/guide/article?id=3225
https://www.broadinstitute.org/gatk/guide/article?id=6925
- check mito DNA first both for misidentification & for contamination
- use kmer alignment-free methods to calculate distances to look for possible contaminants or misidentification (
MASH
orkWIP
) - add more plotting to summarize SNP quality
- heterozygosity vs. AF
- coverage vs. heterozygosity
- add more plotting to identify collapsed paralogs or chimeric sequences
- sliding window plots of p-distance across loci
- sliding window plots of heterozygosity across loci
- sliding window plots of depth across loci
- from these, identify peaky distributions
- When running any script, you can see the script arguments using
python script.py --help
. - This pipeline is best run with these scripts all in a row, but each script can be run on its own. However, this piecemeal functionality hasn't been tested.
-
Start with demultiplexed data.
-
Create a CSV file summarizing data
- note, column order does not matter.
- sample: sample name
- read1: complete path to read 1 file
- read2: complete path to read 2 file
- adaptor1: adaptor 1 to scrub; put asterisk where barcode goes
- adaptor2: adaptor 2 to scrub; put asterisk where barcode goes
- barcode1: barcode 1 to scrub
- barcode2: barcode 2 to scrub (if it exists)
- lineage: the group to which the sample belongs; used for assembly and SNP calling
-
Clean reads
- Remove adaptors using
Trimmomatic
- Merge reads using
PEAR
- Lightly remove low quality sequence using
Trimmomatic
- Ends up creating three files: 2 paired files with ordering retained, and 1 unpaired read file
- Assumes:
Trimmomatic v0.36
andPEAR 0.9.10
python ~/squamateUCE/clean_reads.py --trimjar ~/bin/Trimmomatic-0.36/trimmomatic-0.36.jar \ --PEAR ~/bin/pear-0.9.10/pear-0.9.10-bin-64 --dir /scratch/drabosky_flux/sosi/uce_test/ \ --sample Anolis_carolinensis --file /scratch/drabosky_flux/sosi/uce_test/samples.csv
IF YOU WANT This is a good place to check data quality by running
quality_1_reads.py
. - Remove adaptors using
-
Assemble reads
Trinity
is meant for transcriptomes, but the authors ofPhyluce
have found it works well with non-transcriptome data sets- This script combines read 1 and unpaired reads into one file to use with
Trinity
- Then assembles
- This is very memory intensive - recommended to give 1 Gb of RAM per 1M reads
- Can use
--normal
flag to normalize big data sets- found this to be necessary because otherwise had heap error during
Butterfly
step
- found this to be necessary because otherwise had heap error during
- Requires
Trinity v2.2.0
python /home/sosi/squamateUCE/trinity_assembly.py --trinity ~/bin/trinityrnaseq-2.2.0/Trinity \ --sample Anolis_carolinensis --dir /scratch/drabosky_flux/sosi/uce_test/ \ --mem 55 --CPU 16
-
Match assemblies to original targets
- Uses
blat
and a soft reciprocal match to identify which probes match to which contigs in the assembly - Multiple probes belong to the same locus, and while the probes don't necessarily overlap, they are likely to match to the same contig in the assembly
- Script uses a regex that is dataset specific to synonymize probes to targeted loci
- This script only considers matches that are above the e-value provided by the user and that are no more than 10 orders of magnitude worse than the best match
- Have a few kinds of outcomes
- easy_recip_match: 1-to-1 unique match between contig and targeted locus
- complicated_recip_match: 1-to-1 non-unique match, in which one targeted locus matches to multiple contigs
- ditched_no_recip_match: a case in which the contig matches to the targeted locus, but it isn't the best match
- ditched_no_match: a case in which the contig matches to the targeted locus, but the locus doesn't match to the contig
- ditched_too_many_matches: a case in which one contig has multiple good matches to multiple targeted loci
- removed because possible paralogs?
- This requires
blat v36
python ~/squamateUCE/match_contigs_to_probes.py --blat ~/bin/blat --sample Anolis_carolinensis \ --dir /scratch/drabosky_flux/sosi/uce_test/ --evalue 1e-30 \ --db squamate_AHE_UCE_genes_loci.fasta
- Uses
-
Generate pseudo-reference genomes
- Uses lineage designation in sample file created in step 2 to create a pseudo-reference genome (PRG) for each lineage
- The script identifies all contig matches to all targeted loci across all individuals in that lineage
- if there are multiple contigs across individuals that match to a given targeted locus, then:
- either picks the top contig if it is much better than the other matches (>1e3)
- or if not, picks the longest one
- if there are multiple contigs across individuals that match to a given targeted locus, then:
- This script can be run with only the 'easy' and / or 'complicated' reciprocal matches
- Also will reverse complement contigs so that they are ordered the same as the contig (important later for phylogeny making!)
- Requires no additional programs.
python ~/squamateUCE/make_PRG.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --keep easy_recip_match
IF YOU WANT This is a good place to check data quality by running
quality_2_assembly.py
. -
Align reads
align_reads1.py
: run by individual- Align paired reads & unpaired reads in two separate runs using
bwa
- paired read run performs mate rescue
- Then run fixmate to fix any broken paired-end information
- Then sort the BAM files
- Then mark duplicates
- Align paired reads & unpaired reads in two separate runs using
align_reads2.py
: run by lineage first, then by individual- Generates raw set of variants using
bcftools
for the lineage - Filters raw set lightly (
--dp
and--qual
) - Uses filtered set to perform base quality score recalibration (BQSR)
- Recalibrates individual BAM files
- Generates raw set of variants using
- These scripts asssume
BWA 0.7.12
,samtools 1.3.1
,GATK 4
,Picard 2.23
, andbcftools 1.7
python ~/squamateUCE/align_reads1.py --sample Mus_musculus \ --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --bwa ~/bin/bwa-0.7.12/bwa \ --samtools ~/bin/samtools-1.3.1/samtools \ --picard ~/bin/picard-tools-2.23.1/picard.jar --CPU 1 --mem 1 python ~/squamateUCE/align_reads2.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --samtools ~/bin/samtools-1.3.1/samtools \ --bcftools ~/bin/bcftools-1.7/bcftools \ --gatk ~/bin/GenomeAnalysisTK.jar --mem 3 --dp 10 --qual 20 --CPU 4
IF YOU WANT This is a good place to check data quality by running
quality_3_alignment.py
. -
Call and filter variants
- Call variant (and invariant sites!) based on "final" BAM files per lineage
- Note the importance of calling invariant sites - it is the denominator in all pop gen analyses
- Variants are filtered based on quality (
--qual
) - And then filtered on depth, on a per individual basis (
--dp
)- If an individual has too low of coverage, 'ALLELE/ALLELE' becomes missing ('./.')
- If every individual in the lineage is missing, then the site is dropped
- This requires
bcftools
python ~/squamateUCE/call_variants.py --lineage l1 --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --bcftools ~/bin/bcftools-1.7/bcftools --mem 4 \ --CPU 4 --dp 10 --qual 20
- Call variant (and invariant sites!) based on "final" BAM files per lineage
- Make alignments.
- Takes in all the data in the sample file to identify all unique pseudo-reference genomes
- Creates unaligned fasta files for any locus that is sampled for four or more individuals
- because most gene tree - species tree programs take in unrooted gene trees, there is no information in three-taxon trees
- Also creates a summary file that has information about missingness
python ~/squamateUCE/phylogeny_make_alignments.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/
- Actually do the alignments and make the gene trees.
- Does the alignments and makes the gene trees for all files.
- Uses
multiprocessing
module in Python to do it more quickly - Assumes the user has
mafft 7.294
andRAxML 8.2.4
python ~/squamateUCE/phylogeny_align_genetrees.py --dir /scratch/drabosky_flux/sosi/uce_test/ \ --CPU 4 --mafft ~/bin/mafft --raxml ~/bin/standard-RAxML/raxmlHPC
- Make the concatenated alignment.
- Designed to be used with
RAxML
, but could work with any phylogeny program.
python ~/squamateUCE/phylogeny_make_concatenated.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --miss 0.8
- Designed to be used with
- Makes the files to be used for
ASTRID
andASTRAL
.- Creates a file with all the best trees for each gene.
- And another file that has the file paths to the bootstrap trees.
python ~/squamateUCE/phylogeny_prep_astrid_astral.py --file /scratch/drabosky_flux/sosi/uce_test/samples.csv \ --dir /scratch/drabosky_flux/sosi/uce_test/ --miss 0.8
- Can then run with commands like these:
java -jar astral.4.10.6.jar -i best_trees_0.9.trees -b bootstrap_files_0.9.txt -r 100 -o astral_0.9.tre ASTRID -b bootstrap_files_0.9.txt -i best_trees_0.9.trees -o astrid_0.9.tre