Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
This can be used if you have a multiple alignment of one or more sets of reference sequences and SNP information that you want to call using ARIBA. The variant grouping option of ARIBA can be used to track the "same" SNPs across all the sequences.
The procedure is explained using the following example on toy data.
Make the input files
We will use the following toy sequences. They are supposed to represent different alleles of the same (very short!) gene.
>seq1 ATGGCTAATTAG >seq2 ATGTTTAATTAG >seq3 ATGTTTTGTAATTAG >seq4 ATGTTTGATAATTAG
They translate to the following amino acid sequences.
>seq1 MAN* >seq2 MFN* >seq3 MFCN* >seq4 MFDN*
Here is a multiple alignment of the amino acid sequences:
>seq1 M-AN* >seq2 MF-N* >seq3 MFCN* >seq4 MFDN*
and the corresponding nucleotide sequences:
>seq1 ATG---GCTAATTAG >seq2 ATGTTT---AATTAG >seq3 ATGTTTTGTAATTAG >seq4 ATGTTTGATAATTAG
This final file is the one that must be used as input to
Every sequence must have the same length in this file (length includes the
In addition, a file of SNP information is needed. Suppose we know the following two SNPs confer antibiotic resistance:
- A2D in sequence seq1
- F2E in sequence seq4
ARIBA can be used to identify the corresponding SNPs in any of the sequences. The second required file is a TSV file containing information on these SNPs. It must have four columns:
Sequence name. Must exactly match a sequence the multialignment FASTA file.
The SNP, for example A2D.
Group name. If you do not want to put the SNP into a group, use ".".
A description of the SNP, for example "Causes resistance to antibiotic x".
In this example, we will use the file:
seq1 A2D group1 Description of A2D.group1 seq4 F2E group2 Description of F2E.group2
Run aln2meta like this:
ariba aln2meta seqs.aln.fa snps.tsv coding out
seqs.aln.fais the multifasta alignment file of nucleotide sequences
snps.tsvis the TSV file of SNP information
coding, because these are coding sequences. For non-coding sequences, use
noncodinginstead, and the SNPs should be nucleotide SNPs, as opposed to amino acids.
outis the prefix of the names of the output files.
Note that ARIBA sanity checks the SNPs against the sequences. It outputs these two warnings:
Warning: position has a gap in sequence seq2 corresponding to variant A2D (group1) in sequence seq1 ... Ignoring for seq2 Warning: position has a gap in sequence seq1 corresponding to variant F2E (group2) in sequence seq4 ... Ignoring for seq1
which makes sense looking at the sequences. For example, the A2D variant in seq2 aligns to a gap in seq1, so it gets ignored for seq1 (but included for the other sequences).
The aln2meta command above outputs three files,
which can be used as input to
ariba prepareref like this:
prepareref -f out.fa -m out.tsv --cdhit_clusters out.cluster out.prepareref
ariba run can be run as normal.
More than one set of multiple alignments
It is possible to use more than one set of multiple alignments, eg you have several genes, each of which have multiple alleles and SNPs of interest. Run aln2meta once for each gene/set of alleles. For example:
ariba aln2meta seqs.aln.1.fa snps.1.tsv coding out1 ariba aln2meta seqs.aln.2.fa snps.2.tsv coding out2 ariba aln2meta seqs.aln.3.fa snps.3.tsv coding out3
Then cat the relevant files together and run prepareref:
cat out*fa > all.fa cat out*tsv > all.tsv cat out*cluster > all.cluster ariba prepareref -f all.fa -m all.tsv --cdhit_clusters all.cluster out.prepareref
(or you could not cat the files, and instead use
-m once for each file),
ariba run can be run as normal.