Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
This runs the main ARIBA local assembly pipeline.
ariba prepareref has been run, with the output directory called
ref, run the pipeline with
ariba run ref reads_1.fq reads_2.fq output_dir
reads_2.fq are the names of the forwards and
reverse paired reads files. The reads files can be in any format
that is compatible with
(in particular, gzipped).
Important: ARIBA assumes that read N in the file
reads_1.fq is the
mate of read N in the file
All output files will be put in a new directory called
To see all the options, use
ariba run --help
The most important file is
report.tsv, an example of which
can be found here.
The meaning of the columns in
report.tsv is as follows.
|1. ariba_ref_name||ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking)|
|2. ref_name||original name of reference sequence chosen from cluster, before renaming|
|3. gene||1=gene, 0=non-coding (same as metadata column 2)|
|4. var_only||1=variant only, 0=presence/absence (same as metadata column 3)|
|5. flag||cluster flag|
|6. reads||number of reads in this cluster|
|7. cluster||name of cluster|
|8. ref_len||length of reference sequence|
|9. ref_base_assembled||number of reference nucleotides assembled by this contig|
|10. pc_ident||%identity between reference sequence and contig|
|11. ctg||name of contig matching reference|
|12. ctg_len||length of contig|
|13. ctg_cov||mean mapped read depth of this contig|
|14. known_var||is this a known SNP from reference metadata? 1 or 0|
|15. var_type||The type of variant. Currently only SNP supported|
|16. var_seq_type||Variant sequence type. if known_var=1, n or p for nucleotide or protein|
|17. known_var_change||if known_var=1, the wild/variant change, eg I42L|
|18. has_known_var||if known_var=1, 1 or 0 for whether or not the assembly has the variant|
|19. ref_ctg_change||amino acid or nucleotide change between reference and contig, eg I42L|
|20. ref_ctg_effect||effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only)|
|21. ref_start||start position of variant in reference|
|22. ref_end||end position of variant in reference|
|23. ref_nt||nucleotide(s) in reference at variant position|
|24. ctg_start||start position of variant in contig|
|25. ctg_end||end position of variant in contig|
|26. ctg_nt||nucleotide(s) in contig at variant position|
|27. smtls_total_depth||total read depth at variant start position in contig, reported by mpileup|
|28. smtls_nts||nucleotides on contig, as reported by mpileup. The first is the contig nucleotide|
|29. smtls_nts_depth||depths on contig, as reported by mpileup. One number per nucleotide in the previous column|
|30. var_description||description of variant from reference metdata|
|31. free_text||other free text about reference sequence, from reference metadata|
If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-12 (and possibly 30) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:
cut -f1,4 report.tsv | uniq
The other files written to the output directory are as follows.
assembled_genes.fa.gz. This is a gzipped FASTA file of assembled gene sequences. It does not contain non-coding sequences (those are in
assembled_seqs.fa.gz), only genes. When comparing a local assembly to a gene, mismatches near the end of the gene can cause the alignment to be too short. ARIBA tries to extend the match by looking for start and stop codons. The extended sequences are in this file. The not extended sequences are in
assembled_seqs.fa.gz. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept.
assemblies.fa.gz. This is a gzipped FASTA file of the assemblies. It contains the complete, unedited, contigs.
log.clusters.gz. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information.
version_info.txt. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.
--nucmer_min_id INT. Minimum alignment identity (the delta-filter -i option). Default: 90.
--nucmer_min_len INT. Minimum alignment length (the delta-filter -l option). Default: 20.
--nucmer_breaklen INT. Value to use for
-breaklenwhen running nucmer. Default: 200.
--assembly_cov INT. Target read coverage when sampling reads for assembly. The reads are randomly downsampled to the given coverage for assembly. If this option is more than the actual read coverage, then all reads are used for assembly. Default: 50.
--min_scaff_depth INT. Minimum number of read pairs needed as evidence for scaffold link between two contigs. Default: 10
--threads INT. Experimental. Number of threads. Will run clusters in parallel, but not minimap. Default: 1.
--assembled_threshold FLOAT (between 0 and 1). If proportion of gene assembled (regardless of into how many contigs) is at least this value then the flag
gene_assembled is set. Default: 0.95.
--gene_nt_extend INT. Max number of nucleotides to extend ends of gene matches to look for start/stop codons. Default: 30.
--unique_threshold FLOAT (between 0 and 1). If proportion of bases in gene assembled more than once is <= this value, then the flag
unique_contigis set. Default: 0.03.
--force. Overwrite output directory, if it already exists
--noclean. Do not clean up intermediate files
--tmp_dir TMP_DIR. Existing directory in which to create a temporary directory used for local assemblies.
--verbose. Be verbose.