Structured metadata for your FASTA sequences.
The FASTJ format is a convention for including structured sequence metadata inside a FASTA file while remaining interchangeable with most software that consumes FASTA.
In FASTJ, metadata is stored as a single-line JSON object in the description
field of the typical FASTA >id description
definition line. FASTA parsers
treat everything after the first whitespace in the definition line as the
description. (Everything before, minus the >
prefix, is the id.)
A simple example of FASTJ:
>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…
Missing descriptions should be treated as an empty JSON empty. That is, the following two FASTJ sequence records are equivalent:
>seqA
ATCG
>seqA {}
ATCG
That's all!
The command-line program fastj
provides tools to encode and decode
FASTJ files and to otherwise work with them.
Converts other formats, such as FASTA with delimited fields in the sequence id or FASTA + TSV/CSV, to FASTJ.
Input is from named files, if given with the input flag, otherwise stdin.
Output is always FASTJ written to stdout.
Examples:
file.fasta
>flu|2017-05-04|specimenA
ATCG…
>flu|2017-05-13|specimenB
CGAT…
output
>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…
file.fasta
>specimenA
ATCG…
>specimenB
CGAT…
file.tsv
id,virus,date
specimenA,flu,2017-05-04
specimenB,flu,2017-05-13
output
>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…
file.json (output from fastj decode
)
[{ "id": "specimenA", "sequence": "ATCG…", "date": "2017-05-04", "virus": "flu" }
,{ "id": "specimenB", "sequence": "CGAT…", "date": "2017-05-13", "virus": "flu" }
]
output
>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…
Converts FASTJ sequences to another format.
Input is from the listed files, if any, otherwise stdin.
Output defaults to JSON. Supported output formats are:
-
json
: The top-level value will always be an array, even if there is only one sequence record. -
fasta
: Plain FASTA with delimited sequence ids constructed from the FASTJ fields.
Examples:
file.fastj for all examples
>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…
[{ "id": "specimenA", "sequence": "ATCG…", "date": "2017-05-04", "virus": "flu" }
,{ "id": "specimenB", "sequence": "CGAT…", "date": "2017-05-13", "virus": "flu" }
]
>flu|2017-05-04|specimenA
ATCG…
>flu|2017-05-13|specimenB
CGAT…
>flu/2017-05-04/specimenA
ATCG…
>flu/2017-05-13/specimenB
CGAT…
Tentative.
Tentative.