Skip to content
/ fastj Public

Structured metadata for your FASTA sequences.

Notifications You must be signed in to change notification settings

tsibley/fastj

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

FASTJ

Structured metadata for your FASTA sequences.

Format

The FASTJ format is a convention for including structured sequence metadata inside a FASTA file while remaining interchangeable with most software that consumes FASTA.

In FASTJ, metadata is stored as a single-line JSON object in the description field of the typical FASTA >id description definition line. FASTA parsers treat everything after the first whitespace in the definition line as the description. (Everything before, minus the > prefix, is the id.)

A simple example of FASTJ:

>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…

Missing descriptions should be treated as an empty JSON empty. That is, the following two FASTJ sequence records are equivalent:

>seqA
ATCG
>seqA {}
ATCG

That's all!

Command

The command-line program fastj provides tools to encode and decode FASTJ files and to otherwise work with them.

fastj encode

Converts other formats, such as FASTA with delimited fields in the sequence id or FASTA + TSV/CSV, to FASTJ.

Input is from named files, if given with the input flag, otherwise stdin.

Output is always FASTJ written to stdout.

Examples:

fastj encode --fasta file.fasta --delimiter="|" --fields virus date id

file.fasta

>flu|2017-05-04|specimenA
ATCG…
>flu|2017-05-13|specimenB
CGAT…

output

>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…

fastj encode --fasta file.fasta --metadata file.tsv

file.fasta

>specimenA
ATCG…
>specimenB
CGAT…

file.tsv

id,virus,date
specimenA,flu,2017-05-04
specimenB,flu,2017-05-13

output

>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…

fastj encode --json file.json

file.json (output from fastj decode)

[{ "id": "specimenA", "sequence": "ATCG…", "date": "2017-05-04", "virus": "flu" }
,{ "id": "specimenB", "sequence": "CGAT…", "date": "2017-05-13", "virus": "flu" }
]

output

>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…

fastj decode

Converts FASTJ sequences to another format.

Input is from the listed files, if any, otherwise stdin.

Output defaults to JSON. Supported output formats are:

  • json: The top-level value will always be an array, even if there is only one sequence record.

  • fasta: Plain FASTA with delimited sequence ids constructed from the FASTJ fields.

Examples:

file.fastj for all examples

>specimenA {"date":"2017-05-04", "virus":"flu"}
ATCG…
>specimenB {"date":"2017-05-13", "virus":"flu"}
CGAT…

fastj decode [file.fastj [file2.fastj […]]]

[{ "id": "specimenA", "sequence": "ATCG…", "date": "2017-05-04", "virus": "flu" }
,{ "id": "specimenB", "sequence": "CGAT…", "date": "2017-05-13", "virus": "flu" }
]

fastj decode --output=fasta --fields virus date id -- [file.fastj [file2.fastj […]]]

>flu|2017-05-04|specimenA
ATCG…
>flu|2017-05-13|specimenB
CGAT…

fastj decode --output=fasta --delimiter=/ --fields virus date id -- [file.fastj [file2.fastj […]]]

>flu/2017-05-04/specimenA
ATCG…
>flu/2017-05-13/specimenB
CGAT…

fastj index

Tentative.

fastj search

Tentative.

About

Structured metadata for your FASTA sequences.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published