deepnog
uses standard file formats, as detailed below for
eggNOG 5 (1239, Firmicutes) data.
Protein sequences are expected in FASTA format.
Each entry must contain a unique record ID.
That is, a user_data.faa
should look like this:
>1000569.HMPREF1040_0002 MMKHDDHVHQIRTEPIYAILGETFSRGRTNRQVAKALLGAGVRIIQYREKEKSWQEKYEE ARDICQWCNEYGATFIMNDSIDLAIACEAPAIHVGQDDAPVAWVRRLAQRDIVVGVSTHT IAEMKKAVRDGADYVGLGPMYQTTSKMDVHDIVADVDKAYALTLPIPVVTIGGIDLIHIR QLYTEGFRSFAMISALVGATDIVEQIGAFRQVLQEKIDEC >1000569.HMPREF1040_0003 MATTVGDIVTYLQGIAPLYLKEEWDNPGLLLGNQGDPVSSVLVTLDVMEGTVDYAIAEGI SFIFSHHPLIMKGIKAIRTDSYDGRMYQKLLSHHIAVYAAHTNLDSATGGVNDVLAEHLQ LQHVRPFIPGVSESLYKIAIYVPKGYGDAIREVLGKHDAGHLGAYSYCSFSVAGQGRFKP LAGTHPFIGKRDVLETVEEERIETIVEGSRLGEVITAMLAVHPYEEPAYDIYPLYQQRTA LGLGRLGELATPLSSMAAVQWVKEALHLTHVSYAGPMDRQIQTIAVLGGSGAEFIATAKA AGATLYVTGDMKYHAAQEAIKQGILVVDAGHFGTEFPVIDRMKQNIEAENEKQGWHIQCV VDPTAMDMIQRL
Compression is allowed (user_data.faa.gz
, or user_data.faa.xz
).
For typical usage of deepnog infer
for protein orthologous group assignments
this is already sufficient.
Training new models with deepnog train
, or assessing model quality
with deepnog infer --test_labels
require providing the orthologous group
labels.
File format is CSV (comma-separated values) with a preceding header line, and three columns (index, sequence record ID, orthologous group ID).
,string_id,eggnog_id 1543720,1121929.KB898683_gene1916,1V3NB 351865,536232.CLM_3459,1TPCN [...] 1570381,1000569.HMPREF1040_0002,1V3ZR 744166,1000569.HMPREF1040_0003,1TQ27 [...] 426023,1423743.JCM14108_56,1TPGE
To construct some user_data.csv
:
- Copy (do not modify) the header line.
- Provide an index in the first column (e.g. 1..N; currently unused, but required).
- Provide the sequence ID (e.g. eggNOG/STRING ID) in column 2.
- Provide its corresponding group label in column 3.
- Sequence IDs in column 2 must match the IDs used in the
user_data.faa
.
Orthologous group assignments are output in tabular format (comma-separated).
- Column 1: Sequence ID
- Column 2: Assignment/Orthologous group
- Column 3: Assignment confidence in 0..1 (higher=better).
Example:
sequence_id,prediction,confidence 1000565.METUNv1_00038,COG0466,1.0 1000565.METUNv1_00060,COG0500,0.20852506 1000565.METUNv1_00091,COG0810,0.9999591 1000565.METUNv1_00093,COG0659,1.0 1000565.METUNv1_00103,COG5000,0.70716757 1000565.METUNv1_00105,COG0346,0.9999982 1000565.METUNv1_00106,COG3791,1.0 1000565.METUNv1_00114,COG0239,1.0 1000565.METUNv1_00115,COG1643,1.0