[WIP] Annotation File Support #67

AbhinavChede · 2022-01-18T18:02:44Z

I have added the functionality into BinaRena that reads annotation files. This is only a preliminary version. I think the current version gets the job done but maybe is not the most optimal method, regarding run time. Also, I do not know if that is how you imagined the code should check if the current file is an annotation file. See here . I also need to find a better regex for the greengenes file to account for exceptions in the order of the taxons. For now, it does recognize most of the taxons in the test cases.

Also, @pavia27 , is this how you imagined the KEGG support should work? Right now, the code reads the KEGG annotation file and outputs which genes the contig has. It does not indicate at which position the genes is in nor does it show the missing genes. Regarding the KEGG support, I only tested it with a very small dataset due to a lack of proper testing samples. Let me know what you think and if there is something I have to add.

qiyunzhu · 2022-01-23T18:48:01Z

Hello @AbhinavChede This is wonderful work! I can see that you have carefully researched the formats of the example files and created deliberate regex.

I can only comment on the Greengenes file:

The first part /G\d{9}/g is not always like this. It is but a contig identifier, which can be any arbitrary string (e.g., "ctg_01").
The second part is the lineage. It should always include p (phylum), c (class), o (order), f (family), g (genus), s (species). What's tricky is the first position. It may be k (kingdom) or d (domain), or both (d before k). Sometimes there is an additional last position: t (strain).

Hello @pavia27 Please work with Abhinav to sort out the KEGG file format. It may be helpful it we have a small sample set of files placed under examples/. I envision that there can be a subdirectory cami2/ (for example), containing the following files:

README.md: Instruction of operation.
assembly.fasta: SPAdes assembly file.
taxonomy.txt: Greengenes taxonomic lineage file.
annotation.tsv: Annotation file (BED/GFF? Prokka? Koala?) (I don't know the format though; I searched my inbox and didn't file the original file you shared. Please make sure Abhinav has this.)
checkm.tsv: CheckM output file (can be added after other functions are implemented).
etc.

qiyunzhu · 2022-01-27T23:45:04Z

@pavia27 any thoughts? Thanks!

AbhinavChede and others added 12 commits January 2, 2022 12:05

annotations

3be46f0

greengenes improved

60dabb8

in progress

b000de9

greengenes

2af83f2

kegg

70036af

Merge https://github.com/AbhinavChede/binarena into append

1b5548b

Merge branch 'qiyunlab:master' into append

97713e5

restored

3b3f69a

kegg

afe8021

kegg

095d3d7

small changes

33ad9d3

syntax

6ca5fdd

qiyunzhu assigned pavia27 Jan 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Annotation File Support #67

[WIP] Annotation File Support #67

AbhinavChede commented Jan 18, 2022

qiyunzhu commented Jan 23, 2022

qiyunzhu commented Jan 27, 2022

[WIP] Annotation File Support #67

Are you sure you want to change the base?

[WIP] Annotation File Support #67

Conversation

AbhinavChede commented Jan 18, 2022

qiyunzhu commented Jan 23, 2022

qiyunzhu commented Jan 27, 2022