Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Annotation File Support #67

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

AbhinavChede
Copy link
Collaborator

Hi @qiyunzhu ,

I have added the functionality into BinaRena that reads annotation files. This is only a preliminary version. I think the current version gets the job done but maybe is not the most optimal method, regarding run time. Also, I do not know if that is how you imagined the code should check if the current file is an annotation file. See here . I also need to find a better regex for the greengenes file to account for exceptions in the order of the taxons. For now, it does recognize most of the taxons in the test cases.

Also, @pavia27 , is this how you imagined the KEGG support should work? Right now, the code reads the KEGG annotation file and outputs which genes the contig has. It does not indicate at which position the genes is in nor does it show the missing genes. Regarding the KEGG support, I only tested it with a very small dataset due to a lack of proper testing samples. Let me know what you think and if there is something I have to add.

@qiyunzhu
Copy link
Collaborator

Hello @AbhinavChede This is wonderful work! I can see that you have carefully researched the formats of the example files and created deliberate regex.

I can only comment on the Greengenes file:

  • The first part /G\d{9}/g is not always like this. It is but a contig identifier, which can be any arbitrary string (e.g., "ctg_01").
  • The second part is the lineage. It should always include p (phylum), c (class), o (order), f (family), g (genus), s (species). What's tricky is the first position. It may be k (kingdom) or d (domain), or both (d before k). Sometimes there is an additional last position: t (strain).

Hello @pavia27 Please work with Abhinav to sort out the KEGG file format. It may be helpful it we have a small sample set of files placed under examples/. I envision that there can be a subdirectory cami2/ (for example), containing the following files:

  • README.md: Instruction of operation.
  • assembly.fasta: SPAdes assembly file.
  • taxonomy.txt: Greengenes taxonomic lineage file.
  • annotation.tsv: Annotation file (BED/GFF? Prokka? Koala?) (I don't know the format though; I searched my inbox and didn't file the original file you shared. Please make sure Abhinav has this.)
  • checkm.tsv: CheckM output file (can be added after other functions are implemented).
  • etc.

@qiyunzhu
Copy link
Collaborator

@pavia27 any thoughts? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants