Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header parsing using blast format 6 with salltitles #60

Open
jjkoehorst opened this issue Mar 24, 2021 · 1 comment
Open

Header parsing using blast format 6 with salltitles #60

jjkoehorst opened this issue Mar 24, 2021 · 1 comment

Comments

@jjkoehorst
Copy link
Contributor

jjkoehorst commented Mar 24, 2021

When using the standardised diamond analysis counter I noticed that a FASTA file of the input database was required.
Since this was a rather large database I was wondering why this was needed.

I realised that the python script parses the headers for accession numbers with function and species lookup.

Diamond does provide this functionality to have the fasta headers as an additional column in the tabular format. Only a few modifications are needed to use the diamond result file instead of parsing the fasta file for the database (unless it is used for other statistics?).

  • When executing diamond make sure to include --salltitles
  • When converting diamond daa to tsv use --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle to ensure the default format + subject title

Then some minor code modifications are needed see https://pastebin.com/5eq6GPaB for the whole script.

First reuse the input tsv file

# Reusing db code for parsing diamond file with additional salltitles column
db = open (infile_name, "r")

Change the split
line = line.strip().split("\t")[-1]

Skip the last element
db_id = str(splitline[0].split()[0]) # [1:] # Not needed anymore

In addition I added a check for multispecies as it is currently counted as an error:

			if "MULTISPECIES" not in line:
				db_error_counter += 1

I have not removed the database argument yet etc as I am not sure if this method is preferred.

I can submit a cleaned up version later if needed.

@transcript
Copy link
Owner

Hi Jasper, I was not aware that DIAMOND is able to pull the headers out and include those in the outfile! I definitely want to explore this a bit more (and apologies for the delayed response to this).

If you could submit a cleaned up version, that would be great - I'll probably have to do some testing on my own but a cleaned up version could really be useful for making sure that I can make this update (and credit you, of course!).

Best,
Sam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants