error standardized_DIAMOND_analysis_counter.py #77

arianlundberg · 2022-12-18T04:41:44Z

Hello, I am having issues with standardized_DIAMOND_analysis_counter.py script
I am getting an IndexError in line 138

I've modified your master_script_for_sample_files.bash file and the error comes after STEP4 is DONE

command from your script:

STEP 5: AGGREGATING WITH ANALYSIS_COUNTER

for file in $starting_files_location/step_4_output/*RefSeq_annotated*
do
	python $python_programs/standardized_DIAMOND_analysis_counter.py -I $file -D $RefSeq_db -O
	python $python_programs/standardized_DIAMOND_analysis_counter.py -I $file -D $RefSeq_db -F
done

error:

Now reading through the m8 results infile.

Analysis of /projects/bact.fun.unmapped.RefSeq_annotated complete.
Number of total lines: 574668
Number of unique sequences: 574668
Time elapsed: 1.8101940155 seconds.

then "Starting database analysis now." message pops and goes until

198M lines processed so far in 2025.08801007 seconds.

Then I get this error:

Traceback (most recent call last):
File "/projects/tools/samsa2/python_scripts/standardized_DIAMOND_analysis_counter.py", line 138, in
db_entry = db_entry[1][:-1]
IndexError: list index out of range

Here is an snapshot of your script from line 127 to 138

for line in db:
	if line.startswith(">") == True:
		db_line_counter += 1
		splitline = line.split("[",1)

		# ID, the hit returned in DIAMOND results
		db_id = str(splitline[0].split()[0])[1:]

		# name and functional description
		db_entry = line.split("[", 1)
		db_entry = db_entry[0].split(" ", 1)
		db_entry = db_entry[1][:-1]

Thanks in advance

The text was updated successfully, but these errors were encountered:

transcript · 2022-12-19T21:56:41Z

Hi Arian,

Sorry about that, let's see if we can figure out an answer!

A couple things to check:

First, are you using the standard bacterial RefSeq database that is available for distribution with SAMSA2, or is this a custom-built database?

Second, it looks from this error like there's at least one entry where there's an ID but no description. Could you try re-running with the following code added to replace the existing line 138 of the standardized_DIAMOND_analysis_counter.py script?

try:
	db_entry = db_entry[1][:-1]
except IndexError:
	print(line)
	print(db_entry)
	print("this occurs at line: " + str(db_line_counter))

The idea here is to see the offending line, as well as where in your reference database this line occurs (for easy fixing, if there's an issue with it). This would require you to re-run the script, although you don't need to re-run the whole master_script; you could just run the command:

python $python_programs/standardized_DIAMOND_analysis_counter.py -I $file -D $RefSeq_db -O

(You'll need to correct the pathways and replace the variables with your output file and the RefSeq_db path.)

Essentially, this looks to be an issue with the reference database; happy to help identify where it's occurring and what, specifically, is causing it.

arianlundberg · 2022-12-20T17:54:18Z

Hi Sam!
Thanks for your quick response. I appreciate it.
I actually use a custom-built database containing information on three different organisms. While working with the data, I noticed that the Fungi dataset (which is included in the combined database) only had IDs and no descriptions. I've taken care of this by downloading new data from Ensembl and creating a new database for DIAMOND. It's currently running, but I'll let you know if I run into any issues.

By the way, I mentioned earlier that I downloaded Fungi protein sequences from Ensembl. Do you have any suggestions for other good sources or databases to use? I used NCBI refseq as well but I believe the sequence files were corrupted/weird, because the sequences had numbers or "_" in between which resulted an error with DIAMOND as well.

Thank you again for your help.

arianlundberg · 2022-12-22T21:49:07Z

Hello again Sam,

After replacing my old Fungi database with the one I downloaded from Ensembl, I've got list index out of range error in line 159. :/

Traceback (most recent call last):
File "/projects/tools/samsa2/python_scripts/standardized_DIAMOND_analysis_counter.py", line 159, in <module>
    db_org = split_db_org[1] + " " + split_db_org[2]
IndexError: list index out of range

here is the the python script from line 144 to 159:

			db_org = splitline[line.count("[")].strip()[:-1]
			if db_org[0].isdigit():
				split_db_org = db_org.split()
				try:
					if split_db_org[1] == "sp.":
						db_org = split_db_org[0] + " " + split_db_org[1] + " " + split_db_org[2]
					else:
						db_org = split_db_org[1] + " " + split_db_org[2]
				except IndexError:
					try:
						db_org = split_db_org[1]
					except IndexError:
						db_org = splitline[line.count("[")-1]
						if db_org[0].isdigit():
							split_db_org = db_org.split()
							db_org = split_db_org[1] + " " + split_db_org[2]

I think the fungi dataset causes this issue, perhaps the annotations in the fasta files are not compatible with your script. Do you know where I could find a fungi dataset compatible with DIAMOND/your script? Thanks.

BTW, here is an example the header/description of Fungi data which causes the error (I think?):

>KGQ13519 pep supercontig:BBA1.0:contig00047:97669:98940:-1 gene:BBAD15_g702 transcript:KGQ13519 gene_biotype:protein_coding transcript_biotype:protein_coding description:tRNA-(ms[2]io[6]A)-hydroxylase

JSSaini · 2023-04-11T19:08:49Z

Any solution for this?

transcript added bug database labels Dec 19, 2022

transcript self-assigned this Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error standardized_DIAMOND_analysis_counter.py #77

error standardized_DIAMOND_analysis_counter.py #77

arianlundberg commented Dec 18, 2022

transcript commented Dec 19, 2022

arianlundberg commented Dec 20, 2022

arianlundberg commented Dec 22, 2022 •

edited

Loading

JSSaini commented Apr 11, 2023

error standardized_DIAMOND_analysis_counter.py #77

error standardized_DIAMOND_analysis_counter.py #77

Comments

arianlundberg commented Dec 18, 2022

command from your script:

error:

Then I get this error:

transcript commented Dec 19, 2022

Essentially, this looks to be an issue with the reference database; happy to help identify where it's occurring and what, specifically, is causing it.

arianlundberg commented Dec 20, 2022

arianlundberg commented Dec 22, 2022 • edited Loading

JSSaini commented Apr 11, 2023

arianlundberg commented Dec 22, 2022 •

edited

Loading