Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSOD for large XML file #12

Open
laceysanderson opened this issue Oct 31, 2016 · 8 comments
Open

WSOD for large XML file #12

laceysanderson opened this issue Oct 31, 2016 · 8 comments
Assignees
Labels
bug - confirmed For issues where a maintainer has confirmed a bug exists. Maintainer - Previous Version Any issues that were filed on other versions of this module that we are not sure apply anymore.

Comments

@laceysanderson
Copy link
Member

There is a known problem that if a user has a large blast result XML file (exact size depends on your web server) they might end with a memory exhausted WSOD. This is due to the XML reader we use reading the entire file into memory.

We have mostly mitigated this issue by checking the number of hits using grep and then only reading the XML if it is below a static threshold (currently 500). However, there appear to be some edge-cases that this still happens in and of course, if your particular server has issues with even 500 hits then this will still be a problem.

Furthermore, since we simply just don't read the XML we can't even show you a subset of hits or a summary which would be more useful in the case of large resultsets then simply forcing the user to download TSV and go from there.

In the future I would like to move to a stream-based XML reader (http://php.net/manual/en/class.xmlreader.php) to completely remove this issue.

@laceysanderson laceysanderson added the bug - confirmed For issues where a maintainer has confirmed a bug exists. label Oct 31, 2016
@bradfordcondon
Copy link
Member

bradfordcondon commented Feb 27, 2018

Would this result in a (HTTP error code) 500 error?

EDIT (googles white screen of death.... yep, OK, thats our problem)

The user im troubleshooting with will get an HTTP 500 error and not be able to display the job page if he sets the max hit limit to 250. (Setting it to 500 and your failsafe kicks in).

Setting to 100, and the page loads.

We can check the logs and see that the blast job ran successfully in all cases.

So it sounds like for HWG implementing your module, 250 is somewhere between your fix and what our server can manage. I can change the code on our local branch to cut off at 101 instead of at 500.

@ekcannon
Copy link
Collaborator

Possible solutions:

  1. As suggested, use a streaming XML parser like XMLReader. But this requires a PHP package that isn't typically included in a standard PHP installation. The number of results displayed will still need to be limited, and it isn't simply a matter of limiting the number of hits as a single hit can contain many hsps.

  2. Limit the number of hits when the XML is created from the ASN result file but save the full results in the other file formats (HTML, tab, GFF) . This can be done with a parameter to blast_formatter. But this only limits the hits; there can still be lots and lots of hsps for each hit.

  3. Since HSPs can be limited by the BLAST command, add an hsp limit to the configuration options. But this prevents people from searching with repetitive sequence, for example, to look at the distribution of a particular repeat sequence.

  4. Detect if there are too many results after running BLAST and converting ASN into HTML, tab, and GFF outputs. If too many, run BLAST again with hsps limited to: (# hits)/(total # HSPs) and generate the XML from those results. But this greatly slows down the job.

Reactions? Any other ideas?

@laceysanderson
Copy link
Member Author

Hmmmmm.... my gut reaction is that #4 is the safest but I worry since BLAST is a heuristic we would end up with results shown on the page that are not in the downloadable files :-(

@ekcannon
Copy link
Collaborator

I have an operational proof-of-concept for option 4, but the concern about getting different results from the second BLAST execution is very valid ... and quit likely to happen. Unfortunately, blast_formatter won't permit limiting hsps like the BLAST programs do.

@laceysanderson
Copy link
Member Author

What about if we do a combination of #2 & #3?
A. create an advanced option for the blast form that limits HSPs and set its default to something reasonable. That way if they need to do repeat analysis they can change it but most people should not.
B. When creating the XML from the ASN we limit the number of hits.

A takes care of the HSPs and B takes care of the hits keeping us in a safe zone and ensuring we have the same results on the page as in the files. There is still a little bit of rope for people to hang themselves but it is MUCH better then where we are currently at. Thoughts?

@ekcannon ekcannon self-assigned this Mar 16, 2018
@ekcannon
Copy link
Collaborator

Hmm. I'll play with this idea for a bit.

@laceysanderson
Copy link
Member Author

laceysanderson commented Mar 16, 2018

Ok, thanks for taking this one on and Yay for coffee and a fresh morning ;-)

@laceysanderson laceysanderson added the Maintainer - Previous Version Any issues that were filed on other versions of this module that we are not sure apply anymore. label Mar 30, 2024
@smriti-135
Copy link

smriti-135 commented Apr 23, 2024

Hello. I tried blast recently with a huge file and am getting the same problem. Sample given below
Query:

CM030740.1 Citrus sinensis isolate HZAU_DHSO_2021 chromosome 1, whole genome shotgun sequence
AATTTTACCATTAAACATTTAGTTAATTGGAATATGAAGTTTAGGACCGCCAACTTCATATTGTAATGCT
CTAGTTAATAATTAAAATTTTATTAAGTTAGTGATTGTAGATTTCATGTTAACATTAGAAGGGTGGAATT
ATTTTAGGTTCAATAATAATGGTTTCATGTTATAGTCAATGATTTTAGGTTAAACATTTAGAATATTTTT
AGTTAATGGTCTGTAAATATACGGATGTGATGTTTCGGGTTCAGTGTCGTACTCAAAGTAAGACTGTAAT
ATACTTTGTTCTCTAGTATATAGTGATTATACATATATCAATGCATACTAAGTTAAACTCTATTAAAAAA
TGCATGGAACAAGCTCGTGGATGGTGCAAGGTTGTATTGCATTAAATTACTTTTCATGTTACATAAATAT
AATATTGTAGTCGACGTTATTATCCTATATTATAATAACTAATAAAAATGATTATTCAGTTTCTTATAAT
ATAATAAGCTTAAACTATATAACAACATATTGTAATATTTAATTATTTATTAATGGTTTAGTATTACGTT
AAAACATTTCGCAAGCCAATAGCCTAATATAACTATTTGTAGGTTTAATTAAAAAATGAATAACAAACCC
CGCAAATATTAATGCATCTAAAATAGTATGAATTTAATTTGAACGGTATTATATAAACTATAAATATAGC
ATAAAATTTAATTTGTTAAATAATACAGTTCATAATTATTCGAAAACAATGATATATCATAATACAATTC
AAGATAATGCAATATAGCCAAAAAAATTATGTTAATAGTCCACAGTGCTTACAATTGAATATGTTGTAAT
GTTAATTGTAACTGTCTCATTTTTATTTTTTTAATGATTAAATACAATAACATAAAATTAATATATTTAA
AAATGGATATTATATTTCAATAATGATTTCAAATCTATTTGTTACTATGAAAATAATTAATGGACTATGA
ATTGTAGGAATTGAATTACATAAAATATAGTTACTCTAATTACTATCTTATTCTAAAAAACAAAATTAGA
ATTATTTTTCTTTAAAAAATGCTATAACTTTAAAATCCAGCTTATTTGCCAAAATCTTCCCATTTTTTGG
TTCGCCCTCCACAAAAGGGCTTCAAGCCTGTTGATTAGGTTGCCCAACATTCTAAACAAAACAATTCAGC
CGTTGTTCCATCCATTATCACTTCTGGACATATTAAATATCATCCATTAGATAAGATTTTAACATATTCA
GTGGTTGAGATTTTGTTGACTGTAAGTTACATTTAACGTAACTTAGGGATTGCCTACAAAAATGAAGCTA

Target:

scaffold00001 length=5927163
TTTTGTATTCTATGTCCTCTGATCTTTATACTTCTTCATTTTGTCTTTGCAAGAACCGGA
ATTATGGGTACATCACAAATTCTCTAGGTGTGACTTGTGTTGTGGGGCCTTTTTTTtACA
TTTCCATATTGCAAGTATTTTTTTGCTACCATTGGTATATTTGTCTGTTAAAATCAATCT
GCTTTCACTTATGTTCGTGCGTTCTTGTTCCCTCGCCTTGCAATTGCATATCTCAAATTA
TCTTTCTTACTTTGATTTAGATGGCCAAGGTTTTAAGCTAACTTTTTACAATGCCAATTT
TTAAATGGTTTTCTAATGCTGTTCAAAGTTGCAGCCTTTACTTCGTATATTTGTCAGGTT
CTGACGGGTGCGGTCGGCGGCGGGGGCTATAGCATGCGGTCTCGAGAGCCGCAAAGAAAA
ATGGGTGGTTTTCCCGGTTTCGGCCATAACTCGTGATCGGGGCCTCCGATTCTGGTTCCG
TTTCGTCCCACGGGACCAGCCGGGCGGGGGCATCGGATTGCAAAAGTCTTTAAATTTGAA
TTTGATTTAAGTTTATATAGTTTGAACACAAAAACTAGCCATTACGGACAAAAACAACAA
ATAGTCGGCTAGCCTATTAATTAGCCAGATCGCCTCTTAATACAGTGCAAGTTACCGTTG
CAATTTGAATTTTGCTGCAGTGATGCTATAGTAACACTATTTTTtAAAATTTCATTGTTA
CCTAAAACTTTTTTATAATTTGACTATGACCCAAAATGTCATAAAATTTTGCAAATATAT
CAAATTTCAGAATTTCTAAATAATGCGCGTTATTCTTAAAACTTTTTGAAATTATGCTAT
GGCCTAAAACTTTATAATATTTTTCAAAGAGATTCTTCTCAGAATTTTAACATAATGCTT
ATTTATTTCAAAGTTCCCAAAATCTTTTTCAGTTTAATCCAAACTTTGAAAAACACTCAA
ATCCTCAAAATACTCGTCTTATAAATATAAAATCTTTTTGTTTATAAAAAGTAATGATTT
ATTAAATAAAATCTTGAGCTTTTTCAATGCTAAACTATACATATATCAAATCATACTGGC
TTTATAAGAATTTGTTGCAATAATGACTCCGCAGAGCTAAACTTTGCTCTTGATCAAGCC

I am either getting a memory error such as

Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 324671885 bytes) in /var/www/html/teak-wood-genes-drupal7-test/includes/database/database.inc on line 2284
Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 23072768 bytes) in Unknown on line 0
Warning: Unknown: Cannot call session save handler in a recursive manner in Unknown on line 0
Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 98570240 bytes) in /var/www/html/teak-wood-genes-drupal7-test/includes/bootstrap.inc on line 3876

or a "gatekeeping timeout" error. But the files are getting generated and stored...
I was asked to contact the developers/maintainers to see what can be done. I am using Tripal v3.100.

Edit 1: I tried blast in blast website using the same files and got this error
"Length limit exceeded. Please reduce your query/subject sequence length to 10,000,000 letters or less."
So is my file too large? I still need to use that file so how to I configure so?

Edit 2: When I tried blasting in the web ui using a very small query size, I am getting some results but I am getting more errors (hundreds which are basically repetitions of the same few)

Deprecated function: imagefilledpolygon(): Using the $num_points parameter is deprecated in generate_blast_hit_image() (line 654 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc).
Deprecated function: imagefilledpolygon(): Using the $num_points parameter is deprecated in generate_blast_hit_image() (line 654 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc).
Deprecated function: Implicit conversion from float 299.4699930400503 to int loses precision in generate_blast_hit_image() (line 671 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc).
Deprecated function: Implicit conversion from float 50.219298245614034 to int loses precision in generate_blast_hit_image() (line 672 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc).
Deprecated function: Implicit conversion from float 50.219298245614034 to int loses precision in generate_blast_hit_image() (line 696 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc).
Warning: A non-numeric value encountered in include() (line 92 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php).
Warning: A non-numeric value encountered in include() (line 106 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php).
Warning: A non-numeric value encountered in include() (line 111 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug - confirmed For issues where a maintainer has confirmed a bug exists. Maintainer - Previous Version Any issues that were filed on other versions of this module that we are not sure apply anymore.
Projects
None yet
Development

No branches or pull requests

4 participants