Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyses fail randomly due to IndexError #2993

Open
hackdna opened this issue Sep 10, 2018 · 9 comments · May be fixed by #3497
Open

Analyses fail randomly due to IndexError #2993

hackdna opened this issue Sep 10, 2018 · 9 comments · May be fixed by #3497

Comments

@hackdna
Copy link
Member

hackdna commented Sep 10, 2018

  • Specific code commit: 137a461
  • Version of the web browser and OS: any
  • Environment where the error occurred: Vagrant

Steps to reproduce

Run an analysis using a test workflow

Observed behavior

Analysis completes successfully in Galaxy but fails randomly in Refinery with the following error message in the Celery log:

2018-09-07 11:26:27,619 ERROR    celery.worker.job:282 log() - Task analysis_manager.tasks.run_analysis[75d2a63a-c8b0-4bd3-a710-e332bacba1fd] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/vagrant/refinery/analysis_manager/tasks.py", line 302, in run_analysis
    _attach_workflow_outputs(analysis_uuid)
  File "/vagrant/refinery/analysis_manager/tasks.py", line 149, in _attach_workflow_outputs
    analysis.attach_derived_nodes_to_dataset()
  File "/vagrant/refinery/core/models.py", line 1414, in attach_derived_nodes_to_dataset
    self._create_derived_data_file_nodes(graph_with_input_nodes_linked)
  File "/vagrant/refinery/core/models.py", line 1589, in _create_derived_data_file_nodes
    self._get_output_connection_to_analysis_result_mapping()
  File "/vagrant/refinery/core/models.py", line 1508, in _get_output_connection_to_analysis_result_mapping
    )[index]
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/django/db/models/query.py", line 201, in __getitem__
    return list(qs)[0]
IndexError: list index out of range

Expected behavior

No error

Notes

There is no default order for items returned by filter() call.

screen shot 2018-09-11 at 11 51 29
screen shot 2018-09-11 at 11 52 54

Also, this may cause outputs to be added incorrectly to the provenance graph in case of analyses with more than one input.

@scottx611x
Copy link
Member

This error occurs 100% of the time with my local setup, ~50% of the time with Ilya's local setup, and 0% of the time with a CloudMan Cluster and AWS Refinery instance.

I'm hesitant to write a fix for something that we haven't observed in our production environment. Especially due to the fact that I am able to reproduce this error reliably with my local setup.

@hackdna
Copy link
Member Author

hackdna commented Sep 11, 2018

The code in question is broken by design. Different error rates are simply due to random order of items returned by the filter() function. It's just a matter of time before this error occurs in production. Also, it may have already caused some analysis outputs to be attached to wrong inputs.

@scottx611x
Copy link
Member

scottx611x commented Sep 12, 2018

After taking the time and testing this with a CloudMan cluster, the error is 100% reproducible and isolated to the Test workflow: 5 steps without branching workflow.

@hackdna Your 50% success rate was because you were also running the Test workflow: SPP analog workflow locally which doesn't exhibit this issue.
My 100% Failure rate was due to the fact that I was only running Test workflow: 5 steps without branching locally.

What the error stems from is the fact that we hadn't updated the Rename DataSet field properly for the Test workflow: 5 steps without branching workflow to give a more descriptive/unique output name for the outputs that we are bringing back into Refinery like we have been with other workflows. We saw the IndexError because each output file name was Output file.txt.

When this is done properly the order of that filter() shouldn't matter.

You can try this out locally with the following updated version of Test_workflow__5_steps_without_branching.ga
(Please note I had to add the .txt extension to upload to Github and that should be removed before adding to Galaxy)
Test_workflow__5_steps_without_branching.ga.txt

@hackdna
Copy link
Member Author

hackdna commented Sep 12, 2018

I did not use Test workflow: SPP analog. The IndexError is due to the value of the index variable derived from a randomly ordered list of one size used for another randomly ordered list of a different size. Renaming of inputs is only going to mask this problem.

@scottx611x
Copy link
Member

@hackdna are you able to reproduce the failure condition with the Workflow that I've provided?

This error will not occur (nor do we care about the ordering of the mentioned filter) with properly annotated workflows. Specifically the Test_workflow__5_steps_without_branching.ga is the only one of our workflows that will exhibit this issue, Could you please test with your local setup and confirm?

@hackdna
Copy link
Member Author

hackdna commented Sep 12, 2018

Updating workflow annotations is useful but it does not fix fundamental problems with the code in _get_output_connection_to_analysis_result_mapping().

@scottx611x
Copy link
Member

@hackdna That didn't answer my question.

When we properly annotate our workflows (as we should be doing to provide more meaningful names to the derived results of our workflow runs Ref #2373) the ordering of the filter that you have mentioned does not matter.

I believe that I have done my part to dive deeper and investigate the root cause of the issue that you're experiencing. It would be helpful if you could confirm that this behavior does not exist with the workflow I've provided you.

If you still feel that this issue is critical please feel free to present your findings in our next team meeting.

@jkmarx jkmarx modified the milestones: Release 1.6.7, Next Sep 20, 2018
@hackdna
Copy link
Member Author

hackdna commented Oct 2, 2018

The function in question _get_output_connection_to_analysis_result_mapping() is also causing random test failures:

FAIL: test___get_output_connection_to_analysis_result_mapping (core.tests.AnalysisTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/refinery-platform/refinery-platform/refinery/core/tests.py", line 718, in test___get_output_connection_to_analysis_result_mapping
    (self.analysis_node_connection_a, analysis_result_1)
AssertionError: Lists differ: [(<AnalysisNodeConnection: out... != [(<AnalysisNodeConnection: out...
First differing element 0:
(<AnalysisNodeConnection: out: 1_ (True) None>, <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)
(<AnalysisNodeConnection: out: 3_ (True) None>, <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)
- [(<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
+ [(<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>),
   (<AnalysisNodeConnection: out: 2_ (False) None>, None),
-  (<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
+  (<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)]

@scottx611x scottx611x modified the milestones: Next, Release 1.6.8 Nov 5, 2018
@jkmarx jkmarx removed this from the Release 1.6.8 milestone Feb 27, 2019
@hackdna
Copy link
Member Author

hackdna commented Jul 26, 2019

A Travis build just failed again due to this issue:
https://travis-ci.org/refinery-platform/refinery-platform/builds/563759322#L6078

FAIL: test___get_output_connection_to_analysis_result_mapping (core.test_models.AnalysisTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/refinery-platform/refinery-platform/refinery/core/test_models.py", line 368, in test___get_output_connection_to_analysis_result_mapping
    (self.analysis_node_connection_a, analysis_result_1)
AssertionError: Lists differ: [(<AnalysisNodeConnection: out... != [(<AnalysisNodeConnection: out...
First differing element 0:
(<AnalysisNodeConnection: out: 1_ (True) None>, <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)
(<AnalysisNodeConnection: out: 3_ (True) None>, <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)
- [(<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
+ [(<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>),
   (<AnalysisNodeConnection: out: 2_ (False) None>, None),
-  (<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
+  (<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)]

@ilan-gold ilan-gold linked a pull request Nov 20, 2019 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants