Analyses fail randomly due to IndexError #2993

hackdna · 2018-09-10T14:53:05Z

Specific code commit: 137a461
Version of the web browser and OS: any
Environment where the error occurred: Vagrant

Steps to reproduce

Run an analysis using a test workflow

Observed behavior

Analysis completes successfully in Galaxy but fails randomly in Refinery with the following error message in the Celery log:

2018-09-07 11:26:27,619 ERROR    celery.worker.job:282 log() - Task analysis_manager.tasks.run_analysis[75d2a63a-c8b0-4bd3-a710-e332bacba1fd] raised unexpected: IndexError('list index out of range',)
Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/vagrant/refinery/analysis_manager/tasks.py", line 302, in run_analysis
    _attach_workflow_outputs(analysis_uuid)
  File "/vagrant/refinery/analysis_manager/tasks.py", line 149, in _attach_workflow_outputs
    analysis.attach_derived_nodes_to_dataset()
  File "/vagrant/refinery/core/models.py", line 1414, in attach_derived_nodes_to_dataset
    self._create_derived_data_file_nodes(graph_with_input_nodes_linked)
  File "/vagrant/refinery/core/models.py", line 1589, in _create_derived_data_file_nodes
    self._get_output_connection_to_analysis_result_mapping()
  File "/vagrant/refinery/core/models.py", line 1508, in _get_output_connection_to_analysis_result_mapping
    )[index]
  File "/home/vagrant/.virtualenvs/refinery-platform/local/lib/python2.7/site-packages/django/db/models/query.py", line 201, in __getitem__
    return list(qs)[0]
IndexError: list index out of range

Expected behavior

No error

Notes

There is no default order for items returned by filter() call.

Also, this may cause outputs to be added incorrectly to the provenance graph in case of analyses with more than one input.

The text was updated successfully, but these errors were encountered:

scottx611x · 2018-09-11T20:55:08Z

This error occurs 100% of the time with my local setup, ~50% of the time with Ilya's local setup, and 0% of the time with a CloudMan Cluster and AWS Refinery instance.

I'm hesitant to write a fix for something that we haven't observed in our production environment. Especially due to the fact that I am able to reproduce this error reliably with my local setup.

hackdna · 2018-09-11T21:02:05Z

The code in question is broken by design. Different error rates are simply due to random order of items returned by the filter() function. It's just a matter of time before this error occurs in production. Also, it may have already caused some analysis outputs to be attached to wrong inputs.

scottx611x · 2018-09-12T15:40:42Z

After taking the time and testing this with a CloudMan cluster, the error is 100% reproducible and isolated to the Test workflow: 5 steps without branching workflow.

@hackdna Your 50% success rate was because you were also running the Test workflow: SPP analog workflow locally which doesn't exhibit this issue.
My 100% Failure rate was due to the fact that I was only running Test workflow: 5 steps without branching locally.

What the error stems from is the fact that we hadn't updated the Rename DataSet field properly for the Test workflow: 5 steps without branching workflow to give a more descriptive/unique output name for the outputs that we are bringing back into Refinery like we have been with other workflows. We saw the IndexError because each output file name was Output file.txt.

When this is done properly the order of that filter() shouldn't matter.

You can try this out locally with the following updated version of Test_workflow__5_steps_without_branching.ga
(Please note I had to add the .txt extension to upload to Github and that should be removed before adding to Galaxy)
Test_workflow__5_steps_without_branching.ga.txt

hackdna · 2018-09-12T16:03:10Z

I did not use Test workflow: SPP analog. The IndexError is due to the value of the index variable derived from a randomly ordered list of one size used for another randomly ordered list of a different size. Renaming of inputs is only going to mask this problem.

scottx611x · 2018-09-12T16:37:59Z

@hackdna are you able to reproduce the failure condition with the Workflow that I've provided?

This error will not occur (nor do we care about the ordering of the mentioned filter) with properly annotated workflows. Specifically the Test_workflow__5_steps_without_branching.ga is the only one of our workflows that will exhibit this issue, Could you please test with your local setup and confirm?

hackdna · 2018-09-12T16:42:56Z

Updating workflow annotations is useful but it does not fix fundamental problems with the code in _get_output_connection_to_analysis_result_mapping().

scottx611x · 2018-09-12T16:51:22Z

@hackdna That didn't answer my question.

When we properly annotate our workflows (as we should be doing to provide more meaningful names to the derived results of our workflow runs Ref #2373) the ordering of the filter that you have mentioned does not matter.

I believe that I have done my part to dive deeper and investigate the root cause of the issue that you're experiencing. It would be helpful if you could confirm that this behavior does not exist with the workflow I've provided you.

If you still feel that this issue is critical please feel free to present your findings in our next team meeting.

hackdna · 2018-10-02T21:03:59Z

The function in question _get_output_connection_to_analysis_result_mapping() is also causing random test failures:

FAIL: test___get_output_connection_to_analysis_result_mapping (core.tests.AnalysisTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/refinery-platform/refinery-platform/refinery/core/tests.py", line 718, in test___get_output_connection_to_analysis_result_mapping
    (self.analysis_node_connection_a, analysis_result_1)
AssertionError: Lists differ: [(<AnalysisNodeConnection: out... != [(<AnalysisNodeConnection: out...
First differing element 0:
(<AnalysisNodeConnection: out: 1_ (True) None>, <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)
(<AnalysisNodeConnection: out: 3_ (True) None>, <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)
- [(<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
+ [(<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>),
   (<AnalysisNodeConnection: out: 2_ (False) None>, None),
-  (<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
+  (<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> 66baa46c-04bf-407e-b010-054299674cbd>)]

hackdna · 2019-07-26T15:06:05Z

A Travis build just failed again due to this issue:
https://travis-ci.org/refinery-platform/refinery-platform/builds/563759322#L6078

FAIL: test___get_output_connection_to_analysis_result_mapping (core.test_models.AnalysisTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/refinery-platform/refinery-platform/refinery/core/test_models.py", line 368, in test___get_output_connection_to_analysis_result_mapping
    (self.analysis_node_connection_a, analysis_result_1)
AssertionError: Lists differ: [(<AnalysisNodeConnection: out... != [(<AnalysisNodeConnection: out...
First differing element 0:
(<AnalysisNodeConnection: out: 1_ (True) None>, <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)
(<AnalysisNodeConnection: out: 3_ (True) None>, <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)
- [(<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
+ [(<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>),
   (<AnalysisNodeConnection: out: 2_ (False) None>, None),
-  (<AnalysisNodeConnection: out: 3_ (True) None>,
?                                 ^
+  (<AnalysisNodeConnection: out: 1_ (True) None>,
?                                 ^
    <AnalysisResult: test_node.txt <-> c7647b1d-8050-4c6e-a4bb-08c58783e01b>)]

hackdna added analysis core bug critical server labels Sep 10, 2018

hackdna added this to the Release 1.6.7 milestone Sep 10, 2018

hackdna mentioned this issue Sep 10, 2018

Rewrite import_file task #2359

Closed

14 tasks

hackdna assigned scottx611x Sep 11, 2018

scottx611x closed this as completed Sep 11, 2018

scottx611x reopened this Sep 11, 2018

scottx611x mentioned this issue Sep 12, 2018

Add RenameDatasetAction for test workflow output refinery-platform/galaxy-tools#6

Merged

jkmarx modified the milestones: Release 1.6.7, Next Sep 20, 2018

scottx611x modified the milestones: Next, Release 1.6.8 Nov 5, 2018

jkmarx removed this from the Release 1.6.8 milestone Feb 27, 2019

ilan-gold mentioned this issue Nov 7, 2019

Refactor Celery and Analysis Handling #3473

Open

ilan-gold linked a pull request Nov 20, 2019 that will close this issue

Refactor Celery #3497

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyses fail randomly due to IndexError #2993

Analyses fail randomly due to IndexError #2993

hackdna commented Sep 10, 2018 •

edited

scottx611x commented Sep 11, 2018

hackdna commented Sep 11, 2018

scottx611x commented Sep 12, 2018 •

edited

hackdna commented Sep 12, 2018

scottx611x commented Sep 12, 2018

hackdna commented Sep 12, 2018

scottx611x commented Sep 12, 2018

hackdna commented Oct 2, 2018 •

edited

hackdna commented Jul 26, 2019

Analyses fail randomly due to IndexError #2993

Analyses fail randomly due to IndexError #2993

Comments

hackdna commented Sep 10, 2018 • edited

Steps to reproduce

Observed behavior

Expected behavior

Notes

scottx611x commented Sep 11, 2018

hackdna commented Sep 11, 2018

scottx611x commented Sep 12, 2018 • edited

hackdna commented Sep 12, 2018

scottx611x commented Sep 12, 2018

hackdna commented Sep 12, 2018

scottx611x commented Sep 12, 2018

hackdna commented Oct 2, 2018 • edited

hackdna commented Jul 26, 2019

hackdna commented Sep 10, 2018 •

edited

scottx611x commented Sep 12, 2018 •

edited

hackdna commented Oct 2, 2018 •

edited