Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis results are not added to source data set #2294

Closed
4 tasks done
hackdna opened this issue Oct 30, 2017 · 8 comments
Closed
4 tasks done

Analysis results are not added to source data set #2294

hackdna opened this issue Oct 30, 2017 · 8 comments

Comments

@hackdna
Copy link
Member

hackdna commented Oct 30, 2017

  • Specific code commit: v1.6.0
  • Version of the web browser and OS: Chrome on OS X
  • Environment where the error occurred (Vagrant VM and site conf mode or AWS instance): AWS

Steps to reproduce

  1. Set up a new stack (DATA_SNAPSHOT: snap-0b69229e9b0290af1 and RDS_SNAPSHOT: betastemcellcommons-snapshot-rdsinstance-1b7kwlbwn0ocz)
  2. Launch a new CloudMan cluster
  3. Delete old versions of workflows from Refinery
  4. Import new versions of workflows from Galaxy
  5. Update Solr index
  6. Run a ChIP-seq or RNA-seq workflow

Observed behavior

Analysis results are not added to the source data set

Expected behavior

Analysis results are added to the source data set

Notes

  • FastQC workflow results are added to the data set
  • This seems to be due to a bug in Galaxy where one has to manually deselect asterisked Workflow outputs, save the Workflow, reselect said outputs, and save the workflow again when importing workflows from files.

TO-DO:

@scottx611x
Copy link
Member

scottx611x commented Oct 31, 2017

So this seems to be a side effect of utilizing the asterisking feature that Galaxy provides to select workflow outputs.

When a Workflow that specifies workflow_outputs(asterisks) is exported and reimported on a different Galaxy instance the UI will render the asterisked outputs but api responses won't yield them causing Refinery to not import any derived results. (This behavior doesn't seem to happen when one exports a Workflow and reimports it on the same Galaxy instance)

If one takes the same Workflow and deselects/reselects the asterisked outputs and then saves the entire workflow Refinery will then detect that there are workflow_outputs and will import derived results properly.

I initially choose to utilize this feature of Galaxy because I thought that the old way of annotating outputs wouldn't scale very well. i.e. A user may be much more willing to click 10 things than annotate 10 things.

Moving forward I see a couple of paths:

  • Checking if this bug even exists in newer versions of Galaxy, and potentially submitting a solution
  • Having logic that will fail early in ToolDefinition generation if no workflow_outputs are detected alerting the user that they may need to re-select and save the WF as described above.
  • Reverting back to the Step level annotations.

Looking for input @hackdna

DS w/ Derived results from FastQC, RNA-SEQ SE/PE, and ChIP-Seq (all hg19)

Pic incase the instance goes away
screen shot 2017-10-31 at 4 58 31 pm

@hackdna
Copy link
Member Author

hackdna commented Nov 1, 2017

Thanks. First, just a bit of context: workflow annotation is meant to be done by site admins, not by end users. Also, it is something done infrequently, so there is no reason to worry about scaling.

All proposed solutions sound OK. There should definitely be a full suite of checks for workflows imported into Refinery regardless of the annotation format chosen (it should be impossible to import a workflow that doesn't declare outputs). Also, there should probably be some error handling at the end of analysis when files are downloaded but not associated with the data set (they are inaccessible yet occupy storage space).

It sounds like there is a workaround (re-saving workflows) that can be applied to the existing CloudMan clusters. However, it is a manual process, so a proper long term solution is needed. It is worth checking if this behavior exists in the latest version of Galaxy.

If it doesn't, it would still probably take months before we can use it since that version of Galaxy would need to make it into CloudMan then we would need to test everything with Refinery, make changes if necessary, create a new shared cluster, etc. If this behavior does exist then you'd need to submit a patch for Galaxy and that would take even longer.

Also, it is unclear if using the Galaxy mechanism for hiding intermediate workflow outputs is even sufficient for use with Refinery (if no outputs are hidden then all are returned and it is impossible to tell whether that was by design). So, all this basically means that we will most likely need to revert back to using step level annotations at least for medium term.

Finally: what would be the process to clean up analysis output files on beta.stemcellcommons.org that were already downloaded but not associated with a data set?

@scottx611x
Copy link
Member

scottx611x commented Nov 1, 2017

Okay thanks for the input:
I'm going to go ahead and update this issue with some action items.

Also, the workaround is very specific, I've found just saving again not to be enough. I have a gist here illustrating the odd behavior, but, in short, one would need to: upload a .ga file, deselect the asterisked outputs, save the workflow, reselect the asterisked outputs, and save the workflow again.

Regarding: it is unclear if using the Galaxy mechanism for hiding intermediate workflow outputs is even sufficient for use with Refinery (if no outputs are hidden then all are returned and it is impossible to tell whether that was by design). I'm not following what you're saying. I know that if no outputs are selected then none of the resulting derived data is returned to Refinery. I don't know if we've seen the scenario you're describing? If you're talking about #2293, that was a bug I introduced in our application code and know how to address.

Simply deleting these recent Analyses will remove their Nodes & FileStoreItems and trigger an update of the Solr index

@hackdna
Copy link
Member Author

hackdna commented Nov 1, 2017

Sounds good. Yes, I assumed that all outputs are downloaded if none are selected because of the behavior described in #2293. So, if Galaxy can properly recognize and report the outputs marked by asterisk and if workflow outputs are checked during import then I guess it won't be a problem.

Thanks for the gist. Have you already updated the workflows on the current prod cluster and re-imported them into beta or should I?

@scottx611x
Copy link
Member

👍

I've only updated the Human-based Workflows so far.
None are imported on beta, but I can do that today along with deleting the wonky recent Analyses.

@hackdna
Copy link
Member Author

hackdna commented Nov 1, 2017

OK, thanks, that would be great.

@hackdna
Copy link
Member Author

hackdna commented Nov 3, 2017

Just to clarify: we should revert back to using step level annotations for workflow outputs at least for medium term.

@scottx611x
Copy link
Member

So I can reproduce the odd asterisking behavior in newer Galaxy 17.05

I'm going to close this in favor of #2381 where we are reverting back to using workflow step annotations for desired output files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants