Fix .bam file handling for visualizations #3484

ilan-gold · 2019-10-30T14:09:41Z

Using a chain, we now have two steps to fix #2713 #2718 and #3465 :

Generate the .bai index file from the .bam file (downloaded from s3 if necessary - note that 3600 second time limit on the generate task)
Import the new file into the Refinery system.

Then we need to make this file visible to the docker container running IGV, which we do by updating the API as well as the container: refinery-platform/docker_igv_js#32

This also helps solve #2501 because we are not generating unnecessary auxiliary nodes

information passed to viz containers

…inery-platform/refinery-platform into ilan-gold/bam_file_fix

generation

… of https://github.com/refinery-platform/refinery-platform into ilan-gold/bam_file_fix

hackdna

First pass

hackdna · 2019-11-04T19:52:59Z

refinery/analysis_manager/tasks.py

+            logger.info(
+                "Starting auxiliary file generation and import for analysis "


I'd just keep it on one line.

hackdna · 2019-11-04T19:57:10Z

refinery/analysis_manager/tasks.py

+                "Starting auxiliary file generation and import for analysis "
+                "'%s'", analysis)
+            auxiliary_file_tasks = TaskSet(
+                tasks=auxiliary_file_tasks_signatures


tasks=analysis.attach_derived_nodes_to_dataset()
I'd also break up this function because it sounds like it is doing more than one thing

hackdna · 2019-11-04T19:58:34Z

refinery/analysis_manager/tasks.py

+            analysis_status.auxiliary_file_task_group_id = (
+                auxiliary_file_tasks.taskset_id
+            )


analysis_status.auxiliary_file_task_group_id = \ auxiliary_file_tasks.taskset_id

hackdna · 2019-11-04T20:05:31Z

refinery/core/models.py

+        auxiliary_file_import_tasks = self._create_annotated_nodes()
+        return auxiliary_file_import_tasks


return self._create_annotated_nodes()

hackdna · 2019-11-04T20:20:00Z

refinery/core/models.py

+                    if subtask is not None:
+                        auxiliary_file_tasks += [subtask]


may be auxiliary_file_tasks += [subtask] if subtask else []?

hackdna · 2019-11-04T20:21:15Z

refinery/core/models.py

+        auxiliary_file_tasks = self._prepare_annotated_nodes(node_uuids)
+        return auxiliary_file_tasks


return self._prepare_annotated_nodes(node_uuids)

hackdna · 2019-11-04T20:40:04Z

refinery/data_set_manager/models.py

+        if auxiliary_filter is None:
+            return [child.uuid for child in self.children.all()]
+        else:
+            return [
+                child.uuid for child in self.children.filter(
+                    is_auxiliary_node=auxiliary_filter
+                )
+            ]


This can be a lot more readable as a separate function (e.g., get_auxiliary_nodes())

hackdna · 2019-11-04T20:42:53Z

refinery/data_set_manager/models.py

                self.file_item.filetype.used_for_visualization and
                self.file_item.datafile and
                settings.REFINERY_AUXILIARY_FILE_GENERATION ==
-                'on_file_import'):
+                'on_file_import' and
+                self.file_item.get_extension().lower() in
+                self.AUXILIARY_FILES_NEEDED_FOR_VISUALIZATION):


This list of conditions is a good candidate for factoring out into a helper function (at least a part of it). Also, ideally, this should be checked before calling this function, so you would not have to return None at the end.

hackdna · 2019-11-04T20:58:17Z

refinery/data_set_manager/models.py

-            auxiliary_file_store_item.save()
+            file_import = FileImportTask().subtask(
+                (auxiliary_node.file_item.uuid, None,),
+                immutable=True


If generate_auxiliary_file() returns a file UUID then this won't be needed?

what is "this" and why is it unnecessary?

Line 675 (if a comment doesn't specify a line range then it is referring to the line directly above it) but now that you mention it, the args on line 674 would not be needed also since "the first task executes passing its return value to the next task in the chain" (http://docs.celeryproject.org/en/3.1/userguide/canvas.html#the-primitives).

ah i see - generate_auxiliary_file does not currently do that. were you saying you wanted that? it's not like the FileStoreItem is generated there

Yes, the name implies that it should generate the FileStoreItem and return its UUID but there may be more refactoring required.

ok i have done some refactoring

hackdna · 2019-11-04T21:04:29Z

refinery/data_set_manager/models.py

+            generate_and_import = chain(generate, file_import)
+            return generate_and_import


return chain(generate, file_import)

hackdna · 2019-11-04T22:21:41Z

refinery/data_set_manager/tasks.py

@@ -273,22 +276,26 @@ def parse_isatab(username, public, path, identity_id=None,
        return data_set_uuid


-@task()
-def generate_auxiliary_file(auxiliary_node, parent_node_file_store_item):
+@task(soft_time_limit=3600)


Is default 60 seconds not sufficient?

Since this is a file-based operation (we have to both move a file and do an operation on it), the timeout should match the FileImportTask's timeout

The FileImportTask timeout was chosen to accommodate downloads from sites on public Internet which can take a really long time (e.g., from ftp://ftp.sra.ebi.ac.uk).
Here we are dealing with transfers to/from S3 within AWS network. It would be great to benchmark how long does this operation take for a typical BAM file (download from s3 + indexing + upload to S3) and set the timeout accordingly (perhaps with a 30% margin?).

ok i'll do that now

hackdna · 2019-11-04T22:23:25Z

refinery/data_set_manager/tasks.py

@@ -368,7 +389,8 @@ def post_process_file_import(**kwargs):
        logger.info("Updated Solr index with file import state for Node '%s'",
                    node.uuid)
        if kwargs['state'] == celery.states.SUCCESS:
-            node.run_generate_auxiliary_node_task()
+            if node.is_auxiliary_node_needed():


kwargs['state'] == celery.states.SUCCESS and node.is_auxiliary_node_needed()

hackdna · 2019-11-04T23:54:00Z

refinery/data_set_manager/tasks.py

+        datafile_path = temp_file
+        os.remove(temp_file)
+    else:
+        pysam.index(bytes(datafile_path))


The index file should not be written directly to the file store dir - it is part of the reason this code is broken (see also line 348).

maybe i am confused by what you mean by "file store dir" - this bam file is downloaded into a temporary directory created by tempfile.gettempdir(), which, as far as i can tell, is not the "file store dir." pysam then dumps the index file into there. pysam.index can take a string argument to tell it where to dump, but i'm not sure what is better than a temporary directory.:

https://pysam.readthedocs.io/en/latest/usage.html#using-samtools-commands-within-python

http://www.htslib.org/doc/samtools.html

i realized this morning coming in you weren't referring to the s3 bit but the local bit - will update

Also, I could not find any documentation about pysam.index(). Has it been deprecated?

no - i should have been clearer about that, sorry! pysam.index is a wrapper around samtools' command-line index function which has not been deprecated

OK, can we also update pysam to the latest version 0.15.3?
https://pypi.org/project/pysam/

i don't see why not - will add it in

refinery/data_set_manager/tasks.py

hackdna · 2019-11-05T20:06:26Z

refinery/data_set_manager/tasks.py

+        if not settings.REFINERY_S3_USER_DATA:
+            datafile_path = datafile.path
+        else:
+            datafile_path = datafile.name
    except (NotImplementedError, ValueError):
        datafile_path = None
    try:


This try block is huge. I think the only function that can raise exceptions is generate_bam_index().

…platform/refinery-platform into ilan-gold/bam_file_fix

ilan-gold · 2019-11-06T15:20:33Z

@hackdna I just tested this functionality via both direct file upload and running an analysis - We needed a few small changes but otherwise I think this is good to go.

hackdna · 2019-11-12T15:41:06Z

refinery/data_set_manager/tasks.py

+            generate_auxiliary_file.update_state(state=celery.states.FAILURE)
+            raise celery.exceptions.Ignore()
+    else:  # this should never occur
+        auxiliary_file_path = ''


I think the task should just fail at this point instead.

hackdna · 2019-11-12T15:43:12Z

refinery/data_set_manager/tasks.py

+            auxiliary_task = node.generate_auxiliary_node_task()
+            auxiliary_task.delay()


node.generate_auxiliary_node_task().delay()?

hackdna · 2019-11-12T15:46:47Z

requirements.txt

@@ -34,7 +34,7 @@ psycopg2-binary==2.7.4
 pycparser==2.13
 Pygments==1.6rc1
 pyOpenSSL==17.5.0
-pysam==0.9.1.4
+pysam==0.15.3


Have you tested this? I got an error during installation:

(refinery-platform) vagrant@refinery:/vagrant/refinery$ pip install pysam==0.15.3 DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support Collecting pysam==0.15.3 Using cached https://files.pythonhosted.org/packages/15/e7/2dab8bb0ac739555e69586f1492f0ff6bc4a1f8312992a83001d3deb77ac/pysam-0.15.3.tar.gz ERROR: Command errored out with exit status 1: command: /home/vagrant/.virtualenvs/refinery-platform/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-jQZFUB/pysam/pip-egg-info cwd: /tmp/pip-install-jQZFUB/pysam/ Complete output (131 lines): # pysam: no cython available - using pre-compiled C # pysam: htslib mode is shared # pysam: HTSLIB_CONFIGURE_OPTIONS=None # pysam: (sysconfig) CC=x86_64-linux-gnu-gcc -pthread # pysam: (sysconfig) CFLAGS=-fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security # pysam: (sysconfig) LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro checking for gcc... x86_64-linux-gnu-gcc -pthread checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed checking for ranlib... ranlib checking for grep that handles long lines and -e... /bin/grep checking for C compiler warning flags... unknown checking for special C compiler options needed for large files... no checking for _FILE_OFFSET_BITS value needed for large files... no checking for _LARGEFILE_SOURCE value needed for large files... no checking shared library type for unknown-Linux... plain .so checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E checking for egrep... /bin/grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking for sys/param.h... yes checking for getpagesize... yes checking for working mmap... yes checking for gmtime_r... yes checking for fsync... yes checking for drand48... yes checking whether fdatasync is declared... yes checking for fdatasync... yes checking for library containing log... -lm checking for zlib.h... yes checking for inflate in -lz... yes checking for library containing recv... none required checking for bzlib.h... no checking for BZ2_bzBuffToBuffCompress in -lbz2... no configure: error: libbzip2 development files not found The CRAM format may use bzip2 compression, which is implemented in HTSlib by using compression routines from libbzip2 <http://www.bzip.org/>. Building HTSlib requires libbzip2 development files to be installed on the build machine; you may need to ensure a package such as libbz2-dev (on Debian or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin) is installed. Either configure with --disable-bz2 (which will make some CRAM files produced elsewhere unreadable) or resolve this error to build HTSlib. checking for gcc... x86_64-linux-gnu-gcc -pthread checking whether the C compiler works... yes checking for C compiler default output file name... a.out checking for suffix of executables... checking whether we are cross compiling... no checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed checking for ranlib... ranlib checking for grep that handles long lines and -e... /bin/grep checking for C compiler warning flags... unknown checking for special C compiler options needed for large files... no checking for _FILE_OFFSET_BITS value needed for large files... no checking for _LARGEFILE_SOURCE value needed for large files... no checking shared library type for unknown-Linux... plain .so checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E checking for egrep... /bin/grep -E checking for ANSI C header files... yes checking for sys/types.h... yes checking for sys/stat.h... yes checking for stdlib.h... yes checking for string.h... yes checking for memory.h... yes checking for strings.h... yes checking for inttypes.h... yes checking for stdint.h... yes checking for unistd.h... yes checking for stdlib.h... (cached) yes checking for unistd.h... (cached) yes checking for sys/param.h... yes checking for getpagesize... yes checking for working mmap... yes checking for gmtime_r... yes checking for fsync... yes checking for drand48... yes checking whether fdatasync is declared... yes checking for fdatasync... yes checking for library containing log... -lm checking for zlib.h... yes checking for inflate in -lz... yes checking for library containing recv... none required checking for bzlib.h... no checking for BZ2_bzBuffToBuffCompress in -lbz2... no configure: error: libbzip2 development files not found The CRAM format may use bzip2 compression, which is implemented in HTSlib by using compression routines from libbzip2 <http://www.bzip.org/>. Building HTSlib requires libbzip2 development files to be installed on the build machine; you may need to ensure a package such as libbz2-dev (on Debian or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin) is installed. Either configure with --disable-bz2 (which will make some CRAM files produced elsewhere unreadable) or resolve this error to build HTSlib. make: ./version.sh: Command not found make: ./version.sh: Command not found config.mk:2: *** Resolve configure error first. Stop. # pysam: htslib configure options: None Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 241, in <module> htslib_make_options = run_make_print_config() File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 68, in run_make_print_config stdout = subprocess.check_output(["make", "-s", "print-config"]) File "/usr/lib/python2.7/subprocess.py", line 574, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command '['make', '-s', 'print-config']' returned non-zero exit status 2 ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I did try this out but it worked for me. I have seen this before though (going through the Python3 business) and I think it better then to just wait for that since I have something to ameliorate this.

hackdna · 2019-11-12T16:12:01Z

refinery/data_set_manager/tasks.py

+            datafile_path = parent_node.file_item.datafile.name
+    except ValueError:
+        logger.error("No datafile for %s", parent_node.file_item)
+        return auxiliary_file_store_item.uuid


Should the task fail here also?

ilan-gold added 30 commits October 18, 2019 14:31

rename files in flight in galaxy

0a7ed9c

remove unnessary rename_results call

3c4068e

remove rename_results function entirely

43db45e

remove unused import

6e8fc3b

fix json-unicode bug with ast.literal_eval

28f2c52

append auxiliary file url information to

bff4862

information passed to viz containers

import auxiliary into refinery file system

7a76172

Merge branch 'ilan-gold/rename_on_download' of https://github.com/ref…

08ae2c6

…inery-platform/refinery-platform into ilan-gold/bam_file_fix

remove file import subtask

bedc1f7

bump soft_time_limit and use FileImportTask within auxiliary file

312bc8d

generation

add in auxiliary file generation/import task monitoring

529c65a

update tests to reflect new solr field

4999476

Merge branches 'ilan-gold/bam_file_fix' and 'ilan-gold/text-json-bug'…

f680b4c

… of https://github.com/refinery-platform/refinery-platform into ilan-gold/bam_file_fix

add data mgiration to add auxiliary file generation/import task

82dca41

no need for getting the task_id on a Signature object

48e0137

remove task_id for auxiliary_file_store_item

38b6afe

check if subtask is None and only then add to task list

2819048

srv

482cca7

get source of FileStoreItem, not the datafile itself

4d18e92

access bucket/key in s3 properly and use BASSE_DIR

fe17dbc

make directory before creating file

8dbfa6f

need to create directories recursively

f92a055

execute file generation subtasks as a chain

5e0f73e

remove data_set_manager tasks import

26cbda8

make file import task immutable

f3b192e

remove debugging changes

515575f

add in botocore iport

f374ffc

remove extra return

7ca8a6d

update logging

acff06d

move temp storage setting to base settings

06729db

ilan-gold added 2 commits October 31, 2019 16:31

add stricter checking for auxiliary file generation

b1c9706

update spelling

12df5f5

ilan-gold mentioned this pull request Nov 1, 2019

Fixing the Provenance Graph #3487

Merged

ngehlenborg added this to the Release 1.7.2 milestone Nov 4, 2019

hackdna requested changes Nov 4, 2019

View reviewed changes

ilan-gold added 2 commits November 4, 2019 16:36

address ilya's comments

ba0dfe1

address ilya's task comments

77ebce6

hackdna requested changes Nov 4, 2019

View reviewed changes

ilan-gold added 5 commits November 5, 2019 10:12

address local file usage in bam

4c31562

upgrade pysam

4ae5da8

use and and not two if's

c77d4f4

remove unnecessary commment

32e6ca6

update indentation

6af7517

hackdna reviewed Nov 5, 2019

View reviewed changes

refinery/data_set_manager/tasks.py Outdated Show resolved Hide resolved

hackdna reviewed Nov 5, 2019

View reviewed changes

ilan-gold and others added 6 commits November 5, 2019 15:15

update try-except to be smaller

084449f

update root_volume size and celery handling

023de8d

Merge branch 'ilan-gold/bam_file_fix' of https://github.com/refinery-…

f38b064

…platform/refinery-platform into ilan-gold/bam_file_fix

Rewrite generate_bam_index() and refactor generate_auxiliary_file()

b9b4b02

actually run post-file import task and change function signature

a762c93

pass bytes to psyam.index

60c1ee8

update comments and logging

41b0f6d

hackdna requested changes Nov 12, 2019

View reviewed changes

hackdna reviewed Nov 12, 2019

View reviewed changes

address ilya's comments

74a3821

ilan-gold force-pushed the ilan-gold/bam_file_fix branch from 3a820a9 to 74a3821 Compare November 12, 2019 16:29

hackdna approved these changes Nov 12, 2019

View reviewed changes

ilan-gold merged commit 2a7b804 into develop Nov 12, 2019

ilan-gold deleted the ilan-gold/bam_file_fix branch November 12, 2019 16:53

		logger.info(
		"Starting auxiliary file generation and import for analysis "

		auxiliary_file_import_tasks = self._create_annotated_nodes()
		return auxiliary_file_import_tasks

		auxiliary_file_tasks = self._prepare_annotated_nodes(node_uuids)
		return auxiliary_file_tasks

		generate_and_import = chain(generate, file_import)
		return generate_and_import

		auxiliary_task = node.generate_auxiliary_node_task()
		auxiliary_task.delay()

Fix .bam file handling for visualizations #3484

Fix .bam file handling for visualizations #3484

Conversation

ilan-gold commented Oct 30, 2019 • edited

hackdna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackdna Nov 4, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold Nov 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold Nov 5, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold commented Nov 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilan-gold commented Oct 30, 2019 •

edited

hackdna Nov 4, 2019 •

edited

ilan-gold Nov 5, 2019 •

edited

ilan-gold Nov 5, 2019 •

edited