Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix .bam file handling for visualizations #3484

Merged
merged 49 commits into from Nov 12, 2019
Merged

Conversation

ilan-gold
Copy link
Member

@ilan-gold ilan-gold commented Oct 30, 2019

Using a chain, we now have two steps to fix #2713 #2718 and #3465 :

  1. Generate the .bai index file from the .bam file (downloaded from s3 if necessary - note that 3600 second time limit on the generate task)
  2. Import the new file into the Refinery system.

Then we need to make this file visible to the docker container running IGV, which we do by updating the API as well as the container: refinery-platform/docker_igv_js#32

This also helps solve #2501 because we are not generating unnecessary auxiliary nodes

information passed to viz containers
@ngehlenborg ngehlenborg added this to the Release 1.7.2 milestone Nov 4, 2019
Copy link
Member

@hackdna hackdna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass

Comment on lines 153 to 154
logger.info(
"Starting auxiliary file generation and import for analysis "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just keep it on one line.

"Starting auxiliary file generation and import for analysis "
"'%s'", analysis)
auxiliary_file_tasks = TaskSet(
tasks=auxiliary_file_tasks_signatures
Copy link
Member

@hackdna hackdna Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks=analysis.attach_derived_nodes_to_dataset()
I'd also break up this function because it sounds like it is doing more than one thing

Comment on lines 160 to 162
analysis_status.auxiliary_file_task_group_id = (
auxiliary_file_tasks.taskset_id
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analysis_status.auxiliary_file_task_group_id = \
    auxiliary_file_tasks.taskset_id

Comment on lines 1361 to 1362
auxiliary_file_import_tasks = self._create_annotated_nodes()
return auxiliary_file_import_tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self._create_annotated_nodes()

Comment on lines 1430 to 1431
if subtask is not None:
auxiliary_file_tasks += [subtask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be auxiliary_file_tasks += [subtask] if subtask else []?

Comment on lines 1635 to 1636
auxiliary_file_tasks = self._prepare_annotated_nodes(node_uuids)
return auxiliary_file_tasks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self._prepare_annotated_nodes(node_uuids)

Comment on lines 609 to 616
if auxiliary_filter is None:
return [child.uuid for child in self.children.all()]
else:
return [
child.uuid for child in self.children.filter(
is_auxiliary_node=auxiliary_filter
)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a lot more readable as a separate function (e.g., get_auxiliary_nodes())

Comment on lines 657 to 662
self.file_item.filetype.used_for_visualization and
self.file_item.datafile and
settings.REFINERY_AUXILIARY_FILE_GENERATION ==
'on_file_import'):
'on_file_import' and
self.file_item.get_extension().lower() in
self.AUXILIARY_FILES_NEEDED_FOR_VISUALIZATION):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of conditions is a good candidate for factoring out into a helper function (at least a part of it). Also, ideally, this should be checked before calling this function, so you would not have to return None at the end.

auxiliary_file_store_item.save()
file_import = FileImportTask().subtask(
(auxiliary_node.file_item.uuid, None,),
immutable=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If generate_auxiliary_file() returns a file UUID then this won't be needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "this" and why is it unnecessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 675 (if a comment doesn't specify a line range then it is referring to the line directly above it) but now that you mention it, the args on line 674 would not be needed also since "the first task executes passing its return value to the next task in the chain" (http://docs.celeryproject.org/en/3.1/userguide/canvas.html#the-primitives).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see - generate_auxiliary_file does not currently do that. were you saying you wanted that? it's not like the FileStoreItem is generated there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the name implies that it should generate the FileStoreItem and return its UUID but there may be more refactoring required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i have done some refactoring

Comment on lines 677 to 678
generate_and_import = chain(generate, file_import)
return generate_and_import
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return chain(generate, file_import)

@@ -273,22 +276,26 @@ def parse_isatab(username, public, path, identity_id=None,
return data_set_uuid


@task()
def generate_auxiliary_file(auxiliary_node, parent_node_file_store_item):
@task(soft_time_limit=3600)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is default 60 seconds not sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a file-based operation (we have to both move a file and do an operation on it), the timeout should match the FileImportTask's timeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FileImportTask timeout was chosen to accommodate downloads from sites on public Internet which can take a really long time (e.g., from ftp://ftp.sra.ebi.ac.uk).
Here we are dealing with transfers to/from S3 within AWS network. It would be great to benchmark how long does this operation take for a typical BAM file (download from s3 + indexing + upload to S3) and set the timeout accordingly (perhaps with a 30% margin?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i'll do that now

@@ -368,7 +389,8 @@ def post_process_file_import(**kwargs):
logger.info("Updated Solr index with file import state for Node '%s'",
node.uuid)
if kwargs['state'] == celery.states.SUCCESS:
node.run_generate_auxiliary_node_task()
if node.is_auxiliary_node_needed():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs['state'] == celery.states.SUCCESS and node.is_auxiliary_node_needed()

datafile_path = temp_file
os.remove(temp_file)
else:
pysam.index(bytes(datafile_path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index file should not be written directly to the file store dir - it is part of the reason this code is broken (see also line 348).

Copy link
Member Author

@ilan-gold ilan-gold Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i am confused by what you mean by "file store dir" - this bam file is downloaded into a temporary directory created by tempfile.gettempdir(), which, as far as i can tell, is not the "file store dir." pysam then dumps the index file into there. pysam.index can take a string argument to tell it where to dump, but i'm not sure what is better than a temporary directory.:

https://pysam.readthedocs.io/en/latest/usage.html#using-samtools-commands-within-python

http://www.htslib.org/doc/samtools.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i realized this morning coming in you weren't referring to the s3 bit but the local bit - will update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I could not find any documentation about pysam.index(). Has it been deprecated?

Copy link
Member Author

@ilan-gold ilan-gold Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - i should have been clearer about that, sorry! pysam.index is a wrapper around samtools' command-line index function which has not been deprecated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we also update pysam to the latest version 0.15.3?
https://pypi.org/project/pysam/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see why not - will add it in

if not settings.REFINERY_S3_USER_DATA:
datafile_path = datafile.path
else:
datafile_path = datafile.name
except (NotImplementedError, ValueError):
datafile_path = None
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try block is huge. I think the only function that can raise exceptions is generate_bam_index().

@ilan-gold
Copy link
Member Author

@hackdna I just tested this functionality via both direct file upload and running an analysis - We needed a few small changes but otherwise I think this is good to go.

generate_auxiliary_file.update_state(state=celery.states.FAILURE)
raise celery.exceptions.Ignore()
else: # this should never occur
auxiliary_file_path = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the task should just fail at this point instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 370 to 371
auxiliary_task = node.generate_auxiliary_node_task()
auxiliary_task.delay()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node.generate_auxiliary_node_task().delay()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

requirements.txt Outdated
@@ -34,7 +34,7 @@ psycopg2-binary==2.7.4
pycparser==2.13
Pygments==1.6rc1
pyOpenSSL==17.5.0
pysam==0.9.1.4
pysam==0.15.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? I got an error during installation:

(refinery-platform) vagrant@refinery:/vagrant/refinery$ pip install pysam==0.15.3
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Collecting pysam==0.15.3
  Using cached https://files.pythonhosted.org/packages/15/e7/2dab8bb0ac739555e69586f1492f0ff6bc4a1f8312992a83001d3deb77ac/pysam-0.15.3.tar.gz
    ERROR: Command errored out with exit status 1:
     command: /home/vagrant/.virtualenvs/refinery-platform/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-jQZFUB/pysam/pip-egg-info
         cwd: /tmp/pip-install-jQZFUB/pysam/
    Complete output (131 lines):
    # pysam: no cython available - using pre-compiled C
    # pysam: htslib mode is shared
    # pysam: HTSLIB_CONFIGURE_OPTIONS=None
    # pysam: (sysconfig) CC=x86_64-linux-gnu-gcc -pthread
    # pysam: (sysconfig) CFLAGS=-fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security
    # pysam: (sysconfig) LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro
    checking for gcc... x86_64-linux-gnu-gcc -pthread
    checking whether the C compiler works... yes
    checking for C compiler default output file name... a.out
    checking for suffix of executables...
    checking whether we are cross compiling... no
    checking for suffix of object files... o
    checking whether we are using the GNU C compiler... yes
    checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes
    checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed
    checking for ranlib... ranlib
    checking for grep that handles long lines and -e... /bin/grep
    checking for C compiler warning flags... unknown
    checking for special C compiler options needed for large files... no
    checking for _FILE_OFFSET_BITS value needed for large files... no
    checking for _LARGEFILE_SOURCE value needed for large files... no
    checking shared library type for unknown-Linux... plain .so
    checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E
    checking for egrep... /bin/grep -E
    checking for ANSI C header files... yes
    checking for sys/types.h... yes
    checking for sys/stat.h... yes
    checking for stdlib.h... yes
    checking for string.h... yes
    checking for memory.h... yes
    checking for strings.h... yes
    checking for inttypes.h... yes
    checking for stdint.h... yes
    checking for unistd.h... yes
    checking for stdlib.h... (cached) yes
    checking for unistd.h... (cached) yes
    checking for sys/param.h... yes
    checking for getpagesize... yes
    checking for working mmap... yes
    checking for gmtime_r... yes
    checking for fsync... yes
    checking for drand48... yes
    checking whether fdatasync is declared... yes
    checking for fdatasync... yes
    checking for library containing log... -lm
    checking for zlib.h... yes
    checking for inflate in -lz... yes
    checking for library containing recv... none required
    checking for bzlib.h... no
    checking for BZ2_bzBuffToBuffCompress in -lbz2... no
    configure: error: libbzip2 development files not found

    The CRAM format may use bzip2 compression, which is implemented in HTSlib
    by using compression routines from libbzip2 <http://www.bzip.org/>.

    Building HTSlib requires libbzip2 development files to be installed on the
    build machine; you may need to ensure a package such as libbz2-dev (on Debian
    or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin)
    is installed.

    Either configure with --disable-bz2 (which will make some CRAM files
    produced elsewhere unreadable) or resolve this error to build HTSlib.
    checking for gcc... x86_64-linux-gnu-gcc -pthread
    checking whether the C compiler works... yes
    checking for C compiler default output file name... a.out
    checking for suffix of executables...
    checking whether we are cross compiling... no
    checking for suffix of object files... o
    checking whether we are using the GNU C compiler... yes
    checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes
    checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed
    checking for ranlib... ranlib
    checking for grep that handles long lines and -e... /bin/grep
    checking for C compiler warning flags... unknown
    checking for special C compiler options needed for large files... no
    checking for _FILE_OFFSET_BITS value needed for large files... no
    checking for _LARGEFILE_SOURCE value needed for large files... no
    checking shared library type for unknown-Linux... plain .so
    checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E
    checking for egrep... /bin/grep -E
    checking for ANSI C header files... yes
    checking for sys/types.h... yes
    checking for sys/stat.h... yes
    checking for stdlib.h... yes
    checking for string.h... yes
    checking for memory.h... yes
    checking for strings.h... yes
    checking for inttypes.h... yes
    checking for stdint.h... yes
    checking for unistd.h... yes
    checking for stdlib.h... (cached) yes
    checking for unistd.h... (cached) yes
    checking for sys/param.h... yes
    checking for getpagesize... yes
    checking for working mmap... yes
    checking for gmtime_r... yes
    checking for fsync... yes
    checking for drand48... yes
    checking whether fdatasync is declared... yes
    checking for fdatasync... yes
    checking for library containing log... -lm
    checking for zlib.h... yes
    checking for inflate in -lz... yes
    checking for library containing recv... none required
    checking for bzlib.h... no
    checking for BZ2_bzBuffToBuffCompress in -lbz2... no
    configure: error: libbzip2 development files not found

    The CRAM format may use bzip2 compression, which is implemented in HTSlib
    by using compression routines from libbzip2 <http://www.bzip.org/>.

    Building HTSlib requires libbzip2 development files to be installed on the
    build machine; you may need to ensure a package such as libbz2-dev (on Debian
    or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin)
    is installed.

    Either configure with --disable-bz2 (which will make some CRAM files
    produced elsewhere unreadable) or resolve this error to build HTSlib.
    make: ./version.sh: Command not found
    make: ./version.sh: Command not found
    config.mk:2: *** Resolve configure error first.  Stop.
    # pysam: htslib configure options: None
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 241, in <module>
        htslib_make_options = run_make_print_config()
      File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 68, in run_make_print_config
        stdout = subprocess.check_output(["make", "-s", "print-config"])
      File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
        raise CalledProcessError(retcode, cmd, output=output)
    subprocess.CalledProcessError: Command '['make', '-s', 'print-config']' returned non-zero exit status 2
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try this out but it worked for me. I have seen this before though (going through the Python3 business) and I think it better then to just wait for that since I have something to ameliorate this.

datafile_path = parent_node.file_item.datafile.name
except ValueError:
logger.error("No datafile for %s", parent_node.file_item)
return auxiliary_file_store_item.uuid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the task fail here also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@ilan-gold ilan-gold merged commit 2a7b804 into develop Nov 12, 2019
@ilan-gold ilan-gold deleted the ilan-gold/bam_file_fix branch November 12, 2019 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid generating BAM index file under FILE_STORE_BASE_DIR tree
3 participants