New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix .bam file handling for visualizations #3484
Conversation
information passed to viz containers
…inery-platform/refinery-platform into ilan-gold/bam_file_fix
… of https://github.com/refinery-platform/refinery-platform into ilan-gold/bam_file_fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass
refinery/analysis_manager/tasks.py
Outdated
logger.info( | ||
"Starting auxiliary file generation and import for analysis " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just keep it on one line.
refinery/analysis_manager/tasks.py
Outdated
"Starting auxiliary file generation and import for analysis " | ||
"'%s'", analysis) | ||
auxiliary_file_tasks = TaskSet( | ||
tasks=auxiliary_file_tasks_signatures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks=analysis.attach_derived_nodes_to_dataset()
I'd also break up this function because it sounds like it is doing more than one thing
refinery/analysis_manager/tasks.py
Outdated
analysis_status.auxiliary_file_task_group_id = ( | ||
auxiliary_file_tasks.taskset_id | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analysis_status.auxiliary_file_task_group_id = \
auxiliary_file_tasks.taskset_id
refinery/core/models.py
Outdated
auxiliary_file_import_tasks = self._create_annotated_nodes() | ||
return auxiliary_file_import_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self._create_annotated_nodes()
refinery/core/models.py
Outdated
if subtask is not None: | ||
auxiliary_file_tasks += [subtask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be auxiliary_file_tasks += [subtask] if subtask else []
?
refinery/core/models.py
Outdated
auxiliary_file_tasks = self._prepare_annotated_nodes(node_uuids) | ||
return auxiliary_file_tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self._prepare_annotated_nodes(node_uuids)
refinery/data_set_manager/models.py
Outdated
if auxiliary_filter is None: | ||
return [child.uuid for child in self.children.all()] | ||
else: | ||
return [ | ||
child.uuid for child in self.children.filter( | ||
is_auxiliary_node=auxiliary_filter | ||
) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a lot more readable as a separate function (e.g., get_auxiliary_nodes()
)
refinery/data_set_manager/models.py
Outdated
self.file_item.filetype.used_for_visualization and | ||
self.file_item.datafile and | ||
settings.REFINERY_AUXILIARY_FILE_GENERATION == | ||
'on_file_import'): | ||
'on_file_import' and | ||
self.file_item.get_extension().lower() in | ||
self.AUXILIARY_FILES_NEEDED_FOR_VISUALIZATION): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list of conditions is a good candidate for factoring out into a helper function (at least a part of it). Also, ideally, this should be checked before calling this function, so you would not have to return None at the end.
refinery/data_set_manager/models.py
Outdated
auxiliary_file_store_item.save() | ||
file_import = FileImportTask().subtask( | ||
(auxiliary_node.file_item.uuid, None,), | ||
immutable=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If generate_auxiliary_file()
returns a file UUID then this won't be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "this" and why is it unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 675 (if a comment doesn't specify a line range then it is referring to the line directly above it) but now that you mention it, the args on line 674 would not be needed also since "the first task executes passing its return value to the next task in the chain" (http://docs.celeryproject.org/en/3.1/userguide/canvas.html#the-primitives).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see - generate_auxiliary_file does not currently do that. were you saying you wanted that? it's not like the FileStoreItem is generated there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the name implies that it should generate the FileStoreItem and return its UUID but there may be more refactoring required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i have done some refactoring
refinery/data_set_manager/models.py
Outdated
generate_and_import = chain(generate, file_import) | ||
return generate_and_import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return chain(generate, file_import)
refinery/data_set_manager/tasks.py
Outdated
@@ -273,22 +276,26 @@ def parse_isatab(username, public, path, identity_id=None, | |||
return data_set_uuid | |||
|
|||
|
|||
@task() | |||
def generate_auxiliary_file(auxiliary_node, parent_node_file_store_item): | |||
@task(soft_time_limit=3600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is default 60 seconds not sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a file-based operation (we have to both move a file and do an operation on it), the timeout should match the FileImportTask's timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FileImportTask timeout was chosen to accommodate downloads from sites on public Internet which can take a really long time (e.g., from ftp://ftp.sra.ebi.ac.uk).
Here we are dealing with transfers to/from S3 within AWS network. It would be great to benchmark how long does this operation take for a typical BAM file (download from s3 + indexing + upload to S3) and set the timeout accordingly (perhaps with a 30% margin?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i'll do that now
refinery/data_set_manager/tasks.py
Outdated
@@ -368,7 +389,8 @@ def post_process_file_import(**kwargs): | |||
logger.info("Updated Solr index with file import state for Node '%s'", | |||
node.uuid) | |||
if kwargs['state'] == celery.states.SUCCESS: | |||
node.run_generate_auxiliary_node_task() | |||
if node.is_auxiliary_node_needed(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs['state'] == celery.states.SUCCESS and node.is_auxiliary_node_needed()
refinery/data_set_manager/tasks.py
Outdated
datafile_path = temp_file | ||
os.remove(temp_file) | ||
else: | ||
pysam.index(bytes(datafile_path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index file should not be written directly to the file store dir - it is part of the reason this code is broken (see also line 348).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i am confused by what you mean by "file store dir" - this bam
file is downloaded into a temporary directory created by tempfile.gettempdir()
, which, as far as i can tell, is not the "file store dir." pysam then dumps the index file into there. pysam.index
can take a string argument to tell it where to dump, but i'm not sure what is better than a temporary directory.:
https://pysam.readthedocs.io/en/latest/usage.html#using-samtools-commands-within-python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i realized this morning coming in you weren't referring to the s3 bit but the local bit - will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I could not find any documentation about pysam.index()
. Has it been deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no - i should have been clearer about that, sorry! pysam.index
is a wrapper around samtools' command-line index
function which has not been deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we also update pysam to the latest version 0.15.3?
https://pypi.org/project/pysam/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't see why not - will add it in
refinery/data_set_manager/tasks.py
Outdated
if not settings.REFINERY_S3_USER_DATA: | ||
datafile_path = datafile.path | ||
else: | ||
datafile_path = datafile.name | ||
except (NotImplementedError, ValueError): | ||
datafile_path = None | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try block is huge. I think the only function that can raise exceptions is generate_bam_index()
.
…platform/refinery-platform into ilan-gold/bam_file_fix
@hackdna I just tested this functionality via both direct file upload and running an analysis - We needed a few small changes but otherwise I think this is good to go. |
refinery/data_set_manager/tasks.py
Outdated
generate_auxiliary_file.update_state(state=celery.states.FAILURE) | ||
raise celery.exceptions.Ignore() | ||
else: # this should never occur | ||
auxiliary_file_path = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the task should just fail at this point instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
refinery/data_set_manager/tasks.py
Outdated
auxiliary_task = node.generate_auxiliary_node_task() | ||
auxiliary_task.delay() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node.generate_auxiliary_node_task().delay()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
requirements.txt
Outdated
@@ -34,7 +34,7 @@ psycopg2-binary==2.7.4 | |||
pycparser==2.13 | |||
Pygments==1.6rc1 | |||
pyOpenSSL==17.5.0 | |||
pysam==0.9.1.4 | |||
pysam==0.15.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this? I got an error during installation:
(refinery-platform) vagrant@refinery:/vagrant/refinery$ pip install pysam==0.15.3
DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support
Collecting pysam==0.15.3
Using cached https://files.pythonhosted.org/packages/15/e7/2dab8bb0ac739555e69586f1492f0ff6bc4a1f8312992a83001d3deb77ac/pysam-0.15.3.tar.gz
ERROR: Command errored out with exit status 1:
command: /home/vagrant/.virtualenvs/refinery-platform/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"'; __file__='"'"'/tmp/pip-install-jQZFUB/pysam/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-jQZFUB/pysam/pip-egg-info
cwd: /tmp/pip-install-jQZFUB/pysam/
Complete output (131 lines):
# pysam: no cython available - using pre-compiled C
# pysam: htslib mode is shared
# pysam: HTSLIB_CONFIGURE_OPTIONS=None
# pysam: (sysconfig) CC=x86_64-linux-gnu-gcc -pthread
# pysam: (sysconfig) CFLAGS=-fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security
# pysam: (sysconfig) LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro
checking for gcc... x86_64-linux-gnu-gcc -pthread
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes
checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed
checking for ranlib... ranlib
checking for grep that handles long lines and -e... /bin/grep
checking for C compiler warning flags... unknown
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for _LARGEFILE_SOURCE value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for unistd.h... (cached) yes
checking for sys/param.h... yes
checking for getpagesize... yes
checking for working mmap... yes
checking for gmtime_r... yes
checking for fsync... yes
checking for drand48... yes
checking whether fdatasync is declared... yes
checking for fdatasync... yes
checking for library containing log... -lm
checking for zlib.h... yes
checking for inflate in -lz... yes
checking for library containing recv... none required
checking for bzlib.h... no
checking for BZ2_bzBuffToBuffCompress in -lbz2... no
configure: error: libbzip2 development files not found
The CRAM format may use bzip2 compression, which is implemented in HTSlib
by using compression routines from libbzip2 <http://www.bzip.org/>.
Building HTSlib requires libbzip2 development files to be installed on the
build machine; you may need to ensure a package such as libbz2-dev (on Debian
or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin)
is installed.
Either configure with --disable-bz2 (which will make some CRAM files
produced elsewhere unreadable) or resolve this error to build HTSlib.
checking for gcc... x86_64-linux-gnu-gcc -pthread
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether x86_64-linux-gnu-gcc -pthread accepts -g... yes
checking for x86_64-linux-gnu-gcc -pthread option to accept ISO C89... none needed
checking for ranlib... ranlib
checking for grep that handles long lines and -e... /bin/grep
checking for C compiler warning flags... unknown
checking for special C compiler options needed for large files... no
checking for _FILE_OFFSET_BITS value needed for large files... no
checking for _LARGEFILE_SOURCE value needed for large files... no
checking shared library type for unknown-Linux... plain .so
checking how to run the C preprocessor... x86_64-linux-gnu-gcc -pthread -E
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdlib.h... (cached) yes
checking for unistd.h... (cached) yes
checking for sys/param.h... yes
checking for getpagesize... yes
checking for working mmap... yes
checking for gmtime_r... yes
checking for fsync... yes
checking for drand48... yes
checking whether fdatasync is declared... yes
checking for fdatasync... yes
checking for library containing log... -lm
checking for zlib.h... yes
checking for inflate in -lz... yes
checking for library containing recv... none required
checking for bzlib.h... no
checking for BZ2_bzBuffToBuffCompress in -lbz2... no
configure: error: libbzip2 development files not found
The CRAM format may use bzip2 compression, which is implemented in HTSlib
by using compression routines from libbzip2 <http://www.bzip.org/>.
Building HTSlib requires libbzip2 development files to be installed on the
build machine; you may need to ensure a package such as libbz2-dev (on Debian
or Ubuntu Linux) or bzip2-devel (on RPM-based Linux distributions or Cygwin)
is installed.
Either configure with --disable-bz2 (which will make some CRAM files
produced elsewhere unreadable) or resolve this error to build HTSlib.
make: ./version.sh: Command not found
make: ./version.sh: Command not found
config.mk:2: *** Resolve configure error first. Stop.
# pysam: htslib configure options: None
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 241, in <module>
htslib_make_options = run_make_print_config()
File "/tmp/pip-install-jQZFUB/pysam/setup.py", line 68, in run_make_print_config
stdout = subprocess.check_output(["make", "-s", "print-config"])
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['make', '-s', 'print-config']' returned non-zero exit status 2
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try this out but it worked for me. I have seen this before though (going through the Python3 business) and I think it better then to just wait for that since I have something to ameliorate this.
refinery/data_set_manager/tasks.py
Outdated
datafile_path = parent_node.file_item.datafile.name | ||
except ValueError: | ||
logger.error("No datafile for %s", parent_node.file_item) | ||
return auxiliary_file_store_item.uuid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the task fail here also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
3a820a9
to
74a3821
Compare
Using a chain, we now have two steps to fix #2713 #2718 and #3465 :
Then we need to make this file visible to the docker container running IGV, which we do by updating the API as well as the container: refinery-platform/docker_igv_js#32
This also helps solve #2501 because we are not generating unnecessary auxiliary nodes