Skip to content

Commit

Permalink
Update taxonomy so as to retrieve DBs from zenodo (toydb and viruses_…
Browse files Browse the repository at this point in the history
…masking)
  • Loading branch information
cokelaer committed Mar 18, 2024
1 parent 3a91102 commit e041d4f
Show file tree
Hide file tree
Showing 10 changed files with 134 additions and 78 deletions.
22 changes: 10 additions & 12 deletions README.rst
Expand Up @@ -40,14 +40,11 @@ SEQUANA
Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of
Open Source Software, 2(16), 352, `JOSS DOI doi:10.21105/joss.00352 <https://joss.theoj.org/papers/10.21105/joss.00352>`_

For the **genome coverage** tool (sequana_coverage): Dimitri Desvillechabrol, Christiane Bouchier,
Sean Kennedy, Thomas Cokelaer. Sequana coverage: detection and characterization of genomic
variations using running median and mixture models. GigaScience, 7(12), 2018.
https://doi.org/10.1093/gigascience/giy110
Also available on bioRxiv (DOI: http://biorxiv.org/content/early/2016/12/08/092478)

For **Sequanix**: Dimitri Desvillechabrol, Rachel Legendre, Claire Rioualen,
Christiane Bouchier, Jacques van Helden, Sean Kennedy, Thomas Cokelaer.
For the **genome coverage** tool (sequana_coverage): Desvillechabrol et al, 2018:
detection and characterization of genomic variations using running median and
mixture models. GigaScience, 7(12), 2018. https://doi.org/10.1093/gigascience/giy110

For **Sequanix**: Desvillechabrol et al.
Sequanix: A Dynamic Graphical Interface for Snakemake Workflows
Bioinformatics, bty034, https://doi.org/10.1093/bioinformatics/bty034
Also available on bioRxiv (DOI: https://doi.org/10.1101/162701)
Expand Down Expand Up @@ -251,6 +248,8 @@ Version Description
implementation. Remove useless VCF code.
* Fixes rnadiff HTML report
* speedup kegg enrichment using multiprocess
* Allow sequana_taxonomy to download toydb and viruses_masking DBs from
zenodo
0.16.9 * Major fix on PCA and add batch effect plots in RNAdiff analysis
* count matrix and DESeq2 output files' headers fixed with missing index
(no impact on analysis but only for those willing to use the CSV files
Expand Down Expand Up @@ -331,8 +330,7 @@ Version Description
* Move sequana_lane_merging into a subcommand (sequana lane_merging)
* General cleanup of documentation, test and links to pipelines
* add new ribodesigner subcommand
0.12.7 * Fix memory leak in len() of FastA class
0.12.6 * remove some rules now in https://github.com/sequana/sequana-wrappers
0.12.5 * refactorisation of VCF tools/modules to use vcfpy instead of pyVCF
0.12.4 * complete change log before 0.12.4 in the github /doc/Changelog.txt
0.12.X * remove some rules now in https://github.com/sequana/sequana-wrappers
* refactorisation of VCF tools/modules to use vcfpy instead of pyVCF
* complete change log before 0.12.4 in the github /doc/Changelog.txt
========= ==========================================================================
7 changes: 4 additions & 3 deletions doc/installation.rst
Expand Up @@ -7,7 +7,7 @@ Here below are the instructions to install Sequana. There are different ways (so

If you want the latest version of Sequana, you should install it from source (see :ref:`github_method`). Otherwise, you can install a release of **Sequana** from the Pypi website (using **pip**). Note that for pipelines, which are now independent Python packages, we also use Pypi releases. However, third-party dependencies (not Python) should be installed manually. Most of them are provided through **Anaconda** channels. See the :ref:`installation_conda` Section for details on how to set up Conda.

For instance, if you want to use the sequana_fastqc pipelinem you must install **fastqc** yourself, which is not a
For instance, if you want to use the sequana_fastqc pipeline you must install **fastqc** yourself, which is not a
Python package.

If you just want to test **Sequana** or **Sequanix** (see note here below) or one of the Sequana
Expand All @@ -29,6 +29,7 @@ standalone, we also provide **Singularity** containers as explained in the
https://github.com/sequana/variant_calling. This way, you can install
Sequana quite easily using pip.


.. topic:: Sequanix

Sequanix has now its own repository here: https://github.com/sequana/sequanix and should
Expand All @@ -38,7 +39,7 @@ standalone, we also provide **Singularity** containers as explained in the
Latest recommended installation method
======================================

Sequana is maintained under Python 3.8 and above (Dev 2023).
Sequana is maintained under Python 3.8 and above (Dec 2023).

We strongly recommend to use a virtual environment so that (i)
you can install all requirements without root permissions and (ii) you do
Expand Down Expand Up @@ -218,7 +219,7 @@ instance).
Singularity/Apptainer
======================

We maintain a version of sequana within the Damona project.
We maintain a version of sequana within the https://damona.readthedocs.io project.

You can download e.g version 0.16.2 and use it as follows::

Expand Down
4 changes: 2 additions & 2 deletions doc/tutorial.rst
Expand Up @@ -18,7 +18,7 @@ The standalone Sequana
New since version 0.9.0. We are a single entry point for a set of tools used in
pipelines or as standalone applications. You can type::

sequana --help
sequana --help


to get the list of applications. Would you need completion, this is possible
Expand All @@ -39,7 +39,7 @@ and then source the file in ~/.bashrc.::
The fastqc pipeline
--------------------

The following example will show how to run the fastqc pipeline
The following example will show how to run the fastqc pipeline
(https://github.com/sequana/fastqc) on a pair of
FastQ files. The data comes from a sequencing (using HiSeq technology) of a
Measles virus. For testing purposes, you can download :download:`R1
Expand Down
2 changes: 1 addition & 1 deletion sequana/__init__.py
Expand Up @@ -60,10 +60,10 @@ def get_package_version(package_name):
from .homer import Homer
from .idr import IDR
from .itol import ITOL
from .kraken.downloads import KrakenDownload
from .kraken.analysis import (
KrakenAnalysis,
KrakenDB,
KrakenDownload,
KrakenPipeline,
KrakenResults,
KrakenSequential,
Expand Down
56 changes: 1 addition & 55 deletions sequana/kraken/analysis.py
Expand Up @@ -33,7 +33,6 @@
"KrakenResults",
"KrakenPipeline",
"KrakenAnalysis",
"KrakenDownload",
"KrakenSequential",
"KrakenDB",
]
Expand Down Expand Up @@ -712,7 +711,7 @@ class KrakenPipeline(object):
either download a database from https://ccb.jhu.edu/software/kraken/
or use this class to download a toy example that will
be stored in e.g .config/sequana under Unix platforms.
See :class:`KrakenDownload`.
See :class:`~sequana.kraken.downloads.KrakenDownload`.
.. seealso:: We provide a standalone application of this class, which is
called sequana_taxonomy and can be used within a command shell.
Expand Down Expand Up @@ -1317,56 +1316,3 @@ def run(self, dbname="multiple", output_prefix="kraken_final"):
return summary


class KrakenDownload(object):
"""Utility to download Kraken DB and place them in a local directory
::
from sequana import KrakenDownload
kd = KrakenDownload()
kd.download('toydb')
"""

def __init__(self, output_dir=None):
if output_dir is None:
self.output_dir = Path(f"{sequana_config_path}") / "kraken2_dbs"
else:
self.output_dir = Path(output_dir)

def download(self, name, verbose=True):
if name == "toydb":
self._download_kraken2_toydb(verbose=verbose)
else:
raise ValueError("name must be 'toydb' for now")

def _download_kraken2_toydb(self, verbose=True):
"""Download the kraken DB toy example from sequana_data into
.config/sequana directory
Checks the md5 checksums. About 32Mb of data
"""
base = self.output_dir / "toydb"
base.mkdir(exist_ok=True, parents=True)

baseurl = "https://github.com/sequana/data/raw/main/"

# download only if required
logger.info("Downloading the database into %s" % base)

md5sums = [
"31f4b20f9e5c6beb9e1444805264a6e5",
"733f7587f9c0c7339666d5906ec6fcd3",
"7bb56a0f035b27839fb5c18590b79263",
]

filenames = ["hash.k2d", "opts.k2d", "taxo.k2d"]

for filename, md5sum in zip(filenames, md5sums):
url = baseurl + f"kraken2_toydb/{filename}"
filename = base / filename
if os.path.exists(filename) and md5(filename) == md5sum:
logger.warning(f"{filename} already present with good md5sum")
else:
logger.info(f"Downloading {url}")
wget(url, filename)
89 changes: 89 additions & 0 deletions sequana/kraken/downloads.py
@@ -0,0 +1,89 @@
# This file is part of Sequana software
#
# Copyright (c) 2016-2021 - Sequana Development Team
#
#
# Distributed under the terms of the 3-clause BSD license.
# The full license is in the LICENSE file, distributed with this software.
#
# website: https://github.com/sequana/sequana
# documentation: http://sequana.readthedocs.io
#
##############################################################################
import os
from pathlib import Path

import colorlog
from easydev import md5

from sequana import sequana_config_path
from sequana.misc import wget

logger = colorlog.getLogger(__name__)


__all__ = [
"KrakenDownload",
]


class KrakenDownload(object):
"""Utility to download Kraken DB and place them in a local directory
::
from sequana import KrakenDownload
kd = KrakenDownload()
kd.download('toydb')
"""

def __init__(self, output_dir=None):
if output_dir is None: #pragma: no cover
self.output_dir = Path(f"{sequana_config_path}") / "kraken2_dbs"
else:
self.output_dir = Path(output_dir)

def download(self, name):

assert name in ["viruses_masking:v21.1.1", "toydb"]

base = self.output_dir / f"{name}"
base.mkdir(exist_ok=True, parents=True)

if name == "viruses_masking:v21.1.1": #pragma: no cover
links = [
"https://zenodo.org/records/10826105/files/hash.k2d",
"https://zenodo.org/records/10826105/files/opts.k2d",
"https://zenodo.org/records/10826105/files/taxo.k2d",
]

md5sums = [
"a159efd713abd151d7dfc78327ae47f9",
"dc786f571c76d1c0c568c6dd7a701160",
"6fde7647f2cc02499035dfef5f615eab",
]
elif name == "toydb":
links = [
"https://zenodo.org/records/10829308/files/hash.k2d",
"https://zenodo.org/records/10829308/files/opts.k2d",
"https://zenodo.org/records/10829308/files/taxo.k2d",
]
md5sums = [
"31f4b20f9e5c6beb9e1444805264a6e5",
"733f7587f9c0c7339666d5906ec6fcd3",
"7bb56a0f035b27839fb5c18590b79263",
]


for link, md5sum in zip(links, md5sums):

basename = link.split("/")[-1]

filename = base / basename
if os.path.exists(filename) and md5(filename) == md5sum:
logger.warning(f"{filename} already present with expected md5sum")
else:
logger.info(f"Downloading {link}")
wget(link, filename)

12 changes: 9 additions & 3 deletions sequana/scripts/taxonomy.py
Expand Up @@ -28,7 +28,6 @@
import rich_click as click

from sequana import KrakenDownload, KrakenPipeline, KrakenSequential
from sequana import sequana_config_path as cfg
from sequana import sequana_config_path as scfg
from sequana import version as sequana_version
from sequana.modules_report.kraken import KrakenModule
Expand All @@ -44,13 +43,17 @@
def update_taxonomy(ctx, param, value): # pragma: no cover
if value:
tax = Taxonomy()
click.echo(f"Will overwrite the local database taxonomy.dat in {cfg}")
click.echo(f"Will overwrite the local database taxonomy.dat in {scfg}")
tax.download_taxonomic_file(overwrite=True)
sys.exit(0)
return value


def download_database(ctx, param, value):

from sequana import logger
logger.setLevel("INFO")

if value:
kd = KrakenDownload()
kd.download(value)
Expand Down Expand Up @@ -164,7 +167,7 @@ def check_databases(ctx, param, value):
"--download",
"download",
default=None,
type=click.Choice(["toydb"]),
type=click.Choice(["toydb", "viruses_masking:v21.1.1"]),
callback=download_database,
is_eager=True,
help="A toydb example to be downloaded.",
Expand Down Expand Up @@ -229,8 +232,11 @@ def main(**kwargs):

options = AttrDict(**kwargs)

from sequana import logger as loggers
loggers.setLevel(options.level)
logger.setLevel(options.level)


fastq = [options.file1]
if options.file2:
fastq.append(options.file2)
Expand Down
2 changes: 1 addition & 1 deletion sequana/taxonomy.py
Expand Up @@ -131,7 +131,7 @@ def create_taxonomy_file(self, filename="taxonomy.csv.gz"):
df_names = self.df_names.query("key == 'scientific name'").copy()

# first we create the CSV file
logger.info("Creating CSV fike")
logger.info("Creating CSV file")
with filename.with_suffix("").open(mode="w") as fout:
fout.write("id,parent,rank,scientific_name\n")

Expand Down
2 changes: 1 addition & 1 deletion sequana/viz/pca.py
Expand Up @@ -44,7 +44,7 @@ class PCA(clusterisation.Cluster):
p.plot(n_components=2)
From R, a PCA is selecting the first 500 features based on variance.
From R, a PCA is selecting the first 500 features based on variance.
"""
Expand Down
16 changes: 16 additions & 0 deletions test/kraken/test_downloads.py
@@ -0,0 +1,16 @@

import pytest

from sequana import KrakenDownload
from . import test_dir


def test_download(tmpdir):
# save in specific path
p = tmpdir.mkdir("kr")
kd = KrakenDownload(output_dir=str(p))
kd.download("toydb")

# redownload on purpose
kd.download("toydb")

0 comments on commit e041d4f

Please sign in to comment.