Update taxonomy so as to retrieve DBs from zenodo (toydb and viruses_…

…masking)
sequana · Mar 18, 2024 · e041d4f · e041d4f
1 parent 3a91102
commit e041d4f
Show file tree

Hide file tree

Showing 10 changed files with 134 additions and 78 deletions.
diff --git a/README.rst b/README.rst
@@ -40,14 +40,11 @@ SEQUANA
     Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of
     Open Source Software, 2(16), 352, `JOSS DOI doi:10.21105/joss.00352 <https://joss.theoj.org/papers/10.21105/joss.00352>`_
 
-    For the **genome coverage** tool (sequana_coverage):  Dimitri Desvillechabrol, Christiane Bouchier,
-    Sean Kennedy, Thomas Cokelaer. Sequana coverage: detection and characterization of genomic
-    variations using running median and mixture models. GigaScience, 7(12), 2018.
-    https://doi.org/10.1093/gigascience/giy110
-    Also available on bioRxiv (DOI: http://biorxiv.org/content/early/2016/12/08/092478)
-
-    For **Sequanix**: Dimitri Desvillechabrol, Rachel Legendre, Claire Rioualen,
-    Christiane Bouchier, Jacques van Helden, Sean Kennedy, Thomas Cokelaer.
+    For the **genome coverage** tool (sequana_coverage):  Desvillechabrol et al, 2018:
+    detection and characterization of genomic variations using running median and
+    mixture models. GigaScience, 7(12), 2018. https://doi.org/10.1093/gigascience/giy110
+
+    For **Sequanix**: Desvillechabrol et al.
     Sequanix: A Dynamic Graphical Interface for Snakemake Workflows
     Bioinformatics, bty034, https://doi.org/10.1093/bioinformatics/bty034
     Also available on bioRxiv (DOI: https://doi.org/10.1101/162701)
@@ -251,6 +248,8 @@ Version   Description
             implementation. Remove useless VCF code.
           * Fixes rnadiff HTML report
           * speedup kegg enrichment using multiprocess
+          * Allow sequana_taxonomy to download toydb and viruses_masking DBs from 
+            zenodo
 0.16.9    * Major fix on PCA and add batch effect plots in RNAdiff analysis
           * count matrix and DESeq2 output files' headers fixed with missing index
             (no impact on analysis but only for those willing to use the CSV files
@@ -331,8 +330,7 @@ Version   Description
           * Move sequana_lane_merging into a subcommand (sequana lane_merging)
           * General cleanup of documentation, test and links to pipelines
           * add new ribodesigner subcommand
-0.12.7    * Fix memory leak in len() of FastA class
-0.12.6    * remove some rules now in https://github.com/sequana/sequana-wrappers
-0.12.5    * refactorisation of VCF tools/modules to use vcfpy instead of pyVCF
-0.12.4    * complete change log before 0.12.4 in the github /doc/Changelog.txt
+0.12.X    * remove some rules now in https://github.com/sequana/sequana-wrappers
+          * refactorisation of VCF tools/modules to use vcfpy instead of pyVCF
+          * complete change log before 0.12.4 in the github /doc/Changelog.txt
 ========= ==========================================================================
diff --git a/doc/installation.rst b/doc/installation.rst
@@ -7,7 +7,7 @@ Here below are the instructions to install Sequana. There are different ways (so
 
 If you want the latest version of Sequana, you should install it from source (see :ref:`github_method`). Otherwise, you can install a release of **Sequana** from the Pypi website (using **pip**). Note that for pipelines, which are now independent Python packages, we also use Pypi releases. However, third-party dependencies (not Python) should be installed manually. Most of them are provided through **Anaconda** channels.  See the :ref:`installation_conda` Section for details on how to set up Conda.
 
-For instance, if you want to use the sequana_fastqc pipelinem you must install **fastqc** yourself, which is not a
+For instance, if you want to use the sequana_fastqc pipeline you must install **fastqc** yourself, which is not a
 Python package.
 
 If you just want to test **Sequana** or **Sequanix** (see note here below) or one of the Sequana
@@ -29,6 +29,7 @@ standalone, we also provide **Singularity** containers as explained in the
     https://github.com/sequana/variant_calling. This way, you can install
     Sequana quite easily using pip.
 
+
 .. topic:: Sequanix
 
     Sequanix has now its own repository here: https://github.com/sequana/sequanix and should
@@ -38,7 +39,7 @@ standalone, we also provide **Singularity** containers as explained in the
 Latest recommended installation method
 ======================================
 
-Sequana is maintained under Python 3.8 and above  (Dev 2023).
+Sequana is maintained under Python 3.8 and above  (Dec 2023).
 
 We strongly recommend to use a virtual environment so that (i)
 you can install all requirements without root permissions and (ii) you do
@@ -218,7 +219,7 @@ instance).
 Singularity/Apptainer
 ======================
 
-We maintain a version of sequana within the Damona project.
+We maintain a version of sequana within the https://damona.readthedocs.io project.
 
 You can download e.g version 0.16.2 and use it as follows::
 

diff --git a/doc/tutorial.rst b/doc/tutorial.rst
@@ -18,7 +18,7 @@ The standalone Sequana
 New since version 0.9.0. We are a single entry point for a set of tools used in
 pipelines or as standalone applications. You can type::
 
-    sequana --help 
+    sequana --help
 
 
 to get the list of applications. Would you need completion, this is possible
@@ -39,7 +39,7 @@ and then source the file in ~/.bashrc.::
 The fastqc pipeline
 --------------------
 
-The following example will show how to run the fastqc pipeline 
+The following example will show how to run the fastqc pipeline
 (https://github.com/sequana/fastqc) on a pair of
 FastQ files. The data comes from a sequencing (using HiSeq technology) of a
 Measles virus. For testing purposes, you can download :download:`R1

diff --git a/sequana/__init__.py b/sequana/__init__.py
@@ -60,10 +60,10 @@ def get_package_version(package_name):
 from .homer import Homer
 from .idr import IDR
 from .itol import ITOL
+from .kraken.downloads import KrakenDownload
 from .kraken.analysis import (
     KrakenAnalysis,
     KrakenDB,
-    KrakenDownload,
     KrakenPipeline,
     KrakenResults,
     KrakenSequential,

diff --git a/sequana/kraken/analysis.py b/sequana/kraken/analysis.py
@@ -33,7 +33,6 @@
     "KrakenResults",
     "KrakenPipeline",
     "KrakenAnalysis",
-    "KrakenDownload",
     "KrakenSequential",
     "KrakenDB",
 ]
@@ -712,7 +711,7 @@ class KrakenPipeline(object):
         either download a database from https://ccb.jhu.edu/software/kraken/
         or use this class to download a toy example that will
         be stored in e.g .config/sequana under Unix platforms.
-        See :class:`KrakenDownload`.
+        See :class:`~sequana.kraken.downloads.KrakenDownload`.
 
     .. seealso:: We provide a standalone application of this class, which is
         called sequana_taxonomy and can be used within a command shell.
@@ -1317,56 +1316,3 @@ def run(self, dbname="multiple", output_prefix="kraken_final"):
         return summary
 
 
-class KrakenDownload(object):
-    """Utility to download Kraken DB and place them in a local directory
-
-    ::
-
-        from sequana import KrakenDownload
-        kd = KrakenDownload()
-        kd.download('toydb')
-
-    """
-
-    def __init__(self, output_dir=None):
-        if output_dir is None:
-            self.output_dir = Path(f"{sequana_config_path}") / "kraken2_dbs"
-        else:
-            self.output_dir = Path(output_dir)
-
-    def download(self, name, verbose=True):
-        if name == "toydb":
-            self._download_kraken2_toydb(verbose=verbose)
-        else:
-            raise ValueError("name must be 'toydb' for now")
-
-    def _download_kraken2_toydb(self, verbose=True):
-        """Download the kraken DB toy example from sequana_data into
-        .config/sequana directory
-
-        Checks the md5 checksums. About 32Mb of data
-        """
-        base = self.output_dir / "toydb"
-        base.mkdir(exist_ok=True, parents=True)
-
-        baseurl = "https://github.com/sequana/data/raw/main/"
-
-        # download only if required
-        logger.info("Downloading the database into %s" % base)
-
-        md5sums = [
-            "31f4b20f9e5c6beb9e1444805264a6e5",
-            "733f7587f9c0c7339666d5906ec6fcd3",
-            "7bb56a0f035b27839fb5c18590b79263",
-        ]
-
-        filenames = ["hash.k2d", "opts.k2d", "taxo.k2d"]
-
-        for filename, md5sum in zip(filenames, md5sums):
-            url = baseurl + f"kraken2_toydb/{filename}"
-            filename = base / filename
-            if os.path.exists(filename) and md5(filename) == md5sum:
-                logger.warning(f"{filename} already present with good md5sum")
-            else:
-                logger.info(f"Downloading {url}")
-                wget(url, filename)
diff --git a/sequana/kraken/downloads.py b/sequana/kraken/downloads.py
@@ -0,0 +1,89 @@
+#  This file is part of Sequana software
+#
+#  Copyright (c) 2016-2021 - Sequana Development Team
+#
+#
+#  Distributed under the terms of the 3-clause BSD license.
+#  The full license is in the LICENSE file, distributed with this software.
+#
+#  website: https://github.com/sequana/sequana
+#  documentation: http://sequana.readthedocs.io
+#
+##############################################################################
+import os
+from pathlib import Path
+
+import colorlog
+from easydev import md5
+
+from sequana import sequana_config_path
+from sequana.misc import wget
+
+logger = colorlog.getLogger(__name__)
+
+
+__all__ = [
+    "KrakenDownload",
+]
+
+
+class KrakenDownload(object):
+    """Utility to download Kraken DB and place them in a local directory
+
+    ::
+
+        from sequana import KrakenDownload
+        kd = KrakenDownload()
+        kd.download('toydb')
+
+    """
+
+    def __init__(self, output_dir=None):
+        if output_dir is None: #pragma: no cover
+            self.output_dir = Path(f"{sequana_config_path}") / "kraken2_dbs"
+        else:
+            self.output_dir = Path(output_dir)
+
+    def download(self, name):
+
+        assert name in ["viruses_masking:v21.1.1", "toydb"]
+
+        base = self.output_dir / f"{name}"
+        base.mkdir(exist_ok=True, parents=True)
+
+        if name == "viruses_masking:v21.1.1": #pragma: no cover
+            links = [
+                "https://zenodo.org/records/10826105/files/hash.k2d",
+                "https://zenodo.org/records/10826105/files/opts.k2d",
+                "https://zenodo.org/records/10826105/files/taxo.k2d",
+            ]
+
+            md5sums = [
+                "a159efd713abd151d7dfc78327ae47f9",
+                "dc786f571c76d1c0c568c6dd7a701160",
+                "6fde7647f2cc02499035dfef5f615eab",
+            ]
+        elif name == "toydb":
+            links = [
+                "https://zenodo.org/records/10829308/files/hash.k2d",
+                "https://zenodo.org/records/10829308/files/opts.k2d",
+                "https://zenodo.org/records/10829308/files/taxo.k2d",
+            ]
+            md5sums = [
+                "31f4b20f9e5c6beb9e1444805264a6e5",
+                "733f7587f9c0c7339666d5906ec6fcd3",
+                "7bb56a0f035b27839fb5c18590b79263",
+            ]
+
+
+        for link, md5sum in zip(links, md5sums):
+
+            basename = link.split("/")[-1]
+
+            filename = base / basename
+            if os.path.exists(filename) and md5(filename) == md5sum:
+                logger.warning(f"{filename} already present with expected md5sum")
+            else:
+                logger.info(f"Downloading {link}")
+                wget(link, filename)
+
diff --git a/sequana/scripts/taxonomy.py b/sequana/scripts/taxonomy.py
@@ -28,7 +28,6 @@
 import rich_click as click
 
 from sequana import KrakenDownload, KrakenPipeline, KrakenSequential
-from sequana import sequana_config_path as cfg
 from sequana import sequana_config_path as scfg
 from sequana import version as sequana_version
 from sequana.modules_report.kraken import KrakenModule
@@ -44,13 +43,17 @@
 def update_taxonomy(ctx, param, value):  # pragma: no cover
     if value:
         tax = Taxonomy()
-        click.echo(f"Will overwrite the local database taxonomy.dat in {cfg}")
+        click.echo(f"Will overwrite the local database taxonomy.dat in {scfg}")
         tax.download_taxonomic_file(overwrite=True)
         sys.exit(0)
     return value
 
 
 def download_database(ctx, param, value):
+
+    from sequana import logger
+    logger.setLevel("INFO")
+
     if value:
         kd = KrakenDownload()
         kd.download(value)
@@ -164,7 +167,7 @@ def check_databases(ctx, param, value):
     "--download",
     "download",
     default=None,
-    type=click.Choice(["toydb"]),
+    type=click.Choice(["toydb", "viruses_masking:v21.1.1"]),
     callback=download_database,
     is_eager=True,
     help="A toydb example to be downloaded.",
@@ -229,8 +232,11 @@ def main(**kwargs):
 
     options = AttrDict(**kwargs)
 
+    from sequana import logger as loggers
+    loggers.setLevel(options.level)
     logger.setLevel(options.level)
 
+
     fastq = [options.file1]
     if options.file2:
         fastq.append(options.file2)

diff --git a/sequana/taxonomy.py b/sequana/taxonomy.py
@@ -131,7 +131,7 @@ def create_taxonomy_file(self, filename="taxonomy.csv.gz"):
         df_names = self.df_names.query("key == 'scientific name'").copy()
 
         # first we create the CSV file
-        logger.info("Creating CSV fike")
+        logger.info("Creating CSV file")
         with filename.with_suffix("").open(mode="w") as fout:
             fout.write("id,parent,rank,scientific_name\n")
 

diff --git a/sequana/viz/pca.py b/sequana/viz/pca.py
@@ -44,7 +44,7 @@ class PCA(clusterisation.Cluster):
         p.plot(n_components=2)
 
 
-        From R, a PCA is selecting the first 500 features based on variance.
+    From R, a PCA is selecting the first 500 features based on variance.
 
 
     """

diff --git a/test/kraken/test_downloads.py b/test/kraken/test_downloads.py
@@ -0,0 +1,16 @@
+
+import pytest
+
+from sequana import KrakenDownload 
+from . import test_dir
+
+
+def test_download(tmpdir):
+    # save in specific path
+    p = tmpdir.mkdir("kr")
+    kd = KrakenDownload(output_dir=str(p))
+    kd.download("toydb")
+
+    # redownload on purpose
+    kd.download("toydb")
+