Support draft assemblies #97

muffato · 2024-05-09T09:16:26Z

On this branch, there is no input Yaml file. The only mandatory parameters are:

Species name / taxon_id (--taxon)
Assembly (--fasta)
Sample sheet (--input) to list the read files

--accession is optional and is used to pull assembly information from ENA into the blobDir's meta.json.

I haven't restructured the pipeline much. All the blobtools command at the end still require a yaml file. My solution is to add a script at the beginning of the pipeline that generates the minimal yaml file required (as per #77 (comment)). It still allows clearly getting some parameters in the input-check sub-workflow and making the busco sub-workflow more focused on running buco + blastp.

Busco lineages are inferred from the taxonomy directly here. Like in the genome-note pipeline, I've moved away from using GoaT as GoaT is just a proxy to the NCBI taxonomy. This way, I can keep control of both the version of Busco and the list of lineages in the same place.
I've also introduced the --busco_lineages parameter to allow precisely selecting the lineages that are used, rather than the taxonomy-based defaults.

Still a draft for now as I want to review /nfs/team135/yy5/btk_config/taxonomiser_v2.py and maybe incorporate some elements of it.

PR checklist

github-actions · 2024-05-09T09:18:06Z

`nf-core lint` overall result: Passed ✅

Posted for pipeline commit 8c70c77

+| ✅ 134 tests passed       |+
#| ❔  24 tests were ignored |#

❔ Tests ignored:

files_exist - File is ignored: CODE_OF_CONDUCT.md
files_exist - File is ignored: assets/nf-core-blobtoolkit_logo_light.png
files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_light.png
files_exist - File is ignored: docs/images/nf-core-blobtoolkit_logo_dark.png
files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
files_exist - File is ignored: .github/workflows/awstest.yml
files_exist - File is ignored: .github/workflows/awsfulltest.yml
files_exist - File is ignored: conf/igenomes.config
nextflow_config - Config variable ignored: manifest.name
nextflow_config - Config variable ignored: manifest.homePage
files_unchanged - File ignored due to lint config: CODE_OF_CONDUCT.md
files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
files_unchanged - File ignored due to lint config: assets/nf-core-blobtoolkit_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_light.png
files_unchanged - File ignored due to lint config: docs/images/nf-core-blobtoolkit_logo_dark.png
files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy
actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/blobtoolkit/blobtoolkit/.github/workflows/awstest.yml
template_strings - template_strings
merge_markers - merge_markers

✅ Tests passed:

files_exist - File found: .gitattributes
files_exist - File found: .gitignore
files_exist - File found: .nf-core.yml
files_exist - File found: .editorconfig
files_exist - File found: .prettierignore
files_exist - File found: .prettierrc.yml
files_exist - File found: CHANGELOG.md
files_exist - File found: CITATIONS.md
files_exist - File found: LICENSE or LICENSE.md or LICENCE or LICENCE.md
files_exist - File found: nextflow_schema.json
files_exist - File found: nextflow.config
files_exist - File found: README.md
files_exist - File found: .github/.dockstore.yml
files_exist - File found: .github/CONTRIBUTING.md
files_exist - File found: .github/ISSUE_TEMPLATE/bug_report.yml
files_exist - File found: .github/ISSUE_TEMPLATE/feature_request.yml
files_exist - File found: .github/PULL_REQUEST_TEMPLATE.md
files_exist - File found: .github/workflows/branch.yml
files_exist - File found: .github/workflows/ci.yml
files_exist - File found: .github/workflows/linting_comment.yml
files_exist - File found: .github/workflows/linting.yml
files_exist - File found: assets/email_template.html
files_exist - File found: assets/email_template.txt
files_exist - File found: assets/sendmail_template.txt
files_exist - File found: conf/modules.config
files_exist - File found: conf/test.config
files_exist - File found: conf/test_full.config
files_exist - File found: docs/output.md
files_exist - File found: docs/README.md
files_exist - File found: docs/README.md
files_exist - File found: docs/usage.md
files_exist - File found: lib/nfcore_external_java_deps.jar
files_exist - File found: lib/NfcoreTemplate.groovy
files_exist - File found: lib/Utils.groovy
files_exist - File found: lib/WorkflowMain.groovy
files_exist - File found: main.nf
files_exist - File found: assets/multiqc_config.yml
files_exist - File found: conf/base.config
files_exist - File found: lib/WorkflowBlobtoolkit.groovy
files_exist - File found: modules.json
files_exist - File found: pyproject.toml
files_exist - File not found check: Singularity
files_exist - File not found check: parameters.settings.json
files_exist - File not found check: pipeline_template.yml
files_exist - File not found check: .nf-core.yaml
files_exist - File not found check: bin/markdown_to_html.r
files_exist - File not found check: conf/aws.config
files_exist - File not found check: .github/workflows/push_dockerhub.yml
files_exist - File not found check: .github/ISSUE_TEMPLATE/bug_report.md
files_exist - File not found check: .github/ISSUE_TEMPLATE/feature_request.md
files_exist - File not found check: docs/images/nf-core-blobtoolkit_logo.png
files_exist - File not found check: .markdownlint.yml
files_exist - File not found check: .yamllint.yml
files_exist - File not found check: lib/Checks.groovy
files_exist - File not found check: lib/Completion.groovy
files_exist - File not found check: lib/Workflow.groovy
files_exist - File not found check: .travis.yml
nextflow_config - Config variable found: manifest.nextflowVersion
nextflow_config - Config variable found: manifest.description
nextflow_config - Config variable found: manifest.version
nextflow_config - Config variable found: timeline.enabled
nextflow_config - Config variable found: trace.enabled
nextflow_config - Config variable found: report.enabled
nextflow_config - Config variable found: dag.enabled
nextflow_config - Config variable found: process.cpus
nextflow_config - Config variable found: process.memory
nextflow_config - Config variable found: process.time
nextflow_config - Config variable found: params.outdir
nextflow_config - Config variable found: params.input
nextflow_config - Config variable found: params.validationShowHiddenParams
nextflow_config - Config variable found: params.validationSchemaIgnoreParams
nextflow_config - Config variable found: manifest.mainScript
nextflow_config - Config variable found: timeline.file
nextflow_config - Config variable found: trace.file
nextflow_config - Config variable found: report.file
nextflow_config - Config variable found: dag.file
nextflow_config - Config variable (correctly) not found: params.nf_required_version
nextflow_config - Config variable (correctly) not found: params.container
nextflow_config - Config variable (correctly) not found: params.singleEnd
nextflow_config - Config variable (correctly) not found: params.igenomesIgnore
nextflow_config - Config variable (correctly) not found: params.name
nextflow_config - Config variable (correctly) not found: params.enable_conda
nextflow_config - Config timeline.enabled had correct value: true
nextflow_config - Config report.enabled had correct value: true
nextflow_config - Config trace.enabled had correct value: true
nextflow_config - Config dag.enabled had correct value: true
nextflow_config - Config dag.file ended with .html
nextflow_config - Config variable manifest.nextflowVersion started with >= or !>=
nextflow_config - Config manifest.version ends in dev: 0.6.0-dev
nextflow_config - Config params.custom_config_version is set to master
nextflow_config - Config params.custom_config_base is set to https://raw.githubusercontent.com/nf-core/configs/master
nextflow_config - Lines for loading custom profiles found
nextflow_config - nextflow.config contains configuration profile test
files_unchanged - .gitattributes matches the template
files_unchanged - .prettierrc.yml matches the template
files_unchanged - .github/.dockstore.yml matches the template
files_unchanged - .github/CONTRIBUTING.md matches the template
files_unchanged - .github/ISSUE_TEMPLATE/feature_request.yml matches the template
files_unchanged - .github/workflows/linting_comment.yml matches the template
files_unchanged - assets/email_template.html matches the template
files_unchanged - assets/email_template.txt matches the template
files_unchanged - assets/sendmail_template.txt matches the template
files_unchanged - docs/README.md matches the template
files_unchanged - lib/nfcore_external_java_deps.jar matches the template
files_unchanged - .gitignore matches the template
files_unchanged - .prettierignore matches the template
files_unchanged - pyproject.toml matches the template
actions_ci - '.github/workflows/ci.yml' is triggered on expected events
actions_ci - '.github/workflows/ci.yml' checks minimum NF version
readme - README Nextflow minimum version badge matched config. Badge: 23.04.0, Config: 23.04.0
readme - README Zenodo placeholder was replaced with DOI.
pipeline_todos - No TODO strings found
pipeline_name_conventions - Name adheres to nf-core convention
schema_lint - Schema lint passed
schema_lint - Schema title + description lint passed
schema_lint - Input mimetype lint passed: 'text/csv'
schema_params - Schema matched params returned from nextflow config
system_exit - No System.exit calls found
actions_schema_validation - Workflow validation passed: branch.yml
actions_schema_validation - Workflow validation passed: fix-linting.yml
actions_schema_validation - Workflow validation passed: linting.yml
actions_schema_validation - Workflow validation passed: clean-up.yml
actions_schema_validation - Workflow validation passed: ci.yml
actions_schema_validation - Workflow validation passed: linting_comment.yml
actions_schema_validation - Workflow validation passed: release-announcements.yml
actions_schema_validation - Workflow validation passed: sanger_test_full.yml
actions_schema_validation - Workflow validation passed: sanger_test.yml
modules_json - Only installed modules found in modules.json
multiqc_config - 'assets/multiqc_config.yml' contains report_section_order
multiqc_config - 'assets/multiqc_config.yml' contains export_plots
multiqc_config - 'assets/multiqc_config.yml' contains report_comment
multiqc_config - 'assets/multiqc_config.yml' follows the ordering scheme of the minimally required plugins.
multiqc_config - 'assets/multiqc_config.yml' contains 'export_plots: true'.
modules_structure - modules directory structure is correct 'modules/nf-core/TOOL/SUBTOOL'

Run details

nf-core/tools version 2.11
Run at 2024-08-24 10:19:22

github-actions · 2024-05-20T09:42:14Z

Python linting (`black`) is failing

To keep the code consistent with lots of contributors, we run automated code consistency checks.
To fix this CI test, please run:

Install black: pip install black
Fix formatting errors in your pipeline: black .

Once you push these changes the test should pass, and you can hide this comment 👍

We highly recommend setting up Black in your code editor so that this formatting is done automatically on save. Ask about it on Slack for help!

Thanks again for your contribution!

muffato · 2024-05-24T10:48:25Z

I've added some code to achieve the goal of taxonomiser_v2.py, which is: find a taxon_id that is recognised by the NT database and the closest to the species of interest.
It's implemented very differently from the script. I leverage the taxonomy4blast.sqlite3 database that is shipped with NT and essentially lists the taxon_ids it knows about. If the species' taxon_id is not recognised, then it looks for the parent, etc.

As far as I understand the requirements, this is the last bit that was missing to complete support for draft assemblies. I'll mark this pull-request as ready.

muffato · 2024-07-10T17:48:51Z

@eeaunin . I've rebased this branch. It now includes the fixes I've made for blast

Release 0.5

docs/usage.md

eeaunin · 2024-08-05T13:02:45Z

I had a closer look at how -negative_taxids has been implemented in the Snakemake pipeline and it appears quite confusing. The BlobToolKit paper (https://academic.oup.com/g3journal/article/10/4/1361/6026202) says:

An optional filter excludes a configurable list of NCBI taxIDs (default: excludes query genus).

So the exclusion of taxids is supposed to be optional and configurable by the user.
BlobToolKit pipeline v1 has the mask_ids setting for excluding taxids:

https://github.com/blobtoolkit/pipeline/blob/master/v1/example.yaml

However, I couldn't find a setting for the same thing in the Snakemake pipeline v2 code. Maybe the authors just forgot to include it?

In my runs with the Snakemake pipeline negative taxids were not used but there are suppressed error messages buried in the run logs relating to that. In a run with a Plasmodium yoelii yoelii assembly there is this error in the logs (/lustre/scratch123/tol/teams/tola/users/ea10/pipeline_testing/20230215_pyoelii_asg_cobiont_check_run/btk_busco/blastn/logs/pyoelii/run_blastn.log):

BLAST Database error: Taxonomy ID(s) not found.Taxonomy ID(s) not found. This could be because the ID(s) provided are not at or below the species level. Please use get_species_taxids.sh to get taxids for nodes higher than species (see https://www.ncbi.nlm.nih.gov/books/NBK546209/).
Restarting blastn without taxid filter

So it ran into the error but then just quietly continued running. It is unclear to me what caused this error, as the taxid used there (352914) is at strain level.

In another run it has skipped using the taxid filter due to another error: /lustre/scratch123/tol/teams/grit/contamination_screen/icMagCera1/20240712_icMagCera1.20240711.hap1.fa_asg_cobiont_check_run/btk_busco/blastn/logs/icMagCera1.20240711.hap1.fa/run_blastn.log

BLAST Database error: Taxonomy filtering is not supported in v4 BLAST dbs
Restarting blastn without taxid filter

So the filtering doesn't work if the supplied database is V4 instead of V5 but this also doesn't crash the Snakemake pipeline and just produces an error message in the logs.

I guess it would be okay if the sanger-tol/blobtoolkit pipeline used -negative_taxids in all runs with draft assemblies as long as this doesn't produce frequent crashes. But I think it would be better if the use of -negative_taxids was optional for draft assemblies.

The filter in SEQTK_SUBSEQ is not sufficient because some BLOBTOOLKIT_CHUNK further excludes masked regions

Skip blastn if there are no chunks

… the container

…y-classifications/

…NA taxon_ids NCBI is still the first database we query

muffato · 2024-08-22T14:29:05Z

@eeaunin . I've added a --skip_taxon_filtering flag for you. It removes the taxon filtering from all Blast searches

I've rebased the branch onto the latest stable release 0.5.1

eeaunin · 2024-08-22T14:50:35Z

That's good then! I think it's fine to merge the draft_assemblies branch to dev now

…he big show

muffato self-assigned this May 9, 2024

muffato mentioned this pull request May 18, 2024

Overall clean up #98

Merged

10 tasks

muffato changed the base branch from dev to clean_params May 20, 2024 09:42

muffato force-pushed the draft_assemblies branch 2 times, most recently from a049ba9 to 0e6fa8e Compare May 20, 2024 09:57

muffato requested review from eeaunin and DLBPointon May 20, 2024 10:00

muffato force-pushed the draft_assemblies branch from 0e6fa8e to 80d86de Compare May 21, 2024 09:16

muffato force-pushed the clean_params branch from fae5ecb to 9b1ecc3 Compare May 23, 2024 14:18

Base automatically changed from clean_params to dev May 23, 2024 15:02

muffato force-pushed the draft_assemblies branch from 80d86de to 4b7b3b2 Compare May 23, 2024 15:03

muffato marked this pull request as ready for review May 24, 2024 10:48

muffato added the enhancement Improvement of the existing features label May 24, 2024

muffato linked an issue Jun 1, 2024 that may be closed by this pull request

Improved generation of the summary Yaml file #77

Closed

muffato added the user request Requests made by users and public label Jun 20, 2024

muffato mentioned this pull request Jul 1, 2024

Publish peripheral data as well, even if we don't use it ourselves #99

Merged

10 tasks

muffato force-pushed the draft_assemblies branch from 413a84d to 424cdf7 Compare July 10, 2024 17:47

muffato force-pushed the draft_assemblies branch from d6ff541 to ddf7b44 Compare July 10, 2024 18:56

muffato force-pushed the draft_assemblies branch from ddf7b44 to aa82abc Compare July 18, 2024 19:53

Merge pull request #104 from sanger-tol/dev

1c0bf53

Release 0.5

eeaunin requested changes Aug 1, 2024

View reviewed changes

docs/usage.md Show resolved Hide resolved

muffato added 4 commits August 15, 2024 20:08

Skip blastn if there are no chunks

e1ef451

The filter in SEQTK_SUBSEQ is not sufficient because some BLOBTOOLKIT_CHUNK further excludes masked regions

Version bump

ed44da1

New date

3053bdb

Merge pull request #109 from sanger-tol/patch

0a5b28b

Skip blastn if there are no chunks

muffato added 17 commits August 22, 2024 14:28

Generate a more complete yaml to match the one we get from blobtoolkit

4c7cef6

Update the database paths in the final meta.json

eb87193

Fill in the reads too

f4c82fb

Fill in the assembly information too

a28690f

No need to generate the reference initial yaml file

a514f2f

Switched to the newer endpoint

df6d7d8

Introduced --parameters to have flexibility regarding their order

e190bdd

Adjust the taxon_id to make sure it exists in the NT database

c37044c

All these parameters are mandatory

9963aa5

bugfix: --busco is optional

e959c04

bugfix: this should be a "path" so that the file is made available to…

ed94111

… the container

bugfix: accept older assembly versions

5962034

These fields can be missing

121e372

Some genomes don't have organelles

8085a74

Easier to read

30af129

Release name

9dcd338

https://ncbiinsights.ncbi.nlm.nih.gov/2024/06/04/changes-ncbi-taxonom…

8886dba

…y-classifications/

muffato force-pushed the draft_assemblies branch from aa82abc to 635c6e0 Compare August 22, 2024 13:29

Use GoaT in addition to the NCBI because GoaT also has the freshest E…

70f961c

…NA taxon_ids NCBI is still the first database we query

muffato force-pushed the draft_assemblies branch from 635c6e0 to cd471de Compare August 22, 2024 14:10

Added an option to skip filtering hits from the same species

e9d3a64

muffato force-pushed the draft_assemblies branch from cd471de to e9d3a64 Compare August 22, 2024 14:23

eeaunin approved these changes Aug 22, 2024

View reviewed changes

Corrected the version as I want to be sure this is really ready for t…

8c70c77

…he big show

muffato force-pushed the draft_assemblies branch from 57b041f to 8c70c77 Compare August 24, 2024 10:18

muffato merged commit 18d2daf into dev Aug 24, 2024
6 checks passed

muffato deleted the draft_assemblies branch August 24, 2024 12:02

muffato mentioned this pull request Sep 11, 2024

Release 0.6 #112

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support draft assemblies #97

Support draft assemblies #97

muffato commented May 9, 2024 •

edited

Loading

github-actions bot commented May 9, 2024 •

edited

Loading

❔ Tests ignored:

✅ Tests passed:

Run details

github-actions bot commented May 20, 2024

muffato commented May 24, 2024

muffato commented Jul 10, 2024

eeaunin commented Aug 5, 2024

muffato commented Aug 22, 2024 •

edited

Loading

eeaunin commented Aug 22, 2024

Support draft assemblies #97

Support draft assemblies #97

Conversation

muffato commented May 9, 2024 • edited Loading

PR checklist

github-actions bot commented May 9, 2024 • edited Loading

nf-core lint overall result: Passed ✅

❔ Tests ignored:

✅ Tests passed:

Run details

github-actions bot commented May 20, 2024

Python linting (black) is failing

muffato commented May 24, 2024

muffato commented Jul 10, 2024

eeaunin commented Aug 5, 2024

muffato commented Aug 22, 2024 • edited Loading

eeaunin commented Aug 22, 2024

muffato commented May 9, 2024 •

edited

Loading

github-actions bot commented May 9, 2024 •

edited

Loading

`nf-core lint` overall result: Passed ✅

Python linting (`black`) is failing

muffato commented Aug 22, 2024 •

edited

Loading