Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busco dev #37

Merged
merged 17 commits into from
Feb 2, 2023
Merged

Busco dev #37

merged 17 commits into from
Feb 2, 2023

Conversation

alxndrdiaz
Copy link
Collaborator

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/blobtoolkit branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@alxndrdiaz alxndrdiaz added enhancement Improvement of the existing features help wanted Open to anyone interested testing Code testing labels Jan 20, 2023
@alxndrdiaz alxndrdiaz mentioned this pull request Jan 20, 2023
10 tasks
@alxndrdiaz alxndrdiaz self-assigned this Jan 20, 2023
Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, this looks so much better ! Great !
I can confirm that the test profile works for me on the farm, with just this small change below.

In terms of functionality of the subworkflow, do you think it does everything it needs to do ?

conf/test.config Outdated Show resolved Hide resolved
added missing single quote

Co-authored-by: Matthieu Muffato <mm49@sanger.ac.uk>
@alxndrdiaz
Copy link
Collaborator Author

Oh yeah, this looks so much better ! Great ! I can confirm that the test profile works for me on the farm, with just this small change below.

In terms of functionality of the subworkflow, do you think it does everything it needs to do ?

Results from the busco_diamond subworkflow can be found in the work/ directory but they are not exported to the results directory.

Adding the following lines to conf/modules.config:

 withName: BUSCO_DIAMOND {
        publishDir = [
            path: { "${params.outdir}/blobtoolkit" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
        ]
    }

doesn't solve the problem, the following Nextflow warnings are raised:

WARN: There's no process matching config selector: CUSTOM_DUMPSOFTWAREVERSIONS
WARN: There's no process matching config selector: BUSCO_DIAMOND
WARN: There's no process matching config selector: FASTQC

I need to take a closer look at this, not sure which other files might be causing this error.

@muffato
Copy link
Member

muffato commented Jan 24, 2023

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'

@alxndrdiaz
Copy link
Collaborator Author

alxndrdiaz commented Jan 24, 2023

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'

It worked, only results from TAR module which are only renamed and compressed files from BUSCO are excluded (these are only required for EXTRACT_BUSCO_GENES module and not used outside the subworkflow):

 withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
		publishDir = [
		    path: { "${params.outdir}/blobtoolkit/busco_diamond" },
		    mode: params.publish_dir_mode,
		    saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
		]
	    }

When running:

nextflow run main.nf -profile test,singularity

The results folder should look something like this using tree -L 3 results/blobtoolkit/:

results/blobtoolkit/
├── busco_diamond
│   ├── GCA_922984935.2.subset-archaea_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset_busco_genes.fasta
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset.tsv
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│   └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi

31 directories, 39 files

@alxndrdiaz
Copy link
Collaborator Author

withName only works with process names. BUSCO_DIAMOND is a sub-workflow name. Something like this may work, I think:

withName: '.*.*:BUSCO_DIAMOND:.*'

It worked, only results from TAR module which are only renamed and compressed files from BUSCO are excluded (these are only required for EXTRACT_BUSCO_GENES module and not used outside the subworkflow):

 withName: '.*.*:BUSCO_DIAMOND:GOAT_TAXONSEARCH|BUSCO|EXTRACT_BUSCO_GENES|DIAMOND_BLASTP' {
		publishDir = [
		    path: { "${params.outdir}/blobtoolkit/busco_diamond" },
		    mode: params.publish_dir_mode,
		    saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
		]
	    }

The results folder should look something like this using tree -L 3 results/blobtoolkit/:

results/blobtoolkit/
├── busco_diamond
│   ├── GCA_922984935.2.subset-archaea_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-archaea_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-bacteria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset_busco_genes.fasta
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-carnivora_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eukaryota_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-eutheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-laurasiatheria_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-mammalia_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-metazoa_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-tetrapoda_odb10-busco.batch_summary.txt
│   ├── GCA_922984935.2.subset.tsv
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco
│   │   ├── GCA_922984935.2.subset.fasta
│   │   └── logs
│   ├── GCA_922984935.2.subset-vertebrata_odb10-busco.batch_summary.txt
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.archaea_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.bacteria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.carnivora_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eukaryota_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.eutheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.laurasiatheria_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.mammalia_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.metazoa_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.json
│   ├── short_summary.specific.tetrapoda_odb10.GCA_922984935.2.subset.fasta.txt
│   ├── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.json
│   └── short_summary.specific.vertebrata_odb10.GCA_922984935.2.subset.fasta.txt
├── mMelMel1_T1.mosdepth.global.dist.txt
├── mMelMel1_T1.mosdepth.region.dist.txt
├── mMelMel1_T1.mosdepth.summary.txt
├── mMelMel1_T1.per-base.bed.gz
├── mMelMel1_T1.per-base.bed.gz.csi
├── mMelMel1_T1.regions.bed.gz
└── mMelMel1_T1.regions.bed.gz.csi

31 directories, 39 files

@priyanka-surana
Copy link
Contributor

I would not worry much about publishing results to the results folder. Once the pipeline is completed we will update this with the final structure. For now as long as the code works and creates the correct output in the work folder we can move forward.

@priyanka-surana
Copy link
Contributor

Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.

@alxndrdiaz
Copy link
Collaborator Author

Are there any issues with the current code besides linting? If not, let’s merge. A lot of downstream work depends on this.

Using there nf-core lint the following failed linting tests are reported:

╭─ [✗] 19 Pipeline Tests Failed ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                                                                                                                         │
│ nextflow_config: Config variable (incorrectly) found: params.enable_conda                                                                               │
│ nextflow_config: Config manifest.name did not begin with nf-core/: sanger-tol/blobtoolkit                                                               │
│ nextflow_config: Config variable manifest.homePage did not begin with https://github.com/nf-core/: https://github.com/sanger-tol/blobtoolkit            │
│ files_unchanged: .gitattributes does not match the template                                                                                             │
│ files_unchanged: LICENSE does not match the template                                                                                                    │
│ files_unchanged: .github/CONTRIBUTING.md does not match the template                                                                                    │
│ files_unchanged: .github/ISSUE_TEMPLATE/bug_report.yml does not match the template                                                                      │
│ files_unchanged: .github/ISSUE_TEMPLATE/feature_request.yml does not match the template                                                                 │
│ files_unchanged: .github/PULL_REQUEST_TEMPLATE.md does not match the template                                                                           │
│ files_unchanged: .github/workflows/branch.yml does not match the template                                                                               │
│ files_unchanged: .github/workflows/linting_comment.yml does not match the template                                                                      │
│ files_unchanged: .github/workflows/linting.yml does not match the template                                                                              │
│ files_unchanged: assets/email_template.txt does not match the template                                                                                  │
│ files_unchanged: assets/sendmail_template.txt does not match the template                                                                               │
│ files_unchanged: docs/README.md does not match the template                                                                                             │
│ files_unchanged: lib/NfcoreSchema.groovy does not match the template                                                                                    │
│ files_unchanged: lib/NfcoreTemplate.groovy does not match the template                                                                                  │
│ files_unchanged: .prettierignore does not match the template                                                                                            │
│ multiqc_config: 'assets/multiqc_config.yml' does not contain a matching 'report_comment'.                                                                                                 

The test using conf/test.config runs as expected and the output files are exported to the results folder.

Copy link
Contributor

@priyanka-surana priyanka-surana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 as long as the tests pass

@zb32
Copy link
Contributor

zb32 commented Jan 26, 2023

Hi it looks like there's a problem with the EXTRACT_BUSCO_GENES module. I ran the pipeline with the full BUSCO lineage datasets and DIAMOND_BLASTP still isn't running as the fast file from EXTRACT_BUSCO_GENES is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive.

/lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls
fragmented_busco_sequences.tar.gz  multi_copy_busco_sequences.tar.gz  single_copy_busco_sequences.tar.gz 

@alxndrdiaz
Copy link
Collaborator Author

Hi it looks like there's a problem with the EXTRACT_BUSCO_GENES module. I ran the pipeline with the full BUSCO lineage datasets and DIAMOND_BLASTP still isn't running as the fast file from EXTRACT_BUSCO_GENES is empty. I've looked at the BUSCO results for the lineage eukaryota_odb10 and there are BUSCO hits so this file shouldn't be empty. It looks like in the original command it's looking for files ending in .faa but these are nested within another TAR archive.

/lustre/scratch123/tol/teams/tolit/users/zb3/blobtoolkit/work/49/62b3e7b8e51136b3ffa55ac66661e8/eukaryota_odb10/busco_sequences$ ls
fragmented_busco_sequences.tar.gz  multi_copy_busco_sequences.tar.gz  single_copy_busco_sequences.tar.gz 

It seems only the single_copy_busco_sequences.tar.gz file contains .faa files:

tar -tf single_copy_busco_sequences.tar.gz

Output:

single_copy_busco_sequences/
single_copy_busco_sequences/939345at2759.faa
single_copy_busco_sequences/939345at2759.fna
single_copy_busco_sequences/939345at2759.gff

Also the Python script you mentioned looks into each .tar.gz and searches for all ".faa" files inside (but it would be a good idea to confirm this). However as you mentioned there is at least one .faa in this case and the output FASTA file with extracted genes should contain this sequence. The module TAR prepares the input for EXTRACT_BUSCO_GENES module and includes a .tar.gz compression step, so it is possible that the issue is in that module instead, also I used the flag --tar for running busco that also compresses some of these folders in the busco output. Then I need to check how these folders are being compressed and see if I can fix the issue.

@muffato
Copy link
Member

muffato commented Jan 26, 2023

The archives single_copy_busco_sequences.tar.gz & co come from the --tar option we asked you to add to Busco. (in conf/modules.config) Didn't realise it would cause some trouble down the line. In order to get it to work, feel free to remove the --tar option, though this alone may not fix the issue.

@alxndrdiaz
Copy link
Collaborator Author

@zb32 Hi. I fixed the issue you found. When running the test there should be the following file containing the diamond blastp hits: results/blobtoolkit/busco_diamond/GCA_922984935.2.subset.txt, the content of this file looks like this:

OV277441.1:691847-695889=939345at2759=single	9838	979	OV277441.1:691847-695889=939345at2759=single	tr|A0A5N4CFV3|A0A5N4CFV3_CAMDR	64.6	867	124	8	1	815	1	736	0.0	979

Which is the expected output (columns: "qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore").

@alxndrdiaz
Copy link
Collaborator Author

@muffato @priyanka-surana @priyanka-surana I was not sure about merging, if you have any comments or issues that should be fixed, please let me know.

@muffato
Copy link
Member

muffato commented Feb 1, 2023

Thank you @alxndrdiaz ! I can confirm that the Busco hit makes it way to Diamond on the unit test.

I've started a full test on gfLaeSulp1.1 (and had to do a few changes, which I have added to this branch). It's a small genome, so hopefully it shouldn't take too long. I'll talk to Zaynab tomorrow morning, but I think it will be OK to merge 🤞🏼

@muffato
Copy link
Member

muffato commented Feb 1, 2023

Your subworkflow actually already completed on the full test. 326 Busco genes recovered across the three domains, and 280 Diamond hits. It looks fine by me 👍🏼

Copy link
Contributor

@zb32 zb32 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ran the pipeline and it looks good :D

@muffato muffato merged commit d796984 into dev Feb 2, 2023
@muffato muffato deleted the busco_dev branch February 2, 2023 17:54
@muffato muffato mentioned this pull request Feb 2, 2023
10 tasks
This was linked to issues Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features help wanted Open to anyone interested testing Code testing
Projects
Development

Successfully merging this pull request may close these issues.

subworkflow: diamond_blastp subworkflow: busco
4 participants