Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert rule spades_assembly to a python script #201

Merged
merged 2 commits into from
Apr 28, 2024

Conversation

YibiChen
Copy link

No description provided.

@AroneyS
Copy link
Collaborator

AroneyS commented Apr 23, 2024

Do the integration tests all succeed?

@AroneyS AroneyS requested a review from rhysnewell April 23, 2024 22:14
@YibiChen
Copy link
Author

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

@YibiChen
Copy link
Author

I also have another concern about the tmpdir. In test_short_read_recovery_queue_submission, the tmpdir I got from qsub-logs is still /tmp, which is the $TMPDIR in the local env where I ran aviary. Is this correct? If I were to run bin chicken, would the aviary command be running on the remote node?

@rhysnewell
Copy link
Owner

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

tmp_dir_arg = ""

if os.path.exists("data/spades_assembly/tmp"):
subprocess.call("rm -rf data/spades_assembly/tmp",shell=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be using subprocess.call with shell=True, it isn't safe. I know it is what snakemake does when shell commands are run, but we can do better. Refer to other python scripts on how to avoid using shell=True. run is a safe alternative, run_flye.py has a proper implementation

# run cmd
with open(log, 'a') as logf:
logf.write(f"Queueing command {command}\n")
subprocess.run(command.split(), stdout=logf, stderr=subprocess.STDOUT)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've done it correctly here actually, so you should use this method elsewhere

if long_read_type in ["ont","ont_hq"]:
command = f"spades.py --checkpoints all --memory {max_memory} --meta --nanopore {input_long_reads} --12 {input_fastq} "\
f"-o data/spades_assembly -t {threads} -k {kmer_sizes} {tmp_dir_arg} "
elif long_read_type == "pacbio":
Copy link
Owner

@rhysnewell rhysnewell Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you looking for "pacbio", we should use "pacbio" mode when the user supplies "rs", "sq", "ccs", or "hifi". "pacbio" isn't even an option?

This should really just be an else statement as it was in the original script.

f"-o data/spades_assembly -t {threads} -k {kmer_sizes} {tmp_dir_arg} "
else:
# raise error
raise ValueError(f"Invalid long read type: {long_read_type}")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't be raising a value error here, just fall back to pacbio.

@YibiChen
Copy link
Author

YibiChen commented Apr 24, 2024

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

I am sorry I didn't explain this well. I just thought it would be nice to have more sanity check. A good example is the test case for batch submission (see test/data/example_batch.tsv). Although long read type is "ont" for both samples here, it becomes "none" in the individual config.yaml files. This is likely causing by an index mistake somewhere. More importantly, the command would finish without any complains, even though spade were running in pacbio mode not nanopore. This could also happen if the user misspelled "ont" or used uppercase.

I can resubmit the pull request without the sanity check if you think it's not nessesary or I can correct the current version with the correct pacbio codes. And thanks for all the comments, they are very helpful.

@AroneyS
Copy link
Collaborator

AroneyS commented Apr 24, 2024

The choices for long-read-type are restricted by the cli (

choices=["ont","ont_hq", "rs", "sq", "ccs", "hifi"],
), so typos will error immediately.

@rhysnewell
Copy link
Owner

The choices for long-read-type are restricted by the cli (

choices=["ont","ont_hq", "rs", "sq", "ccs", "hifi"],

), so typos will error immediately.

I think I understand what Yibi is saying, but I don't exactly understand how the long_read_type became none. The batch input doesn't have the same safety check in place that the CLI does, users could potentially introduce a typo in their batch file if the user isn't writing the batch scripts to a file

@YibiChen
Copy link
Author

But batch mode doesn't seem to have this restriction.

@YibiChen
Copy link
Author

@rhysnewell maybe you can try to run the test case and figure it out?

@AroneyS
Copy link
Collaborator

AroneyS commented Apr 24, 2024

My vote would be to add checks for batch mode, rather than worry at each stage if the variable is garbage.
Though in a separate PR

@rhysnewell
Copy link
Owner

@YibiChen I've pinpointed the issue here and it's an easy fix, but it will go in a separate PR. Just remove your pacbio check from your script and respond to the other comments, thanks

@YibiChen
Copy link
Author

The integration tests all succeed. Let me know if you spot anything else.

@AroneyS AroneyS requested a review from rhysnewell April 25, 2024 21:48
@AroneyS AroneyS merged commit 8d3898c into rhysnewell:fix-tmpdir Apr 28, 2024
1 check passed
@YibiChen YibiChen deleted the fix-tmpdir branch April 28, 2024 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants