convert rule spades_assembly to a python script #201

YibiChen · 2024-04-19T11:49:28Z

No description provided.

AroneyS · 2024-04-23T22:14:11Z

Do the integration tests all succeed?

YibiChen · 2024-04-23T22:34:06Z

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

YibiChen · 2024-04-23T22:44:49Z

I also have another concern about the tmpdir. In test_short_read_recovery_queue_submission, the tmpdir I got from qsub-logs is still /tmp, which is the $TMPDIR in the local env where I ran aviary. Is this correct? If I were to run bin chicken, would the aviary command be running on the remote node?

rhysnewell · 2024-04-23T23:08:24Z

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

rhysnewell · 2024-04-23T23:11:58Z

aviary/modules/assembly/scripts/spades_assembly.py

+            tmp_dir_arg = ""
+
+    if os.path.exists("data/spades_assembly/tmp"):
+        subprocess.call("rm -rf data/spades_assembly/tmp",shell=True)


We shouldn't be using subprocess.call with shell=True, it isn't safe. I know it is what snakemake does when shell commands are run, but we can do better. Refer to other python scripts on how to avoid using shell=True. run is a safe alternative, run_flye.py has a proper implementation

rhysnewell · 2024-04-23T23:13:05Z

aviary/modules/assembly/scripts/spades_assembly.py

+        # run cmd
+        with open(log, 'a') as logf:
+            logf.write(f"Queueing command {command}\n")
+            subprocess.run(command.split(), stdout=logf, stderr=subprocess.STDOUT)        


You've done it correctly here actually, so you should use this method elsewhere

rhysnewell · 2024-04-23T23:15:40Z

aviary/modules/assembly/scripts/spades_assembly.py

+        if long_read_type in ["ont","ont_hq"]:
+            command = f"spades.py --checkpoints all --memory {max_memory} --meta --nanopore {input_long_reads} --12 {input_fastq} "\
+                f"-o data/spades_assembly -t {threads}  -k {kmer_sizes} {tmp_dir_arg} "       
+        elif long_read_type == "pacbio":


Why are you looking for "pacbio", we should use "pacbio" mode when the user supplies "rs", "sq", "ccs", or "hifi". "pacbio" isn't even an option?

This should really just be an else statement as it was in the original script.

rhysnewell · 2024-04-23T23:17:20Z

aviary/modules/assembly/scripts/spades_assembly.py

+                f"-o data/spades_assembly -t {threads}  -k {kmer_sizes} {tmp_dir_arg} "
+        else:
+            # raise error
+            raise ValueError(f"Invalid long read type: {long_read_type}")


We shouldn't be raising a value error here, just fall back to pacbio.

YibiChen · 2024-04-24T05:08:29Z

Yes, the integration tests all succeed except one. This is relevant to test_batch_recovery, where long read file is given, but long read type is "none". In my script (aviary/modules/assembly/scripts/spades_assembly.py 63-65), I raised a ValueError for this scenario whereas in the earlier rule this would be defaulted to pacbio. Is there any reason to set the default type as pacbio? I believe nanopore is more common.

This doesn't seem to make a whole lot of sense. The default long read type in the CLI is ont, none is not a valid option. You shouldn't be supplying "none" as a long read type in the first place. If you are asking why pacbio was the final branch in the if statement originally, then it's because there are 4 different pacbio read types and only 2 ont types so it's programatically easier to catch the first 2 ont types and then safely assume pacbio after that.

If you don't have long reads, then this rule shouldn't even be getting run

I am sorry I didn't explain this well. I just thought it would be nice to have more sanity check. A good example is the test case for batch submission (see test/data/example_batch.tsv). Although long read type is "ont" for both samples here, it becomes "none" in the individual config.yaml files. This is likely causing by an index mistake somewhere. More importantly, the command would finish without any complains, even though spade were running in pacbio mode not nanopore. This could also happen if the user misspelled "ont" or used uppercase.

I can resubmit the pull request without the sanity check if you think it's not nessesary or I can correct the current version with the correct pacbio codes. And thanks for all the comments, they are very helpful.

AroneyS · 2024-04-24T05:12:02Z

The choices for long-read-type are restricted by the cli (

aviary/aviary/aviary.py

Line 481 in 7832126

choices=["ont","ont_hq", "rs", "sq", "ccs", "hifi"],

), so typos will error immediately.

rhysnewell · 2024-04-24T05:15:49Z

The choices for long-read-type are restricted by the cli (

aviary/aviary/aviary.py

Line 481 in 7832126

choices=["ont","ont_hq", "rs", "sq", "ccs", "hifi"],

), so typos will error immediately.

I think I understand what Yibi is saying, but I don't exactly understand how the long_read_type became none. The batch input doesn't have the same safety check in place that the CLI does, users could potentially introduce a typo in their batch file if the user isn't writing the batch scripts to a file

YibiChen · 2024-04-24T05:16:24Z

But batch mode doesn't seem to have this restriction.

YibiChen · 2024-04-24T05:17:54Z

@rhysnewell maybe you can try to run the test case and figure it out?

AroneyS · 2024-04-24T05:42:03Z

My vote would be to add checks for batch mode, rather than worry at each stage if the variable is garbage.
Though in a separate PR

rhysnewell · 2024-04-24T05:56:43Z

@YibiChen I've pinpointed the issue here and it's an easy fix, but it will go in a separate PR. Just remove your pacbio check from your script and respond to the other comments, thanks

subprocess.run

YibiChen · 2024-04-25T11:13:34Z

The integration tests all succeed. Let me know if you spot anything else.

convert rule spades_assembly to a python script

4ac30bc

AroneyS requested a review from rhysnewell April 23, 2024 22:14

rhysnewell requested changes Apr 23, 2024

View reviewed changes

removed string check for long read type; replaced subprocess.call with

adb54a2

subprocess.run

AroneyS requested a review from rhysnewell April 25, 2024 21:48

rhysnewell approved these changes Apr 28, 2024

View reviewed changes

AroneyS merged commit 8d3898c into rhysnewell:fix-tmpdir Apr 28, 2024
1 check passed

YibiChen deleted the fix-tmpdir branch April 28, 2024 22:57

AroneyS mentioned this pull request Apr 28, 2024

fix tmpdir being set from base command #200

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert rule spades_assembly to a python script #201

convert rule spades_assembly to a python script #201

YibiChen commented Apr 19, 2024

AroneyS commented Apr 23, 2024

YibiChen commented Apr 23, 2024

YibiChen commented Apr 23, 2024

rhysnewell commented Apr 23, 2024

rhysnewell Apr 23, 2024

rhysnewell Apr 23, 2024

rhysnewell Apr 23, 2024 •

edited

Loading

rhysnewell Apr 23, 2024

YibiChen commented Apr 24, 2024 •

edited

Loading

AroneyS commented Apr 24, 2024

rhysnewell commented Apr 24, 2024

YibiChen commented Apr 24, 2024

YibiChen commented Apr 24, 2024

AroneyS commented Apr 24, 2024

rhysnewell commented Apr 24, 2024

YibiChen commented Apr 25, 2024

convert rule spades_assembly to a python script #201

convert rule spades_assembly to a python script #201

Conversation

YibiChen commented Apr 19, 2024

AroneyS commented Apr 23, 2024

YibiChen commented Apr 23, 2024

YibiChen commented Apr 23, 2024

rhysnewell commented Apr 23, 2024

rhysnewell Apr 23, 2024

Choose a reason for hiding this comment

rhysnewell Apr 23, 2024

Choose a reason for hiding this comment

rhysnewell Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

rhysnewell Apr 23, 2024

Choose a reason for hiding this comment

YibiChen commented Apr 24, 2024 • edited Loading

AroneyS commented Apr 24, 2024

rhysnewell commented Apr 24, 2024

YibiChen commented Apr 24, 2024

YibiChen commented Apr 24, 2024

AroneyS commented Apr 24, 2024

rhysnewell commented Apr 24, 2024

YibiChen commented Apr 25, 2024

rhysnewell Apr 23, 2024 •

edited

Loading

YibiChen commented Apr 24, 2024 •

edited

Loading