Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QC Stats #161

Merged
merged 27 commits into from
May 30, 2024
Merged

Add QC Stats #161

merged 27 commits into from
May 30, 2024

Conversation

LeeBergstrand
Copy link
Collaborator

Add QC stats to Rotary.

@LeeBergstrand LeeBergstrand changed the title start working on QC stats. QC Stats. Apr 23, 2024
@LeeBergstrand LeeBergstrand changed the title QC Stats. QC Stats Apr 23, 2024
@LeeBergstrand LeeBergstrand changed the title QC Stats Add QC Stats Apr 23, 2024
@jmtsuji
Copy link
Collaborator

jmtsuji commented May 1, 2024

@LeeBergstrand Like I mentioned in #91 , this is a nice start toward making QC summary reports. Here are a few observations of mine after testing this code:

  • It would be ideal to also run FastQC on the raw reads (before input to the QC module)
  • If we want to run FastQC on intermediate QC steps like you have done in this draft code, we will need to modify the FastQC rules somehow so that each intermediate QC step is optional. Depending on how the user sets up their config file, some of the QC steps might be skipped, and I think this would produce an error in the current code. I can think of two different ways to make FastQC for intermediate steps optional:
    • Keep the current fastqc rule but add more complex logic to each input variable
    • Change the current fastqc rule so that it runs on each FastQ output file during QC, one file at a time, based on wildcards. In this case, the DAG would just figure out when to run the fastqc rule dynamically.
  • Once ready, we should add another rule that runs MultiQC. Based on my quick testing, it would be ideal to generate separate MultiQC reports for short reads and long reads.

If you agree that this PR is a good way to address #91, then we can start to refine this PR including addressing the comments above. Thanks again!

@LeeBergstrand
Copy link
Collaborator Author

@LeeBergstrand Like I mentioned in #91 , this is a nice start toward making QC summary reports. Here are a few observations of mine after testing this code:

  • It would be ideal to also run FastQC on the raw reads (before input to the QC module)

  • If we want to run FastQC on intermediate QC steps like you have done in this draft code, we will need to modify the FastQC rules somehow so that each intermediate QC step is optional. Depending on how the user sets up their config file, some of the QC steps might be skipped, and I think this would produce an error in the current code. I can think of two different ways to make FastQC for intermediate steps optional:

    • Keep the current fastqc rule but add more complex logic to each input variable
    • Change the current fastqc rule so that it runs on each FastQ output file during QC, one file at a time, based on wildcards. In this case, the DAG would just figure out when to run the fastqc rule dynamically.
  • Once ready, we should add another rule that runs MultiQC. Based on my quick testing, it would be ideal to generate separate MultiQC reports for short reads and long reads.

If you agree that this PR is a good way to address #91, then we can start to refine this PR including addressing the comments above. Thanks again!

Sounds good. I'll pursue this.

@LeeBergstrand LeeBergstrand marked this pull request as ready for review May 22, 2024 22:51
@LeeBergstrand LeeBergstrand self-assigned this May 22, 2024
@LeeBergstrand LeeBergstrand added the enhancement New feature or request label May 22, 2024
@LeeBergstrand
Copy link
Collaborator Author

@jmtsuji This is essentially a preliminary pass of generating QC stats for Rotary.

  1. Run FASTQC on short and long reads.
  2. Run MultiQC on short and long FASTQC outputs.

Caveats

  • The names of the FASTQC files are currently not ordered in the MultiQC output.
    • We could order them but that would require a step adding organized numbers to the FASTQC output.
  • This code doesn't use checkpoints or input functions. I found it hard to approach using them without an entire rewrite.

Does this look good enough to merge in for now?

@jmtsuji Do you have any comments or suggestions for improvements?

@LeeBergstrand
Copy link
Collaborator Author

LeeBergstrand commented May 24, 2024

@jmtsuji Requesting merge of #173 (assembly stats using quast) into this branch.

@jmtsuji
Copy link
Collaborator

jmtsuji commented May 27, 2024

@LeeBergstrand Thanks so much for these updates. Things are a bit crazy on my end at the moment, but I'll take a look at #161 and #173 by the end of this week. Let me know if a code review is urgent, and I can take a look sooner. Thanks!

…empt to allow it to make short qc reports. This would occur even if there was no short reads to process.
@LeeBergstrand
Copy link
Collaborator Author

@LeeBergstrand Thanks so much for these updates. Things are a bit crazy on my end at the moment, but I'll take a look at #161 and #173 by the end of this week. Let me know if a code review is urgent, and I can take a look sooner. Thanks!

@jmtsuji End of the week should be fine.

Copy link
Collaborator

@jmtsuji jmtsuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LeeBergstrand Thanks for working on these improvements to the QC reports! Although I haven't tested the code yet, I didn't notice any issues when looking through the updated rules. The new code looks like it should work well within the caveats you mentioned.

Regarding the caveats:

  • The names of the FASTQC files are currently not ordered in the MultiQC output.
    We could order them but that would require a step adding organized numbers to the FASTQC output.

I don't think it's a huge priority to have them ordered. We can leave them unordered for this PR.

  • This code doesn't use checkpoints or input functions. I found it hard to approach using them without an entire rewrite.

My only concern is that rule run_fastq_short might error out if the user turns off one of the short read QC steps (e.g., adapter trimming) in the config file, because QC_SHORT_FILE_TYPES is hard-coded to include '_reformat_', '_adapter_trim_', '_quality_trim_'. Do you agree that this will likely be an issue? (I haven't tested the code to confirm, but I can do so if needed.)

Assuming the above error occurs: if it looks like it's going to be difficult to address this issue using input functions etc., then I would be OK with just simplifying rule run_fastqc_short to be like rule run_fastqc_long, which only shows the raw reads and the final QC'ed reads. Although seeing the results of intermediate QC steps could be nice for some troubleshooting cases, I don't think it's a huge priority to include these in the report at this stage, especially if doing so will require some major code re-writes.

Let me know your thoughts, and we can proceed based on that. Thanks!

@jmtsuji
Copy link
Collaborator

jmtsuji commented May 30, 2024

@jmtsuji Requesting merge of #173 (assembly stats using quast) into this branch.

@LeeBergstrand Approved -- feel free to merge this in. I can run an end-to-end test of the merged code in this branch once we're satisfied with the FastQC reports.

@LeeBergstrand LeeBergstrand merged commit cc981cf into develop May 30, 2024
@LeeBergstrand LeeBergstrand deleted the improve_qc_stats branch May 30, 2024 19:32
@LeeBergstrand
Copy link
Collaborator Author

@jmtsuji Great!

@LeeBergstrand
Copy link
Collaborator Author

My only concern is that rule run_fastq_short might error out if the user turns off one of the short read QC steps (e.g., adapter trimming) in the config file, because QC_SHORT_FILE_TYPES is hard-coded to include '_reformat_', '_adapter_trim_', '_quality_trim_'. Do you agree that this will likely be an issue? (I haven't tested the code to confirm, but I can do so if needed.)

This can be addressed by changing perform_adapter_trimming into a variable flag and then using that to turn off adapter_trim as an input. I'm considering creating a function to address creating inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants