Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNA-Seq Workflow v2.0 #1

Merged
merged 38 commits into from Oct 2, 2019
Merged

RNA-Seq Workflow v2.0 #1

merged 38 commits into from Oct 2, 2019

Conversation

@claymcleod
Copy link
Contributor

claymcleod commented Jun 24, 2019

@claymcleod claymcleod requested a review from lnason Jun 25, 2019
@claymcleod claymcleod self-assigned this Jun 25, 2019
@snewmanStJude

This comment has been minimized.

Copy link

snewmanStJude commented Jun 26, 2019

no idea what fq lint is

Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model:
The point of the gene model is to assist with the alignment. We shouldn't really be filtering out stuff at this stage.

Some of the GATK and Picard tools have stringency setting. Check whether one exists for ValidateSamFile. The risk is that GATK-based tools will crap out downstream if the bam file isn;t "just so"

--sjdbOverhang 125 check that this makes sense for our data

Split BAM file into multiple BAMs on the different read groups: Why are we doing this? We only want to give STAR a single pair of fastqs, don't we?

Run picard MarkDuplicates on the STAR-aligned BAM file. What is GDC doing? Last I heard there is no consensus on whether to mark/remove duplicates from RNASeq data. So long as we mark and do not remove, I think it should be fine.

Why not add splice junction RNA-Peg and HTSeq/Salmon quantification here since you are already messing with the data, anyway. Same for coverage bw file. May as well get it all out of the way in one go. Or else, break these out in to a downstream module for secondary analysis.

Using the full path in the header might be useful for forensic reasons in the future.

@claymcleod

This comment has been minimized.

Copy link
Contributor Author

claymcleod commented Jun 27, 2019

no idea what fq lint is

Michael Macias and I developed this tool to catch common errors we see in FastQ files. We use this to ensure the BAM to FastQ step worked properly.

Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model:
The point of the gene model is to assist with the alignment. We shouldn't really be filtering out stuff at this stage.

I'm of the same mind.

Some of the GATK and Picard tools have stringency setting. Check whether one exists for ValidateSamFile. The risk is that GATK-based tools will crap out downstream if the bam file isn;t "just so"

Noted, I think we are covered with our current use of ValidateSamFile.

--sjdbOverhang 125 check that this makes sense for our data

Most places use 100 for this value. The consensus from our RNA-Seq workflow v1 meeting was that it was better to go over than under on this parameter, and some of our data may be more than 100bp.

Split BAM file into multiple BAMs on the different read groups: Why are we doing this? We only want to give STAR a single pair of fastqs, don't we?

Primarily we do this because it's the easiest way to retain the read group information and feed it back into STAR for the remapped BAM. So you split the original BAM by read group, you store the read group information alongside the pair of FastQ files, then you pass it in comma-delimited order back to STAR.

Run picard MarkDuplicates on the STAR-aligned BAM file. What is GDC doing? Last I heard there is no consensus on whether to mark/remove duplicates from RNASeq data. So long as we mark and do not remove, I think it should be fine.

Based on their mRNA documentation, they don't appear to be doing any duplicate marking/removing (I find that hard to believe). It may just not be documented. For what it's worth, they do say that they use MarkDuplicates in their DNA-Seq analysis docs.

Why not add splice junction RNA-Peg and HTSeq/Salmon quantification here since you are already messing with the data, anyway. Same for coverage bw file. May as well get it all out of the way in one go. Or else, break these out in to a downstream module for secondary analysis.

Agree with this, I think it's probably safer to go with HTSeq than Salmon at this point. I can give more details on my feelings why that is the case if people are interested. I hadn't considered computing the bigWig file, but it seems reasonable.

Using the full path in the header might be useful for forensic reasons in the future.

Begrudgingly agree.

text/0000-rnaseq-workflow-v2.0.md Outdated Show resolved Hide resolved
text/0000-rnaseq-workflow-v2.0.md Outdated Show resolved Hide resolved
text/0000-rnaseq-workflow-v2.0.md Show resolved Hide resolved

# Outstanding questions

* Any parameters we want to change during the STAR alignment step? I don't expect any, but we should explicitly discuss.

This comment has been minimized.

Copy link
@zaeleus

zaeleus Jun 27, 2019

Considering outlier failures, the two parameters that I sometimes have to tweak are limitSjdbInsertNsj and limitOutSJcollapsed. They are few enough to treat as special cases, but I'm just noting it here for discussion.

This comment has been minimized.

Copy link
@claymcleod

claymcleod Jun 29, 2019

Author Contributor

Yeah I agree with this. I'll put it in the "to investigate" list.

text/0000-rnaseq-workflow-v2.0.md Outdated Show resolved Hide resolved
@dfinkels

This comment has been minimized.

Copy link

dfinkels commented Jun 28, 2019

"The inclusion of the ERCC genome in all of our RNA-Seq samples was violating that practice and causing problems for cross-target analyses"

Removing the ERCC spike in controls makes sense since they are causing incompatibility with other pipelines. They are a special case and possible candidate for a future tool

"Previously, we were filtering out anything not matching "level 1" or "level 2" from the gene model"
Agree that including level 3 sequence makes sense for mapping. Inclusion will effect iFDR if counted , but that's at a later stage

-sjdbOverhang 125 is not the "ideal" according the manual given current mapping, but more is harmless and some users might have longer reads eventually that may take advantage of this extension

htseq-count -m union vs -m intersection-nonempty or intersection-strict

The union option is the standard/default, most people will use it. The "intersection non-empty" option will add reads when the read fully maps to one transcript over partial mapping in others. I think it is reasonable and may happen more often if level 3 transcripts are included. Those reads otherwise would be excluded. GDC uses it, so including it will move us one step closer to harmonized/federated data, although I don't like all their choices.

As for gene id or gene name I think gene name is more useful for end users in a count file. We need to decide how valuable it is to be harmonized with GDC. I would choose end users over GDC at this stage. "GDC like" results that use all GDC options could be an optional format at some later point.

--secondary-alignments ignore
--supplementary-alignments ignore
From what I've read, if the data is position sorted and it is ( -r pos) ignore is the best option in these cases, seems to be a problem in older versions of htseq

@rawagiha

This comment has been minimized.

Copy link

rawagiha commented Jun 28, 2019

Thanks a lot for preparing the proposal. I enjoyed reading this.

  1. May I ask at which step in the workflow adapter trimming is performed?

  2. Under 'Outstanding questions', I think the strandedness argument (-s) in htseq-count is critical.
    Rather than setting a default, is it possible to automatically supply the parameter from 'infer_experiment.py' at Step 8?

@zaeleus

This comment has been minimized.

Copy link

zaeleus commented Jun 28, 2019

--secondary-alignments ignore
--supplementary-alignments ignore
From what I've read, if the data is position sorted and it is ( -r pos) ignore is the best option in these cases, seems to be a problem in older versions of htseq

Yes, the author of HTSeq noted these should typically be ignored. From the release notes for 0.11.0:

Turns out [ignoring secondary and supplementary alignments] are much more sensible in the majority of cases, because the mate of a secondary alignment is an ill-defined beast.

@claymcleod

This comment has been minimized.

Copy link
Contributor Author

claymcleod commented Aug 24, 2019

@adthrasher For some reason, I can't reply directly on your comment above...
What do you think about sha256 vs BLAKE2b (see Michael M's comment above)? If the coreutils suggests the latter, maybe we should use it instead... I don't know. I do think three would be overboard.

It depends.

For the end user, I think they're only likely to use md5 (if anything). Only a small set of users is likely to use sha256.

For internal usage, going with a more cryptographically secure method is preferable, but I don't think we should generate three checksums. What internal use cases do we have for the checksums? That may give some clarity on the best choice.

At this point, I think generating MD5s only would be sufficient, but we may also want to generate SHA256 internally because I think that is a better long-term solution.

@claymcleod claymcleod force-pushed the master branch 2 times, most recently from 80196e8 to 7f85a7a Aug 25, 2019
@agout

This comment has been minimized.

Copy link

agout commented Sep 12, 2019

Some further thoughts regarding coverage bw file......

Scott:

Why not add splice junction RNA-Peg and HTSeq/Salmon quantification here since you are already messing with the data, anyway. Same for coverage bw file. May as well get it all out of the way in one go. Or else, break these out in to a downstream module for secondary analysis.

Clay:

Agree with this, I think it's probably safer to go with HTSeq than Salmon at this point. I can give more details on my feelings why that is the case if people are interested. I hadn't considered computing the bigWig file, but it seems reasonable.

Alex:
Would be very useful if we could create these bw files so we could use them to calculate coverage statistics for each RNAseq .bam file. Will come in handy for downstream QC + associated analysis.

Can we incorporate into the pipeline?

@claymcleod

This comment has been minimized.

Copy link
Contributor Author

claymcleod commented Sep 13, 2019

@adthrasher, thoughts? ^

@adthrasher

This comment has been minimized.

Copy link
Member

adthrasher commented Sep 13, 2019

@adthrasher, thoughts? ^

I agree with Alex that this is a good idea to include. I'll take a look at what is involved.

@adthrasher

This comment has been minimized.

Copy link
Member

adthrasher commented Oct 2, 2019

The two week public comment period has elapsed. So this is considered final. Any further changes can be made through a new PR.

@adthrasher adthrasher closed this Oct 2, 2019
@adthrasher adthrasher reopened this Oct 2, 2019
@adthrasher adthrasher merged commit c666d85 into master Oct 2, 2019
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@adthrasher

This comment has been minimized.

Copy link
Member

adthrasher commented Oct 2, 2019

Assigned RFC 0001

@kmhernan

This comment has been minimized.

Copy link

kmhernan commented Oct 17, 2019

Just saw this and wanted to say the GDC absolutely does not mark duplicates for RNA-Seq. That is generally not a good practice for RNA-Seq unless you have UMI where you can confidently remove duplicates. We also run all the readgroups at once through star (you have to do SE and PE separately though).

@claymcleod

This comment has been minimized.

Copy link
Contributor Author

claymcleod commented Oct 20, 2019

Hi @kmhernan, thanks for your comment on this. Note that we are only marking the duplicates (not removing them). I'm curious if you are aware of any negative downstream effects from just marking duplicates for RNA-Seq data? Most tools that should account for true biological duplications (e.g. expression quantification using htseq) ignore the flag (at least to my knowledge).

If true, that would make the decision whether to mark the duplicates or not one of opinion. After giving it a few moments thought, it may actually be better to assert nothing at all about the duplicate status than to mark them. That would also reduce the runtime, which is a plus.

For those interested, this is worth skimming: https://www.nature.com/articles/srep25533.

Currently, quantitative RNA-seq methods are pushed to work with increasingly small starting amounts of RNA that require amplification. However, it is unclear how much noise or bias amplification introduces and how this affects precision and accuracy of RNA quantification.... We find that a large fraction of computationally identified read duplicates are not PCR duplicates and can be explained by sampling and fragmentation bias. Consequently, the computational removal of duplicates does improve neither accuracy nor precision and can actually worsen the power and the False Discovery Rate (FDR) for differential gene expression. Even when duplicates are experimentally identified by unique molecular identifiers (UMIs), power and FDR are only mildly improved...

I'll discuss with our team. Thanks again for your contribution!

cc: @adthrasher

@adthrasher

This comment has been minimized.

Copy link
Member

adthrasher commented Oct 21, 2019

After internal consideration, we have decided to no longer mark duplicates for RNA-Seq. The pipeline will be updated in the near future.

@kmhernan

This comment has been minimized.

Copy link

kmhernan commented Oct 21, 2019

Hey sorry was traveling this weekend. Yeah just marking seems like it wouldn’t be an issue (only compute cost). HTSeq does some bizarro things that you can only really see in source (at least from my experience). I actually prefer just getting the counts from STAR cause kallisto Salmon etc seem to perform better in quantification anyways. If you wanted to call rna-seq variants (no good way to do this yet) then your markdups step would be needed.

@hbeale

This comment has been minimized.

Copy link

hbeale commented Nov 27, 2019

Hi folks - Thanks for inviting feedback. As you know, we at Treehouse have many similar issues to those you address here (1). We agree that updating reference files and software is worthwhile. We do not perform gene model post-processing for the reasons you've outlined. We include potential duplicate reads in our quantification steps, but subsequently mark duplicates and count non-duplicate reads as a QC step. We have found that counting mapped, exonic, non-duplicate reads has allowed us to identify and exclude some extremely low quality samples that would not otherwise be detected (2,3)

1 Vaske et al, 2019, http://dx.doi.org/10.1001/jamanetworkopen.2019.13968
2 https://github.com/UCSC-Treehouse/mend_qc
3 https://www.biorxiv.org/content/10.1101/716829v1

@claymcleod

This comment has been minimized.

Copy link
Contributor Author

claymcleod commented Nov 27, 2019

Thanks for reviewing @hbeale, this is really helpful and affirming feedback. We'll be sure to let you know when we release the new version of the data.

@claymcleod claymcleod deleted the cmcleod/rnaseq-workflow-v2 branch Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.