Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

info required on duplicate reads #11

Closed
tahashmi opened this issue Mar 2, 2021 · 6 comments
Closed

info required on duplicate reads #11

tahashmi opened this issue Mar 2, 2021 · 6 comments

Comments

@tahashmi
Copy link

tahashmi commented Mar 2, 2021

Hi Umair,

I hope you are doing well!
I am working on cluster scale acceleration of different neural networks based (like DeepVariant) variant calling complete workflows (BWA/minimap2 for alignment, sorting and duplicates removal stages). I am using PySpark to leverage the benefits of Apache Arrow in-memory columnar data format. I want to integrate NanoCaller in my workflow as well. I have couple of questions regarding this.

For PacBio CCS data:
1). You mentioned "PacBio CCS alignment files for HG001 are downloaded from the GIAB database [30, 34]" Do you mean this ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_MtSinai_NIST/ dataset? If yes, did you align this dataset yourself or just used BAM files. I want to know if duplicate removal step is essential for this dataset or this is PCR-free.

2). Does PCR-free FASTQ dataset means no need of running duplicates removal application on it (just for my info.)?

3). Does PacBio CCS reads available on precisionFDA challengev2 website needs duplicates removal stage?

Thanks!

@umahsn
Copy link
Collaborator

umahsn commented Mar 4, 2021

Hi Tanveer,

Thank you for your interest in NanoCaller.

For HG001 (NA12878), we used aligned CCS reads from here: (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/PacBio_SequelII_CCS_11kb/HG001_GRCh38/) so we didn't do any alignment ourselves. For HG002-HG004, we obtained PacBio CCS FASTQ files from precisionFDA (now available at https://data.nist.gov/pdr/od/id/mds2-2336). We aligned these reads with minimap2 without any duplicate removal. Generally, for long-reads sequencing, you do not need to do any duplicate removal, regardless of PCR-free or not.

You can consult the supplementary file of precisionFDA challenge preprint to see what different teams did for alignment and any other pre-processing steps. Since we did not sequence these genomes, we do not if anything else was done before the FASTQ files were released publicly, so it would be better to consult with the sequencing team about that.

Just FYI, we used the following minimap2 command for CCS reads, with settings used in several aligned files provided by GIAB (example):
minimap2 -a -x map-pb -k 19 -O 5,56 -E 4,1 -B 5 -z 400,50 -r 2k --eqx --secondary=no

Let me know if there are any further questions.

@tahashmi
Copy link
Author

tahashmi commented Mar 4, 2021

Thank you so much Umair for your quick and detailed reply. It's clear now.

Would you like to mention a short flow of tools in ensemble mode (like mentioned here) for your precisionFDA challenge submission? I want to know if I can use PySaprk dataframes for SAM data to integrate such ensemble mode variant calling workflow. Thanks.

@umahsn
Copy link
Collaborator

umahsn commented Mar 11, 2021

For our precisionFDA submission we had a somewhat elementary ensemble. We ran NanoCaller, Clair and Medaka separately on BAM files. After variant calling we took variant quality scores (QUAL field) of predictions made by each of the three variant callers and scaled them so that were all on a uniform scale. The reason for this is most of Clair's quality score were mostly in 500-1000, Medaka's score were mostly in 0-25, and NanoCaller's quality scores were mostly in 30-400 range.

We performed a weighted vote for each variant call, instead of majority voting. More precisely, we looked at each variant call in the union of three tools, issued the same call in final output with the following quality score:

QUAL=(NanoCaller QUAL score for the variant +Clair QUAL score for the variant +Medaka QUAL score for the variant)/3

If a tool did not make the variant call in union set we gave a quality score of 0 for that variant caller. So if there was a variant call made by NanoCaller with QUAL=240, but Medaka and Clair did not make this variant call, then the ensemble will issue the same variant call with QUAL=(214+0+0)/3=80.

So instead of removing calls with low number of votes, we just gave it a lower score. The reason for this is that there are several variant calls that are uniquely called by only one or two methods, so it would not seem prudent to discard such calls. This allows us maximum recall, but with quality score filtering we can make the callset more stringent to remove low vote supporting calls.

@tahashmi
Copy link
Author

Thanks Umair for this detailed answer. Is it possible to contact you on your email for a possible collaboration? Thanks.

@umahsn
Copy link
Collaborator

umahsn commented Mar 18, 2021

Hi Tanveer, yes that would be great. Could you give me your email so I can copy the rest of my team in the email as well?

@tahashmi

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants