-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dissimilar data points in re-analysis #13
Comments
We have not used all the SRR1910373 reads but used only 30x coverage reads from SRR1910373. Check Table S3 and carefully read our manuscript.
… 2020/10/06 14:51、Mkddb ***@***.***>のメール:
Dear team,
I am checking a few tools with the raw data and scripts as given/mentioned in your analysis.
There are minor fractional diffrences with the Simulated data, but i am getting significant differences from Real datasets.
I am using Real dataset 4 (NA12878) downloaded from ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495
Files 👍 SRR1910373_1.fastq.bz2 and SRR1910373_1.fastq.bz2 and unzipped them.
Followed these commands to generate bam and index file..
bwa mem hs37d5.fa SRR1910373_1.fastq SRR1910373_2.fastq > SRR1910373.sam
samtools view -S -b SRR1910373.sam > SRR1910373.bam
samtools sort SRR1910373.bam -o SRR1910373_sorted.bam
samtools index SRR1910373_sorted.bam.bai
bwa version - Version: 0.7.17-r1188
samtools version - Version: 1.7 (using htslib 1.7-2)
Reference data used : hs37d5.fa
Link : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
#Then, installed Manta 0.29.5 using the default commands in installation.md
Executed it to call SV on Real data 4 :
configManta.py --bam SRR1910373_sorted.bam --referenceFasta hs37d5.fa --runDir /media/divakar/Thirddisk/RealData_Kosugianalysis/new1/
./runWorkflow.py -m local -j 6 -g 100
#Conversion and evaluation of *VCFs called
perl convert_Manta_vcf.pl diploidSV.vcf > diploidSVconverted.vcf
perl evaluate_SV_callers.pl -r N diploidSVconverted.vcf
Obtained this result
Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<=1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96
Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1
Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146
Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48
<< NA12878_DGV-2016_LR-assembly >>
DEL
<Number of supporting reads>
2 3 4 5 6 7 8 9 10 12
Call (A) 5991 5945 5890 5823 5732 5628 5498 5368 5216 4902
Recall (A) 41.1 41 41 40.9 40.6 40.3 39.9 39.5 38.8 37.3
Precis (A) 63 63.4 63.8 64.4 65.1 65.8 66.7 67.6 68.3 69.9
Call (SS) 2291 2268 2241 2210 2171 2118 2054 1983 1912 1765
Recall (SS) 60.4 60.4 60.4 60.2 59.9 59.4 59.1 58.1 57.1 55.1
Precis (SS) 40.3 40.7 41.2 41.6 42.1 42.8 43.7 44.5 45 46.6
Call (S) 2887 2866 2842 2806 2756 2707 2648 2598 2529 2401
Recall (S) 42.1 42 41.9 41.8 41.5 41.1 40.6 40.3 39.5 38
Precis (S) 74.1 74.5 75 75.8 76.7 77.4 78.3 79.2 79.9 81.3
Call (M) 771 769 766 766 765 763 756 748 736 697
Recall (M) 30.1 30.1 30 30 30 29.9 29.7 29.5 29.1 27.7
Precis (M) 90.6 90.8 90.8 90.8 90.8 90.8 91 91.1 91.5 91.9
Call (L) 42 42 41 41 40 40 40 39 39 39
Recall (L) 12.5 12.5 11.4 11.4 10.4 10.4 10.4 10.4 10.4 10.4
Precis (L) 26.1 26.1 24.3 24.3 22.5 22.5 22.5 23 23 23
DUP
<Number of supporting reads>
2 3 4 5 6 7 8 9 10 12
Call (A) 466 466 463 456 444 431 408 389 369 339
Recall (A) 8 8 7.9 7.7 7.6 7.4 6.9 6.4 6 5.3
Precis (A) 47.2 47.2 47.3 46.7 47.2 46.8 46 44.9 44.4 42.1
Call (S) 312 312 309 302 292 279 260 241 223 195
Recall (S) 19.6 19.6 19.5 19 18.7 18 16.8 15.7 14.7 12.9
Precis (S) 63.4 63.4 63.7 63.2 64.3 64.5 64.2 63.9 64.1 63.5
Call (M) 126 126 126 126 125 125 123 123 122 120
Recall (M) 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1
Precis (M) 16.6 16.6 16.6 16.6 16.8 16.8 16.2 16.2 16.3 15
Call (L) 28 28 28 28 27 27 25 25 24 24
Recall (L) 2 2 2 2 2 2 1.3 1.3 1.3 1.3
Precis (L) 3.5 3.5 3.5 3.5 3.7 3.7 4 4 4.1 4.1
INS
<Number of supporting reads>
2 3 4 5 6 7 8 9 10 12
Call (A) 2546 2544 2544 2542 2529 2517 2503 2489 2465 2416
Recall (A) 14.3 14.3 14.3 14.3 14.3 14.2 14.2 14.1 14 13.8
Precis (A) 77.2 77.2 77.2 77.3 77.4 77.5 77.7 77.9 78.1 78.4
INV
<Number of supporting reads>
2 3 4 5 6 7 8 9 10 12
Call (A) 259 259 258 257 255 254 253 251 247 238
Recall (A) 24.4 24.4 24.4 24 24 24 24 24 24 22.9
Precis (A) 25.8 25.8 25.9 25.6 25.8 25.9 26 26.2 26.7 26.4
While, the data in Table S9 of additional file 3 was as following :
Manta DEL All Call (A) 5012 4966 4888 4788 4604 4405 4177 3908 3610 2953
Recall (A) 37.2 37.2 37 36.7 35.9 34.9 33.5 31.8 29.8 25.1
Precis (A) 68.2 68.7 69.6 70.3 71.6 72.7 73.7 74.6 75.9 78
MIER (A) 30.9 30.4 29.6 28.8 27.8 27.1 26 25.1 24.3 21.8
SS Call (SS) 1870 1849 1810 1753 1658 1579 1469 1363 1235 978
Recall (SS) 55.9 55.8 55.7 55.1 54.1 53.3 51 48.4 44.7 37
Precis (SS) 44.3 44.7 45.5 46.4 48 49.3 50.8 51.4 52.9 54.8
MIER (SS) 45.8 45.4 44.8 43.9 43.1 42.6 41.9 41.1 40.7 38.8
S Call (S) 2439 2417 2380 2342 2267 2172 2080 1954 1826 1520
Recall (S) 37.7 37.6 37.5 37.1 36.2 35.1 33.9 32 30.4 25.7
Precis (S) 79.6 80.2 81.2 81.6 82.4 83.6 84.2 85.1 86.2 88.3
MIER (S) 25.3 24.8 23.9 23.2 22 21 20 19 18.1 15.6
M Call (M) 676 674 672 667 653 630 604 570 528 436
Recall (M) 27.6 27.5 27.5 27.3 26.8 25.9 24.8 23.6 21.9 18.1
Precis (M) 94.6 94.8 94.7 94.9 95.2 95.2 95.1 95.7 95.8 96.3
MIER (M) 9.4 9.3 9.2 8.9 8.5 8.7 8.4 7.8 7.7 5.7
L Call (L) 27 26 26 26 26 24 24 21 21 19
Recall (L) 11.4 10.4 10.4 10.4 10.4 9.3 8.3 7.2 7.2 7.2
Precis (L) 37 34.6 34.6 34.6 34.6 33.3 33.3 33.3 33.3 36.8
MIER (L) 33.3 34.6 34.6 34.6 34.6 29.1 29.1 19 19 15.7
DUP All Call (A) 324 322 313 301 284 260 237 216 187 138
Recall (A) 5.9 5.9 5.8 5.5 5.3 4.7 4.3 3.9 3.4 2.4
Precis (A) 50 50.3 50.4 50.1 50.3 48.4 48.5 48.1 48.6 47.1
MIER (A) 45.6 45.3 45 44.1 42.6 41.5 40.5 41.6 42.2 43.4
S Call (S) 224 222 213 201 185 164 144 126 114 83
Recall (S) 14.6 14.6 14.3 13.6 12.9 11.4 10.3 9.2 8.5 6.1
Precis (S) 63.8 64.4 65.2 65.6 67 65.8 67.3 68.2 69.2 67.4
MIER (S) 56.2 55.8 55.8 55.2 53.5 52.4 51.3 53.9 51.7 50.6
M Call (M) 85 85 85 85 84 81 79 77 63 46
Recall (M) 1.1 1.1 1.1 1.1 1.1 1 1 1 0.7 0.5
Precis (M) 21.1 21.1 21.1 21.1 21.4 20.9 21.5 22 19 19.5
MIER (M) 23.5 23.5 23.5 23.5 23.8 24.6 25.3 25.9 28.5 34.7
L Call (L) 15 15 15 15 15 15 14 13 10 9
Recall (L) 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0 0
Precis (L) 6.6 6.6 6.6 6.6 6.6 6.6 7.1 7.6 0 0
MIER (L) 13.3 13.3 13.3 13.3 13.3 13.3 14.2 15.3 20 22.2
INS All Call (A) 1819 1816 1809 1803 1782 1754 1727 1680 1641 1538
Recall (A) 11 11 11 10.9 10.8 10.5 10.4 10.1 9.9 9.3
Precis (A) 77.7 77.7 77.8 77.9 78.1 78.5 78.8 79.4 79.7 80.8
MIER (A) 55.5 55.5 55.5 55.4 55.2 54.7 54.7 54.5 54.2 53.5
INV All Call (A) 180 180 178 175 173 166 158 148 137 120
Recall (A) 22.2 22.2 21.8 21.5 21.1 19.7 19.7 18.6 17.1 15.3
Precis (A) 33.8 33.8 33.7 33.7 33.5 32.5 34.1 34.4 34.3 35
MIER (A) 16.6 16.6 16.2 16 15.6 15 15.1 14.8 14.5 13.3
There are significant differences in Recall and Precision values.
do you see something erroneous for this difference ?
Tried on few other tools and obtained similar diffrences for Real dataset4 but not for SIM-A.
can you help me to understand and fix this ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#13>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIBV3GI2JJLU6TGRL6WU6DSJKV7PANCNFSM4SFRZOQQ>.
|
Thanks for highlighting it. Yes, i read the manuscript carefully again. it's mentioned "Table S3. Whole genome sequencing reads used in this study" and "Genome coverage for NA12878 (data4) *4 is 30x". hope, it's the same thing being highlighted here. But, i am not able to understand this "We have not used all the SRR1910373 reads but used only 30x coverage reads from SRR1910373" what about 3rd Fastq file ? Can you please ellaborate as how to use only 30x coverage reads from above data. It will be really helpful if these script for fastq to bam conversion or the final *bam files can be shared for NA12878 (data4). This gives me additional doubt how would one obtain RSS of 2 to 12, if only the 30x coverage reads were used for the alignment ? |
SRR1910373 files contain 251 bp paired-end reads. So the number of read pairs with 30x coverage is about 3,000,000,000 x 30 / (251 x 2). The calculated numbers of read pairs can be extracted from SRR1910373_1.fastq.bz2 and SRR1910373_2.fastq.bz2.
… 2020/10/07 16:02、Mkddb ***@***.***>のメール:
Thanks for highlighting it.
Yes, i read the manuscript carefully again. it's mentioned "Table S3. Whole genome sequencing reads used in this study" and "Genome coverage for NA12878 (data4) *4 is 30x". hope, it's the same thing being highlighted here.
But, i am not able to understand this "We have not used all the SRR1910373 reads but used only 30x coverage reads from SRR1910373"
does it really mean only 30x coverage reads from SRR1910373 Fastq files or the whole genome coverage(which is actually an overall average) is 30. also, Are you using these files only ?
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373_2.fastq.bz2
what about 3rd Fastq file ?
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373.fastq.bz2
Can you please ellaborate as how to use only 30x coverage reads from above data. It will be really helpful if these script for fastq to bam conversion or the final *bam files can be shared for NA12878 (data4).
This gives me additional doubt how would one obtain RSS of 2 to 12, if only the 30x coverage reads were used for the alignment ?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#13 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADIBV3DWZPOJ6NRSOCD5Z5DSJQG73ANCNFSM4SFRZOQQ>.
|
Are these commands correct to obtain the same or i need to apply additional parameters. bzip2 -d SRR1910373_1.fastq.bz2 bwa mem hs37d5.fa SRR1910373_1.fastq SRR1910373_2.fastq > SRR1910373.sam samtools view -S -b SRR1910373.sam > SRR1910373.bam samtools sort SRR1910373.bam -o SRR1910373_sorted.bam samtools index SRR1910373_sorted.bam.bai |
Dear team,
I am checking a few tools with the raw data and scripts as given/mentioned in your analysis.
There are minor fractional diffrences with the Simulated data, but i am getting significant differences from Real datasets.
I am using Real dataset 4 (NA12878) downloaded from ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495
Files 👍 SRR1910373_1.fastq.bz2 and SRR1910373_1.fastq.bz2 and unzipped them.
Followed these commands to generate bam and index file..
bwa mem hs37d5.fa SRR1910373_1.fastq SRR1910373_2.fastq > SRR1910373.sam
samtools view -S -b SRR1910373.sam > SRR1910373.bam
samtools sort SRR1910373.bam -o SRR1910373_sorted.bam
samtools index SRR1910373_sorted.bam.bai
bwa version - Version: 0.7.17-r1188
samtools version - Version: 1.7 (using htslib 1.7-2)
Reference data used : hs37d5.fa
Link : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
#Then, installed Manta 0.29.5 using the default commands in installation.md
Executed it to call SV on Real data 4 :
configManta.py --bam SRR1910373_sorted.bam --referenceFasta hs37d5.fa --runDir /media/divakar/Thirddisk/RealData_Kosugianalysis/new1/
./runWorkflow.py -m local -j 6 -g 100
#Conversion and evaluation of *VCFs called
perl convert_Manta_vcf.pl diploidSV.vcf > diploidSVconverted.vcf
perl evaluate_SV_callers.pl -r N diploidSVconverted.vcf
Obtained this result
Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<=1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96
Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1
Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146
Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48
<< NA12878_DGV-2016_LR-assembly >>
DEL
Call (A) 5991 5945 5890 5823 5732 5628 5498 5368 5216 4902
Recall (A) 41.1 41 41 40.9 40.6 40.3 39.9 39.5 38.8 37.3
Precis (A) 63 63.4 63.8 64.4 65.1 65.8 66.7 67.6 68.3 69.9
Call (SS) 2291 2268 2241 2210 2171 2118 2054 1983 1912 1765
Recall (SS) 60.4 60.4 60.4 60.2 59.9 59.4 59.1 58.1 57.1 55.1
Precis (SS) 40.3 40.7 41.2 41.6 42.1 42.8 43.7 44.5 45 46.6
Call (S) 2887 2866 2842 2806 2756 2707 2648 2598 2529 2401
Recall (S) 42.1 42 41.9 41.8 41.5 41.1 40.6 40.3 39.5 38
Precis (S) 74.1 74.5 75 75.8 76.7 77.4 78.3 79.2 79.9 81.3
Call (M) 771 769 766 766 765 763 756 748 736 697
Recall (M) 30.1 30.1 30 30 30 29.9 29.7 29.5 29.1 27.7
Precis (M) 90.6 90.8 90.8 90.8 90.8 90.8 91 91.1 91.5 91.9
Call (L) 42 42 41 41 40 40 40 39 39 39
Recall (L) 12.5 12.5 11.4 11.4 10.4 10.4 10.4 10.4 10.4 10.4
Precis (L) 26.1 26.1 24.3 24.3 22.5 22.5 22.5 23 23 23
DUP
Call (A) 466 466 463 456 444 431 408 389 369 339
Recall (A) 8 8 7.9 7.7 7.6 7.4 6.9 6.4 6 5.3
Precis (A) 47.2 47.2 47.3 46.7 47.2 46.8 46 44.9 44.4 42.1
Call (S) 312 312 309 302 292 279 260 241 223 195
Recall (S) 19.6 19.6 19.5 19 18.7 18 16.8 15.7 14.7 12.9
Precis (S) 63.4 63.4 63.7 63.2 64.3 64.5 64.2 63.9 64.1 63.5
Call (M) 126 126 126 126 125 125 123 123 122 120
Recall (M) 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1
Precis (M) 16.6 16.6 16.6 16.6 16.8 16.8 16.2 16.2 16.3 15
Call (L) 28 28 28 28 27 27 25 25 24 24
Recall (L) 2 2 2 2 2 2 1.3 1.3 1.3 1.3
Precis (L) 3.5 3.5 3.5 3.5 3.7 3.7 4 4 4.1 4.1
INS
Call (A) 2546 2544 2544 2542 2529 2517 2503 2489 2465 2416
Recall (A) 14.3 14.3 14.3 14.3 14.3 14.2 14.2 14.1 14 13.8
Precis (A) 77.2 77.2 77.2 77.3 77.4 77.5 77.7 77.9 78.1 78.4
INV
Call (A) 259 259 258 257 255 254 253 251 247 238
Recall (A) 24.4 24.4 24.4 24 24 24 24 24 24 22.9
Precis (A) 25.8 25.8 25.9 25.6 25.8 25.9 26 26.2 26.7 26.4
While, the data in Table S9 of additional file 3 was as following :
Manta DEL All Call (A) 5012 4966 4888 4788 4604 4405 4177 3908 3610 2953
Recall (A) 37.2 37.2 37 36.7 35.9 34.9 33.5 31.8 29.8 25.1
Precis (A) 68.2 68.7 69.6 70.3 71.6 72.7 73.7 74.6 75.9 78
MIER (A) 30.9 30.4 29.6 28.8 27.8 27.1 26 25.1 24.3 21.8
SS Call (SS) 1870 1849 1810 1753 1658 1579 1469 1363 1235 978
Recall (SS) 55.9 55.8 55.7 55.1 54.1 53.3 51 48.4 44.7 37
Precis (SS) 44.3 44.7 45.5 46.4 48 49.3 50.8 51.4 52.9 54.8
MIER (SS) 45.8 45.4 44.8 43.9 43.1 42.6 41.9 41.1 40.7 38.8
S Call (S) 2439 2417 2380 2342 2267 2172 2080 1954 1826 1520
Recall (S) 37.7 37.6 37.5 37.1 36.2 35.1 33.9 32 30.4 25.7
Precis (S) 79.6 80.2 81.2 81.6 82.4 83.6 84.2 85.1 86.2 88.3
MIER (S) 25.3 24.8 23.9 23.2 22 21 20 19 18.1 15.6
M Call (M) 676 674 672 667 653 630 604 570 528 436
Recall (M) 27.6 27.5 27.5 27.3 26.8 25.9 24.8 23.6 21.9 18.1
Precis (M) 94.6 94.8 94.7 94.9 95.2 95.2 95.1 95.7 95.8 96.3
MIER (M) 9.4 9.3 9.2 8.9 8.5 8.7 8.4 7.8 7.7 5.7
L Call (L) 27 26 26 26 26 24 24 21 21 19
Recall (L) 11.4 10.4 10.4 10.4 10.4 9.3 8.3 7.2 7.2 7.2
Precis (L) 37 34.6 34.6 34.6 34.6 33.3 33.3 33.3 33.3 36.8
MIER (L) 33.3 34.6 34.6 34.6 34.6 29.1 29.1 19 19 15.7
DUP All Call (A) 324 322 313 301 284 260 237 216 187 138
Recall (A) 5.9 5.9 5.8 5.5 5.3 4.7 4.3 3.9 3.4 2.4
Precis (A) 50 50.3 50.4 50.1 50.3 48.4 48.5 48.1 48.6 47.1
MIER (A) 45.6 45.3 45 44.1 42.6 41.5 40.5 41.6 42.2 43.4
S Call (S) 224 222 213 201 185 164 144 126 114 83
Recall (S) 14.6 14.6 14.3 13.6 12.9 11.4 10.3 9.2 8.5 6.1
Precis (S) 63.8 64.4 65.2 65.6 67 65.8 67.3 68.2 69.2 67.4
MIER (S) 56.2 55.8 55.8 55.2 53.5 52.4 51.3 53.9 51.7 50.6
M Call (M) 85 85 85 85 84 81 79 77 63 46
Recall (M) 1.1 1.1 1.1 1.1 1.1 1 1 1 0.7 0.5
Precis (M) 21.1 21.1 21.1 21.1 21.4 20.9 21.5 22 19 19.5
MIER (M) 23.5 23.5 23.5 23.5 23.8 24.6 25.3 25.9 28.5 34.7
L Call (L) 15 15 15 15 15 15 14 13 10 9
Recall (L) 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0 0
Precis (L) 6.6 6.6 6.6 6.6 6.6 6.6 7.1 7.6 0 0
MIER (L) 13.3 13.3 13.3 13.3 13.3 13.3 14.2 15.3 20 22.2
INS All Call (A) 1819 1816 1809 1803 1782 1754 1727 1680 1641 1538
Recall (A) 11 11 11 10.9 10.8 10.5 10.4 10.1 9.9 9.3
Precis (A) 77.7 77.7 77.8 77.9 78.1 78.5 78.8 79.4 79.7 80.8
MIER (A) 55.5 55.5 55.5 55.4 55.2 54.7 54.7 54.5 54.2 53.5
INV All Call (A) 180 180 178 175 173 166 158 148 137 120
Recall (A) 22.2 22.2 21.8 21.5 21.1 19.7 19.7 18.6 17.1 15.3
Precis (A) 33.8 33.8 33.7 33.7 33.5 32.5 34.1 34.4 34.3 35
MIER (A) 16.6 16.6 16.2 16 15.6 15 15.1 14.8 14.5 13.3
There are significant differences in Recall and Precision values.
do you see something erroneous for this difference ?
Tried on few other tools and obtained similar diffrences for Real dataset4 but not for SIM-A.
can you help me to understand and fix this ?
The text was updated successfully, but these errors were encountered: