Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dissimilar data points in re-analysis #13

Closed
Mkddb opened this issue Oct 6, 2020 · 4 comments
Closed

Dissimilar data points in re-analysis #13

Mkddb opened this issue Oct 6, 2020 · 4 comments

Comments

@Mkddb
Copy link

Mkddb commented Oct 6, 2020

Dear team,

I am checking a few tools with the raw data and scripts as given/mentioned in your analysis.

There are minor fractional diffrences with the Simulated data, but i am getting significant differences from Real datasets.

I am using Real dataset 4 (NA12878) downloaded from ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495
Files 👍 SRR1910373_1.fastq.bz2 and SRR1910373_1.fastq.bz2 and unzipped them.

Followed these commands to generate bam and index file..

bwa mem hs37d5.fa SRR1910373_1.fastq SRR1910373_2.fastq > SRR1910373.sam
samtools view -S -b SRR1910373.sam > SRR1910373.bam
samtools sort SRR1910373.bam -o SRR1910373_sorted.bam
samtools index SRR1910373_sorted.bam.bai

bwa version - Version: 0.7.17-r1188
samtools version - Version: 1.7 (using htslib 1.7-2)

Reference data used : hs37d5.fa
Link : ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

#Then, installed Manta 0.29.5 using the default commands in installation.md

Executed it to call SV on Real data 4 :

configManta.py --bam SRR1910373_sorted.bam --referenceFasta hs37d5.fa --runDir /media/divakar/Thirddisk/RealData_Kosugianalysis/new1/

./runWorkflow.py -m local -j 6 -g 100

#Conversion and evaluation of *VCFs called

perl convert_Manta_vcf.pl diploidSV.vcf > diploidSVconverted.vcf

perl evaluate_SV_callers.pl -r N diploidSVconverted.vcf

Obtained this result
Ref-DEL: 9176 tinny (<= 100 bp): 1192 short (<=1.0 kp): 5558 middle (<= 100 kb): 2330 large (> 100 kb): 96
Ref-INS: 13669 short (<=1.0 kp): 12660 middle (<= 100 kb): 1008 large (> 100 kb): 1
Ref-DUP: 2604 short (<=1.0 kp): 873 middle (<= 100 kb): 1585 large (> 100 kb): 146
Ref-INV: 274 short (<=1.0 kp): 26 middle (<= 100 kb): 200 large (> 100 kb): 48

<< NA12878_DGV-2016_LR-assembly >>

DEL

    	<Number of supporting reads>
    	2	3	4	5	6	7	8	9	10	12

Call (A) 5991 5945 5890 5823 5732 5628 5498 5368 5216 4902
Recall (A) 41.1 41 41 40.9 40.6 40.3 39.9 39.5 38.8 37.3
Precis (A) 63 63.4 63.8 64.4 65.1 65.8 66.7 67.6 68.3 69.9
Call (SS) 2291 2268 2241 2210 2171 2118 2054 1983 1912 1765
Recall (SS) 60.4 60.4 60.4 60.2 59.9 59.4 59.1 58.1 57.1 55.1
Precis (SS) 40.3 40.7 41.2 41.6 42.1 42.8 43.7 44.5 45 46.6
Call (S) 2887 2866 2842 2806 2756 2707 2648 2598 2529 2401
Recall (S) 42.1 42 41.9 41.8 41.5 41.1 40.6 40.3 39.5 38
Precis (S) 74.1 74.5 75 75.8 76.7 77.4 78.3 79.2 79.9 81.3
Call (M) 771 769 766 766 765 763 756 748 736 697
Recall (M) 30.1 30.1 30 30 30 29.9 29.7 29.5 29.1 27.7
Precis (M) 90.6 90.8 90.8 90.8 90.8 90.8 91 91.1 91.5 91.9
Call (L) 42 42 41 41 40 40 40 39 39 39
Recall (L) 12.5 12.5 11.4 11.4 10.4 10.4 10.4 10.4 10.4 10.4
Precis (L) 26.1 26.1 24.3 24.3 22.5 22.5 22.5 23 23 23

DUP

    	<Number of supporting reads>
    	2	3	4	5	6	7	8	9	10	12

Call (A) 466 466 463 456 444 431 408 389 369 339
Recall (A) 8 8 7.9 7.7 7.6 7.4 6.9 6.4 6 5.3
Precis (A) 47.2 47.2 47.3 46.7 47.2 46.8 46 44.9 44.4 42.1
Call (S) 312 312 309 302 292 279 260 241 223 195
Recall (S) 19.6 19.6 19.5 19 18.7 18 16.8 15.7 14.7 12.9
Precis (S) 63.4 63.4 63.7 63.2 64.3 64.5 64.2 63.9 64.1 63.5
Call (M) 126 126 126 126 125 125 123 123 122 120
Recall (M) 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1
Precis (M) 16.6 16.6 16.6 16.6 16.8 16.8 16.2 16.2 16.3 15
Call (L) 28 28 28 28 27 27 25 25 24 24
Recall (L) 2 2 2 2 2 2 1.3 1.3 1.3 1.3
Precis (L) 3.5 3.5 3.5 3.5 3.7 3.7 4 4 4.1 4.1

INS

    	<Number of supporting reads>
    	2	3	4	5	6	7	8	9	10	12

Call (A) 2546 2544 2544 2542 2529 2517 2503 2489 2465 2416
Recall (A) 14.3 14.3 14.3 14.3 14.3 14.2 14.2 14.1 14 13.8
Precis (A) 77.2 77.2 77.2 77.3 77.4 77.5 77.7 77.9 78.1 78.4

INV

    	<Number of supporting reads>
    	2	3	4	5	6	7	8	9	10	12

Call (A) 259 259 258 257 255 254 253 251 247 238
Recall (A) 24.4 24.4 24.4 24 24 24 24 24 24 22.9
Precis (A) 25.8 25.8 25.9 25.6 25.8 25.9 26 26.2 26.7 26.4

While, the data in Table S9 of additional file 3 was as following :

Manta DEL All Call (A) 5012 4966 4888 4788 4604 4405 4177 3908 3610 2953
Recall (A) 37.2 37.2 37 36.7 35.9 34.9 33.5 31.8 29.8 25.1
Precis (A) 68.2 68.7 69.6 70.3 71.6 72.7 73.7 74.6 75.9 78
MIER (A) 30.9 30.4 29.6 28.8 27.8 27.1 26 25.1 24.3 21.8
SS Call (SS) 1870 1849 1810 1753 1658 1579 1469 1363 1235 978
Recall (SS) 55.9 55.8 55.7 55.1 54.1 53.3 51 48.4 44.7 37
Precis (SS) 44.3 44.7 45.5 46.4 48 49.3 50.8 51.4 52.9 54.8
MIER (SS) 45.8 45.4 44.8 43.9 43.1 42.6 41.9 41.1 40.7 38.8
S Call (S) 2439 2417 2380 2342 2267 2172 2080 1954 1826 1520
Recall (S) 37.7 37.6 37.5 37.1 36.2 35.1 33.9 32 30.4 25.7
Precis (S) 79.6 80.2 81.2 81.6 82.4 83.6 84.2 85.1 86.2 88.3
MIER (S) 25.3 24.8 23.9 23.2 22 21 20 19 18.1 15.6
M Call (M) 676 674 672 667 653 630 604 570 528 436
Recall (M) 27.6 27.5 27.5 27.3 26.8 25.9 24.8 23.6 21.9 18.1
Precis (M) 94.6 94.8 94.7 94.9 95.2 95.2 95.1 95.7 95.8 96.3
MIER (M) 9.4 9.3 9.2 8.9 8.5 8.7 8.4 7.8 7.7 5.7
L Call (L) 27 26 26 26 26 24 24 21 21 19
Recall (L) 11.4 10.4 10.4 10.4 10.4 9.3 8.3 7.2 7.2 7.2
Precis (L) 37 34.6 34.6 34.6 34.6 33.3 33.3 33.3 33.3 36.8
MIER (L) 33.3 34.6 34.6 34.6 34.6 29.1 29.1 19 19 15.7
DUP All Call (A) 324 322 313 301 284 260 237 216 187 138
Recall (A) 5.9 5.9 5.8 5.5 5.3 4.7 4.3 3.9 3.4 2.4
Precis (A) 50 50.3 50.4 50.1 50.3 48.4 48.5 48.1 48.6 47.1
MIER (A) 45.6 45.3 45 44.1 42.6 41.5 40.5 41.6 42.2 43.4
S Call (S) 224 222 213 201 185 164 144 126 114 83
Recall (S) 14.6 14.6 14.3 13.6 12.9 11.4 10.3 9.2 8.5 6.1
Precis (S) 63.8 64.4 65.2 65.6 67 65.8 67.3 68.2 69.2 67.4
MIER (S) 56.2 55.8 55.8 55.2 53.5 52.4 51.3 53.9 51.7 50.6
M Call (M) 85 85 85 85 84 81 79 77 63 46
Recall (M) 1.1 1.1 1.1 1.1 1.1 1 1 1 0.7 0.5
Precis (M) 21.1 21.1 21.1 21.1 21.4 20.9 21.5 22 19 19.5
MIER (M) 23.5 23.5 23.5 23.5 23.8 24.6 25.3 25.9 28.5 34.7
L Call (L) 15 15 15 15 15 15 14 13 10 9
Recall (L) 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0 0
Precis (L) 6.6 6.6 6.6 6.6 6.6 6.6 7.1 7.6 0 0
MIER (L) 13.3 13.3 13.3 13.3 13.3 13.3 14.2 15.3 20 22.2
INS All Call (A) 1819 1816 1809 1803 1782 1754 1727 1680 1641 1538
Recall (A) 11 11 11 10.9 10.8 10.5 10.4 10.1 9.9 9.3
Precis (A) 77.7 77.7 77.8 77.9 78.1 78.5 78.8 79.4 79.7 80.8
MIER (A) 55.5 55.5 55.5 55.4 55.2 54.7 54.7 54.5 54.2 53.5
INV All Call (A) 180 180 178 175 173 166 158 148 137 120
Recall (A) 22.2 22.2 21.8 21.5 21.1 19.7 19.7 18.6 17.1 15.3
Precis (A) 33.8 33.8 33.7 33.7 33.5 32.5 34.1 34.4 34.3 35
MIER (A) 16.6 16.6 16.2 16 15.6 15 15.1 14.8 14.5 13.3

There are significant differences in Recall and Precision values.
do you see something erroneous for this difference ?

Tried on few other tools and obtained similar diffrences for Real dataset4 but not for SIM-A.

can you help me to understand and fix this ?

@stat-lab
Copy link
Owner

stat-lab commented Oct 6, 2020 via email

@Mkddb
Copy link
Author

Mkddb commented Oct 7, 2020

Thanks for highlighting it.

Yes, i read the manuscript carefully again. it's mentioned "Table S3. Whole genome sequencing reads used in this study" and "Genome coverage for NA12878 (data4) *4 is 30x". hope, it's the same thing being highlighted here.

But, i am not able to understand this "We have not used all the SRR1910373 reads but used only 30x coverage reads from SRR1910373"
does it really mean only 30x coverage reads from SRR1910373 Fastq files or the whole genome coverage(which is actually an overall average) is 30. also, Are you using these files only ?
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373_1.fastq.bz2
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373_2.fastq.bz2

what about 3rd Fastq file ?
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/fastq/SRA245/SRA245681/SRX950495/SRR1910373.fastq.bz2

Can you please ellaborate as how to use only 30x coverage reads from above data. It will be really helpful if these script for fastq to bam conversion or the final *bam files can be shared for NA12878 (data4).

This gives me additional doubt how would one obtain RSS of 2 to 12, if only the 30x coverage reads were used for the alignment ?

@stat-lab
Copy link
Owner

stat-lab commented Oct 7, 2020 via email

@Mkddb
Copy link
Author

Mkddb commented Oct 7, 2020

Are these commands correct to obtain the same or i need to apply additional parameters.

bzip2 -d SRR1910373_1.fastq.bz2
bzip2 -d SRR1910373_2.fastq.bz2

bwa mem hs37d5.fa SRR1910373_1.fastq SRR1910373_2.fastq > SRR1910373.sam

samtools view -S -b SRR1910373.sam > SRR1910373.bam

samtools sort SRR1910373.bam -o SRR1910373_sorted.bam

samtools index SRR1910373_sorted.bam.bai

@Mkddb Mkddb closed this as completed Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants