-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.txt
118 lines (94 loc) · 4.62 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
Empirically determining optimal read length cutoff for re-sequencing projects
ASSUMPTION
The idea is that the shorter reads probably have more (mapping) errors and therefore contribute more *apparent* SNPs. We want to find, empirically for each data set, the read length at which this effect tapers off, and to use it as a read length cutoff.
APPROACH
I use a series of simple perl scripts and linux system utilities: I extract for each entry of a .sam file the NM tag (edit distance: SNPs and indels) and the read length. For each possible length (between about 20 and 40bp or so) I then collect the same total number of bases. Finally, I output how many variants I have for each read length. As a cutoff we use the read length at which we start to see a relatively constant number of variants for the bases collected. These analyses are currently done in four steps and require two different perl scripts, see below.
INPUT
Required input is a sam file that is the result of these steps:
- trim and merge reads with a minimum length filter of 10bp
- collapse duplicates using samtools rmdup
- use bwa-aln & samtools for mapping of reads, with read groups (as needed for GATK variant calling), -q 30 mapping filter
ANALYSIS STEPS
1. make an inventory of how many SNPs/indels (NM tag in SAM file!) there are for each read, also keep read length:
perl 01_computeStats.pl /path/to/input.sam output.txt
output: a. on the terminal: information about how many mapped reads don't have an NM tag. This number should be zero!
b. the file output.txt
2. shuffle the order of the lines in the file. Because we didn't know if the original file is sorted based on quality:
perl -MList::Util -e 'print List::Util::shuffle <>' output.txt > output_shuffled.txt
output: the file output_shuffled.txt
3. compare read length bins for the full data set
perl 02_compareBinsLT30.pl output_shuffled.txt 20000000000
output: a table that lists for each read length bin the total amount of data (nt), and the SNPs/indels for this data
EXAMPLE OUTPUT
key lengths variants
rl10 0 0
rl11 0 0
rl12 0 0
rl13 0 0
rl14 0 0
rl15 195 10
rl16 208 0
rl17 1156 2
rl18 2610 61
rl19 14573 1108
rl20 59340 5048
rl21 102606 8490
rl22 115610 8928
rl23 88734 5975
rl24 55752 2883
rl25 42050 1310
rl26 34996 606
rl27 35073 288
rl28 35980 210
rl29 42224 207
rl30 43590 190
rl31 48422 195
rl32 50272 213
rl33 53856 239
rl34 58378 269
rl35 59255 217
rl40 67800 331
rl45 71955 329
rl50 67650 269
rl55 57805 237
rl60 54000 218
---> so this tells me, for example, that all reads of length 23 (rl23) make up a total length of 88,734 bp and collectively contain 5,975 variants. Based on this information, I select a number of bp from the second column for the next step (see below) such that for the entire range of read lengths I'm interested in, I'm able to collect the same total length of sequence data. I could use 35,000 here, for example. Selecting 40,000 is a bad idea since I don't have that much data for some of the read lengths (e.g., rl26-28).
4. compare read length bins for a reasonable total length per read length bin. This requires looking at the data and setting the total length yourself, see explanation above.
perl 02_compareBinsLT30.pl output_shuffled.txt xxxxx
output: same table as above, but normalized for a given total data volume xxxxx
EXAMPLE OUTPUT with xxxxx = 25000
key lengths variants
rl10 0 0
rl11 0 0
rl12 0 0
rl13 0 0
rl14 0 0
rl15 195 10
rl16 208 0
rl17 1156 2
rl18 2610 61
rl19 14573 1108
rl20 25000 2165
rl21 25011 2090
rl22 25014 1901
rl23 25001 1677
rl24 25008 1277
rl25 25000 758
rl26 25012 411
rl27 25002 215
rl28 25004 142
rl29 25027 124
rl30 25020 109
rl31 25017 106
rl32 25024 103
rl33 25014 115
rl34 25024 109
rl35 25025 111
rl40 25000 118
rl45 25020 102
rl50 25000 110
rl55 25025 110
rl60 25020 91
---> now I can compare total number of variants for each read length for which I have collected xxxxx = 25000bp. For reads of lengths 20-24 there are lots more variants, then it drops at rl25 and rl26, and by rl28 things start to taper off. If pressed for data, I could use a cutoff of 28bp, otherwise 29 or 30 might be safer.
5. Sincere apologies
These scripts are ugly! The whole thing started out as a quick-and-dirty way to look at some of our own people's data, and I never got around to prettify the approach.