majority of running time for large samples is dominated by those bubbles #136

Closed
sebhtml opened this Issue Jan 19, 2013 · 35 comments

Comments

Projects
None yet
1 participant
@sebhtml
Owner

sebhtml commented Jan 19, 2013

Ray Cloud Browser shows that the weak part of bubbles are valid seeds.

With a k-mer length of , both part of the bubble have a length of 2k-1 for mismatch errors, and 2k-1-1 or 2k-1+1 for indels. But we see mostly 2k-1, 2k-2 and 2k-3 below:

The Seed Length Distribution also shows this:

179 517321
180 519227
181 526897
182 537475
183 542882
184 549478
185 565402
186 597626
187 8430639 ### 2k-3 Ray QoS should remove these
188 6957585 ### 2k-2 Ray QoS should remove these
189 3219146 ### 2k-1 Ray QoS should remove these
190 630013
191 566728
192 545189
193 541001
194 533416
195 525638
196 519764
197 513767
198 503885

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

OK I have a verbose log to better understand this.

[boisver1@ip03 white-spruce]$ pwd
/home/boisver1/corbeil_group/projects/white-spruce
[boisver1@ip03 white-spruce]$ ls -lh SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.std*
-rw------- 1 boisver1 corbeil 0 Jan 29 15:47 SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.stderr
-rw------- 1 boisver1 corbeil 12G Jan 29 15:47 SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.stdout

12 GiB of stuff !

Owner

sebhtml commented Jan 30, 2013

OK I have a verbose log to better understand this.

[boisver1@ip03 white-spruce]$ pwd
/home/boisver1/corbeil_group/projects/white-spruce
[boisver1@ip03 white-spruce]$ ls -lh SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.std*
-rw------- 1 boisver1 corbeil 0 Jan 29 15:47 SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.stderr
-rw------- 1 boisver1 corbeil 12G Jan 29 15:47 SRA056234-k95-Picea-glauca-mp2-4096-2013-01-29-18.stdout

12 GiB of stuff !

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

With a k-mer length of 95, every sequencing error will produce these funny parts in the graph.

Basically, the graph is hairy thing, and the hair must be preemptively eliminated, not trimmed from the ready graph.

The plugin actually discovered 15624944 seeds,
but progression was only [1600001/13379916] on average (11 % of the graph was explored).
7900671 / 15624944 -> coverage 2X
1556809 / 15624944 -> coverage 3X

If I extrapolate, this ill process will discover 130662692 seeds.

Clearly this must be fixed.

Using the coverage alone is not enough because
these sequencing artefacts occur also at higher coverage when the sample itself is sampled at higher coverage.

The solution will be topological and adaptive.

Owner

sebhtml commented Jan 30, 2013

With a k-mer length of 95, every sequencing error will produce these funny parts in the graph.

Basically, the graph is hairy thing, and the hair must be preemptively eliminated, not trimmed from the ready graph.

The plugin actually discovered 15624944 seeds,
but progression was only [1600001/13379916] on average (11 % of the graph was explored).
7900671 / 15624944 -> coverage 2X
1556809 / 15624944 -> coverage 3X

If I extrapolate, this ill process will discover 130662692 seeds.

Clearly this must be fixed.

Using the coverage alone is not enough because
these sequencing artefacts occur also at higher coverage when the sample itself is sampled at higher coverage.

The solution will be topological and adaptive.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

Distribution of coverage for these seeds:

[boisver1@ip03 white-spruce]$ head counts
6367314 2
863117 3
516565 4
414316 5
400026 6
395455 7
390500 8
375874 9
352090 10
325445 11

Owner

sebhtml commented Jan 30, 2013

Distribution of coverage for these seeds:

[boisver1@ip03 white-spruce]$ head counts
6367314 2
863117 3
516565 4
414316 5
400026 6
395455 7
390500 8
375874 9
352090 10
325445 11

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

these all have this: Rank 2458 next vertex: Coverage= 2, ingoing coverages: outgoing coverages:

to they should be called dead-ends.

there seems to be a bug regarding that.

Owner

sebhtml commented Jan 30, 2013

these all have this: Rank 2458 next vertex: Coverage= 2, ingoing coverages: outgoing coverages:

to they should be called dead-ends.

there seems to be a bug regarding that.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

before patchwork:

[boisver1@cp0333 Ray-Technology-Research]$ grep "adding seed" TestX.1.*|wc -l
525

after patchwork:

Check if this fix E.coli-DH10B-MiSeq-250x2 or at least reduce the number of seeds.

Owner

sebhtml commented Jan 30, 2013

before patchwork:

[boisver1@cp0333 Ray-Technology-Research]$ grep "adding seed" TestX.1.*|wc -l
525

after patchwork:

Check if this fix E.coli-DH10B-MiSeq-250x2 or at least reduce the number of seeds.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

data path is /rap/nne-790-ab/projects/EdgeBio/invalid-seeds

Owner

sebhtml commented Jan 30, 2013

data path is /rap/nne-790-ab/projects/EdgeBio/invalid-seeds

@sebhtml

This comment has been minimized.

Show comment
Hide comment
Owner

sebhtml commented Jan 30, 2013

a-tip-forms-a-seed

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

actually, this is a picture of a eroded graph -- it's a bubble, not a tip

Owner

sebhtml commented Jan 30, 2013

actually, this is a picture of a eroded graph -- it's a bubble, not a tip

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

Algorithm:

We have a putative seed S.

Its last vertex has 1 child, and this child has 2 parents: the last vertex of our seed, and another one.

So we just try to backtrack on the other parent as long as the last symbol of vertices match the last symbol of the putative seed, if this holds true long enough, and if the coverage of the child (not in the seed) is more similar to the other parent, then the putative seed is a false seed.

Owner

sebhtml commented Jan 30, 2013

Algorithm:

We have a putative seed S.

Its last vertex has 1 child, and this child has 2 parents: the last vertex of our seed, and another one.

So we just try to backtrack on the other parent as long as the last symbol of vertices match the last symbol of the putative seed, if this holds true long enough, and if the coverage of the child (not in the seed) is more similar to the other parent, then the putative seed is a false seed.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

Algorithm:

We have a putative seed S.

Its last vertex has 1 child, and this child has 2 parents: the last vertex of our seed, and another one.
It is also possible that the child has 2 children and 1 parent, but in that case, there are no bubbles anyway.

So we just try to backtrack on the other parent as long as the last symbol of vertices match the last symbol of the putative seed, if this holds true long enough, and if the coverage of the child (not in the seed) is more similar to the other parent, then the putative seed is a false seed.

Owner

sebhtml commented Jan 30, 2013

Algorithm:

We have a putative seed S.

Its last vertex has 1 child, and this child has 2 parents: the last vertex of our seed, and another one.
It is also possible that the child has 2 children and 1 parent, but in that case, there are no bubbles anyway.

So we just try to backtrack on the other parent as long as the last symbol of vertices match the last symbol of the putative seed, if this holds true long enough, and if the coverage of the child (not in the seed) is more similar to the other parent, then the putative seed is a false seed.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

With bug:

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-2/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 8587
101 7565
102 7630
103 7257
104 6722
105 6442
106 6411
107 6129
108 6166
109 6701
110 6267
111 5308
112 4691
113 4325
114 4344
115 3983
116 3765
117 3721
118 3818
119 66410
120 139
121 77
122 56
123 57
124 56
125 42
126 53
127 44
128 40

Owner

sebhtml commented Jan 30, 2013

With bug:

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-2/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 8587
101 7565
102 7630
103 7257
104 6722
105 6442
106 6411
107 6129
108 6166
109 6701
110 6267
111 5308
112 4691
113 4325
114 4344
115 3983
116 3765
117 3721
118 3818
119 66410
120 139
121 77
122 56
123 57
124 56
125 42
126 53
127 44
128 40

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

k-mer length was 61. When 1 sequencing error occurs, the weak part of a bubble has:
k k-mers. In nucleotides, that's k+k-1 = 120, but results above shows that at 119 we have a peak of crap,

So we only need to run the bubble detection for seeds when we have a given number of vertices.

Owner

sebhtml commented Jan 30, 2013

k-mer length was 61. When 1 sequencing error occurs, the weak part of a bubble has:
k k-mers. In nucleotides, that's k+k-1 = 120, but results above shows that at 119 we have a peak of crap,

So we only need to run the bubble detection for seeds when we have a given number of vertices.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

6 quality controls:

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2
101 1
102 8
103 6
104 7
105 3
106 4
107 3
108 2
109 5
110 6
111 3
112 6
113 5
114 3
115 2
116 1
117 2
118 3
120 5
122 2
123 2
124 2
126 2
127 1
128 1
129 3
130 2
132 2

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 95
101 58
102 22
103 28
104 36
105 23
106 22
107 20
108 25
109 14
110 20
111 13
112 18
113 22
114 8
115 8
116 13
117 11
118 7
119 15
120 8
121 6
122 10
123 4
124 8
125 4
126 4
127 2
128 4

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2655
101 2518
102 2531
103 2294
104 2120
105 2295
106 2072
107 1969
108 1888
109 1967
110 1809
111 1780
112 1667
113 1703
114 1706
115 1560
116 1350
117 1371
118 1191
119 17098 *******************************
120 61
121 50
122 15
123 9
124 19
125 12
126 18
127 10
128 13

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2
101 1
102 9
103 4
104 6
105 5
106 4
107 4
108 5
109 2
110 8
111 4
112 4
113 3
114 3
115 5
116 1
117 2
118 2
119 1
120 3
121 2
122 2
123 3
124 1
126 5
128 2
129 3
130 3

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 234
101 119
102 82
103 73
104 93
105 62
106 55
107 43
108 61
109 53
110 50
111 49
112 33
113 44
114 46
115 29
116 41
117 36
118 19
119 22
120 28
121 21
122 23
123 15
124 25
125 19
126 16
127 14
128 12

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 8584
101 7578
102 7629
103 7254
104 6723
105 6441
106 6408
107 6128
108 6170
109 6701
110 6265
111 5312
112 4686
113 4338
114 4344
115 3982
116 3771
117 3719
118 3829
119 66387 *******************************
120 138
121 77
122 56
123 57
124 56
125 42
126 52
127 44
128 40

Owner

sebhtml commented Jan 30, 2013

6 quality controls:

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-31-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2
101 1
102 8
103 6
104 7
105 3
106 4
107 3
108 2
109 5
110 6
111 3
112 6
113 5
114 3
115 2
116 1
117 2
118 3
120 5
122 2
123 2
124 2
126 2
127 1
128 1
129 3
130 2
132 2

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-51-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 95
101 58
102 22
103 28
104 36
105 23
106 22
107 20
108 25
109 14
110 20
111 13
112 18
113 22
114 8
115 8
116 13
117 11
118 7
119 15
120 8
121 6
122 10
123 4
124 8
125 4
126 4
127 2
128 4

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-150x2-Ray-61-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2655
101 2518
102 2531
103 2294
104 2120
105 2295
106 2072
107 1969
108 1888
109 1967
110 1809
111 1780
112 1667
113 1703
114 1706
115 1560
116 1350
117 1371
118 1191
119 17098 *******************************
120 61
121 50
122 15
123 9
124 19
125 12
126 18
127 10
128 13

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-31-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 2
101 1
102 9
103 4
104 6
105 5
106 4
107 4
108 5
109 2
110 8
111 4
112 4
113 3
114 3
115 5
116 1
117 2
118 2
119 1
120 3
121 2
122 2
123 3
124 1
126 5
128 2
129 3
130 3

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-51-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 234
101 119
102 82
103 73
104 93
105 62
106 55
107 43
108 61
109 53
110 50
111 49
112 33
113 44
114 46
115 29
116 41
117 36
118 19
119 22
120 28
121 21
122 23
123 15
124 25
125 19
126 16
127 14
128 12

$ head -n 30 EdgeBio-MiSeq-E.Coli-DH10B-250x2-Ray-61-2012-12-19-1/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 8584
101 7578
102 7629
103 7254
104 6723
105 6441
106 6408
107 6128
108 6170
109 6701
110 6265
111 5312
112 4686
113 4338
114 4344
115 3982
116 3771
117 3719
118 3829
119 66387 *******************************
120 138
121 77
122 56
123 57
124 56
125 42
126 52
127 44
128 40

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 30, 2013

Owner

Some quality controls jobs with fca1680

$ showq|grep boisv
10205257 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:20
10205263 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:24
10205265 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:26
10205262 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:24
10205255 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205254 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205256 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205260 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:22
10205253 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:16
10205267 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:27
10205266 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:26
10205258 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:20
10205268 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:28
10205264 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:25
10205259 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:21
10205261 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:22

Owner

sebhtml commented Jan 30, 2013

Some quality controls jobs with fca1680

$ showq|grep boisv
10205257 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:20
10205263 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:24
10205265 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:26
10205262 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:24
10205255 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205254 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205256 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:18
10205260 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:22
10205253 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:16
10205267 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:27
10205266 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:26
10205258 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:20
10205268 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:28
10205264 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:25
10205259 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:21
10205261 sboisver Idle 32 2:30:00 Wed Jan 30 17:06:22

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 31, 2013

Owner

patch fca1680

Before patch @k=61

Contigs >= 100 nt
Number: 177174
Total length: 24678110

After patch @k=61

Contigs >= 100 nt
Number: 137353
Total length: 20595821

A lot of useless stumps were removed.

The remaining stumps are probably weak parts of bubbles.

Owner

sebhtml commented Jan 31, 2013

patch fca1680

Before patch @k=61

Contigs >= 100 nt
Number: 177174
Total length: 24678110

After patch @k=61

Contigs >= 100 nt
Number: 137353
Total length: 20595821

A lot of useless stumps were removed.

The remaining stumps are probably weak parts of bubbles.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 31, 2013

Owner

5 elastic jobs left to completed, then I'll try to find a formula for these fancy weak bubble parts.

Owner

sebhtml commented Jan 31, 2013

5 elastic jobs left to completed, then I'll try to find a formula for these fancy weak bubble parts.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 31, 2013

Owner

Some QC:

Rank 5 Skipped paths because of dead ends: 48599
Rank 5 Skipped paths because of short length: 2135892
Rank 5 Skipped paths because of bad ownership: 4298
Rank 5 Skipped paths because of low coverage: 0

Owner

sebhtml commented Jan 31, 2013

Some QC:

Rank 5 Skipped paths because of dead ends: 48599
Rank 5 Skipped paths because of short length: 2135892
Rank 5 Skipped paths because of bad ownership: 4298
Rank 5 Skipped paths because of low coverage: 0

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 31, 2013

Owner

Relationship between k-mer length and the length of faulty objects:

the minimum accepted length is 100 nucleotides

31 none
35 none
41 none
45 none

51
100 176

55
100 3953
101 3742
102 3756
103 3522
104 3423
105 3454
106 3649
107 92057

59
100 4239
101 4257
102 4120
103 3930
104 4609
105 4985
106 4828
107 3897
108 3168
109 3076
110 3169
111 2825
112 2806
113 2891
114 2931
115 76801
116 119

61
117 2540
118 2700
119 70180
120 105

63
121 2383
122 2415
123 64032
124 112

69
128 1942
129 1862
130 1872
131 1766
132 1818
133 1797
134 1867
135 48610
136 88
137 38

73
136 1652
137 1472
138 1460
139 1483
140 1435
141 1536
142 1581
143 40422

85
100 13666
101 12556
102 11589
103 12551
104 10555
105 9862
106 9408
163 847
164 817
165 918
166 916
167 22426
168 37

89
100 17235
101 16361
102 14696
103 13621
104 13454
105 12245
106 11317
107 12347
108 10312
109 9626
174 771
175 18177
176 36

91
100 19951
101 18414
102 16880
103 15952
104 14577
105 13593
106 13203
107 12031
108 11151
109 12162
110 10198
111 9530
112 9121
113 8369
114 7714
115 7502
116 6756
117 6652
178 711
179 16357
180 35

93
100 23093
101 21172
102 19960
103 18219
104 16665
105 15817
106 14472
107 13395
108 12959
109 11932
110 11010
111 12089
112 10153
113 9504
114 9140
115 8319
116 7746
117 7552
118 6814
119 6630
120 6500
121 5999
122 5679
123 5042
124 4933
125 4684
126 4612
127 4364
128 3959
129 3980
130 3602
131 3514
132 3536
182 674
183 14669
184 29
185 23

95
100 27259
101 25447
102 22906
103 20977
104 19991
105 18082
106 16606
107 15769
108 14354
109 13396
110 12937
111 11910
112 10979
113 12170
114 9991
186 554
187 13160
188 26

Owner

sebhtml commented Jan 31, 2013

Relationship between k-mer length and the length of faulty objects:

the minimum accepted length is 100 nucleotides

31 none
35 none
41 none
45 none

51
100 176

55
100 3953
101 3742
102 3756
103 3522
104 3423
105 3454
106 3649
107 92057

59
100 4239
101 4257
102 4120
103 3930
104 4609
105 4985
106 4828
107 3897
108 3168
109 3076
110 3169
111 2825
112 2806
113 2891
114 2931
115 76801
116 119

61
117 2540
118 2700
119 70180
120 105

63
121 2383
122 2415
123 64032
124 112

69
128 1942
129 1862
130 1872
131 1766
132 1818
133 1797
134 1867
135 48610
136 88
137 38

73
136 1652
137 1472
138 1460
139 1483
140 1435
141 1536
142 1581
143 40422

85
100 13666
101 12556
102 11589
103 12551
104 10555
105 9862
106 9408
163 847
164 817
165 918
166 916
167 22426
168 37

89
100 17235
101 16361
102 14696
103 13621
104 13454
105 12245
106 11317
107 12347
108 10312
109 9626
174 771
175 18177
176 36

91
100 19951
101 18414
102 16880
103 15952
104 14577
105 13593
106 13203
107 12031
108 11151
109 12162
110 10198
111 9530
112 9121
113 8369
114 7714
115 7502
116 6756
117 6652
178 711
179 16357
180 35

93
100 23093
101 21172
102 19960
103 18219
104 16665
105 15817
106 14472
107 13395
108 12959
109 11932
110 11010
111 12089
112 10153
113 9504
114 9140
115 8319
116 7746
117 7552
118 6814
119 6630
120 6500
121 5999
122 5679
123 5042
124 4933
125 4684
126 4612
127 4364
128 3959
129 3980
130 3602
131 3514
132 3536
182 674
183 14669
184 29
185 23

95
100 27259
101 25447
102 22906
103 20977
104 19991
105 18082
106 16606
107 15769
108 14354
109 13396
110 12937
111 11910
112 10979
113 12170
114 9991
186 554
187 13160
188 26

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Jan 31, 2013

Owner

At k=91

Check one with length 100 and another with length 179 in Ray Cloud Browser.
job # 10205427 10205625 Ray-E.Coli-DH10B-250x2-91-2013-01-30-3

100 nt are tips

179 -> owing to the algorithm, first and last are not included, that's why

Basically, dump a seed if all its k-mers have coverage 2 with length < 2 * k

Owner

sebhtml commented Jan 31, 2013

At k=91

Check one with length 100 and another with length 179 in Ray Cloud Browser.
job # 10205427 10205625 Ray-E.Coli-DH10B-250x2-91-2013-01-30-3

100 nt are tips

179 -> owing to the algorithm, first and last are not included, that's why

Basically, dump a seed if all its k-mers have coverage 2 with length < 2 * k

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 14, 2013

Owner

Example of locations for seeds with large kmers:

http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=138853&zoom=0.7486437028696127

In code/plugin_SeedingData/SeedWorker.cpp, add some code that check if the seed is part of a bubble.

The algorithm has to be symmetric

Owner

sebhtml commented Mar 14, 2013

Example of locations for seeds with large kmers:

http://genome.ulaval.ca:10111/client/?map=0&section=0&region=1&location=138853&zoom=0.7486437028696127

In code/plugin_SeedingData/SeedWorker.cpp, add some code that check if the seed is part of a bubble.

The algorithm has to be symmetric

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 15, 2013

Owner

Implement a first draft plugin called SpuriousSeedAnnihilator

not sure if it should be a helper or a plugin though.

creating a helper is faster, creating a plugin is more fun however

Owner

sebhtml commented Mar 15, 2013

Implement a first draft plugin called SpuriousSeedAnnihilator

not sure if it should be a helper or a plugin though.

creating a helper is faster, creating a plugin is more fun however

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 18, 2013

Owner

To look at these funky seeds:

In the Hathor instance on ls30: /home/boiseb01/Hathor/data/Tron-14-test-1

http://genome.ulaval.ca:10111

To create a tunnel to bypass firewalls, if any.

ssh -lboiseb01 -L10111:ls30:10111 genome.ulaval.ca

data on mp2: /home/boisver1/MiSeq-big-kmer/Tron-14-test-1

There is a peak at 2k-3.

Modules:

[boiseb01@ls30 Tron-14-test-1]$ module load Ray-Cloud-Browser/1.0.0-1

http://genome.ulaval.ca:10111/client/?map=1&section=0&region=4&location=88 (this link does not work because the coverage is below)

Algorithm:

Owner

sebhtml commented Mar 18, 2013

To look at these funky seeds:

In the Hathor instance on ls30: /home/boiseb01/Hathor/data/Tron-14-test-1

http://genome.ulaval.ca:10111

To create a tunnel to bypass firewalls, if any.

ssh -lboiseb01 -L10111:ls30:10111 genome.ulaval.ca

data on mp2: /home/boisver1/MiSeq-big-kmer/Tron-14-test-1

There is a peak at 2k-3.

Modules:

[boiseb01@ls30 Tron-14-test-1]$ module load Ray-Cloud-Browser/1.0.0-1

http://genome.ulaval.ca:10111/client/?map=1&section=0&region=4&location=88 (this link does not work because the coverage is below)

Algorithm:

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 18, 2013

Owner

With ray version 408d6ff

http://genome.ulaval.ca:10111/client

with this map:

{ "name": "MG1655 @ 408d6ff",

Links are not provided here because the minimum depth is not fetched from the QUERY_STRING at the moment.

SeedLengthInNucleotides Frequency

100 746 ****************** See Below
101 597
102 511
103 407
104 332
105 311
106 312
107 273
108 250
109 237
110 195
111 179
112 178
113 113
114 123
115 118
116 88
117 105
118 93
119 86
120 85
121 78
122 77
123 79
124 65
125 70
126 44
127 48
128 45
129 48
130 31
131 38
132 33
133 26
134 27
135 34
136 18
137 19
138 20
139 18
140 21
141 19
142 15
143 15
144 8
145 21
146 20
147 13
148 14
149 15
150 10
151 13
152 14
153 14
154 7
155 6
156 15
157 10
158 14
159 14
160 9
161 5
162 11
163 10
164 6
165 8
166 14
167 13
168 10
169 8
170 6
171 7
172 4
173 5
174 9
175 12
176 11
177 6
178 9
179 144 ****************** See Below
180 15
181 15
182 1
183 2
184 4
185 2
186 3
187 4
188 1

Some views in Ray Cloud Browser:

An example of region with 179 nucleotides (2 * k - 3)
179nt-k91

An example of region with 179 nucleotides (2 * k - 3) (zoomed in)
179nt-k91-zooming

A example of region with 100 nucleotides (via -minimum-contig-length 100 )
100nt-k91

Owner

sebhtml commented Mar 18, 2013

With ray version 408d6ff

http://genome.ulaval.ca:10111/client

with this map:

{ "name": "MG1655 @ 408d6ff",

Links are not provided here because the minimum depth is not fetched from the QUERY_STRING at the moment.

SeedLengthInNucleotides Frequency

100 746 ****************** See Below
101 597
102 511
103 407
104 332
105 311
106 312
107 273
108 250
109 237
110 195
111 179
112 178
113 113
114 123
115 118
116 88
117 105
118 93
119 86
120 85
121 78
122 77
123 79
124 65
125 70
126 44
127 48
128 45
129 48
130 31
131 38
132 33
133 26
134 27
135 34
136 18
137 19
138 20
139 18
140 21
141 19
142 15
143 15
144 8
145 21
146 20
147 13
148 14
149 15
150 10
151 13
152 14
153 14
154 7
155 6
156 15
157 10
158 14
159 14
160 9
161 5
162 11
163 10
164 6
165 8
166 14
167 13
168 10
169 8
170 6
171 7
172 4
173 5
174 9
175 12
176 11
177 6
178 9
179 144 ****************** See Below
180 15
181 15
182 1
183 2
184 4
185 2
186 3
187 4
188 1

Some views in Ray Cloud Browser:

An example of region with 179 nucleotides (2 * k - 3)
179nt-k91

An example of region with 179 nucleotides (2 * k - 3) (zoomed in)
179nt-k91-zooming

A example of region with 100 nucleotides (via -minimum-contig-length 100 )
100nt-k91

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 19, 2013

Owner

Tasks to complete this ticket:

duplicate seed length code (20 min) 9e44f08

move SeedLengthDistribution in new plugin (15 min) e38e90a

move checkpointing write in new plugin (15 min) 77ee450

move Seeds.fasta in new plugin (15 min) fae6aa2

skip new plugin if checkpoint exists (10 min) dd36402

weak bubble removal with parent (1 h)

weak bubble removal with child (30 min)

Owner

sebhtml commented Mar 19, 2013

Tasks to complete this ticket:

duplicate seed length code (20 min) 9e44f08

move SeedLengthDistribution in new plugin (15 min) e38e90a

move checkpointing write in new plugin (15 min) 77ee450

move Seeds.fasta in new plugin (15 min) fae6aa2

skip new plugin if checkpoint exists (10 min) dd36402

weak bubble removal with parent (1 h)

weak bubble removal with child (30 min)

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 19, 2013

Owner

Some success indicators:

Before parent-directed removal

SeedLengthInNucleotides Frequency

100 746
101 597
102 511
103 407
104 332
105 311
106 312
107 273
108 250
109 237
110 195
111 179
112 178
113 113
114 123
115 118
116 88
117 105
118 93
119 86
120 85
121 78
122 77
123 79
124 65
125 70
126 44
127 48
128 45
129 48
130 31
131 38
132 33
133 26
134 27
135 34
136 18
137 19
138 20
139 18
140 21
141 19
142 15
143 15
144 8
145 21
146 20
147 13
148 14
149 15
150 10
151 13
152 14
153 14
154 7
155 6
156 15
157 10
158 14
159 14
160 9
161 5
162 11
163 10
164 6
165 8
166 14
167 13
168 10
169 8
170 6
171 7
172 4
173 5
174 9
175 12
176 11
177 6
178 9
179 144
180 15
181 15
182 1
183 2
184 4
185 2
186 3
187 4
188 1

After Before parent-directed removal:

Owner

sebhtml commented Mar 19, 2013

Some success indicators:

Before parent-directed removal

SeedLengthInNucleotides Frequency

100 746
101 597
102 511
103 407
104 332
105 311
106 312
107 273
108 250
109 237
110 195
111 179
112 178
113 113
114 123
115 118
116 88
117 105
118 93
119 86
120 85
121 78
122 77
123 79
124 65
125 70
126 44
127 48
128 45
129 48
130 31
131 38
132 33
133 26
134 27
135 34
136 18
137 19
138 20
139 18
140 21
141 19
142 15
143 15
144 8
145 21
146 20
147 13
148 14
149 15
150 10
151 13
152 14
153 14
154 7
155 6
156 15
157 10
158 14
159 14
160 9
161 5
162 11
163 10
164 6
165 8
166 14
167 13
168 10
169 8
170 6
171 7
172 4
173 5
174 9
175 12
176 11
177 6
178 9
179 144
180 15
181 15
182 1
183 2
184 4
185 2
186 3
187 4
188 1

After Before parent-directed removal:

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 25, 2013

Owner

Here you go, half the crap is removed automatically now with large kmers !

Protips: pop the mother from the stack before pushing her kids on the stack, not after

[boisver1@cp2569-mp2 MiSeq-big-kmer]$ head -n 20 Tron-14-test-6/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 435
101 369
102 311
103 252
104 189
105 201
106 185
107 174
108 167
109 176
110 118
111 116
112 121
113 76
114 79
115 79
116 63
117 78
118 66

2bfe953

Owner

sebhtml commented Mar 25, 2013

Here you go, half the crap is removed automatically now with large kmers !

Protips: pop the mother from the stack before pushing her kids on the stack, not after

[boisver1@cp2569-mp2 MiSeq-big-kmer]$ head -n 20 Tron-14-test-6/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 435
101 369
102 311
103 252
104 189
105 201
106 185
107 174
108 167
109 176
110 118
111 116
112 121
113 76
114 79
115 79
116 63
117 78
118 66

2bfe953

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 25, 2013

Owner

85-95% of the crap is automatically removed now

[boisver1@cp2569-mp2 MiSeq-big-kmer]$ head -n 20 Tron-14-test-6/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 179
101 174
102 138
103 116
104 94
105 113
106 95
107 85
108 88
109 97
110 56
111 62
112 66
113 39
114 42
115 47
116 35
117 45
118 40

aa7bfb7

Owner

sebhtml commented Mar 25, 2013

85-95% of the crap is automatically removed now

[boisver1@cp2569-mp2 MiSeq-big-kmer]$ head -n 20 Tron-14-test-6/SeedLengthDistribution.txt

SeedLengthInNucleotides Frequency

100 179
101 174
102 138
103 116
104 94
105 113
106 95
107 85
108 88
109 97
110 56
111 62
112 66
113 39
114 42
115 47
116 35
117 45
118 40

aa7bfb7

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 25, 2013

Owner

Now, I need to check what's left on Hathor:

http://genome.ulaval.ca:10111/client/?map=3&section=0&region=68&location=0&zoom=1.3918897761638223

=> Increase the depth from 32 to 128 !

Owner

sebhtml commented Mar 25, 2013

Now, I need to check what's left on Hathor:

http://genome.ulaval.ca:10111/client/?map=3&section=0&region=68&location=0&zoom=1.3918897761638223

=> Increase the depth from 32 to 128 !

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 25, 2013

Owner

With depth=128:

SeedLengthInNucleotides Frequency

100 47
101 41
102 43
103 35
104 18
105 38
106 25
107 15
108 27
109 34
110 20
111 19
112 21
113 11
114 14
115 23
116 16
117 22
118 22

Owner

sebhtml commented Mar 25, 2013

With depth=128:

SeedLengthInNucleotides Frequency

100 47
101 41
102 43
103 35
104 18
105 38
106 25
107 15
108 27
109 34
110 20
111 19
112 21
113 11
114 14
115 23
116 16
117 22
118 22

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 26, 2013

Owner

Before

179 144

Owner

sebhtml commented Mar 26, 2013

Before

179 144

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 26, 2013

Owner

The algorithm must be modified: what was concluded is wrong because a component of the bubble usually is not attached directly to anything colored. It's necessary to do:

parent, parent, child, child

or

child, child, parent, parent

Otherwise, the code does not work.
1

2

3

Owner

sebhtml commented Mar 26, 2013

The algorithm must be modified: what was concluded is wrong because a component of the bubble usually is not attached directly to anything colored. It's necessary to do:

parent, parent, child, child

or

child, child, parent, parent

Otherwise, the code does not work.
1

2

3

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 26, 2013

Owner

There is a bug. Paths are probably not indexed.

BUBBLE_HIT first=TGAATGGCGTTACCGGCCTGGTTGAGTATCACGAGCATTTCAATCGCTTTTAATCTCCGGGGTTTGCAGACTGCTTAACAGCTCTGCAACC grandparent= GCTGAATGGCGTTACCGGCCTGGTTGAGTATCACGAGCATTTCAATCGCTTTTAATCTCCGGGGTTTGCAGACTGCTTAACAGCTCTGCAA
LeftPaths: 0 RightPaths: 0

http://genome.ulaval.ca:10111/client/?map=4&section=0&region=1525&location=91&zoom=1.399999058351862

Owner

sebhtml commented Mar 26, 2013

There is a bug. Paths are probably not indexed.

BUBBLE_HIT first=TGAATGGCGTTACCGGCCTGGTTGAGTATCACGAGCATTTCAATCGCTTTTAATCTCCGGGGTTTGCAGACTGCTTAACAGCTCTGCAACC grandparent= GCTGAATGGCGTTACCGGCCTGGTTGAGTATCACGAGCATTTCAATCGCTTTTAATCTCCGGGGTTTGCAGACTGCTTAACAGCTCTGCAA
LeftPaths: 0 RightPaths: 0

http://genome.ulaval.ca:10111/client/?map=4&section=0&region=1525&location=91&zoom=1.399999058351862

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 26, 2013

Owner

almost done with bubbles ! 179 -> 33 (or 34 I am not sure)

Owner

sebhtml commented Mar 26, 2013

almost done with bubbles ! 179 -> 33 (or 34 I am not sure)

@sebhtml

This comment has been minimized.

Show comment
Hide comment
@sebhtml

sebhtml Mar 27, 2013

Owner

Ţhings left are not systematic -- they are due to the biology.

Owner

sebhtml commented Mar 27, 2013

Ţhings left are not systematic -- they are due to the biology.

@sebhtml

This comment has been minimized.

Show comment
Hide comment
Owner

sebhtml commented Mar 27, 2013

@sebhtml sebhtml closed this Mar 27, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment