-
Notifications
You must be signed in to change notification settings - Fork 0
/
Readme.Rmd
276 lines (173 loc) · 9.51 KB
/
Readme.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
title: "Antismash 4 vs Antismash 5 Predictions"
output:
github_document:
toc: true
---
```{r echo=FALSE, warning=FALSE, message=FALSE}
#options(warn=-1)
library(dplyr)
library(data.table)
library(ggplot2)
## --------------------------------------------------------------------------------------------------------------
## Data Loading/Munging
columns <- c("mibig", "AD_domain_idx", "AD_domain_id", "code_type", "prediction" )
as4 <- data.table::fread("output/domains_as4.txt", col.names = columns)
as5 <- data.table::fread("output/domains_as5.txt", col.names = columns)
# fix domain IDs
as5$AD_domain_id <- gsub("AMP-binding\\.", "A", as5$AD_domain_id)
as4$code_type <- gsub("Stachelhaus code", "stachelhaus_predictions_4", as4$code_type)
as4$code_type <- gsub("NRPSpredictor3 SVM", "nrpspredictor3_svm_single", as4$code_type)
as5$code_type <- gsub("stachelhaus_predictions", "stachelhaus_predictions_5", as5$code_type)
as5$code_type <- gsub("single_amino_pred", "nrpspredictor2_single", as5$code_type)
as4 <- as4 %>% filter(
code_type %in% c("stachelhaus_predictions_4", "nrpspredictor3_svm_single",
"stachelhaus_predictions_5", "nrpspredictor2_single"))
as5 <- as5 %>% filter(
code_type %in% c("stachelhaus_predictions_4", "nrpspredictor3_svm_single",
"stachelhaus_predictions_5", "nrpspredictor2_single",
"physicochemical_class"))
as4_wide <- data.table::dcast(as4, mibig+AD_domain_idx+AD_domain_id~code_type)
as5_wide <- data.table::dcast(as5, mibig+AD_domain_idx+AD_domain_id~code_type)
#all_data <- dplyr::full_join(as4_wide, as5_wide, by=c("mibig", "AD_domain_idx", "AD_domain_id")) %>%
all_data <- dplyr::full_join(as4_wide, as5_wide, by=c("mibig", "AD_domain_id")) %>%
group_by(mibig) %>%
add_count() %>%
ungroup() %>%
arrange(mibig, AD_domain_id )
all_data$compare_stach1 <- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
~.x == .y))
all_data$compare_stach2 <- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
~grepl(.x, .y)))
all_data$compare_nrpspred <- as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,
~.x == .y))
```
## Executive Summary
There are differences in Adenylation domain substrate predictions between AS4 and AS5 due to the different programs used for substrate identification. Using the two common measures of substrate prediction, Stachelhaus and NRPSPredictor (2/3) we can see that most of the differences between the two programs are due to instances where one of the programs does not generate a call. This suggests that the differences may simply be due to different acceptance thresholds between AS4 and AS5. However, there is also a small percentage of sequences that are predicted to be different substrates altogether which is potentially a concern if you are working with these clusters contianing this type of domain..
## Intro
I've been watching the Antismash team develop version 5 [on github](https://github.com/antismash/antismash) and have been very
impressed with the refactoring process. There seem to be major upgrades across the board - in the front end HTML (more interactivity, new cluster rule
features, changed tabbed layout for clusterblast and substrate predictions); in the refactoring of the code itself (modularized, type hints, new `Record` handling),
as well as the Dockerfile (more data required outside of the application which will allow smaller images and sideloading/reuse of large datasets). Really its quite an update - big kudos the whole team and Kblin and SJShaw in particular.
Clearly I'm a big fan so I decided to kick the tires. After exploring for a bit I noticed that a few of the Adenylation domain substrates for clusters that I have worked on are not being called identical in Antismash 4 (AS4) and Antismash 5 (AS5). I wondered how prevalent this problem was so I took a reasonably large public dataset and ran AS4 and AS5 on them and compared the predictions between them. AS4 used to offer a larger number of prediction programs while AS5 has narrowed donw to 1 (or 2 depending on how you count). While the AS5 approach has its benefits in the form of speed/efficiency, I wonder if the more limited substrate predictions of AS5 might cause us to mis or mis-predict certain substrates.
## My Approach
1. Download Mibig GBKS and convert to fasta
2. Download AS4 and AS5 docker images
3. Download AS5 sample data
4. Run AS4 and AS5 on each of the gbks in Mibig
5. Parse the results from the output GBK (AS4) or JSON (AS5) files.
6. Explore the results here.
You can reproduce the data here although the domain files are availalbe in the `output/` directory.
```bash
# get Mibig and run against AS4 and AS5 using their docker image
#
# there are some software deps I use you may need
# parallel, docker, biopython
make download
make runsmash
output/domains_as4.txt
```
## The Data
![](images/as4_as5.png)
The substrate information for AS4 and AS5 differs. We can parse this information out of the AS4 gbk files and the AS5 json files. As you can see in the image above AS4 contains specificity predictions for Stachelhaus, NRPSpredictor3, and a few other programs. AS4 has NRPSPredictor2 outputs as well as the stachelhaus prediction. I retrieved data from the fields in red in order looked for AS4 Stachelhaus <---> AS5 Stachelhaus differences as well as NRPSPredictor2 <---> NRPSPredictor3 SVM.
My parser scripts are in `scripts` and after pulling out the data and renaming a few columns, I join the AS4 and AS5 data togther to create the final analysis. To compare the substrate predictions I performed the following checks of equality.
```{r ,eval=FALSE}
# compare stachelhuas calls directly
all_data$compare_stach1 <-
as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
~.x == .y))
# use grep to compare any of the substrates predicted in AS4 against AS5
# example: grepl("leu|d-leu", "leu) -> TRUE
all_data$compare_stach2 <-
as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
~grepl(.x, .y)))
# compare the nrpspredictor calls directly
all_data$compare_nrpspred <-
as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,
~.x == .y))
```
The data including the equality checks are now all in a single table with one row for each domain. There are `r nrow(all_data)` Adenylation domains in this dataset.
It looks like this:
```{r, echo=FALSE}
head(all_data)
```
## Stachelhaus Findings
**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?**
```{r echo=FALSE}
table(all_data$compare_stach1)
```
**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?** (use grep to match multiple AS4 values to a single AS5 value)
```{r echo=FALSE}
table(all_data$compare_stach2)
```
**What are the non-matching values?**
What are the twenty most common AS4 values when AS4 and AS5 do not match? Most are `no-call`s where AS4 didn't predict a value.
```{r, echo=FALSE}
all_data %>%
filter(compare_stach2 == FALSE) %>%
.$stachelhaus_predictions_4 %>%
table() %>%
sort(decreasing=TRUE) %>%
.[1:20]
```
What are the twenty most common AS5 values when AS4 and AS5 do not match? (Most are `no-call`s where AS4 didn't predict a value.
with `as4 no_calls`
```{r, echo=FALSE}
all_data %>%
filter(compare_stach2 == FALSE) %>%
.$stachelhaus_predictions_5 %>%
table() %>%
sort(decreasing=TRUE) %>%
.[1:20]
```
without `as4 no_calls`
```{r, echo=FALSE}
all_data %>%
filter(compare_stach2 == FALSE,
stachelhaus_predictions_4 != "no_call") %>%
.$stachelhaus_predictions_5 %>%
table() %>%
sort(decreasing=TRUE) %>%
.[1:20]
```
**What were the AS4 Values that change to Phe in AS5?**
There are 60 Phe differences. What are the AS4 calls when `AS5="phe"`? Many hydrophobic residues to Phe. A few charged residue predicitons where you might not expect a change.
```{r , echo=FALSE}
## Look at the Stachelhaus Calls first
##
## of the non no-calls, what are the other problems?
not_no_call <- all_data %>% filter(compare_stach2 == FALSE, stachelhaus_predictions_4 != "no_call")
#sort(table(not_no_call$stachelhaus_predictions_5))
#not_no_call %>% filter(stachelhaus_predictions_5 == "phe")
not_no_call %>%
filter(stachelhaus_predictions_5 == "phe") %>%
.$stachelhaus_predictions_4 %>%
table() %>%
sort(decreasing=TRUE)
```
## NRPS Predictor Findings
**Are AS4 NRPSPredictor3 values identical to AS5 NRPSPredictor2 values?**
There is 80% concordance between V2 and V3. Most of the differences are due to N/A values. This brings concordnace to >90% if we include only those cases where a call is made.
```{r , echo=FALSE}
#table(all_data$compare_stach1) # 897/1903 T/F
#table(all_data$compare_stach2) # 1598/1320 T/F
table(all_data$compare_nrpspred) # 2249/551 T/F
```
What are the NRPSPredictor2 values when there is a discrepancy?
```{r, echo=FALSE}
all_data %>%
filter(compare_nrpspred == FALSE) %>%
.$nrpspredictor2_single %>%
table() %>%
sort(decreasing=TRUE) %>%
.[1:20]
```
What are the NRPSPredictor3 values when there is a discrepancy?
```{r, echo=FALSE}
all_data %>%
filter(compare_nrpspred == FALSE) %>%
.$nrpspredictor3_svm_single %>%
table() %>%
sort(decreasing=TRUE) %>%
.[1:20]
```