Readme.Rmd

---
title: "Antismash 4 vs Antismash 5 Predictions"
output: 
  github_document:
    toc: true 
---


```{r echo=FALSE, warning=FALSE, message=FALSE}

#options(warn=-1)

library(dplyr)
library(data.table)
library(ggplot2)

## --------------------------------------------------------------------------------------------------------------
## Data Loading/Munging 

columns <- c("mibig", "AD_domain_idx", "AD_domain_id", "code_type", "prediction" )

as4 <- data.table::fread("output/domains_as4.txt", col.names = columns)
as5 <- data.table::fread("output/domains_as5.txt", col.names = columns)

# fix domain IDs
as5$AD_domain_id <-  gsub("AMP-binding\\.", "A", as5$AD_domain_id)


as4$code_type <-  gsub("Stachelhaus code", "stachelhaus_predictions_4", as4$code_type)
as4$code_type <-  gsub("NRPSpredictor3 SVM", "nrpspredictor3_svm_single", as4$code_type)

as5$code_type <-  gsub("stachelhaus_predictions", "stachelhaus_predictions_5", as5$code_type)
as5$code_type <-  gsub("single_amino_pred", "nrpspredictor2_single", as5$code_type)


as4 <- as4  %>% filter(
  code_type %in% c("stachelhaus_predictions_4", "nrpspredictor3_svm_single", 
                   "stachelhaus_predictions_5", "nrpspredictor2_single"))

as5 <- as5  %>% filter(
  code_type %in% c("stachelhaus_predictions_4", "nrpspredictor3_svm_single", 
                   "stachelhaus_predictions_5", "nrpspredictor2_single",
                   "physicochemical_class"))


as4_wide <-  data.table::dcast(as4, mibig+AD_domain_idx+AD_domain_id~code_type)
as5_wide <-  data.table::dcast(as5, mibig+AD_domain_idx+AD_domain_id~code_type)


#all_data <- dplyr::full_join(as4_wide, as5_wide, by=c("mibig", "AD_domain_idx", "AD_domain_id")) %>% 
all_data <- dplyr::full_join(as4_wide, as5_wide, by=c("mibig", "AD_domain_id")) %>% 
  group_by(mibig) %>% 
  add_count() %>%
  ungroup() %>%
  arrange(mibig, AD_domain_id )


all_data$compare_stach1 <- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
             ~.x == .y))

all_data$compare_stach2 <- as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
                                    ~grepl(.x, .y)))

all_data$compare_nrpspred <- as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,
                                                  ~.x == .y))

```


## Executive Summary

There are differences in Adenylation domain substrate predictions between AS4 and AS5 due to the different programs used for substrate identification. Using the two common measures of substrate prediction, Stachelhaus and NRPSPredictor (2/3) we can see that most of the differences between the two programs are due to instances where one of the programs does not generate a call. This suggests that the differences may simply be due to different acceptance thresholds between AS4 and AS5. However, there is also a small percentage of sequences that are predicted to be different substrates altogether which is potentially a concern if you are working with these clusters contianing this type of domain.. 


## Intro

I've been watching the Antismash team develop version 5 [on github](https://github.com/antismash/antismash) and have been very
impressed with the refactoring process. There seem to be major upgrades across the board - in the front end HTML (more interactivity, new cluster rule
features, changed tabbed layout for clusterblast and substrate predictions); in the refactoring of the code itself (modularized, type hints, new `Record` handling),
as well as the Dockerfile (more data required outside of the application which will allow smaller images and sideloading/reuse of large datasets). Really its quite an update - big kudos the whole team and Kblin and SJShaw in particular. 

Clearly I'm a big fan so I decided to kick the tires.  After exploring for a bit I noticed that a few of the Adenylation domain substrates for clusters that I have worked on are not being called identical in Antismash 4 (AS4) and Antismash 5 (AS5). I wondered how prevalent this problem was so I took a reasonably large public dataset and ran AS4 and AS5 on them and compared the predictions between them.  AS4 used to offer a larger number of prediction programs while AS5 has narrowed donw to 1 (or 2 depending on how you count). While the AS5 approach has its benefits in the form of speed/efficiency, I wonder if the more limited substrate predictions of AS5 might cause us to mis or mis-predict certain substrates.


## My Approach

1. Download Mibig GBKS and convert to fasta
2. Download AS4 and AS5 docker images
3. Download AS5 sample data
4. Run AS4 and AS5 on each of the gbks in Mibig
5. Parse the results from the output GBK (AS4) or JSON (AS5) files.
6. Explore the results here.

You can reproduce the data here although the domain files are availalbe in the `output/` directory.

```bash
# get Mibig and run against AS4 and AS5 using their docker image
#
# there are some software deps I use you may need
# parallel, docker, biopython
make download
make runsmash
output/domains_as4.txt
```


## The Data

![](images/as4_as5.png)

The substrate information for AS4 and AS5 differs. We can parse this information out of the AS4 gbk files and the AS5 json files. As you can see in the image above AS4 contains specificity predictions for Stachelhaus, NRPSpredictor3,  and a few other programs. AS4 has NRPSPredictor2 outputs as well as the stachelhaus prediction.  I retrieved data from the fields in red in order looked for AS4 Stachelhaus <---> AS5 Stachelhaus differences as well as NRPSPredictor2 <---> NRPSPredictor3 SVM. 

My parser scripts are in `scripts` and after pulling out the data and renaming a few columns, I join the AS4 and AS5 data togther to create the final analysis. To compare the substrate predictions I performed the following checks of equality.


```{r ,eval=FALSE}
# compare stachelhuas calls directly
all_data$compare_stach1 <- 
  as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
             ~.x == .y))

# use grep to compare any of the substrates predicted in AS4 against AS5
# example: grepl("leu|d-leu", "leu) -> TRUE
all_data$compare_stach2 <- 
  as.logical(purrr::map2(all_data$stachelhaus_predictions_4, all_data$stachelhaus_predictions_5,
                                    ~grepl(.x, .y)))

# compare the nrpspredictor calls directly
all_data$compare_nrpspred <- 
  as.logical(purrr::map2(all_data$nrpspredictor2_single, all_data$nrpspredictor3_svm_single,
                                                  ~.x == .y))
```


The data including the equality checks are now all in a single table with one row for each domain. There are `r nrow(all_data)` Adenylation domains in this dataset. 
It looks like this:

```{r, echo=FALSE}
head(all_data)
```


## Stachelhaus Findings


**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?**


```{r echo=FALSE}
table(all_data$compare_stach1)
```


**Are AS4 Stachelhaus values identical to AS5 Stachelhaus values?** (use grep to match multiple AS4 values to a single AS5 value)

```{r echo=FALSE}
table(all_data$compare_stach2)
```


**What are the non-matching values?** 

What are the twenty most common AS4 values when AS4 and AS5 do not match? Most are `no-call`s where AS4 didn't predict a value.  

```{r, echo=FALSE}
all_data %>% 
  filter(compare_stach2 == FALSE) %>% 
  .$stachelhaus_predictions_4 %>% 
  table() %>%  
  sort(decreasing=TRUE) %>% 
  .[1:20]
```


What are the twenty most common AS5 values when AS4 and AS5 do not match? (Most are `no-call`s where AS4 didn't predict a value.  

with `as4 no_calls`

```{r, echo=FALSE}
all_data %>% 
  filter(compare_stach2 == FALSE) %>% 
  .$stachelhaus_predictions_5 %>% 
  table() %>%  
  sort(decreasing=TRUE) %>% 
  .[1:20]
```

without `as4 no_calls`

```{r, echo=FALSE}
all_data %>% 
  filter(compare_stach2 == FALSE, 
         stachelhaus_predictions_4 != "no_call") %>% 
  .$stachelhaus_predictions_5 %>% 
  table() %>%  
  sort(decreasing=TRUE) %>% 
  .[1:20]
```


**What were the AS4 Values that change to Phe in AS5?**


There are 60 Phe differences. What are the AS4 calls when `AS5="phe"`?  Many hydrophobic residues to Phe. A few charged residue predicitons where you might not expect a change.

```{r , echo=FALSE}

## Look at the Stachelhaus Calls first
## 
## of the non no-calls, what are the other problems?
not_no_call <- all_data %>% filter(compare_stach2 == FALSE, stachelhaus_predictions_4 != "no_call")
#sort(table(not_no_call$stachelhaus_predictions_5))
#not_no_call  %>% filter(stachelhaus_predictions_5  == "phe")


not_no_call  %>% 
  filter(stachelhaus_predictions_5  == "phe") %>% 
  .$stachelhaus_predictions_4 %>%
  table() %>%
  sort(decreasing=TRUE)

```


## NRPS Predictor Findings

**Are AS4 NRPSPredictor3 values identical to AS5 NRPSPredictor2 values?**


There is 80% concordance between V2 and V3. Most of the differences are due to N/A values. This brings concordnace to >90% if we include only those cases where a call is made.

```{r , echo=FALSE}

#table(all_data$compare_stach1) # 897/1903 T/F
#table(all_data$compare_stach2) # 1598/1320 T/F
table(all_data$compare_nrpspred) # 2249/551 T/F

```


What are the NRPSPredictor2 values when there is a discrepancy?

```{r, echo=FALSE}

all_data %>% 
  filter(compare_nrpspred == FALSE) %>% 
  .$nrpspredictor2_single %>% 
  table() %>%  
  sort(decreasing=TRUE) %>% 
  .[1:20]

```

What are the NRPSPredictor3 values when there is a discrepancy?

```{r, echo=FALSE}

all_data %>% 
  filter(compare_nrpspred == FALSE) %>% 
  .$nrpspredictor3_svm_single %>% 
  table() %>%  
  sort(decreasing=TRUE) %>% 
  .[1:20]

```