## `GO_MWU` functional enrichment analysis

Here we will test for enrichment of GO categories in different groups of transcripts. The code is run by simply editing the names of the inputs and uncommenting certain lines of code as needed. See Mikhail Matz's [GitHub page](https://github.com/z0on/GO_MWU) on `GO_MWU` for more info. The code below is identical to the original script with the exception of some minor tweaks for plotting. 

All functional enrichment analyses in the report were carried out using the `GO_MWU_annot_1e-50.tsv` annotations file, which contains just matching GO terms from the arthropod eggNOG database with e-values at or below 1e-50. Putative contaminants identified by `EnTAP` are also discarded. See the `EnTAP2GO.py` script for more details about how to generate these GO term tables. Run from the `scripts/` directory with this line of code:

```
python EnTAP2GO.py ../EnTAP/entap_outfiles/final_results/final_annotations_lvl0.tsv ../GO_MWU/GO_MWU_annot_1e-50.tsv 1e-50 y Arthropoda
```

Also, before running this, you need to have already run the differential expression analysis in the `DESeq2_RhithroLoxo.ipynb` Jupyter notebook. This saves the required input files for `GO_MWU` into the `outputs/` folder. Copy all of these files (prefix GO_MWU) into the `GO_MWU` directory before proceeding. Also, run the following chunk of code to edit the annotations file so that transcripts that don't appear in the DESeq2 results (b/c flagged as contamination within the `DESeq2_RhithroLoxoDE` notebook) are excluded.  

Rather than repeat all of these lines of code for the various tests and datasets, just use the files specified below as `input` and re-run, changing the filepath for saving output images. Also make sure to switch among GO divisions, including in output filepath as well, so there are three plots each.


>Infected vs. control DESeq2 pvalues, Mann-Whitney U test:
>>`GO_MWU_PvC_pval.csv`

>Infected vs. control DESeq2 LFCs, Mann-Whitney U test:
>>`GO_MWU_PvC_LFC.csv`

>Infected vs. control DESeq2, Fisher's exact test:
>>`GO_MWU_PvC_fisher.csv`

>Range:condition interaction DESeq2 pvalues, Mann-Whitney U test:
>>`GO_MWU_interaction_pval.csv`

>Range:condition interaction DESeq2, Fisher's exact test:
>>`GO_MWU_interaction_fisher.csv` (No significant transcripts found except for CC)

>WGCNA_FP modules (infection and sex):

>>`GO_MWU_WGCNA_FP_kMEgreenyellow.csv`

>>`GO_MWU_WGCNA_FP_kMEmidnightblue.csv`

>>`GO_MWU_WGCNA_FP_kMEsalmon.csv`


>Within range infected vs. control:

>> `GO_MWU_native.PvC_pval.csv`

>> `GO_MWU_invasive.PvC_pval.csv`

>> `GO_MWU_absent.PvC_pval.csv`

>Range:condition interaction contrasts, pval

>> `GO_MWU_interaction_AvN_pval.csv`

>> `GO_MWU_interaction_IvN_pval.csv`

>> `GO_MWU_interaction_AvI_pval.csv`

>Range:condition interaction contrasts, fisher

>> `GO_MWU_interaction_AvN_fisher.csv`

>> `GO_MWU_interaction_IvN_fisher.csv`

>> `GO_MWU_interaction_AvI_fisher.csv`


Note that `GO_MWU` returns a message about "terms without a defined level." Apparently `EnTAP` is returning some deprecated GO terms, as was also noted [here](https://github.com/fishercera/TreehopperSeq/blob/master/GoSeq_Walkthrough.md).

The analysis of the WGCNA modules was done using solely a Fisher test at first. This was because there was a bug in the package. Outputs from WGCNA were converted to this format using `script.awk` in the `GO_MWU/` directory. Non-zero kMEs were coverted to ones. Tests solely for GO term inclusion in module. Skips second part of WGCNA GO_MWU analysis described on the GitHub page. This

Matz fixed the bug and now it's working as expected. Now I have the full enrichment analysis done for the modules.

In [1]:
input="GO_MWU_WGCNA_FP_kMEgreenyellow.csv" # two columns of comma-separated values: gene id, continuous measure of significance. To perform standard GO enrichment analysis based on Fisher's exact test, use binary measure (0 or 1, i.e., either sgnificant or not).
goAnnotations="GO_MWU_annot_1e-50.tsv" # two-column, tab-delimited, one line per gene, multiple GO terms separated by semicolon. If you have multiple lines per gene, use nrify_GOtable.pl prior to running this script.
goDatabase="go.obo" # download from http://www.geneontology.org/GO.downloads.ontology.shtml
goDivision="CC" # either MF, or BP, or CC
source("gomwu.functions.R")

In [2]:
gomwuStats(input, goDatabase, goAnnotations, goDivision,
	perlPath="perl", # replace with full path to perl executable if it is not in your system's PATH already
	largest=0.1,  # a GO category will not be considered if it contains more than this fraction of the total number of genes
	smallest=5,   # a GO category should contain at least this many genes to be considered
	clusterCutHeight=0.25, # threshold for merging similar (gene-sharing) terms. See README for details.
#	Alternative="g" # by default the MWU test is two-tailed; specify "g" or "l" of you want to test for "greater" or "less" instead. 
	Module=TRUE,Alternative="g" # un-remark this if you are analyzing a SIGNED WGCNA module (values: 0 for not in module genes, kME for in-module genes). In the call to gomwuPlot below, specify absValue=0.001 (count number of "good genes" that fall into the module)
	#Module=TRUE # un-remark this if you are analyzing an UNSIGNED WGCNA module 
)

shuffling values to calculate FDR, 20 reps
replicate 1
“cannot compute exact p-value with ties”replicate 2


“cannot compute exact p-value with ties”replicate 3
“cannot compute exact p-value with ties”replicate 4


“cannot compute exact p-value with ties”replicate 5
“cannot compute exact p-value with ties”replicate 6


“cannot compute exact p-value with ties”replicate 7
“cannot compute exact p-value with ties”replicate 8
“cannot compute exact p-value with ties”replicate 9


“cannot compute exact p-value with ties”replicate 10
“cannot compute exact p-value with ties”replicate 11


“cannot compute exact p-value with ties”replicate 12
“cannot compute exact p-value with ties”replicate 13


“cannot compute exact p-value with ties”replicate 14
“cannot compute exact p-value with ties”replicate 15
“cannot compute exact p-value with ties”replicate 16


“cannot compute exact p-value with ties”replicate 17
“cannot compute exact p-value with ties”replicate 18


“cannot compute exact p-value with ties”replicate 19
“cannot compute exact p-value with ties”replicate 20


“cannot compute exact p-value with ties”4 GO terms at 10% FDR


In [4]:
#save plot
png(filename="../vis/GO_MWU_WGCNA_FP_kMEgreenyellow_1e-50_CC_test.png", pointsize=60, height=2200, width=2500)
results=gomwuPlot(input,goAnnotations,goDivision,
#	absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". This setting is for signed log-pvalues. Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
	absValue=0.001, # un-remark this if you are using log2-fold changes
	level1=0.1, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
	level2=0.05, # FDR cutoff to print in regular (not italic) font.
	level3=0.01, # FDR cutoff to print in large bold font.
	txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
	treeHeight=0.5, # height of the hierarchical clustering tree
	#colors=c("gray0","gray0","gray57","gray57") # these are default colors, un-remar and change if needed
)
dev.off()

#show plot
results=gomwuPlot(input,goAnnotations,goDivision,
	absValue=-log(0.05,10),  # genes with the measure value exceeding this will be counted as "good genes". This setting is for signed log-pvalues. Specify absValue=0.001 if you are doing Fisher's exact test for standard GO enrichment or analyzing a WGCNA module (all non-zero genes = "good genes").
#	absValue=1, # un-remark this if you are using log2-fold changes
	level1=0.1, # FDR threshold for plotting. Specify level1=1 to plot all GO categories containing genes exceeding the absValue.
	level2=0.05, # FDR cutoff to print in regular (not italic) font.
	level3=0.01, # FDR cutoff to print in large bold font.
	txtsize=1.2,    # decrease to fit more on one page, or increase (after rescaling the plot so the tree fits the text) for better "word cloud" effect
	treeHeight=0.5, # height of the hierarchical clustering tree
	#colors=c("gray0","gray0","gray57","gray57") # these are default colors, un-remar and change if needed
)


Loading required package: ape


ERROR: Error in `.rowNamesDF<-`(x, value = value): invalid 'row.names' length


Here I am going to try to make a heatmap of up or downregulated biological processes. The cells will be colored by delta rank.

*I don't think that coloring by delta rank is appropriate when there is a really asymmetrical response, like in the absent range PvC comparison. It shows a lot of functions as going down, but I think that is just an effect of there being a lot more going up, and thus the delta ranks are skewed negative.*

In [None]:
native.PvC.MWU <-   read.table("MWU_BP_GO_MWU_native.PvC_pval.csv", header=TRUE, stringsAsFactors=FALSE)
invasive.PvC.MWU <- read.table("MWU_BP_GO_MWU_invasive.PvC_pval.csv", header=TRUE, stringsAsFactors=FALSE)
absent.PvC.MWU <-   read.table("MWU_BP_GO_MWU_absent.PvC_pval.csv", header=TRUE, stringsAsFactors=FALSE)

In [None]:
dim(native.PvC.MWU)
dim(invasive.PvC.MWU)
dim(absent.PvC.MWU)

In [None]:
native.PvC.MWU   <- native.PvC.MWU[native.PvC.MWU[,7]<0.001,]
invasive.PvC.MWU <- invasive.PvC.MWU[invasive.PvC.MWU[,7]<0.001,]
absent.PvC.MWU   <- absent.PvC.MWU[absent.PvC.MWU[,7]<0.001,]

In [None]:
dim(native.PvC.MWU)
dim(invasive.PvC.MWU)
dim(absent.PvC.MWU)

In [None]:
native.PvC.MWU   <- native.PvC.MWU[,c(1,6)]
invasive.PvC.MWU <- invasive.PvC.MWU[,c(1,6)]
absent.PvC.MWU   <- absent.PvC.MWU[,c(1,6)]

In [None]:
head(native.PvC.MWU)
head(invasive.PvC.MWU)
head(absent.PvC.MWU)

In [None]:
all <- union(native.PvC.MWU$name, invasive.PvC.MWU$name)
all <- union(all, absent.PvC.MWU$name)
merged <- data.frame(BP=all, nat=0, inv=0, abs=0)
head(merged)
dim(merged)

In [None]:
for (i in 1:nrow(merged)){
    if (merged[i,1] %in% native.PvC.MWU$name){
        merged[i,2] <- native.PvC.MWU[native.PvC.MWU[,2]==merged[i,1],1]      
    } else {
        merged[i,2] <- 0
    }
    if (merged[i,1] %in% invasive.PvC.MWU$name){
        merged[i,3] <- invasive.PvC.MWU[invasive.PvC.MWU[,2]==merged[i,1],1] 
    } else {
        merged[i,3] <- 0
    }
    if (merged[i,1] %in% absent.PvC.MWU$name){
        merged[i,4] <- absent.PvC.MWU[absent.PvC.MWU[,2]==merged[i,1],1]
    } else {
        merged[i,4] <- 0
    }
}

In [None]:
options(repr.matrix.max.rows=120, repr.matrix.max.cols=10)
merged

In [None]:
require(pheatmap)
require(RColorBrewer)

In [None]:
rownames(merged) <- merged[,1]
merged <- merged[,-c(1)]
head(merged)

In [None]:
paletteLength <- 100
myColor <- colorRampPalette(c("blue", "white", "red"))(paletteLength)
# length(breaks) == length(paletteLength) + 1
# use floor and ceiling to deal with even/odd length pallettelengths
myBreaks <- c(seq(min(na.omit(merged)), 0, length.out=ceiling(paletteLength/2) + 1), 
              seq(max(na.omit(merged))/paletteLength, max(na.omit(merged)), length.out=floor(paletteLength/2)))
#png(filename="../vis/test.png", pointsize=60, height=2200, width=2500)
pheatmap(
    mat = merged,
    cluster_rows = FALSE,
    cluster_cols = FALSE,
    breaks = myBreaks,
    color = myColor,
    filename = "test.png",
    cellheight = 10
)
#dev.off()

In [None]:
?pheatmap