## Differential Expression Analysis - RhithroLoxo

If you've already saved the workspace image from a previous session, jupyter should automatically reload it. You may need to reload the packages though. The .RData file is not on GitHub, so you will have to actually run it the first time through. 

First, make sure you're actually running this from a compute node, not the login. On Poseidon, logins are 'l1' and 'l2', whereas all other nodes start with 'pn'.

In [None]:
Sys.info()

Now load in the packages.

In [None]:
require(DESeq2)
require(ggplot2)
require(apeglm)
require(ashr)
library("BiocParallel")
register(MulticoreParam(36))
require(VennDiagram)
require(RColorBrewer)
require(pheatmap)

Now import the count data, rounding decimals to integers.

In [None]:
path_to_main <- "/vortexfs1/scratch/ztobias/RhithroLoxo_DE/" #change accordingly based on parent file structure
path_to_counts <- "outputs/quant/salmon.isoform.counts.matrix"
path <- paste(path_to_main,path_to_counts,sep="")
all_counts <- read.table(path,header=TRUE)
all_counts <- round(all_counts)

Take a look.

In [None]:
head(all_counts)
dim(all_counts)

Read in the sample metadata and have a look.

In [None]:
path_to_meta <- paste(path_to_main,"metadata/DESeq2_coldata.txt",sep="")
coldata <- read.table(path_to_meta,header=FALSE,row.names=1)
colnames(coldata) <- c("site","condition","range","sex")
head(coldata)
dim(coldata)

Make sure the two matrices contain all of the same samples and are in order.

In [None]:
all(rownames(coldata) == colnames(all_counts))

Okay here I am just going to calculate the normalized counts so I can try to identify transcripts whose highest representation occurs in the Loxo libraries. This is because there appears to be some latent contamination, either from index hopping or basal levels infection in otherwise "uninfected" crabs. This shows up downstream in some crazily overexpressed transcripts in the parasitized individuals, which appear not to be an actual response, but rather just contamination with parasite mRNA. This might not help completely because of tissue specific expression in the parasite (libraries made from externae, while contamination from internal, root-system tissues), but worth a shot. At least to strengthen the case for removal by thresholding later. 

The infected crabs will be excluded from this comparison. Because of tissue specific expression in the parasite, the highest expression of some contaminant transcripts may be expected in the infected crab libraries rather than those of the parasite itself. Thus, we are just looking for transcripts that are more highly represented in the parasite libraries than the control crab libraries.

Create a DESeq dataset for calculating normalized counts

In [None]:
contam <- DESeqDataSetFromMatrix(countData = all_counts, colData = coldata, design = ~ condition )

Estimate size factors and save normalized counts to object 'norm_mat'

In [None]:
contam <- estimateSizeFactors(contam)
norm_mat <- counts(contam, normalized=TRUE)

First let's remove the parasitized crabs. This includes all samples with the naming pattern `*_P_*`. The function `grepl()` returns a boolean vector that can be used to index. I will also remove sample MD_C_12, as this was identified to have a latent infection (unidentified infection detected in previous runs of this analysis).

In [None]:
norm_mat_sub <- norm_mat[,!grepl("*_P_*", colnames(norm_mat))]
norm_mat_sub <- subset(norm_mat_sub, select=-c(MD_C_12))

In [None]:
colnames(norm_mat_sub)

Column numbers of the parasite samples in norm_mat_sub are 19 and 28. Command below finds the index of the maximum column for each row, checks if it matches 19 or 28 (the parasite samples), returns boolean which is used to subset the dataframe.  Let's take a look at a slice of the output to verify it's behavior. Parasite samples follow the naming pattern `*_F_*`.

In [None]:
contam_subset <- norm_mat_sub[max.col(norm_mat_sub) %in% c(19,28),]
contam_subset[1:20,18:29]

Scrolling through, it's clear that these transcripts have the highest expression in at least one of the parasite samples. This is indicative of Loxo contamination in the Rhithro assembly. Interestingly, oftentimes even when the highest value is in a parasite column, the other parasite column has a value of zero. I am not quite sure what this could mean. Different expression patterns between the two parasite externae, with both ending up in the Rhithro assembly? 

Okay now I'm going to save the rownames for use later:

In [None]:
contam_IDs <- rownames(contam_subset)
length(contam_IDs)

There are 5797 transcripts that have higher expression in a parasite sample than any of the "clean" samples. It's a large number, but relative to the ~150K transcripts, not really.

Back to regularly scheduled programming. I am now going to remove the Loxo samples from the raw expression matrix and sample metadata, and make another DESeq dataset object. 

In [None]:
all_counts <- all_counts[,-c(29,41)]
coldata <- coldata[-c(29,41),]

Verify again that the counts and metadata match up:

In [None]:
all(rownames(coldata) == colnames(all_counts))

Okay let's get started. This part of the analysis is just going to do a base comparison between uninfected and parasitized, controlling for population-specific effects. 

In [None]:
dds <- DESeqDataSetFromMatrix(countData = all_counts, colData = coldata, design = ~site + condition)

In [None]:
dds <- DESeq(dds)

Let's visualize the data using some PCAs. This will be helpful for identifying sample outliers.

First we'll apply a variance stabilizing transformation to our normalized counts. 

In [None]:
vsd <- vst(dds, blind=FALSE)

In [None]:
PCA <- plotPCA(vsd, intgroup="condition")
PCA + geom_label(aes(label = name))

You can see that PCA 1 clearly separates the sample according to infection status. Along the second PCA axis, you can see four extreme outlier at the upper left corner. Let's investigate this a little more closely by looking at the expression of particular high variance transcripts. 

In [None]:
topVarGenes <- head(order(rowVars(assay(vsd)), decreasing = TRUE), 500)
df <- as.data.frame(colData(dds)[,c("condition","site")])
vsd_df <- assay(vsd)
heatmap <- pheatmap(vsd_df[topVarGenes,], cluster_rows=TRUE, show_rownames=FALSE,
         cluster_cols=TRUE, annotation_col=df)
heatmap

Looking at the top 500 highest variance transcripts, you can see that the transcript cluster second from the top separates samples in the same way as PCA axis 2 from above, with MA_C_1, MA_C_2, MA_C_4, and AP_C_6 having really high expression. Let's figure out what those are. 

In [None]:
idx <- sort(cutree(heatmap$tree_row, k=10)) #separate transcripts by cluster assignment
idx <- names(which(idx==6)) #after searching, cluster 6 is the one with the transcripts we want
heatmap2 <- pheatmap(vsd_df[idx,], cluster_rows=TRUE, show_rownames=FALSE, #plot
         cluster_cols=TRUE)
heatmap2

Okay now we found the cluster that has all of these transcripts. Let's get the names and search annotations. Load in annotations.

In [None]:
annot <- read.table("../EnTAP/entap_outfiles/similarity_search/DIAMOND/overall_results/best_hits_lvl0.tsv", sep="\t", fill=TRUE, header=TRUE, quote="")

In [None]:
idx2 <- rownames(vsd[heatmap2$tree_row[["order"]],]) #get rownames of the transcripts in this cluster
outlier_annot <- annot[annot[,1] %in% rownames(vsd[idx2,]),] #get matching annotations
outlier_annot[order(outlier_annot[,11]),c(1,2,3,11,12,13,14)] #show results

There doesn't seem to be anything special about these transcripts. I expected them to be contamination from some other parasite, but it seems that they actually represent crab genes (if we can trust the taxonomic assignmnet of what's in the reference databases). I will dig into this a bit more later. It is likely they reflect some other process in the crab, such as molting, both Carolyn and I previously observed in earlier rounds of this analysis...

Either way, I am going to remove these outlier samples. They are contributing too much variation and will present issues when trying to fit the DESeq2 models and cause issues with count normalization. 

In [None]:
all_counts <- all_counts[,!(colnames(all_counts) %in% c("MA_C_1","MA_C_2","MA_C_4","AP_C_6"))]
all_counts[1:6,1:6]
dim(all_counts)

Okay now you can see the four of them were removed. Now make sure the coldata matches.

In [None]:
coldata <- coldata[colnames(all_counts),]
all(rownames(coldata) == colnames(all_counts))

Okay good let's make another dds object.

In [None]:
dds <- DESeqDataSetFromMatrix(countData = all_counts, colData = coldata, design = ~ condition)

In [None]:
dds <- DESeq(dds)

In [None]:
res <- results(dds, alpha=0.05)
summary(res)

Here we see that there are 1328 transcripts deemed significantly upregulated in parasitized crabs, and 1132 significantly downregulated. 

As mentioned earlier, it is my suspicion that some of these upregulated transcripts are contamination from Loxo. Let's have a look.

In [None]:
resLFC <- lfcShrink(dds, coef="condition_P_vs_C", type="apeglm")

In [None]:
plotMA(resLFC, ylim=c(-13,13))

Here we see the pattern I have alluded to, where there is a cloud of extremely overexpressed (>10 LFC) in the infected relative to the control. It is my suspicion that these represent contamination from Loxo rather than an actually response by the crabs. 

Let's see if these match the transcript IDs we pulled out earlier as possible Loxo contaminant transcripts. 

First let's make an ordered data.frame of the significant transcripts.

In [None]:
res_sig <- data.frame(subset(res, padj < 0.05))
res_sig <- res_sig[order(-res_sig$log2FoldChange),]
head(res_sig, 30)
tail(res_sig)

You can see that there are some extremely significant, extremely overexpressed transcripts at the top, many of which have high rates of expression overall. These are likely contamination from Loxo. I am going to compare the list of transcripts I made earlier to these to see how much of an overlap there is. 

There are 2377 significant transcripts, 1250 up and 1127 down. There were 5797 transcripts that had higher expression in one of the two parasite samples than any of the clean samples. Let's look at the intersect. 

In [None]:
contam_intersect <- intersect(contam_IDs, rownames(res_sig))
length(contam_intersect)

Okay so there are 452 transcripts that came up as significant that were also identified as potential contamination. Let's take a look at the significant results table with those removed:

In [None]:
head(res_sig[!(rownames(res_sig) %in% contam_intersect),],n=10)

A lot of those transcripts are now removed. There are just around 7 of them left, depending on where you draw the line.

I am going to repeat the lfc shrinking and visualization with the contam transcrips removed. I am subsetting the dds object within the `lfcShrink()` function, using `subset()` to keep only transcripts that do not appear in the contam_IDs list. 

In [None]:
resLFC_nocontam <- lfcShrink(subset(dds, !rownames(dds) %in% contam_IDs), coef="condition_P_vs_C", type="apeglm")
plotMA(resLFC_nocontam, ylim=c(-13,13))

You can see now that the big cloud of points at the top right has mostly disappeared. A few points remain. I am still going to consider these as contamination and remove them. I am going to do this by considering transcripts that are overexpressed above the absolute value of the LFC of the most underexpressed significant transcript. This is done using the shrunk LFCs for ranking purposes.

In [None]:
resLFC_nocontam_sig <- data.frame(subset(resLFC_nocontam, padj < 0.05))
resLFC_nocontam_sig <- resLFC_nocontam_sig[order(-resLFC_nocontam_sig$log2FoldChange),]
head(resLFC_nocontam_sig,n=20)

Okay there's the table. Now find rownames of rows with higher LFC values than absolute value of lowest LFC value. 

In [None]:
contam_add <- rownames(resLFC_nocontam_sig[resLFC_nocontam_sig$log2FoldChange > abs(min(resLFC_nocontam_sig$log2FoldChange)),])
contam_add

Okay here the seven remaining contigs that we will consider contamination. I am going to add them to the list of contam_IDs, which will be used for removal from the intial counts object. Then the analysis will be re-run. This is to account for renormalization after removal, since many of these contigs had high mean expression across samples. I also want to have all of the putative contaminant transcripts removed before I do the WGCNA analysis. Because it looks for co-expression patterns among transcripts, if I leave in transcripts that are actually just contaminants, it will likely assign transcripts to modules not based on any functional relevance to particular pathways, but rather just to infection status. 

In [None]:
contam_IDs <- c(contam_IDs, contam_add)
length(contam_IDs)

Okay successfully added. Now to reperform the anaysis.

In [None]:
dim(all_counts)
counts_clean <- all_counts[!rownames(all_counts) %in% contam_IDs,]
dim(counts_clean)

You can see that the 5804 putative contaminant transcripts have been removed.

Check that the sample names match between metadata and counts matrices.

In [None]:
all(rownames(coldata) == colnames(counts_clean))

Good. Create new dds object without the contaminant transcripts.

In [None]:
dds_clean <- DESeqDataSetFromMatrix(countData = counts_clean, colData = coldata, design = ~ condition)

Fit the model.

In [None]:
dds_clean <- DESeq(dds_clean)

In [None]:
res_clean <- results(dds_clean, alpha=0.05)
summary(res_clean)

Now you can see that we have 883 significantly upregulated transcripts and 935 significantly downregulated transcripts. There were a lot of transcripts that had too low of counts to be included (41373), though the creators of DESeq2 recommend not removing these before-hand because have all of the transcripts helps with the dispersion estimates. So, even though a lot got "thrown out", they are still helpful behind the scenes. 

Now I'll just save all of the significant results into a table.

In [None]:
res_clean_sig <- data.frame(subset(res_clean, padj < 0.05))
res_clean_sig <- res_clean_sig[order(res_clean_sig$padj),]
head(res_clean_sig, 20)

Just out of curiosity, I'd like to plot the counts for a few top transcripts. 

In [None]:
count_plot <- plotCounts(dds_clean, gene="TRINITY_DN4558_c1_g1_i1", intgroup=c("condition","site"), returnData=TRUE)
ggplot(count_plot, aes(x=condition, y=count, color=site)) +
    geom_point(position=position_jitter(w=0.2,h=0)) + 
    scale_y_log10()


We'll do a variance stabilizing transformation, which is fast and useful for visualization of results.

*BLIND OR NOT?*

In [None]:

#vsd <- vst(dds_clean) #blind = TRUE or FALSE?

Now that we have looked for bulk differences as a result of infection status, we can move on to look at differences among populations with different levels of historical exposure to the parasite. We included the FP samples in our first comparison because it was agnostic to range. However, because it is unresolved whether Loxo is native, invasive, or absent from FP, we are going to remove it from subsequent analyses.

We have to make another dds object. I am going to make it from scratch by removing all FP samples from the coldata and counts_clean.

*WE PROBABLY WANT TO CHANGE THE MODEL FORMULA SO IT ACCOUNTS FOR POPULATION AS BATCH EFFECTS!*

<http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#group-specific-condition-effects-individuals-nested-within-groups>

In [None]:
counts_clean_noFP <- counts_clean[,!grepl("FP_*_*",colnames(counts_clean))]
coldata_noFP <- coldata[colnames(counts_clean_noFP),]

Just a quick sanity check to make sure the samples are in the same order in the counts and metadata matrices.

In [None]:
all(rownames(coldata_noFP) == colnames(counts_clean_noFP))

Good. Now we can specify the model design and the reference levels.

In [None]:
dds_interaction <- DESeqDataSetFromMatrix(countData = counts_clean_noFP, colData = coldata_noFP, design = ~ range + condition + range:condition)
dds_interaction$condition <- relevel(dds_interaction$condition, ref = "C")
dds_interaction$range <- relevel(dds_interaction$range, ref = "Native")

Let's make sure we have the design and levels we want. 

In [None]:
design(dds_interaction)
dds_interaction$condition
dds_interaction$range

We have our correct model formula. Clicking on the "Levels" arrow, we can see that "C" appears first for condition and "Native" appears first for range, indicating they are the reference level for their respective factors.

Now we can move on to running the analysis.

In [None]:
dds_interaction <- DESeq(dds_interaction)

In [None]:
resultsNames(dds_interaction)

Model fitting is now complete. Looking at `resultsNames()` tells us that we have all of the expected coefficients in our model. 

We are interested in a number of comparisons. First, we are simply interested in understanding which transcripts are differentially expressed between infected and uninfected crabs *within* each range. Let's save these results to their own data.frames.

In [None]:
native.PvC <- results(dds_interaction, alpha=0.05, contrast=c("condition","P","C"))
invasive.PvC <- results(dds_interaction, alpha=0.05, contrast=list( c("condition_P_vs_C","rangeInvasive.conditionP")))
absent.PvC <- results(dds_interaction, alpha=0.05, contrast=list( c("condition_P_vs_C","rangeAbsent.conditionP")))
summary(native.PvC)
summary(invasive.PvC)
summary(absent.PvC)

*THIS WILL HAVE TO BE REDONE!*

Just a brief description of the results.

For the native range populations (AP & LA), there are 757 significantly upregulated transcripts and 478 significantly downregulated transcripts. In contrast, for the invasive range there are a lot more, with 3195 significantly upregulated and 1630 significantly downregulated transcripts. There are even more for the absent range, with a large boost in the number of downregulated over the invasive range. For the absent range, there are 3889 significantly upregulated and 3350 significantly downregulated transcripts. 

But is this all due to difference in historical exposure to the parasite? Or could it be an effect of differing numbers of samples between ranges? When you exclude FP (range unresolved) and MA_C_3 (bad seq data), there are a total of 18 native samples, 24 invasive samples, and 26 absent samples. So the elevated recovery of DE transcripts could  be due in part to deeper sampling/sequencing. I think the best normalization in this case would be library size, i.e. total bp in filtered reads from each of the three ranges as denominator, to see if this is an effect of sampling/sequencing effort. I am not sure if this is accounted for under the hood of DESeq2.

There is almost certainly another effect of which samples were used for the assembly. It was just the uninfected samples, although there are certainly disparaties among ranges as well as far as to what proportion of the assembly is derived from reads from that range. 

Now I'll save the significant results to data.frames:

In [None]:
native.PvC.df <- data.frame(subset(native.PvC, padj < 0.05))
invasive.PvC.df <- data.frame(subset(invasive.PvC, padj < 0.05))
absent.PvC.df <- data.frame(subset(absent.PvC, padj < 0.05))
native.PvC.df <- native.PvC.df[order(native.PvC.df$padj),]
invasive.PvC.df <- invasive.PvC.df[order(invasive.PvC.df$padj),]
absent.PvC.df <- absent.PvC.df[order(absent.PvC.df$padj),]

I am going to save the names of the up and downregulated transcripts in each range to objects for making Venn diagrams.

In [None]:
native_DE_up <- rownames(native.PvC.df[native.PvC.df$log2FoldChange > 0,])
invasive_DE_up <- rownames(invasive.PvC.df[invasive.PvC.df$log2FoldChange > 0,])
absent_DE_up <- rownames(absent.PvC.df[absent.PvC.df$log2FoldChange > 0,])
native_DE_down <- rownames(native.PvC.df[native.PvC.df$log2FoldChange < 0,])
invasive_DE_down <- rownames(invasive.PvC.df[invasive.PvC.df$log2FoldChange < 0,])
absent_DE_down <- rownames(absent.PvC.df[absent.PvC.df$log2FoldChange < 0,])

Now I'll make the Venn diagrams.

In [None]:
futile.logger::flog.threshold(futile.logger::ERROR, name = "VennDiagramLogger")
myCol <- brewer.pal(3, "Set1")
venn.diagram(
        x = list(native_DE_up, invasive_DE_up, absent_DE_up),
        category.names = c("Native" , "Invasive" , "Absent"),
        filename = '../vis/venn_range_up.png',
        output = TRUE ,
        imagetype="png" ,
        height = 480 , 
        width = 480 , 
        resolution = 300,
        compression = "lzw",
        lwd = 1,
        fill = myCol,
        cex = 0.5,
        fontfamily = "sans",
        cat.cex = 0.5,
        cat.fontface = "bold",
        cat.default.pos = "outer",
        cat.pos = c(-27, 27, 180),
        cat.dist = c(0.055, 0.055, 0.045),
        cat.fontfamily = "sans",
        cat.col = myCol,
        rotation = 1
)
venn.diagram(
        x = list(native_DE_down, invasive_DE_down, absent_DE_down),
        category.names = c("Native" , "Invasive" , "Absent"),
        filename = '../vis/venn_range_down.png',
        output = TRUE ,
        imagetype="png" ,
        height = 480 , 
        width = 480 , 
        resolution = 300,
        compression = "lzw",
        lwd = 1,
        fill = myCol,
        cex = 0.5,
        fontfamily = "sans",
        cat.cex = 0.5,
        cat.fontface = "bold",
        cat.default.pos = "outer",
        cat.pos = c(-27, 27, 180),
        cat.dist = c(0.055, 0.055, 0.045),
        cat.fontfamily = "sans",
        cat.col = myCol,
        rotation = 1
)
venn.diagram(
        x = list(c(native_DE_down,native_DE_up), c(invasive_DE_down,invasive_DE_up), c(absent_DE_down,absent_DE_up)),
        category.names = c("Native" , "Invasive" , "Absent"),
        filename = '../vis/venn_range_both.png',
        output = TRUE ,
        imagetype="png" ,
        height = 480 , 
        width = 480 , 
        resolution = 300,
        compression = "lzw",
        lwd = 1,
        fill = myCol,
        cex = 0.5,
        fontfamily = "sans",
        cat.cex = 0.5,
        cat.fontface = "bold",
        cat.default.pos = "outer",
        cat.pos = c(-27, 27, 180),
        cat.dist = c(0.055, 0.055, 0.045),
        cat.fontfamily = "sans",
        cat.col = myCol,
        rotation = 1
)

Now I am going to apply the LFC shrinkage for optimal ranking and visualization later on. We are using the `ashr` method. It tends to not overshrink the LFCs and is also compatible with contrasts. See the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) for more details.

For determining significant transcripts above (in the summary and in making the Venn diagrams), the unshrunk LFC p-values were used, as recommended by the DESeq2 creator [here](https://support.bioconductor.org/p/98833/).

In [None]:
native.PvC.LFCshrink <- lfcShrink(dds_interaction, contrast=c("condition","P","C"), type="ashr")
invasive.PvC.LFCshrink <- lfcShrink(dds_interaction, contrast=list( c("condition_P_vs_C","rangeInvasive.conditionP")), type="ashr")
absent.PvC.LFCshrink <- lfcShrink(dds_interaction, contrast=list( c("condition_P_vs_C","rangeAbsent.conditionP")), type="ashr")

*IN HERE WE'LL MAKE SOME PLOTS FOR WITHIN/AMONG GROUP STUFF*

Now on to looking into interactions between the infection status and range.

In [None]:
IvN <- results(dds_interaction, alpha=0.05, name="rangeInvasive.conditionP")
AvN <- results(dds_interaction, alpha=0.05, name="rangeAbsent.conditionP")
AvI <- results(dds_interaction, alpha=0.05, contrast=list("rangeAbsent.conditionP", "rangeInvasive.conditionP"))
summary(IvN)
summary(AvN)
summary(AvI)

*WRITE SOME SHIT HERE TO SUMMARIZE ABOVE, BUT FIRST EXPLORE THE DATA*

In [None]:
IvN.df <- data.frame(subset(IvN, padj < 0.05))
AvN.df <- data.frame(subset(AvN, padj < 0.05))
AvI.df <- data.frame(subset(AvI, padj < 0.05))
IvN.df <- IvN.df[order(IvN.df$padj),]
AvN.df <- AvN.df[order(AvN.df$padj),]
AvI.df <- AvI.df[order(AvI.df$padj),]

In [None]:
head(IvN.df,60)

In [None]:
count_plot <- plotCounts(dds_interaction, gene="TRINITY_DN9438_c0_g1_i1", intgroup=c("condition","range", "site"), returnData=TRUE)
#count_plot
ggplot(count_plot, aes(x=condition, y=count, group=range, color=site)) +
    facet_grid(.~range) +
    geom_point(position=position_jitter(w=0.2,h=0)) + 
    #geom_label(aes(label = site), position=position_jitter(w=0.4,h=0))  + 
    #geom_line()+
    geom_smooth(method = "lm", se=F, aes(group=1)) +
    stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), geom="pointrange", color="red") +
    scale_y_log10()#limits = c(1,1e6)) 

In [None]:
#res_int_abs.P = results(dds_interaction, name="rangeAbsent.conditionP")
#res_int_inv.P = results(dds_interaction, name="rangeInvasive.conditionP")

In [None]:
#summary(res_int_abs.P)
#summary(res_int_inv.P)

In [None]:
#res_int_abs.P_sig <- data.frame(subset(res_int_abs.P, padj < 0.05))
#res_int_abs.P_sig <- res_int_abs.P_sig[order(res_int_abs.P_sig$padj),]
#res_int_inv.P_sig <- data.frame(subset(res_int_inv.P, padj < 0.05))
#res_int_inv.P_sig <- res_int_inv.P_sig[order(res_int_inv.P_sig$padj),]
#head(res_int_inv.P_sig,20)

In [None]:
#count_plot <- plotCounts(dds_interaction, gene="TRINITY_DN7282_c0_g1_i1", intgroup=c("condition","range"), returnData=TRUE)
#ggplot(count_plot, aes(x=condition, y=count, color=range)) +
#    geom_point(position=position_jitter(w=0.2,h=0)) + 
#    #geom_label(aes(label = site), position=position_jitter(w=0.4,h=0))  + 
#    scale_y_log10() 

In [None]:
#count_plot <- plotCounts(dds_interaction, gene="TRINITY_DN7282_c0_g1_i1", intgroup=c("condition","range", "site"), returnData=TRUE)
##count_plot
#ggplot(count_plot, aes(x=condition, y=count, group=range, color=site)) +
#    facet_grid(.~range) +
#    geom_point(position=position_jitter(w=0.2,h=0)) + 
#    #geom_label(aes(label = site), position=position_jitter(w=0.4,h=0))  + 
#    #geom_line()+
#    geom_smooth(method = "lm", se=F, aes(group=1)) +
#    stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), geom="pointrange", color="red") +
#    scale_y_log10()#limits = c(1,1e6)) 

In [None]:
#dim(assays(ddsMF_noFP)[[1]])
#print(data.frame(colData(dds_interaction)))

In [None]:
#save.image()

In [None]:
?hist()

Before we finish, I need to export a matrix of counts to use downstream in WGCNA. The creators [suggest](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html) removing transcripts with consistently low counts to avoid spurious correlations, and the also recommend performing a variance stabilizing transformation. I will do both below and then export as an R data object.

I am removing all transcripts that have normalized counts of less than 10 in over 90% of the samples (72/81).

In [None]:
filterGenes <- rowSums(counts(dds_clean, normalized=TRUE) < 10 ) > 72
for_export <- dds_clean[!filterGenes,]
vsd <- vst(for_export, blind=TRUE)
write.table(assays(vsd)[[1]], file = "../outputs/WGCNA_in.tsv", sep="\t")

In [None]:
annot <- read.table("../EnTAP/entap_outfiles/similarity_search/DIAMOND/overall_results/best_hits_lvl0.tsv", sep="\t", fill=TRUE, header=TRUE, quote="")