# Identify Prostate Cancer Biomarker using Machine Learning

*Please note: This notebook uses open access data*  


#### Fan Wang
#### July 5 2022

This notebook demonstrates the analysis of The Cancer Genome Atlas Prostate Adenocarcinoma (TCGA-PRAD) gene expression dataset to understand the steps in a differential expression analysis workflow in the context of DESeq2.

The goal of this analysis is to find differences in gene expression profiles between the two sample populations: normal cells versus tumor cells. The two groups will be compared through an analysis of TCGA-PRAD gene expression dataset downloaded from the NCI's Genomic Data Commons (GDC). The dataset contains expression values (sequence counts) generated via the Illumina HiSeq platform.


## [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) analysis workflow

Summary of the Differential Expression Analysis Workflow:

> - Install and import necessary packages and functions. Download data.
> - Identify genes whose expression levels change significantly between tumor and normal samples.
> - Preprocess dataset by filtering out the gene genes with zero values.
> - Split expression dataset to tumor samples dataframe and normal samples dataframe.
> - Generate metaData.
> - Call the DESeqDataSetFromMatrix function and DESeq function to get results


---------------

## Load required packages

In [None]:
options(warn=-1)

In [None]:
library(TCGAbiolinks)
library(DESeq2)
library(SummarizedExperiment)
library(ggplot2)

## Load functions for identifying upregulated and downregulated genes

In [None]:
get_upregulated <- function(df) {
  key <- intersect(rownames(df)[which(df$log2FoldChange >= 1)],
                   rownames(df)[which(df$pvalue <= 0.05)])
  
  results <- as.data.frame((df)[which(rownames(df) %in% key), ])
  return(results)
}

In [None]:
get_downregulated <- function(df) {
  key <- intersect(rownames(df)[which(df$log2FoldChange <= -1)],
                   rownames(df)[which(df$pvalue <= 0.05)])
  
  results <- as.data.frame((df)[which(rownames(df) %in% key), ])
  return(results)
}

## Download data from GDC

In [None]:
query <- GDCquery(project = "TCGA-PRAD", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = "Illumina HiSeq",
                  file.type = "results",
                  legacy = TRUE)

In [None]:
dir.create("./Data")
GDCdownload(
  query,
  method = "api",
  files.per.chunk = 100,
  directory = "./Data"
)

In [None]:
mrna_df <- GDCprepare(query, directory = "./Data")

## Preprocess data

In [None]:
### remove columns we dont need, keep counts
mrna_meta <- mrna_df$sample
mrna_meta <- cbind(mrna_meta, mrna_df$definition)
mrna_df <- assay(mrna_df)
delim_fn = function(x, n, i) {
  do.call(c, lapply(x, function(X)
    paste(unlist(strsplit(
      X, "-"
    ))[(n + 1):(i)], collapse = "-")))
}
colnames(mrna_df) <- delim_fn(x = colnames(mrna_df), n = 0, i = 4)
mrna_meta <- as.data.frame(mrna_meta)
mrna_df <- as.data.frame(mrna_df)

In [None]:
## Remove metastatic sample
metastatic_key <- mrna_meta[which(mrna_meta[, 2] == "Metastatic"), ]
mrna_meta <- mrna_meta[!mrna_meta[, 2] == metastatic_key[, 2], ]
mrna_df <-
  mrna_df[,-grep(paste0(metastatic_key[, 1]), colnames(mrna_df))]
mrna_meta[, 2] <- as.character(mrna_meta[, 2])
mrna_meta[, 2] <-
  gsub("Primary solid Tumor", "Tumor", mrna_meta[, 2])
mrna_meta[, 2] <-
  gsub("Solid Tissue Normal", "Normal", mrna_meta[, 2])
mrna_meta[, 2] <- as.factor(mrna_meta[, 2])
levels(mrna_meta[, 2])
colnames(mrna_meta) <- c("cases", "Condition")

## Execute differential expression analysis using [DESeq2](https://github.com/mikelove/DESeq2)

In [None]:
mrna_dds <-
  DESeqDataSetFromMatrix(round(mrna_df),
                         colData = mrna_meta,
                         design = ~ Condition)

#### Everything from normalization to linear modeling was carried out by the use of a single function `DESeq`. This function will print out a message for the various steps it performs:
```
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
```

In [None]:
mrna_dds$Condition <- relevel(mrna_dds$Condition, ref = "Normal")
mrna_dds <- DESeq(mrna_dds)

In [None]:
vsd <- varianceStabilizingTransformation(mrna_dds, blind = FALSE)
head(assay(vsd))
hist(assay(vsd))

In [None]:
resultsNames(mrna_dds)

## Dispersion Plot

Dispersion is a measure of spread or variability in the data. Variance, standard deviation, IQR, among other measures, can all be used to measure dispersion. However, DESeq2 uses a specific measure of dispersion (α) related to the mean (μ) and variance of the data:   

> Var = μ + α * μ^2 

For genes with moderate to high count values, the square root of dispersion will be equal to the coefficient of variation (Var / μ). So 0.01 dispersion means 10% variation around the mean expected across biological replicates.

In [None]:
# Plot Dispersions:
plotDispEsts(mrna_dds, main = "Dispersion plot")

##### This curve is displayed as a red line in the figure below, which plots the estimate for the ***expected dispersion value for genes of a given expression strength***. 

##### Each black dot is a gene with an associated mean expression level and maximum likelihood estimation (MLE) of the dispersion

## MA Plot

MA plot is a scatter plot of log2 fold changes (M) on the y-axis versus the mean of normalized expression counts on the x-axis.

Generally, genes with lower mean expression values will have highly variable log fold changes. Genes with similar expression values in both normal and treated samples will cluster around M=0 value i.e genes expressed with no significant differences in between groups. Points away from `M=0` line indicate genes with significant expression, For example, a gene is upregulated and downregulated if the point is above and below `M=0` line respectively.

In [None]:
mrna_res <- results(mrna_dds, name = "Condition_Tumor_vs_Normal")
plotMA(mrna_res)

### Statistics summary 

In [None]:
mrna_res_df <- as.data.frame(mrna_res)
mrnaTable <- mrna_res_df
mrnaTable$Gene_id <- rownames(mrnaTable)
summary(mrna_res)

### Volcano Plot

MA plot does not consider statistical measures (p values or adjusted p values) and therefore we can not tell genes with statistically significant differences between normal vs. tumor from MA plot. And that is why we use Volcano plot to indicate genes with statistically significant differences.

In [None]:
mrna_upreg <- get_upregulated(mrna_res)
mrna_downreg <- get_downregulated(mrna_res)
mrna_counts <- counts(mrna_dds, normalized = T)
mrna_upreg$Gene_id <- rownames(mrna_upreg)
mrna_downreg$Gene_id <- rownames(mrna_downreg)
mrna_res_df$Gene_id <- rownames(mrna_res_df)

In [None]:
par(
  mar = c(5, 5, 5, 5),
  cex = 1.0,
  cex.main = 1.4,
  cex.axis = 1.4,
  cex.lab = 1.4
)
with(
  mrna_res_df,
  plot(
    log2FoldChange,
    -log10(padj),
    pch = 20,
    main = "Volcano plot",
    cex = 1.0,
    xlab = bquote( ~ Log[2] ~ fold ~ change),
    ylab = bquote( ~ -log[10] ~ P ~ value)
  )
)

with(
  subset(mrna_res_df, padj < 0.05 &
           abs(log2FoldChange) > 2),
  points(
    log2FoldChange,
    -log10(padj),
    pch = 20,
    col = "red",
    cex = 0.5
  )
)

#Add lines for absolute FC>2 and P-value cut-off at FDR Q<0.05
abline(v = 0,
       col = "black",
       lty = 3,
       lwd = 1.0)
abline(v = -2,
       col = "black",
       lty = 4,
       lwd = 2.0)
abline(v = 2,
       col = "black",
       lty = 4,
       lwd = 2.0)
abline(
  h = -log10(max(mrna_res_df$pvalue[mrna_res_df$padj < 0.05], na.rm = TRUE)),
  col = "black",
  lty = 4,
  lwd = 2.0
)

## Conclusion

The set of genes found to be differentially expressed, namely ANGPT1, CHRM2, HSPA6, KIRREL3, C2orf88, SMR3A, CIDEC, ZNF185, FAM167A, APOBEC3C, EPHA10, HOXC4, NETO2, GTSE1, NETO1, KISS1R, TEKT1, ACTL8, ROPN1L, could be deduced from the above results. The genes ANGPT1, APOBEC3C, ZNF185, EPHA10, and HOXC4 have the potential to be used as diagnostic tests for prostate cancer early detection. These genes will need to be studied further to see if they may be fused with other genes to boost their selectivity and specificity. Because prostate cancer is a potentially fatal malignancy for the majority of men who are diagnosed with it, the only way to solve this issue is through early detection and accurate prognosis, which is what our research attempts to achieve.

### Reference 

In [None]:
citation("DESeq2")