analysis/human-hampe13-nash.Rmd

---
title: "NASH patient cohort by Hampe et al. 2013"
author: "Christian H. Holland"
date: "2020-12-20"
output: 
  workflowr::wflow_html:
    code_folding: hide
editor_options:
  chunk_output_type: console
---

```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  autodep = TRUE,
  cache = TRUE
)
```

## Introduction
Here we analysis a patient cohort covering patients with mild and advanced NAFLD generated by [Hampe et al. 2013](https://doi.org/10.1016/j.cmet.2013.07.004).

## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(hugene11sttranscriptcluster.db)

library(tidyverse)
library(tidylog)
library(here)

library(oligo)
library(annotate)
library(GEOquery)
library(limma)
library(biobroom)

library(AachenColorPalette)
library(cowplot)
library(lemon)

options("tidylog.display" = list(print))
source(here("code/utils-microarray.R"))
source(here("code/utils-utils.R"))
source(here("code/utils-plots.R"))
```

Definition of global variables that are used throughout this analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/human-hampe13-nash"
output_path <- "output/human-hampe13-nash"

# graphical parameters
# fontsize
fz <- 9
```

## Data processing
### Load .CEL files and quality control
The array quality is controlled based on the relative log expression values (RLE) and the normalized unscaled standard errors (NUSE).
```{r load-cel-files}
# load cel files and check quality
platforms <- readRDS(here("data/annotation/platforms.rds"))
raw_eset <- list.celfiles(here(data_path), listGzipped = T, full.names = T) %>%
  read.celfiles() %>%
  ma_qc()
```

### Normalization and probe annotation
Probe intensities are normalized with the `rma()` function. Probes are annotated with HGNC symbols.
```{r normalization-and-annotation}
eset <- rma(raw_eset)

# annotate microarray probes with hgnc symbols
expr <- ma_annotate(eset, platforms)
# overwrite column names GSMxxx_xxx-xx.CEL.gz -> GSMxxx
colnames(expr) <- str_extract(colnames(expr), "GSM[0-9]*")

# save normalized expression
saveRDS(expr, here(output_path, "normalized_expression.rds"))
```

### Build meta data
Meta information are downloaded from GEO with the accession ID [GSE48452](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE48452).
```{r build-meta-data}
# extract metadata from GEO
df <- getGEO("GSE48452")
meta <- df$GSE48452_series_matrix.txt.gz %>%
  pData() %>%
  rownames_to_column("sample") %>%
  as_tibble() %>%
  select(sample, group = "group:ch1", gender = "Sex:ch1", inflammation = "inflammation:ch1", fibrosis = "fibrosis:ch1", age = "age:ch1", bmi = "bmi:ch1", nas = "nas:ch1") %>%
  mutate(
    group = str_to_lower(group),
    group = str_remove(group, "healthy "),
    group = factor(group, levels = c("control", "obese", "steatosis", "nash"))
  ) %>%
  mutate(
    fibrosis = fct_explicit_na(fibrosis),
    gender = as_factor(gender),
    inflammation = fct_explicit_na(inflammation),
    nas = fct_explicit_na(nas),
    age = as.numeric(age),
    bmi = as.numeric(bmi)
  )

# save meta data
saveRDS(meta, here(output_path, "meta_data.rds"))
```

## Exploratory analysis
### PCA of normalized data
PCA plot of normalized expression data contextualized based on etiology. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)

saveRDS(pca_result, here(output_path, "pca_result.rds"))

plot_pca(pca_result, feature = "group") +
  my_theme()
```

## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the signature of different etiologies
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(output_path, "meta_data.rds"))

stopifnot(colnames(expr) == meta$sample)

# build design matrix
design <- model.matrix(~ 0 + group, data = meta)
rownames(design) <- meta$sample
colnames(design) <- levels(meta$group)


# define contrasts
contrasts <- makeContrasts(
  obese_vs_ctrl = obese - control,
  steatosis_vs_ctrl = steatosis - control,
  nash_vs_ctrl = nash - control,
  levels = design
)

limma_result <- run_limma(expr, design, contrasts) %>%
  assign_deg()

deg_df <- limma_result %>%
  mutate(
    contrast = fct_inorder(contrast),
    contrast_reference = "control"
  )

saveRDS(deg_df, here(output_path, "limma_result.rds"))
```

### Volcano plots
Volcano plots visualizing the signature of different etiologies.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))

df %>%
  plot_volcano() +
  my_theme(grid = "y", fsize = fz)
```