-
Notifications
You must be signed in to change notification settings - Fork 3
/
human-ramnath-fibrosis.Rmd
189 lines (150 loc) · 5.4 KB
/
human-ramnath-fibrosis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
title: "Fibrosis patient cohort by Ramnath et al."
output:
workflowr::wflow_html:
code_folding: hide
editor_options:
chunk_output_type: console
---
```{r chunk-setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
autodep = TRUE,
cache = TRUE
)
```
```{r wall-time-start, cache=FALSE, include=FALSE}
# Track time spent on performing this analysis
start_time <- Sys.time()
```
## Introduction
Here we analysis a patient cohort covering patients suffering from HCV and NAFLD generated by [Ramnath et al.](https://doi.org/10.1172/jci.insight.120274).
## Libraries and sources
These libraries and sources are used for this analysis.
```{r libs-and-src, message=FALSE, warning=FALSE, cache=FALSE}
library(tidyverse)
library(tidylog)
library(here)
library(edgeR)
library(biobroom)
library(AachenColorPalette)
library(cowplot)
library(lemon)
library(patchwork)
options("tidylog.display" = list(print))
source(here("code/utils-rnaseq.R"))
source(here("code/utils-utils.R"))
source(here("code/utils-plots.R"))
```
Definition of global variables that are used throughout this analysis.
```{r analysis-specific-params, cache=FALSE}
# i/o
data_path <- "data/human-ramnath-fibrosis"
output_path <- "output/human-ramnath-fibrosis"
# graphical parameters
# fontsize
fz <- 9
```
## Preliminary exploratory analysis
### Library size
Barplot of the library size (total counts) for each of the samples.
```{r lib-size}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
plot_libsize(count_matrix) +
my_theme(fsize = fz)
```
### Count distribution
Violin plots of the raw read counts for each of the samples.
```{r "count-distribution"}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))
count_matrix %>%
tdy("gene", "sample", "count", meta) %>%
arrange(disease) %>%
ggplot(aes(
x = fct_reorder(sample, as.numeric(disease)), y = log10(count + 1),
group = sample, fill = disease
)) +
geom_violin() +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
legend.position = "top"
) +
labs(x = NULL) +
my_theme(grid = "no", fsize = fz)
```
### PCA of raw data
PCA plot of raw read counts contextualized based on etiology stages. Before gene with a constant expression across all samples are removed and count values are transformed to log2 scale. Only the top 1000 most variable genes are used as features.
```{r pca-raw-data}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))
stopifnot(colnames(count_matrix) == meta$sample)
# remove constant expressed genes and transform to log2 scale
preprocessed_count_matrix <- preprocess_count_matrix(count_matrix)
pca_result <- do_pca(preprocessed_count_matrix, meta, top_n_var_genes = 1000)
plot_pca(pca_result, feature = "disease") +
plot_pca(pca_result, feature = "stage") &
my_theme(fsize = fz)
```
## Data processing
### Normalization
Raw read counts are normalized by first filtering out lowly expressed genes, TMM normalization and finally logCPM transformation.
```{r normalization}
count_matrix <- readRDS(here(data_path, "count_matrix.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))
stopifnot(meta$sample == colnames(count_matrix))
dge_obj <- DGEList(count_matrix, group = meta$group)
# filter low read counts, TMM normalization and logCPM transformation
norm <- voom_normalization(dge_obj)
saveRDS(norm, here(output_path, "normalized_expression.rds"))
```
### PCA of normalized data
PCA plot of normalized expression data contextualized based on etiology stages. Only the top 1000 most variable genes are used as features.
```{r pca-norm-data}
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))
pca_result <- do_pca(expr, meta, top_n_var_genes = 1000)
saveRDS(pca_result, here(output_path, "pca_result.rds"))
plot_pca(pca_result, feature = "disease") +
plot_pca(pca_result, feature = "stage") &
my_theme(fsize = fz)
```
## Differential gene expression analysis
### Running limma
Differential gene expression analysis via limma with the aim to identify the transcriptomic signatures of HCV and NAFLD.
```{r running-limma}
# load expression and meta data
expr <- readRDS(here(output_path, "normalized_expression.rds"))
meta <- readRDS(here(data_path, "meta_data.rds"))
stopifnot(colnames(expr) == meta$sample)
# build design matrix
design <- model.matrix(~ 0 + group, data = meta)
rownames(design) <- meta$sample
colnames(design) <- levels(meta$group)
# define contrasts
contrasts <- makeContrasts(
hcv_adv_vs_early = hcv_advanced - hcv_early,
nafld_adv_vs_early = nafld_advanced - nafld_early,
levels = design
)
limma_result <- run_limma(expr, design, contrasts) %>%
assign_deg()
deg_df <- limma_result %>%
mutate(contrast = factor(contrast)) %>%
mutate(contrast_reference = contrast)
saveRDS(deg_df, here(output_path, "limma_result.rds"))
```
### Volcano plots
Volcano plots visualizing the transcriptomic signatures of HCV and NAFLD.
```{r volcano-plots}
df <- readRDS(here(output_path, "limma_result.rds"))
df %>%
plot_volcano() +
my_theme(grid = "y", fsize = fz)
```
```{r wall-time-end, cache=FALSE, include=FALSE}
duration <- abs(as.numeric(difftime(Sys.time(), start_time, units = "secs")))
t = print(sprintf("%02d:%02d", duration %% 3600 %/% 60, duration %% 60 %/% 1))
```
Time spend to execute this analysis: `r t` minutes.