-
Notifications
You must be signed in to change notification settings - Fork 3
/
SCADIE_vignette.Rmd
153 lines (106 loc) · 7.55 KB
/
SCADIE_vignette.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "SCADIE_vignette"
output: rmarkdown::html_vignette
author: "Daiwei Tang"
date: "`r Sys.Date()`"
vignette: >
%\VignetteIndexEntry{SCADIE_vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library("NMF")
devtools::load_all(".")
```
In this document we walk through the process of preparing data and analysis with SCADIE.
## Preparing Data
Since SCADIE takes many data inputs for both groups, it is HIGHLY RECOMMENDED to store them in a list. Depending on the analysis goal, the data fields in the list may vary. In this section we use the Alzheimer’s disease (AD) dataset as an example, to walk through the data preparation process.
Due to the complexity of raw data preprocessing, here we begin with pre-processed datasets.
```{R}
data("example_data")
ls()
```
The *example_data* contains five objects:
* *geneExpr_raw* is the combined gene expression dataframe for both groups
* *ad_index* is the column indexes for AD samples in *geneExpr_raw*
* *ctrl_index* is the column indexes for control samples in *geneExpr_raw*
* *initial_H_1* is the IHC-profiled proportions for each sample in control group, see the manuscript for more details.
* *initial_H_2* is the IHC-profiled proportions for each sample in AD group, see the manuscript for more details.
We first create an empty list:
```{R}
ihc_bulk <- list()
```
Before adding bulk data $Y$ into the list, we need to remove the rows with constant value (e.g. 0), as these rows might create singularity during regression. The raw $Y$s are usually data frames, need to convert them to matrix when including in the list. It is also recommended to make sure the row and column names are correctly retained.
```{R}
## Remove const rows across all columns
geneExpr_raw_sub <- geneExpr_raw[-which(apply(geneExpr_raw,1,sd)==0),]
## add the bulk matrices to the list
ihc_bulk$bulk_full_1 <- as.matrix(geneExpr_raw_sub[,ctrl_index])
ihc_bulk$bulk_full_2 <- as.matrix(geneExpr_raw_sub[,ad_index])
## remove the group-specific const rows
rm_bulk_full1 <- which(apply(ihc_bulk$bulk_full_1 , 1, sd)==0)
rm_bulk_full2 <- which(apply(ihc_bulk$bulk_full_2 , 1, sd)==0)
ihc_bulk$bulk_full_1 <- ihc_bulk$bulk_full_1[-c(rm_bulk_full2),]
ihc_bulk$bulk_full_2 <- ihc_bulk$bulk_full_2[-c(rm_bulk_full2),]
```
Next, we include the initial $H$s for each group, again, make sure they are matrices:
```{R}
ihc_bulk$initial_H_1 <- as.matrix(initial_H_1)
ihc_bulk$initial_H_2 <- as.matrix(initial_H_2)
class(ihc_bulk$initial_H_1 )
class(ihc_bulk$initial_H_2 )
```
Sometimes due to numeric error or other reaons, the column sum of $H$s are not 1s, in this case, we need to re-clibrate them.
```{R}
ihc_bulk$initial_H_1 <- as.matrix(ihc_bulk$initial_H_1) %*% diag(1/apply(ihc_bulk$initial_H_1,2,sum))
ihc_bulk$initial_H_2 <- as.matrix(ihc_bulk$initial_H_2)%*%diag(1/apply(ihc_bulk$initial_H_2,2,sum))
```
Initial $W$s can be obtain through $Y^{T}=H^{T}W^{T}$ for each group (you might need *NMF* package for *fcnnls* function). Noted that since the $W$s here are output from *fcnnls*, they do not have appropriate row/column names, we need to add them back.
```{R}
## calculate initial Ws
ihc_bulk$initial_W_1 <- t((fcnnls( x = t(ihc_bulk$initial_H_1),y= t(ihc_bulk$bulk_full_1) ) )$x)
ihc_bulk$initial_W_2 <- t((fcnnls( x = t(ihc_bulk$initial_H_2),y= t(ihc_bulk$bulk_full_2) ))$x)
## add colnames and row names to initial Ws
rownames(ihc_bulk$initial_W_1 ) <- rownames(ihc_bulk$bulk_full_1)
rownames(ihc_bulk$initial_W_2 ) <- rownames(ihc_bulk$bulk_full_2)
colnames(ihc_bulk$initial_W_1 ) <- rownames(ihc_bulk$initial_H1)
colnames(ihc_bulk$initial_W_2 ) <- rownames(ihc_bulk$initial_H2)
```
Normally the signature matrix and their corresponding row indexes are not needed, except when you choose to update $H$ using only signature genes' rows (i.e., H_update_gene = "signature" in the main function). We can normally set the fields corresponding to signature genes as NA.
```{R}
ihc_bulk$sig_matrix <- NA
ihc_bulk$bulk_sub_1 <- NA
ihc_bulk$bulk_sub_2 <- NA
ihc_bulk$signature_gene_row_index <- NA
```
With all these done, we are ready to run the main SCADIE function.
## Run SCADIE for Point Estiamtes
The code below shows the setting for a default SCADIE run. Some common tweaks you can make are:
* Set *update_W_method* to "NNLS" for updating $W$ with NNLS
* Change *cutoff* to control the difference stop criterion.
* When *H_update_gene* is set to "all", we need to input valid *input_bulk_sub1*, *input_bulk_sub2*, and *signature_gene_row_index*, instead of NAs. The *input_bulk_sub1* is the submatrix of *input_bulk_full1* with only signature genes rows, same for *input_bulk_sub1*; the *signature_gene_row_index* is vector containing the signature genes's row positions in the full matrix. Due to the iterative nature of SCADIE, we will use updated siganture $\underline{W}$ for future $H$-update, so the actual signature matrix is not needed.
```{R,cache=T}
ihc_bulk_groundtruth_scad_output <- Iterate_W_H_full_general(n_ct=5,input_initial_H1 = ihc_bulk$initial_H_1,input_initial_H2 = ihc_bulk$initial_H_2,input_initial_W1 = ihc_bulk$initial_W_1,input_initial_W2 = ihc_bulk$initial_W_2,input_bulk_sub1 = ihc_bulk$bulk_sub_1,input_bulk_sub2 = ihc_bulk$bulk_sub_2,input_bulk_full1 = ihc_bulk$bulk_full_1,input_bulk_full2 = ihc_bulk$bulk_full_2,update_W_method = "SCAD",H_update_method = "NNLS",H_update_gene = "all",signature_gene_row_index = ihc_bulk$signature_gene_row_index,max_itr = 100,cutoff = 10^-6 )
```
The output list contains two entries, for group1 and group2, respectively. Each entry contains initial and end $W$s, end $H$s, as well as *weight* matrix $E$. You might also find a variable called *updated_weight*, this variable is useful when you update the weight matrix during iteration (set *weight_update_number* smaller than *max_itr*), it will record the lastest weight matrix *E*.
```{R}
## end W1
head(ihc_bulk_groundtruth_scad_output$output1$W_end)
## end H1
head(ihc_bulk_groundtruth_scad_output$output1$H_end)
```
## Run SCADIE for Standard Error Estimation
The code below shows the setting for running Jackknife estimate standard errors, due to the large number of variables, the *Estimate_sd_general* function ONLY TAKES THE COMPILED LIST as input!
Other parameteres are largely the same as the *Iterate_W_H_full_general*. Since the Jackknife requires multiple runs of SCADIE, you can specify a higher *cores* number to speed up the process through parallele computing.
*jk_subsample* controls the number of leave-one-out jackknife runs conducted, the default setting for *jk_subsample* is the smaller sample size between group1 and group2, in this case, as we have 31 samples in group1 and 18 in group2, *jk_subsample* equals 18. Sometimes you might need to specify a *jk_subsample* number smaller than sample size for very large cohort.
```{R,cache=T}
ihc_bulk_groundtruth_sd <- Estimate_sd_general(input_list=ihc_bulk, update_W_method="SCAD" ,method="jackknife",cores=5,bs_num=NA,H_update_method="NNLS",H_update_gene="all",signature_gene_row_index=ihc_bulk$signature_gene_row_index,max_itr = 100,cutoff = 10^-6,jk_subsample = 18 )
```
The output list contains four fields:
* *W1_vec* and *W2_vec* stores all the intermediate $W$ output throughout jacknife.
* *w_diff_sd_jackknife* is the entry-wise standard error matrix, when performing differential expression analysis, this is the matrix to use.
* *w_diff_sd_jackknife_raw* is the empirical standard error matrix from jackknife, *w_diff_sd_jackknife* was obtained from this matrix with adjustments mentioned in the original paper.