-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding the BWS results #5
Comments
(best bug report ever, Towards your first question, the test statistic depends only on the order of the data, and is invariant with respect to any monotone transform. (c.f. the help page for |
That makes sense. In my case, it seems like I set up the set.seed(10)
gene_num <- 100
test <- data.frame(sample1=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample2=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample3=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample_a4=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
reference_b1=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
reference_b2=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
reference_b3=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)))
genes_row <- character()
for(i in 1:gene_num){
add_gene <- paste(sample(LETTERS,4), collapse = "")
genes_row <- c(genes_row, add_gene)
}
rownames(test) <- genes_row
##############################################################################
############### Use bws package for ranking ##################################
##############################################################################
library(BWStest)
bws_ranked <- numeric()
for(i in rownames(test)){
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_ranked[[i]] <- bws_stat(samp, ref)
}
##############################################################################
bws_ranked[1:5] # Show first few results to see which ones are tied.
#> KFYP SCRH BIYU SCBN KDIA
#> 3.233424 3.233424 1.371321 3.233424 3.233424
test["KFYP",] # example ties
#> sample1 sample2 sample3 sample_a4 reference_b1 reference_b2
#> KFYP 50.18746 42.38196 62.15514 65.02545 19.20543 28.69475
#> reference_b3
#> KFYP 10.9803
test["SCRH",] # example ties
#> sample1 sample2 sample3 sample_a4 reference_b1 reference_b2
#> SCRH 48.15747 54.19375 53.30876 55.90409 31.81752 13.1999
#> reference_b3
#> SCRH 19.4525 Created on 2018-11-19 by the reprex package (v0.2.1) In this example, genes My goal is a bit different though. I'd like to compare two groups and rank the genes ("high" expression in sample vs reference = high ranking). Would it be possible to use Best, |
Found some relevant information about the paper I referenced in their github page. Unfortunately for me, it is in
|
The following code computes the "7 choose 4" possible outcomes the BWS test statistic can take for the case of sample sizes 4 and 3: test <- setNames(data.frame(samples=t(rbind(combn(1:7,4),8 - combn(1:7,3)))),
c(paste0('sample',1:3), 'sample_4a', paste0('reference_b',1:3)))
require(BWStest)
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_stat(samp, ref)
}) I get these values: c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.313781415343915,
0.921916335978836, 0.88322585978836, 0.512194113756614, 0.313781415343915,
0.0588210978835979, 0.387194113756614, 0.282035383597884, 0.207630621693122,
0.157035383597884, 0.231440145502646, 0.249297288359788, 0.532035383597884,
0.13322585978836, 0.228463955026455, 0.227471891534392, 0.46854332010582,
0.207630621693122, 0.217551256613757, 0.177868716931217, 0.24136078042328,
0.726479828042328, 0.619336970899471, 0.227471891534392, 0.38620205026455,
0.679852843915344, 1.0369957010582, 1.08560681216931, 1.26020998677249,
1.26020998677249, 2.04592427248677, 3.23342427248677) Some of these are duplicated as well. The unique values among them are c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.31378141534392,
0.92191633597884, 0.88322585978836, 0.51219411375661, 0.0588210978836,
0.38719411375661, 0.28203538359788, 0.20763062169312, 0.15703538359788,
0.23144014550265, 0.24929728835979, 0.53203538359788, 0.13322585978836,
0.22846395502646, 0.22747189153439, 0.46854332010582, 0.21755125661376,
0.17786871693122, 0.24136078042328, 0.72647982804233, 0.61933697089947,
0.38620205026455, 0.67985284391534, 1.0369957010582, 1.08560681216931,
1.26020998677249, 2.04592427248677, 3.23342427248677) |
The code you have found in Matlab is computed the BWS tests statistic. The computation of The Matlab code is computing this statistic row-by-row. The only odd Matlab syntax I see here is the use of 'dot operators' for Hadamard operations. That is To be sure, I took the Matlab code above, and adapted it to the first row of my example output above, where the samples take values 1 through 4 and the reference takes 5 through 7. The code is as follows: R1 = 1:4;
R2 = 5:7;
n0 = length(R1);
n1 = length(R2);
ind0 = 1:n0;
ind1 = 1:n1;
B0 = sum((R1 - ((n1+n0)/n0 * ind0)).^2./((ind0./(n0+1)) .* (1-(ind0/(n0+1))) * ((n1*(n1+n0))/n0)));
B1 = sum((R2 - ((n1+n0)/n1 * ind1)).^2./((ind1./(n1+1)) .* (1-(ind1/(n1+1))) * ((n0*(n1+n0))/n1)));
ranksa = (B0/n0 + B1/n1)/2;
ranksa I executed this with the online octave interpreter where it claimed that the value of the statistic was 2.9527, which matches the value computed above. |
Thanks for your comments. I distilled my question a bit more by using the following example. I simulated a data frame having 7 observations ( test <- data.frame(rbind(c(10.1, 10.2, 10.3, 10.4, 1.1, 1.2, 1.3), # Should be ranked first (highest difference sample vs ref)
c(10.1, 10.2, 10.3, 10.4, 3.1, 3.2, 3.3), # Should be ranked second
c(10.1, 10.2, 10.3, 10.4, 5.1, 5.2, 5.3), # Should be ranked third
c(10.1, 10.2, 10.3, 10.4, 7.1, 7.2, 7.3), # Should be ranked fourth
c(10.1, 10.2, 10.3, 10.4, 9.1, 9.2, 9.3))) # Should be ranked last (lowest difference sample vs ref)
# Give mock names to the data frame
genes_row <- character()
for(i in 1:5){
add_gene <- paste(sample(LETTERS,4), collapse = "")
genes_row <- c(genes_row, add_gene)
}
rownames(test) <- genes_row
colnames(test) <- c(paste0("sample",1:4), paste0("ref",1:3))
pheatmap::pheatmap(test, cluster_rows = F, cluster_cols = F) # Compute BWS statistics
require(BWStest)
#> Loading required package: BWStest
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_stat(samp, ref)
})
names(bws_ranked)<-rownames(test)
# All the statistics are the same
bws_ranked
#> CRVX RUAO SNDE CEIV HUNZ
#> 3.233424 3.233424 3.233424 3.233424 3.233424 Created on 2018-11-20 by the reprex package (v0.2.1) |
The BWS test is a test of equal probability distributions, not a test of location. Moreover, as noted in the documentation, and in the original paper, the test is invariant with respect to monotonic transforms of the data, so for example require(BWStest)
set.seed(1234)
x <- 10 + runif(5)
y <- 5 + runif(5)
s1 <- bws_stat(x,y)
s2 <- bws_stat(log(x),log(y))
s3 <- bws_stat(exp(x),exp(y))
dput(c(s1,s2,s3)) I get c(4.894, 4.894, 4.894) But now note that a monotonic transform could 'scrunch up' the data between their split: fnc <- function(x) { 10 - 1 / x }
fnc(x)
fnc(y)
bws_stat(fnc(x),fnc(y)) [1] 4.894 Perhaps you should be using a robust test for location? |
Hello,
Thanks for this helpful package. I'm trying to use this package to analyze for biological gene expression data. BWS method was suggested to be a powerful approach for non-parametric analysis of gene expression data this, this and this articles
My end goal is comparing two groups (
sample
vsreference
) with multiple observations in each case (cell_a1/2/3/4
,cell_b1/2/3
) and rank the gene expression for further analysis. I'm using different metrics to rank the gene expression and ranking the genes based on the statistics output of the relevant approach (i.e.t-value
fort.test
, andB
value forBWS
). Simulated data below helps demonstrate my question. The mockdata.frame
contains gene expression data in rows (each row is a gene) and samples in the columns (4 experimental samples, 3 references).My question has two facets:
1- Why do I have many duplicated BWS statistics value in this mock data frame even though the data were sampled randomly?
2- Any ideas why I have a low correlation between BWS method and the other the comparison methods? As a reference,
S2N
(signal to noise) metric is commonly used for the type of analysis that I'm trying perform here. The problem with using S2N in my actual case is the non-normal distribution of expression data. In reality, I have whole bunch of zero values and some non-zero values. Since the data aren't distributed normally, S2N metric might not be the most appropriate method (S2N is calculated as (mean1-mean2)/(stdev1+stdev2))Your thoughts are greatly appreciated.
Best,
Atakan
Created on 2018-11-19 by the reprex package (v0.2.1)
The text was updated successfully, but these errors were encountered: