Any ideas about the runtime? #2

bastian-wur · 2017-10-05T11:47:37Z

Hi everyone,

yes, this is not really an issue. Just wanted to ask if you have an idea what the runtime a normal analysis could be.
I started a test run (first time I'm trying this tool, so not sure what I'm doing), with admittedly pretty big input data (2.5 mil genes, 8 samples), and currently it's still running (22h).
Is this normal/possible, or should I be worried?

Code and output is currently this, and according to top my R instance is still running:

> x <- read.delim("/exports/mm-hpc/bacteriologie/bastian/absolute_counts_per_gene.csv.only_save_patients_before_after.csv",row.names="contig",sep = "\t",header=TRUE)
> library(DAtest)
DAtest version 2.6.2
> mytest <- testDA(x, c(0,1,0,1,0,1,0,1),cores = 1)
Seed is set to 123
1722860 empty features removed
predictor is assumed to be a continuous/quantitative variable

  |                                                                            
  |                                                                      |   0%Estimating sequencing depths...
Resampling to get new data matrices...
perm= 1
perm= 2
perm= 3
perm= 4
perm= 5
perm= 6
perm= 7
perm= 8
perm= 9
perm= 10
perm= 11
perm= 12
perm= 13
perm= 14
perm= 15
perm= 16
perm= 17
perm= 18
perm= 19
perm= 20
perm= 21
perm= 22
perm= 23
perm= 24
perm= 25
perm= 26
perm= 27
perm= 28
perm= 29
perm= 30
perm= 31
perm= 32
perm= 33
perm= 34
perm= 35
perm= 36
perm= 37
perm= 38
perm= 39
perm= 40
perm= 41
perm= 42
perm= 43
perm= 44
perm= 45
perm= 46
perm= 47
perm= 48
perm= 49
perm= 50
perm= 51
perm= 52
perm= 53
perm= 54
perm= 55
perm= 56
perm= 57
perm= 58
perm= 59
perm= 60
perm= 61
perm= 62
perm= 63
perm= 64
perm= 65
perm= 66
perm= 67
perm= 68
perm= 69
perm= 70
perm= 71
perm= 72
perm= 73
perm= 74
perm= 75
perm= 76
perm= 77
perm= 78
perm= 79
perm= 80
perm= 81
perm= 82
perm= 83
perm= 84
perm= 85
perm= 86
perm= 87
perm= 88
perm= 89
perm= 90
perm= 91
perm= 92
perm= 93
perm= 94
perm= 95
perm= 96
perm= 97
perm= 98
perm= 99
perm= 100
Number of thresholds chosen (all possible thresholds) = 944
Getting all the cutoffs for the thresholds...
Getting number of false positives in the permutation...

Unrelated: Greetings to Martin M., from whose poster I saw the link :).

The text was updated successfully, but these errors were encountered:

Russel88 · 2017-10-05T12:16:23Z

2.5 mil is quite a lot more than what I tested it with (up to 10k). But it tells you that 1722860 genes are empty, so you're actually "just" testing 800k genes.
I would recommend you to:

To test run time, set R=1 and subset to a few methods by setting the tests argument, e.g., just the simple t-test and wilcox test; tests = c("ttt","wil").
Use more cores if you can, Using only 1 core significantly increases run time
Note that your predictor is assumed to be quantitative, so you should write it as factor(c(0,1,0,1,0,1,0,1)) or c("A","B","A","B","A","B","A","B").

From extrapolation I would say it should take around 40 min to run t-test and wilcox test on 800k genes with R=1 and with 1 core. Increase it to 2 cores and the run time is about half.

When you have tested that it can run within an acceptable time, you can add other tests. If its RNAseq data try adding edgeR ("ere","ere2","erq","erq2")

Cheers

Russel88 · 2017-10-09T10:19:42Z

Hi again,

I have added a new function to estimate runtime of the different methods; runtimeDA.

This will tell you whether some methods are simply too slow on your dataset. You can then use the tests argument to specify the methods that you want to include.

Cheers,
Jakob

bastian-wur · 2017-10-09T11:13:53Z

Thanks :).
Right now I'm busy with other things, so I'm letting it run, but it seems that it's indeed a bit slow for this size of data.
It's still running and producing output, as it seems (started at the 4th, last output written yesterday morning), and I anyways need to test it on the whole dataset (well, it's actually not the whole, as you noted).

(if it doesn't stop within the next week, I'll try with more cores; just need to monitor the RAM usage, to get an estimate)

bastian-wur · 2017-11-07T14:00:51Z

FYI: It ran 28 days on 10 cores, with a maximum of 112GB RAM.

And then I made an error in my R script and the whole thing aborted, stupid me.
I have the plot with AUC and FPR. I guess that is enough to make a conclusion, right?
(AUC is pretty low though, highest 0.6)

Russel88 · 2017-11-07T19:25:44Z

Wow, well at least it finished. AUC of 0.6 is a bit low. You might wane to spike some more features when the dataset is that large. Also, you can subset to only fast methods. This for example:
mytest <- testDA(x, as.factor(c(0,1,0,1,0,1,0,1)), effectSize = 4, k = c(50,50,50), tests = c("ttt","ltt","ltt2","wil","ere","ere2","lim","vli","lli","lli2"), cores = 10)
This will spike 150 features with effect size of 4, and test with 10 fast and relevant methods.

That being said, if you have the plot it should be fine. So long as it treated your predictor as categorical. If the code is as in your first post it would be wrong, you should wrap the predictor in as.factor().

Also, if you run it again, install DAtest again to get the latest version.

Cheers

bastian-wur · 2017-11-08T13:05:12Z

Thanks :). I changed the predictor before I ran the test, because of what you said.
I installed now also the latest version.
If I try to run the command you put here, I get an error though :/.

The output is:

mytest <- testDA(x, c("A","B","A","B","A","B","A","B"), effectSize = 4, k = c(50,50,50), tests = c("ttt","ltt","ltt2","wil","ere","ere2","lim","vli","lli","lli2"), cores = 20)
You are about to run testDA using 20 cores. Enter y to proceed
Error in testDA(x, c("A", "B", "A", "B", "A", "B", "A", "B"), effectSize = 4, :
Process aborted
Execution halted

My R is not brilliant, but I can tell at least that the syntax itself is right, and I also checked in testDA.Rd, there don't seem to be any issues.
But where is the error? Is probably a general R thingy, which I cannot interpret right now.

Russel88 · 2017-11-08T13:34:40Z

Ah, it's because you have to enter "y" (without quotation marks) after you run the testDA function:
You are about to run testDA using 20 cores. Enter y to proceed

When you're running with many cores (>10), you have to confirm the run. It's just an extra check, to ensure that a server is not accidentally overloaded with parallel workers.

bastian-wur · 2017-11-13T09:54:33Z

Aaaah, thanks.
Okay, had it running with 10 cores, because I couldn't be bothered to figure out how to pass a "y" into the non-interactive session.

Has now finished. The AUC is still max 0.6, and since I now also got the table, I see that the spike in detection rate is constantly 0, except for 1 case, where it's 0.03333.
So...I guess I'd need both more spike ins and a bigger effect size....right?

Russel88 · 2017-11-16T14:10:21Z

You can just run like this:
mytest <- testDA(...); y

You should not expect to get very high AUC or spike.detect.rate for this size of data. You can try k=c(500,500,500) and effectSize = 5. As long as the spike.detect.rate is above zero for the method that has FPR < 0.05 and the highest AUC I would go with that.

Russel88 closed this as completed Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any ideas about the runtime? #2

Any ideas about the runtime? #2

bastian-wur commented Oct 5, 2017 •

edited

Loading

Russel88 commented Oct 5, 2017

Russel88 commented Oct 9, 2017

bastian-wur commented Oct 9, 2017

bastian-wur commented Nov 7, 2017

Russel88 commented Nov 7, 2017

bastian-wur commented Nov 8, 2017

Russel88 commented Nov 8, 2017

bastian-wur commented Nov 13, 2017

Russel88 commented Nov 16, 2017

Any ideas about the runtime? #2

Any ideas about the runtime? #2

Comments

bastian-wur commented Oct 5, 2017 • edited Loading

Russel88 commented Oct 5, 2017

Russel88 commented Oct 9, 2017

bastian-wur commented Oct 9, 2017

bastian-wur commented Nov 7, 2017

Russel88 commented Nov 7, 2017

bastian-wur commented Nov 8, 2017

Russel88 commented Nov 8, 2017

bastian-wur commented Nov 13, 2017

Russel88 commented Nov 16, 2017

bastian-wur commented Oct 5, 2017 •

edited

Loading