Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel backend is broken for doParallel >1.0.6 #7

Closed
mintyplanet opened this issue Mar 3, 2014 · 8 comments
Closed

Parallel backend is broken for doParallel >1.0.6 #7

mintyplanet opened this issue Mar 3, 2014 · 8 comments

Comments

@mintyplanet
Copy link

doParallel package version > 1.0.6 doesn't load as the parallel backend.

> library(NMF)
> data(esGolub)
> nmf(esGolub, 3, nrun=4, .opt="vP")
NMF algorithm: 'brunet'
Multiple runs: 4
Error: Foreach computation aborted: object 'info' not found

The error message refers to this line:
https://github.com/renozao/NMF/blob/master/R/parallel.R#L316

object$info <- doParallel:::info

The internal variable doParallel:::info has been removed since version 1.0.7

@renozao
Copy link
Owner

renozao commented Mar 4, 2014

Thanks for reporting this.
Will look into it asap.

renozao pushed a commit that referenced this issue Mar 4, 2014
@renozao renozao closed this as completed Mar 20, 2014
@brdhungana
Copy link

Recently, I encountered the same error while running NMF in Linux Platform but 'not' in window Platform. As you have indicated above the problem has been fixed. Could you please be explicit with example.

Here is my code:

system.time(NMFFit <- nmf(Train2, rank=4, method="ns", theta=0.7, seed = 123456, nrun=8, .opt = "vP8"));
NMF algorithm: 'nsNMF'
Multiple runs: 8
Error: Parallel computation aborted: object 'info' not found
Timing stopped at: 0.26 0.037 0.295

Here is my system information:
$platform: "x86_64-unknown-linux-gnu"
$version.string: "R version 3.0.0 (2013-04-03)"

other attached packages:
[1] doParallel_1.0.8 iterators_1.0.7 foreach_1.4.2
[4] NMF_0.17 bigmemory_4.4.6 BH_1.54.0-4
[7] bigmemory.sri_0.1.3 digest_0.6.4 rngtools_1.2.4
[10] pkgmaker_0.22 registry_0.2

I appreciate your help for fixing the parallel run issue in Linux. Thanks.

Regards,
BRD

@renozao
Copy link
Owner

renozao commented Dec 6, 2014

Have you tried using the latest version on CRAN (0.20.5)?
On my Ubuntu box:

library(NMF)
x <- rmatrix(100, 20)
res <- nmf(x, rank=4, method="ns", theta=0.7, seed = 123456, nrun=8, .opt = "vP4")

Results:

> library(NMF)
Loading required package: pkgmaker
Loading required package: registry
Loading required package: rngtools
Loading required package: cluster
NMF - BioConductor layer [OK] | Shared memory capabilities [OK] | Cores 3/4
> x <- rmatrix(100, 20)
> res <- nmf(x, rank=4, method="ns", theta=0.7, seed = 123456, nrun=8, .opt = "vP4")
NMF algorithm: 'nsNMF'
Multiple runs: 8
Mode: parallel (4/4 core(s))
Runs: |==================================================| 100%
System time:
   user  system elapsed 
  8.993   0.291   3.648 
> res
<Object of class: NMFfitX1 >
  Method: nsNMF 
  Runs:  8 
  RNG:
   407L, -473780611L, -197192934L, -577462829L, -1713825544L, 1377146521L, 1787321734L 
  Total timing:
   user  system elapsed 
  8.993   0.291   3.648 

@brdhungana
Copy link

I finally be able to run model in recent version of NMF 0.20.5 using multicore in Linux System without any bug as before. Thank you for responding.

@brdhungana
Copy link

My model successfully ran with run=4 but stopped running just before finishing for 50 runs with the same data set:

Here is the successful run case:

library(NMF);
system.time(ckmNMF4 <- nmf(Train.ckm_2, rank=4, method="ns", theta=0.7, seed = 123456, nrun=4, .opt = "vp4"));
NMF algorithm: 'nsNMF'
Multiple runs: 4
Mode: parallel (4/16 core(s))
Runs: |==================================================| 100%
System time:
user system elapsed
32823.259 621.439 15113.068
user system elapsed
32834.302 622.468 15125.747

Failed run: My two attempts yielded the following messages:
system.time(ckmNMF4 <- nmf(Train.ckm_2, rank=4, method="ns", theta=0.7, seed = 123456, nrun=100, .opt = "vp16"));
NMF algorithm: 'nsNMF'
Multiple runs: 100
Mode: parallel (16/16 core(s))
Runs: |==================================================| 100%
ERROR
Error: NMF::nmf - Unexpected error: no partial result seem to have been saved.
Timing stopped at: 788331.4 17865.29 94329.14
Timing stopped at: 788344.5 17866.6 94348.33

How do I debug the causes of this failure. Note that I ran this with 45 million rows and 15 variables. It seems to me memory was not an issue. I appreciate your suggestions.

Thanks,
Basanta

@brdhungana
Copy link

Debug option is not generating useful information for identifying the causes of failure:

nmf.options(debug = TRUE)
system.time(ckmNMF4R <- nmf(Train.ckm_2, rank=4, method="ns", theta=0.7, seed = 123456, nrun=16, .opt = "vp16"));

NMF call: .local(x = x, rank = rank, method = method, seed = 123456, nrun = 16,

  .options = "vp16", theta = 0.7)

NMF algorithm: 'nsNMF'
Multiple runs: 16

OPTIONS:

verbose: TRUE | parallel: 16 | garbage.collect: 50 | RNGstream: TRUE

Setting up requested foreach environment: try-parallel [par]

Check available cores ... [16]

Check requested cores ... [16]

Loading backend for specification par ... OK

Check host compatibility ... OK

Registering backend doParallel ... OK

Check allocated cores ... OK [16/16]

Setting up RNG ...

** Original RNG settings:

RNG kind: Mersenne-Twister / Inversion

RNG state: 403L, 7L, ..., -1289165921L [4de1642ab154e963c6ea7ef488e195d8]

Generate RNGStream sequence using seed (403L, 624L, ..., 449848215L [ed7ba52c9c2666ca159b185949fd9d73]) ... OK

Using foreach backend: doParallelMC [version 1.0.8]

Mode: parallel (16/16 core(s))

Check shared memory capability ... NO [Package bigmemory required]

Setup temporary directory: '/home/XXXXXXX/XXXXXXXX/XXXXXX/NMF_1d015bc581a' ... OK

Running on 1 host(s): 'cma4-corp.XXXX.XX.XXX'

Using shared memory ... FALSE

Setting up libpath on workers for package(s) 'NMF' ... OK

libPaths:

/home/XXXXXXX/XXXXXXXX/R/x86_64-redhat-linux-gnu-library/3.1
/usr/lib64/R/library
/usr/share/R/library
numValues: 16, numResults: 0, stopped: TRUE
got results for task 1
numValues: 16, numResults: 1, stopped: TRUE
returning status FALSE
got results for task 2
numValues: 16, numResults: 2, stopped: TRUE
returning status FALSE
got results for task 3
numValues: 16, numResults: 3, stopped: TRUE
returning status FALSE
got results for task 4
numValues: 16, numResults: 4, stopped: TRUE
returning status FALSE
got results for task 5
numValues: 16, numResults: 5, stopped: TRUE
returning status FALSE
got results for task 6
numValues: 16, numResults: 6, stopped: TRUE
returning status FALSE
got results for task 7
numValues: 16, numResults: 7, stopped: TRUE
returning status FALSE
got results for task 8
numValues: 16, numResults: 8, stopped: TRUE
returning status FALSE
got results for task 9
numValues: 16, numResults: 9, stopped: TRUE
returning status FALSE
got results for task 10
numValues: 16, numResults: 10, stopped: TRUE
returning status FALSE
got results for task 11
numValues: 16, numResults: 11, stopped: TRUE
returning status FALSE
got results for task 12
numValues: 16, numResults: 12, stopped: TRUE
returning status FALSE
got results for task 13
numValues: 16, numResults: 13, stopped: TRUE
returning status FALSE
got results for task 14
numValues: 16, numResults: 14, stopped: TRUE
returning status FALSE
got results for task 15
numValues: 16, numResults: 15, stopped: TRUE
returning status FALSE

Processing partial results ... ERROR

Error: NMF::nmf - Unexpected error: no partial result seem to have been saved.
Timing stopped at: 2692.744 345.294 528.506

NMF computation exit status ... ERROR

Running rollback clean up ...

Restoring RNG settings ...

RNG kind: Mersenne-Twister / Inversion

RNG state: 403L, 7L, ..., -1289165921L [4de1642ab154e963c6ea7ef488e195d8]

OK

Restoring NMF options ... OK

Restoring previous foreach backend '' ... OK

Deleting temporary directory '/XXXX/XXXXX/XXXXX/XXXXX/NMF_1d015bc581a' ... OK

Timing stopped at: 2698.415 345.833 549.012

@brdhungana
Copy link

I recently ran the same model with run=50 using 4 cores in Batch mode, it returned successfully! I am now experimenting with 16 cores and will update you as soon as I get results. At this point in time, I do not consider having any bug in the NMF source code.

Here is few lines from log of my 50th run:
numValues: 50, numResults: 50, stopped: TRUE
calling combine function
evaluating call object to combine results:
fun(accum, result.1, result.2, result.3, result.4, result.5,
result.6, result.7, result.8, result.9, result.10, result.11,
result.12, result.13, result.14, result.15, result.16, result.17,
result.18, result.19, result.20, result.21, result.22, result.23,
result.24, result.25, result.26, result.27, result.28, result.29,
result.30, result.31, result.32, result.33, result.34, result.35,
result.36, result.37, result.38, result.39, result.40, result.41,
result.42, result.43, result.44, result.45, result.46, result.47,
result.48, result.49, result.50)
returning status TRUE

Processing partial results ... OK

NMF computation exit status ... OK

Running normal exit clean up ...

Restoring NMF options ... OK

Restoring previous foreach backend '' ... OK

Updating RNG settings ... OK

RNG kind: Mersenne-Twister / Inversion

RNG state: 403L, 1L, ..., 425501564L [c7f400f3798e6384ca89b63934b32173]

Deleting temporary directory '/home/XXXXXX/XXXXXX/XXXXX/NMF_53456e0662dc' ... OK

  user     system    elapsed 

446964.134 8829.809 156572.488

Thanks,
BRD

@brdhungana
Copy link

I ran a model with nrun=100 but it failed to complete the compilation after finishing all the run. Please give me your email address to send you my debug log for your analysis. I experimented twice with different algorithms, both yielded the same error at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants