Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring wish list #17

Open
clarkfitzg opened this issue Aug 18, 2016 · 9 comments

Comments

Projects
None yet
3 participants
@clarkfitzg
Copy link
Contributor

commented Aug 18, 2016

Following up on #15 here are some ideas for improvements to ddR.

  • Makefile to automate testing, Doc build, and examples. See #15
  • Overhaul init and useBackend methods. See #15
  • Internal documentation on ddR's programming model and how to write backends. See ddR wiki
  • Rewrite examples for clarity, reproducibility, and best practices
  • Examples of reading / processing / writing actual data
  • Benchmarks for various backends (from @dselivanov )
  • Simplifying do_dmapply
  • Set up continuous integration on Travis (or similar service) - Probably requires admin access to repo?
  • Making distributed objects act more like their local counterparts through more OO code

The changes below might require more conversation, since I don't know the reasons behind the design decisions:

  • Allow partitioning dataframes only on rows
  • Make ddR column major order like R
  • Change name from arrays to matrices
@dselivanov

This comment has been minimized.

Copy link

commented Aug 29, 2016

I have 2 proposals:

Instructions on how to setup small dev cluster

It would be nice to have instructions to how to set up cluster with (on AWS for example). Or to have links to machine images.
It is quite easy to use ddR with "fork" parallel backend. But it can be useful to test and evaluate performance with:

  1. snow backend - socket and MPI clusters
  2. distributedR

Special case to distributed RowMatrix

👍 for special case for partitioning matrices by rows (same idea as in Spark - RowMatrix) - something like "dRowMatrix_" and "sparse_dRowMatrix".

@clarkfitzg

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2016

Do you mean an example cluster for distributedR? Because otherwise we use parallel which doesn't use a cluster, just all the cores on a local machine. That should only require starting up an AWS machine with R and installing the package.

+1 for benchmarks with the different systems. Then users can have some expectations of performance.

@dselivanov

This comment has been minimized.

Copy link

commented Aug 29, 2016

I mean

  1. cluster with several machines (at least 2) anddistributedR backend
  2. cluster with several machines (at least 2) with parallel backend which uses parallel::makePSOCKcluster().

I thought ddR supposed to work with all options from parallel, not only with parallel:: makeForkCluster().

Edit: see chapter 4 of parallel vignette

@dselivanov

This comment has been minimized.

Copy link

commented Aug 29, 2016

From pdriver.R I see, that ddR is not supposed to work with distributed PSOCK clusters. PSOCK only used for windows machines...

@etduwx, @lawremi, @fun-indra is there some limitation of PSOCKcluster which causes issues to work with ddR? Or that was just an initial implementation which can be extended to snow clusters?

@fun-indra

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2016

@dselivanov The PSOCK implementation is bare bones. I wanted to support parallelism in Windows. Now would be a good time to extend it to snow clusters. @clarkfitzg what do you think of it as a goal? @dselivanov would love it if you have time to contribute towards snow cluster.

@dselivanov

This comment has been minimized.

Copy link

commented Aug 29, 2016

@fun-indra, @clarkfitzg I will happy to help, but I'm not sure I will have time in the next few weeks (at least for coding). But I will definitely try to help with reviews and testing.

@clarkfitzg

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2016

With the changes in #15 it just calls parallel::makeCluster, so in principle it should work with any cluster type supported by parallel. I'll test it out today.

@clarkfitzg

This comment has been minimized.

Copy link
Contributor Author

commented Sep 1, 2016

The unit tests pass with "SOCK" and "MPI" snow clusters.

@clarkfitzg

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2016

Looks like all the snow clusters are good. Moving this conversation to #15 since I added some minor changes there to support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.