merge spark dev branches #39

piccolbo · 2014-05-08T21:22:23Z

This is a complex merge which requires careful planning.

First the rationale:

pros:
- single code base to maintain with maybe 80% shared
- single syntax and semantics to document, teach or learn, but there will be backend specific variations, hopefully just in the form of specific features
- avoid progressive drift between the two branches (both expected to change quickly)
- multiple backend implementation proved good modularity prod and bug detector (because of tests of equivalence between backends)
cons:
- single code base more complex than each of the separate ones
- merge is going to break things that currently work (hopefully this is temporary and will be confined in a dedicated branch)
- merge will take a lot of work
- SparkR is still incomplete, dev preview, not ready for prime time. Merged package won't be ready for release will have to be maintained in parallel for an unpredictable amount of time

backend specific features

sparkR
- cache
- indexed partitions
- not available yet: accumulators
rmr2:
- local backend
- io formats (for now)
- writing files (for now)

Merge issues

What do we do with deps unique to each branch? Import all of them creates a burden on the installer that needs only one backend. Installing rmr2 means installing hadoop, installing SparkR means installing spark. If we go with the suggests: field, we may end up with an installation that doesn't work. This is also the dplyr approach, but the local mode sort of comes built in. Here it is part of rmr2 and can't be unbundled.
Same for namespace issues, what do we do with functions and methods that are backend specific? In this case there is no suggest alternative, but we can probably keep them all without harm
the big.data class and big.data.R, albeit a misnomer at this point, can be used to interface with rmr2. We could rename it in a more specific way for clarity.
the pipe class and pipe.R file are the tough ones. Let's break it down by function or group thereof.
- comp is a little functional programming helper used for delayed evaluation. Can go to common.R or anywhere
- options env is spark only, but it was on the todo list for the dev branch to also have configurable options and shield the user from accessing rmr2 options directly
- drop.gather changes based on different representation of keyvals in the two backends
- make*fun functions are rmr2 specific
- is.data is probably dead code, as.character.pipe for sure
- set.context should become part of the options subsystem, spark only
- print.pipe is backend independent (for short SS is spark specific, RS is rmr2 specific and BI is backend independent) but git diff didn't catch that
- make.f1 is BI
- mergeable, vectorized and related funcs are BI
- gapply and group.f and ungroup have been completely rewritten for spark
- the keyval related functions are SS and replace the keyval logic in rmr2. There is a little duplication but unless we spin off keyvals in a separate package what else can we do
- group is BI
- run, mrexec mr.options are RS
- output can't be implemented in Spark right now
- the various as. conversions are mostly BS (backend specifc) in their implementation
- the tutorial has been modified to avoid writing to a file in SparkR, can be factored out but would make
  tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR

Merge preparations

If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.

piccolbo · 2014-05-19T22:11:14Z

merge is complete in spark-merge branch. Requires patches in rmr and SparkR.

piccolbo · 2014-05-21T21:35:14Z

the rmr patch has been merged. It reloads objects twice, so I am not too happy about it, but it was the simplest thing that worked. A pull request has been sent for the sparkR patch. The spark-merge branch is merged into the spark branch and this is published, and I also did a routine merge from 0.3.0 through dev. Open problems are:

dependency on both rmr2 and SparkR. Can we use suggest to make them optional?
keep the spark branch going in parallel or merge with dev and hide the spark changes for a while? On one hand there is no way we can support on the spark backend every feature, on the other a parallel branch is a maintenance burden and it has some features, like plyrmr options, that would be nice to have in dev independent of spark.

I am reorganizing the tests in the spark branch so that they can pass cleanly, with the ones that must be skipped in spark clearly marked. This will allow to perform testing according to standard procedures and also track progress in the support of features on spark (every time we exclude fewer tests, that's progress).

piccolbo closed this as completed Jun 13, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge spark dev branches #39

merge spark dev branches #39

piccolbo commented May 8, 2014

piccolbo commented May 19, 2014

piccolbo commented May 21, 2014

merge spark dev branches #39

merge spark dev branches #39

Comments

piccolbo commented May 8, 2014

backend specific features

Merge issues

Merge preparations

piccolbo commented May 19, 2014

piccolbo commented May 21, 2014