You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a complex merge which requires careful planning.
First the rationale:
pros:
single code base to maintain with maybe 80% shared
single syntax and semantics to document, teach or learn, but there will be backend specific variations, hopefully just in the form of specific features
avoid progressive drift between the two branches (both expected to change quickly)
multiple backend implementation proved good modularity prod and bug detector (because of tests of equivalence between backends)
cons:
single code base more complex than each of the separate ones
merge is going to break things that currently work (hopefully this is temporary and will be confined in a dedicated branch)
merge will take a lot of work
SparkR is still incomplete, dev preview, not ready for prime time. Merged package won't be ready for release will have to be maintained in parallel for an unpredictable amount of time
backend specific features
sparkR
cache
indexed partitions
not available yet: accumulators
rmr2:
local backend
io formats (for now)
writing files (for now)
Merge issues
What do we do with deps unique to each branch? Import all of them creates a burden on the installer that needs only one backend. Installing rmr2 means installing hadoop, installing SparkR means installing spark. If we go with the suggests: field, we may end up with an installation that doesn't work. This is also the dplyr approach, but the local mode sort of comes built in. Here it is part of rmr2 and can't be unbundled.
Same for namespace issues, what do we do with functions and methods that are backend specific? In this case there is no suggest alternative, but we can probably keep them all without harm
the big.data class and big.data.R, albeit a misnomer at this point, can be used to interface with rmr2. We could rename it in a more specific way for clarity.
the pipe class and pipe.R file are the tough ones. Let's break it down by function or group thereof.
comp is a little functional programming helper used for delayed evaluation. Can go to common.R or anywhere
options env is spark only, but it was on the todo list for the dev branch to also have configurable options and shield the user from accessing rmr2 options directly
drop.gather changes based on different representation of keyvals in the two backends
make*fun functions are rmr2 specific
is.data is probably dead code, as.character.pipe for sure
set.context should become part of the options subsystem, spark only
print.pipe is backend independent (for short SS is spark specific, RS is rmr2 specific and BI is backend independent) but git diff didn't catch that
make.f1 is BI
mergeable, vectorized and related funcs are BI
gapply and group.f and ungroup have been completely rewritten for spark
the keyval related functions are SS and replace the keyval logic in rmr2. There is a little duplication but unless we spin off keyvals in a separate package what else can we do
group is BI
run, mrexec mr.options are RS
output can't be implemented in Spark right now
the various as. conversions are mostly BS (backend specifc) in their implementation
the tutorial has been modified to avoid writing to a file in SparkR, can be factored out but would make
tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR
Merge preparations
If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.
The text was updated successfully, but these errors were encountered:
the rmr patch has been merged. It reloads objects twice, so I am not too happy about it, but it was the simplest thing that worked. A pull request has been sent for the sparkR patch. The spark-merge branch is merged into the spark branch and this is published, and I also did a routine merge from 0.3.0 through dev. Open problems are:
dependency on both rmr2 and SparkR. Can we use suggest to make them optional?
keep the spark branch going in parallel or merge with dev and hide the spark changes for a while? On one hand there is no way we can support on the spark backend every feature, on the other a parallel branch is a maintenance burden and it has some features, like plyrmr options, that would be nice to have in dev independent of spark.
I am reorganizing the tests in the spark branch so that they can pass cleanly, with the ones that must be skipped in spark clearly marked. This will allow to perform testing according to standard procedures and also track progress in the support of features on spark (every time we exclude fewer tests, that's progress).
This is a complex merge which requires careful planning.
First the rationale:
backend specific features
Merge issues
tutorial less intuitive, probably best is to keep two version and kick this down the road until the output feature is in SparkR
Merge preparations
If we just did a merge of one branch into the other as it is, it could be a disaster. It's probably best o have BS code in separate files and have very few points of selection where the control depends on the backend. This could be implemented the OO way with two classes inherting from pipe, one pipespark and one pipemr for instance the differ in methods gapply group.f and ungroup. Once this preparations are completed in both branches, the merge should be a lot easier. Another possible preparation could be to somwhat normalize the order of function as git diff seems to be confounded by simple code reordering.
The text was updated successfully, but these errors were encountered: