Snapshot a pipe in the REPL #918

bholt · 2014-06-26T17:51:34Z

Fixes #879, could be in the territory of #889.

This pull request attempts to address the issue with not being able to run sub-flows in the repl. Rather than resetting the implicit flowDef (which disallows re-running pipes -- #879), this adds the ability to run a "sub-flow" for a given TypedPipe that includes only the upstream pipes and sources needed to compute that tail pipe.

This is accessible in the REPL via a couple new enrichments to TypedPipe:

.snapshot: immediately runs and writes the pipe to a temporary SequenceFile, returning a TypedPipe with the SequenceFile as a source.
.save(dest): same as snapshot but with a user-provided Sink (kinda like shorthand for .write(dest).run), also returning a TypedPipe with the saved file as a new Source.

In my current way of thinking, snapshots allow one to inspect intermediate steps of a job as they build it:

> val input = TypedPipe.from(TextLine("tutorial/data/hello.txt"))
> val wordsPipe: TypedPipe[String] = input.flatMap(_.split("\\s+"))
> val s1: TypedPipe[String] = wordsPipe.snapshot

One then has a choice, when continuing to work on a job, to use the snapshot pipe, if they wish to "memoize" the intermediate work, or use the original pipe, which will cause the flow to be re-evaluated.

Before continuing to work on this, I wanted to elicit some feedback on the intended direction. Overall, I'm hoping to come up with behavior that is least surprising to users of Scalding, but also cater to users coming from other ad-hoc analytics backgrounds (i.e. Pig, Spark)

Part of using snapshots should be good ways to actually view the contents of the snapshot. I would like to allow a toList kind of behavior that would read the contents into memory. This could either be implicit (.toList calls .snapshot internally), or .snapshot could return a new kind of TypedPipe that allowed .toList to be called on it. (Add a better temp Source/Sink to the repl #709 may be related -- I believe this "sub-flow" ability may fix one of the outstanding issues there)
- There could also be a Pig-like dump call added to snapshots that would just print the contents to stdout (though .toList pretty much subsumes this, as you can print a list to stdout)
What should the semantics of run be in the context of this change? Currently it is broken because all the snapshots add to the implicit flowDef, causing run on the whole flow to balk at all the unconnected sources and sinks. Should run be fixed by just discarding the extraneous sources and sinks created in the REPL? Or should run just go away, in favor of always calling something like save to commit the contents of a pipe to a single Sink?
Should snapshot ultimately be completely hidden from the user? This iteration allows an explicit choice to be made between building off of a snapshot of intermediate results or re-running the whole flow. An alternative would be to only create snapshots when people request to see the contents (via toList), and then swap out the memoized pipe for the original so the user gets consistent results. This leads to questions about whether memoized pipes should ever be invalidated when the original contents have changed (which sounds really daunting).

Thoughts?

note: also disabled `resetFlowDef`

…owDef

this automatically runs the localized job and returns a TypedPipe to a new Source of it

…rence, fix implicits

jcoveney · 2014-06-26T18:00:07Z

Nice! First responding to your overall thoughts, first thoughts are:

fix run to run the whole flow as the user has specified, ignoring extraneous sources etc
add dump
add toList (difference between dump is that you do not have to materialize all of the data into memory, toList is really just reading the result of dump into a List instead of the console)
exposing snapshots is ok, will allow uers to work more fluently with larger data sets

Great work. Was just going to bug you to put it on github :) Will comment on code as well.

johnynek · 2014-06-26T18:37:06Z

scalding-core/src/main/scala/com/twitter/scalding/Source.scala

   */
-  def writeFrom(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode) = {
+  def writeFromAndGetTail(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode) = {


add a return type to public methods (this old code didn't follow that rule).

This makes me pretty nervous, as this breaks old code. Anyone that overrides writeFrom will not have this behavior, right? Is there a way to be compatible?

could we just look through the tals in the flowDef for the sinkName to get the pipe back?

This will break code using writeFromAndGetTail directly when writeFrom is overridden. So far that just entails the REPL. Though we did talk yesterday about maybe looking through the sinks after the call for the new one and working from that. It would seem to be a bit more maybe janky, but far less of an impact on other code and robust to overrides too.

Good points. I'll see if I can come up with the least-brittle way to get the right tail pipe.

johnynek · 2014-06-26T19:03:07Z

Also note that #915 is related here. I think if we get that merged, the REPL code can call that and not make a Job.

This looks really great.

bholt · 2014-06-26T19:13:47Z

Also note that #915 is related here. I think if we get that merged, the REPL code can call that and not make a Job.

Agreed. I'll give it a try once that gets merged.

johnynek · 2014-06-26T19:23:09Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

+
+    // come up with unique temporary filename
+    // TODO: refactor into TemporarySequenceFile class
+    val tmpSeq = "/tmp/scalding-repl/snapshot-" + UUID.randomUUID() + ".seq"


I think we need something like def newTempPath: String to Mode and in Hdfs, we need to do this:

https://github.com/cwensel/cascading/blob/f152df983a70b342dd77c02b84f82fd5026b5c0c/cascading-hadoop/src/main/shared/cascading/tap/hadoop/Hfs.java#L723

This might be of use when making a TempSource[T] that extends Source[T] with Sink[T] https://github.com/cwensel/cascading/blob/f152df983a70b342dd77c02b84f82fd5026b5c0c/cascading-hadoop/src/main/shared/cascading/tap/hadoop/util/TempHfs.java#L87

jcoveney · 2014-06-27T21:35:14Z

scalding-core/src/main/scala/com/twitter/scalding/RichFlowDef.scala

+  /**
+   * New flow def with only sources upstream from tails.
+   */
+  def withoutUnusedSources = {


public API's should always have explicit return types

Gets me every time.

jcoveney · 2014-06-27T21:49:18Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala


-    override def mode = inmode
+  def toList[R](implicit ev: T <:< TypedPipe[R], manifest: Manifest[R]): List[R] = {


shouldn't this go in ShellTypedPipe?

That's from before, and should probably just be left for now. Another PR should fix it with some real support for .toIterator, .dump, etc.

I could move it for now, though.

… to fix it)

jcoveney · 2014-06-27T23:00:50Z

This is some good stuff @bholt

johnynek · 2014-06-27T23:37:23Z

scalding-core/src/main/scala/com/twitter/scalding/RichFlowDef.scala

+    fd.addSources(o.getSources)
+    fd.addSinks(o.getSinks)
+    fd.addTails(o.getTails)
+    fd.mergeMisc(o)


don't like this. Just add all of it here. No need for two methods.

I see, you call it below. Nevermind... Can you add a comment avoid to explain the strange choice?

This should return Unit to make it clear that there is mutation here.

also, .mergeFrom might make it clearer too.

johnynek · 2014-06-27T23:54:26Z

scalding-repl/src/main/scala/com/twitter/scalding/ShellPipe.scala

-  }
-
-  def toList[R](implicit ev: T <:< TypedPipe[R], manifest: Manifest[R]): List[R] = {
+  def toList(implicit manifest: Manifest[T]): List[T] = {


just delete this for now. It is pretty broken now. It should be save, followed by .toIterator on a typed-version of SequenceFile.

We can get this working in the next PR.

johnynek · 2014-06-27T23:55:37Z

This is getting close to merge. I just had a few comments around renaming, TODOs, and deleting some code.

sriramkrishnan · 2014-06-27T23:58:42Z

Looks like pretty good stuff. How about additions to the README or Wiki page?

bholt · 2014-06-30T17:39:03Z

@johnynek: I've started trying to figure out how to use the new Execution code for the REPL, but I'll do that as a separate PR.

@sriramkrishnan: I think it'll make more sense to update the documentation when we've figured out how we're gonna handle toList and dump, as those would be part of demonstrating how the REPL is supposed to work.

johnynek · 2014-06-30T17:45:02Z

scalding-core/src/main/scala/com/twitter/scalding/RichFlowDef.scala

+   * Mutate current flow def to add all sources/sinks/etc from given FlowDef
+   */
+  def mergeFrom(o: FlowDef): Unit = {
+    fd.addSources(o.getSources)


it looks like the add* methods FlowDef in cascading are not idempotent:

https://github.com/cwensel/cascading/blob/wip-2.6/cascading-core/src/main/java/cascading/flow/FlowDef.java#L153

Is that going to be a problem?

In this case, addSources does look idempotent: it's just adding what is already in one map into another.

As for addSource(Pipe), it shouldn't be any different -- my understanding is that head pipes have the same name as their source, so addSource(Pipe,...) should be equivalent to addSource(name: String), which is simply adding to a map.

In any case, all we're doing is re-adding references to the same pipes, so the IllegalArgumentException can't fire (unless something else broke the flow) because we know we already have a valid pipe with 1 head.

Sorry, bad link:

https://github.com/cwensel/cascading/blob/wip-2.6/cascading-core/src/main/java/cascading/flow/FlowDef.java#L128

addSource(String, Tap) is not adding to a map, it is making sure that the key is not already present. Am I missing something?

I am not sure your code ever does this. Does it? But, ultimately, we want Equiv.equiv(x.mergeFrom(x), x) as a law, where Equiv[FlowDef] is set/map equality of all the structure of the flow.

Actually:

val y = x.copy x.mergeFrom(x) Equiv.equiv(y, x) // damn mutable variables

As long as you're copying over all sources, as in mergeFrom, you should get map equality for getSources and the rest. The only place where you might have a problem is if you build up your own set of sources, in which case you have to ensure this property yourself -- as we do here in onlyUpstreamFlow.

So, we can merge with this now, but we need to add an issue to fix it.

mergeFrom should be idempotent, or it will be very difficult to use in interesting cases, specifically for the case I have in mind for making TypedPipe referentially transparent.

Okay. So you're proposing making a deterministic choice when keys collide?

Well, the keys can collide with the same value, in that case there is no choice to make. You can still error if there are keys that collide, but the values are not equal.

bholt · 2014-06-30T22:45:44Z

@johnynek, I can't think of anything to do to make RichFlowDef safer. let me know if you think of anything I should fix/change/investigate.

johnynek · 2014-07-01T00:38:30Z

scalding-repl/src/test/scala/com/twitter/scalding/ReplTest.scala

+
+    joined.write(TypedTsv("final_out.tsv"))
+    // run the overall flow (with the 'final_out' sink), uses snapshot 's1'
+    run


is this test run as part of the travis build? It looks more like an example than a unit test.

Correct, it currently won't be run. My plan is to test using the new Execution stuff in the next PR. I'd be happy to track these plans in Issues.

Snapshot a pipe in the REPL

johnynek · 2014-07-01T01:07:45Z

Okay. Merged. Can you make issues for the next items you have?

I'd like to have a unit test running to make sure we don't break things.

Brandon Holt and others added 17 commits June 26, 2014 10:05

playing around with things

3d737f8

add 'snapshot' method to repl; saves a TypedPipe to a SequenceFile

19ec513

remove commented-out lines

d837f7e

snapshot successfully creates new flowDef

5e347d1

note: also disabled `resetFlowDef`

much simpler code to find sources and add them to newFlow

b6b5ebf

factor out into 'localizedFlow()', make repl jobs take an optional Fl…

4924b3b

…owDef

implement writeAndRun

66f87ea

this automatically runs the localized job and returns a TypedPipe to a new Source of it

get rid of added code in TypedPipe and ReplImplicits

f13ff17

remove unnecessary write() override

af8b3ff

refactor writeFrom to not break compatibility

bd9c0f4

explain localizedFlow method

ea3b9a3

refactor snapshot enrichment into ShellTypedPipe for better type infe…

e262328

…rence, fix implicits

add ReplTest (currently just checks to see if things compile)

6ac5051

comments

c844f41

explain tests a bit

acbcb2b

rename reachable -> upstream, move pipe methods into ShellTypedPipe

2733a88

revert whitespace changes in TypedPipe

3b85bb2

johnynek reviewed Jun 26, 2014
View reviewed changes

jcoveney reviewed Jun 27, 2014
View reviewed changes

add return types

63ba01d

jcoveney reviewed Jun 27, 2014
View reviewed changes

Brandon Holt added 4 commits June 27, 2014 15:08

get rid of ShellObj, move toList as-is into ShellTypedPipe (with TODO…

c01a188

… to fix it)

fixes after getting rid of ShellObj

759bd86

move localizedFlow method into RichFlowDef as onlyUpstreamFrom(pipe)

a01ef97

accidentally checked in change to project/Build

48693dc

johnynek reviewed Jun 27, 2014
View reviewed changes

make merge return Unit, explain mergeMisc in comment

fa2bf25

johnynek reviewed Jun 27, 2014
View reviewed changes

Brandon Holt and others added 4 commits June 27, 2014 17:10

rename merge -> mergeFrom, refactor out 'heads'

70a46c7

delete broken toList

4b53fd3

add TODO's for toList and dump

875af09

add TODO for handling checkpoints

70fa0a1

johnynek reviewed Jun 30, 2014
View reviewed changes

johnynek reviewed Jul 1, 2014
View reviewed changes

johnynek added a commit that referenced this pull request Jul 1, 2014

Merge pull request #918 from bholt/repl

fe566cd

Snapshot a pipe in the REPL

johnynek merged commit fe566cd into twitter:develop Jul 1, 2014

bholt deleted the repl branch July 1, 2014 01:08

This was referenced Jul 1, 2014

Use Execution to run flows in REPL #928

Merged

REPL: Add toIterator (and related methods) #929

Merged


		override def mode = inmode
		def toList[R](implicit ev: T <:< TypedPipe[R], manifest: Manifest[R]): List[R] = {

Snapshot a pipe in the REPL #918

Snapshot a pipe in the REPL #918

Conversation

bholt commented Jun 26, 2014

jcoveney commented Jun 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 26, 2014

bholt commented Jun 26, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcoveney commented Jun 27, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jun 27, 2014

sriramkrishnan commented Jun 27, 2014

bholt commented Jun 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bholt commented Jun 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jul 1, 2014