-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot a pipe in the REPL #918
Changes from 17 commits
3d737f8
19ec513
d837f7e
5e347d1
b6b5ebf
4924b3b
66f87ea
f13ff17
af8b3ff
bd9c0f4
ea3b9a3
e262328
6ac5051
c844f41
acbcb2b
2733a88
3b85bb2
3f6920e
9102d4c
932ce34
36e55d9
e9e7a1c
1f015ed
f266f89
68049e8
63ba01d
c01a188
759bd86
a01ef97
48693dc
fa2bf25
70a46c7
4b53fd3
875af09
70fa0a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -108,10 +108,9 @@ abstract class Source extends java.io.Serializable { | |
} | ||
|
||
/** | ||
* write the pipe and return the input so it can be chained into | ||
* the next operation | ||
* Write the given pipe and return the new pipe which was added as the tail | ||
*/ | ||
def writeFrom(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode) = { | ||
def writeFromAndGetTail(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode) = { | ||
checkFlowDefNotNull | ||
|
||
//insane workaround for scala compiler bug | ||
|
@@ -124,7 +123,17 @@ abstract class Source extends java.io.Serializable { | |
case (test: TestMode, false) => pipe | ||
case _ => transformForWrite(pipe) | ||
} | ||
flowDef.addTail(new Pipe(sinkName, newPipe)) | ||
val finalPipe = new Pipe(sinkName, newPipe) | ||
flowDef.addTail(finalPipe) | ||
finalPipe | ||
} | ||
|
||
/** | ||
* write the pipe but return the input so it can be chained into | ||
* the next operation | ||
*/ | ||
def writeFrom(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode) = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a return type please. |
||
writeFromAndGetTail(pipe) | ||
pipe | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,10 @@ | |
package com.twitter.scalding | ||
|
||
import cascading.flow.Flow | ||
import cascading.flow.FlowDef | ||
import cascading.pipe.Pipe | ||
import java.util.UUID | ||
import com.twitter.scalding.ReplImplicits._ | ||
|
||
/** | ||
* Adds ability to run a pipe in the REPL. | ||
|
@@ -30,12 +33,12 @@ class ShellObj[T](obj: T) { | |
* @param args that should be used to construct the job. | ||
* @return a job that can be used to run the data pipeline. | ||
*/ | ||
private[scalding] def getJob(args: Args, inmode: Mode): Job = new Job(args) { | ||
private[scalding] def getJob(args: Args, inmode: Mode, inFlowDef: FlowDef): Job = new Job(args) { | ||
/** | ||
* The flow definition used by this job, which should be the same as that used by the user | ||
* when creating their pipe. | ||
*/ | ||
override val flowDef = ReplImplicits.flowDef | ||
override val flowDef = inFlowDef | ||
|
||
override def mode = inmode | ||
|
||
|
@@ -89,24 +92,92 @@ class ShellObj[T](obj: T) { | |
*/ | ||
override def buildFlow: Flow[_] = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this override is just calling super. Let's drop that. |
||
val flow = super.buildFlow | ||
ReplImplicits.resetFlowDef() | ||
flow | ||
} | ||
} | ||
|
||
/** | ||
* Runs this pipe as a Scalding job. | ||
*/ | ||
def run() { | ||
val args = new Args(Map()) | ||
getJob(args, ReplImplicits.mode).run | ||
} | ||
def run(inFlowDef: FlowDef = ReplImplicits.flowDef) = | ||
getJob(new Args(Map()), ReplImplicits.mode, inFlowDef).run | ||
|
||
def toList[R](implicit ev: T <:< TypedPipe[R], manifest: Manifest[R]): List[R] = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this needs to be totally redone.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in fact, it should be on ShellTypedPipe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I was thinking the opposite: this ensures that only the sources that are upstream from sinks are included. The snapshot and save methods are special cases were we drop all the other sinks and just do the new one. But I don't yet know how to make this clean. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't this go in ShellTypedPipe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's from before, and should probably just be left for now. Another PR should fix it with some real support for .toIterator, .dump, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could move it for now, though. |
||
import ReplImplicits._ | ||
ev(obj).toPipe("el").write(Tsv("item")) | ||
run() | ||
TypedTsv[R]("item").toIterator.toList | ||
} | ||
|
||
} | ||
|
||
/** | ||
* Enrichment on TypedPipes allowing them to be run locally, independent of the overall flow. | ||
* @param pipe to wrap | ||
*/ | ||
class ShellTypedPipe[T](pipe: TypedPipe[T]) extends ShellObj[TypedPipe[T]](pipe) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these functions are way too useful to be put here. Let's move them to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're referring to the two operations on pipes ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I am. |
||
|
||
/** | ||
* Iterator for all pipes reachable from this pipe (recursively using 'Pipe.getPrevious') | ||
*/ | ||
def upstreamPipes(inpipe: Pipe): Iterator[Pipe] = | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Iterators can be dangerous types (due to mutability) to return, and I generally only like them for performance critical code. Can we make this a List or a Set? By putting There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this the same as getHeads? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess not, this is the transitive closure of this and the previous. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, I'll note that in the comments. |
||
Iterator | ||
.iterate(Seq(inpipe))(pipes => for (pipe <- pipes; prev <- pipe.getPrevious) yield prev) | ||
.takeWhile(_.length > 0) | ||
.flatten | ||
|
||
/** | ||
* Construct a new FlowDef for only the flow that ends with the given pipe. | ||
* That is, it copies over only the sources and sinks that contribute to the | ||
* flow, allowing repl users to build up flows incrementally. | ||
*/ | ||
def localizedFlow(tailPipe: Pipe): FlowDef = { | ||
val newFlow = getEmptyFlowDef | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this the same as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In fact, I don't want the name-change that |
||
|
||
val sourceTaps = flowDef.getSources | ||
val newSrcs = newFlow.getSources | ||
|
||
upstreamPipes(tailPipe) | ||
.filter(_.getParent == null) | ||
.flatMap(_.getHeads) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems redundant. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, you're right. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, it turns out that all of these have "getParent == null", so really the non-redundant criteria should be (_.getPrevious.length == 0). |
||
.foreach(head => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prefer |
||
if (!newSrcs.containsKey(head.getName)) | ||
newFlow.addSource(head, sourceTaps.get(head.getName))) | ||
|
||
newFlow.addTailSink(tailPipe, flowDef.getSinks.get(tailPipe.getName)) | ||
|
||
newFlow | ||
} | ||
|
||
/** | ||
* Shorthand for .write(dest).run | ||
*/ | ||
def save(dest: TypedSink[T] with Mappable[T]): TypedPipe[T] = { | ||
|
||
val d = dest | ||
val thisPipe = pipe.toPipe(d.sinkFields)(d.setter) | ||
val outPipe = d.writeFromAndGetTail(thisPipe) | ||
|
||
run(localizedFlow(outPipe)) | ||
|
||
TypedPipe.from(d) | ||
} | ||
|
||
/** | ||
* Save snapshot of a typed pipe to a temporary sequence file. | ||
* @return A TypedPipe to a new Source, reading from the sequence file. | ||
*/ | ||
def snapshot: TypedPipe[T] = { | ||
import ReplImplicits._ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you comment what this is getting? |
||
|
||
// come up with unique temporary filename | ||
// TODO: refactor into TemporarySequenceFile class | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
val tmpSeq = "/tmp/scalding-repl/snapshot-" + UUID.randomUUID() + ".seq" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be of use when making a |
||
val outPipe = SequenceFile(tmpSeq, 'record).writeFromAndGetTail(pipe.toPipe('record)) | ||
|
||
run(localizedFlow(outPipe)) | ||
|
||
TypedPipe.fromSingleField[T](SequenceFile(tmpSeq)) | ||
} | ||
|
||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
/* | ||
Copyright 2012 Twitter, Inc. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package com.twitter.scalding | ||
|
||
object ReplTest { | ||
import ReplImplicits._ | ||
|
||
def test() { | ||
val hello = TypedPipe.from(TextLine("tutorial/data/hello.txt")) | ||
|
||
val wordScores = | ||
TypedPipe.from(OffsetTextLine("tutorial/data/words.txt")) | ||
.map{ case (offset, word) => (word, offset) } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this just here to be explicit? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reordering so it will group by word. Is that not needed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh whoops totally misread that haha. carry on... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you could do _.swap but this is fine |
||
.group | ||
|
||
// snapshot intermediate results without wiring everything up | ||
val s1 = hello.snapshot | ||
|
||
val s2 = hello.save(TypedTsv("dump.tsv")) | ||
|
||
// use snapshot in further flows | ||
val linesByWord = s1.flatMap(_.split("\\s+")).groupBy(_.toLowerCase) | ||
val counts = linesByWord.size | ||
|
||
// ensure snapshot enrichment works on KeyedListLike (CoGrouped, UnsortedGrouped), too | ||
val s3 = counts.snapshot | ||
val s4 = linesByWord.join(wordScores).snapshot | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a return type to public methods (this old code didn't follow that rule).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me pretty nervous, as this breaks old code. Anyone that overrides writeFrom will not have this behavior, right? Is there a way to be compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just look through the tals in the flowDef for the sinkName to get the pipe back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break code using writeFromAndGetTail directly when writeFrom is overridden. So far that just entails the REPL. Though we did talk yesterday about maybe looking through the sinks after the call for the new one and working from that. It would seem to be a bit more maybe janky, but far less of an impact on other code and robust to overrides too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. I'll see if I can come up with the least-brittle way to get the right tail pipe.