-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typed Tutorial #897
Typed Tutorial #897
Conversation
**/ | ||
case "0" | "1" => { | ||
|
||
val input_raw = TextLine(args("input")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JVM languages generally prefer camelCase
This is really great! I think it might be very useful to have some more which showcase some of the methods that don't exist at all in the fields API, like mapValues and whatnot. These are things people often don't realize exist. Mainly formatting stuff. |
// we'll end up with a new entry for each word. | ||
.flatMap{ _.split("\\s") } | ||
// output of flatMap is still a collection of String | ||
.write(TypedTsv[String](args("output"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you didn't put the type above. Let's be consistent. I lean towards putting the type on output sinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave the types on each of the examples. The comment in "case 2" said that you could leave out the types, but in general you should use them for safety as things change.
- avoid qualified names (add imports) - use camelCase, TypedPipe.from, parentheses for simple closures - use TypedTsv for word scores (and add 'word_scores.tsv') to demonstrate using typed sources - also improved comments a bit
…ield For simplicity, TextLine in the typed DSL discards the byte offset which Cascading provides. This source is an alternative that keeps around the offset.
Working on adding tests, and then I'll ask for a final once-over. |
Thanks for all the helpful feedback, @jcoveney & @johnynek. Based on @johnynek's suggestion, I added a new source, I also tried to address the rest of the feedback. Any other suggestions for this? |
TupleConverter.asSuperConverter[(Long,String), U](TupleConverter.of[(Long,String)]) | ||
|
||
//In TextLine, 0 is the byte position, the text string is in column 1 | ||
//override def sourceFields = Dsl.intFields((Seq(0),Seq(1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no commented lines, please. Also, don't you need to give the sourceFields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, meant to delete those lines. The default is to get all the fields by using converter
to get the arity: Dsl.intFields(0 until converter.arity)
(TypedSource:37), which does what we want (I think).
Okay, thanks. Added an additional section on interop, explaining about TypedPipe.from and toPipe. |
*/ | ||
class OffsetTextLine(filepath: String, | ||
override val sinkMode: SinkMode, | ||
override val textEncoding: String = CHTextLine.DEFAULT_CHARSET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is repeated on like 396. There is a risk of them getting out of sync. Remove the default here, just have it in the object.
Added TypedTutorial to @sriramkrishnan's new tutorial_test script. Are we sure we don't want to run all but the ReplTutorial test in ScalaTest? Guessing it would run way faster than firing up |
IMO running it under tutorial_test.sh serves as an integration test for both the scald.rb and scald-repl.sh. And full disclosure, that script has existed long before I touched it :). |
Fair enough. The way things are right now, we don't verify the output, so they are pretty simplistic tests as it is. If it's not a concern to anyone else, no reason to rock the boat. |
// select the line offset and score fields | ||
.map{ case (word,(offset,score)) => (offset,score) } | ||
// group by line offset (groups all the words for a line together) | ||
.group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mention that this is the same as sumByKey
?
As part of learning Scalding, I wrote a new version of the tutorials that uses the Typed API, and I thought I would share it here in case it could be helpful to people learning Scalding for the first time, so they have some more examples walking through how the Typed API works.
If you have feedback on style, idioms, or my descriptions are incorrect or unclear, I would love to hear it so I can fix it, and more importantly, learn how scalding works.
I also made this all one file, with the different tutorials selected using a command-line flag because I found it easier to see everything in one file. But I would be happy to break it out into separate files as the original tutorial did.