Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed Tutorial #897

Merged
merged 23 commits into from
Jun 13, 2014
Merged

Typed Tutorial #897

merged 23 commits into from
Jun 13, 2014

Conversation

bholt
Copy link
Contributor

@bholt bholt commented Jun 11, 2014

As part of learning Scalding, I wrote a new version of the tutorials that uses the Typed API, and I thought I would share it here in case it could be helpful to people learning Scalding for the first time, so they have some more examples walking through how the Typed API works.

If you have feedback on style, idioms, or my descriptions are incorrect or unclear, I would love to hear it so I can fix it, and more importantly, learn how scalding works.

I also made this all one file, with the different tutorials selected using a command-line flag because I found it easier to see everything in one file. But I would be happy to break it out into separate files as the original tutorial did.

**/
case "0" | "1" => {

val input_raw = TextLine(args("input"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JVM languages generally prefer camelCase

@jcoveney
Copy link
Contributor

This is really great! I think it might be very useful to have some more which showcase some of the methods that don't exist at all in the fields API, like mapValues and whatnot. These are things people often don't realize exist. Mainly formatting stuff.

// we'll end up with a new entry for each word.
.flatMap{ _.split("\\s") }
// output of flatMap is still a collection of String
.write(TypedTsv[String](args("output")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you didn't put the type above. Let's be consistent. I lean towards putting the type on output sinks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave the types on each of the examples. The comment in "case 2" said that you could leave out the types, but in general you should use them for safety as things change.

Brandon Holt added 5 commits June 11, 2014 14:13
- avoid qualified names (add imports)
- use camelCase, TypedPipe.from, parentheses for simple closures
- use TypedTsv for word scores (and add 'word_scores.tsv') to
  demonstrate using typed sources
- also improved comments a bit
…ield

For simplicity, TextLine in the typed DSL discards the byte offset which Cascading provides. This source is an alternative that keeps around the offset.
@bholt
Copy link
Contributor Author

bholt commented Jun 11, 2014

Working on adding tests, and then I'll ask for a final once-over.

@bholt
Copy link
Contributor Author

bholt commented Jun 12, 2014

Thanks for all the helpful feedback, @jcoveney & @johnynek.

Based on @johnynek's suggestion, I added a new source, OffsetTextLine that keeps both the 'offset and 'line fields when used as a TypedPipe. Let me know if a different name would be better.

I also tried to address the rest of the feedback. Any other suggestions for this?

TupleConverter.asSuperConverter[(Long,String), U](TupleConverter.of[(Long,String)])

//In TextLine, 0 is the byte position, the text string is in column 1
//override def sourceFields = Dsl.intFields((Seq(0),Seq(1)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no commented lines, please. Also, don't you need to give the sourceFields?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, meant to delete those lines. The default is to get all the fields by using converter to get the arity: Dsl.intFields(0 until converter.arity) (TypedSource:37), which does what we want (I think).

@bholt
Copy link
Contributor Author

bholt commented Jun 12, 2014

Okay, thanks. Added an additional section on interop, explaining about TypedPipe.from and toPipe.

*/
class OffsetTextLine(filepath: String,
override val sinkMode: SinkMode,
override val textEncoding: String = CHTextLine.DEFAULT_CHARSET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is repeated on like 396. There is a risk of them getting out of sync. Remove the default here, just have it in the object.

@bholt
Copy link
Contributor Author

bholt commented Jun 13, 2014

Added TypedTutorial to @sriramkrishnan's new tutorial_test script. Are we sure we don't want to run all but the ReplTutorial test in ScalaTest? Guessing it would run way faster than firing up scald.rb for each one.

@sriramkrishnan
Copy link
Collaborator

IMO running it under tutorial_test.sh serves as an integration test for both the scald.rb and scald-repl.sh. And full disclosure, that script has existed long before I touched it :).

@bholt
Copy link
Contributor Author

bholt commented Jun 13, 2014

Fair enough. The way things are right now, we don't verify the output, so they are pretty simplistic tests as it is. If it's not a concern to anyone else, no reason to rock the boat.

// select the line offset and score fields
.map{ case (word,(offset,score)) => (offset,score) }
// group by line offset (groups all the words for a line together)
.group
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention that this is the same as sumByKey?

johnynek added a commit that referenced this pull request Jun 13, 2014
@johnynek johnynek merged commit 84c13c2 into twitter:develop Jun 13, 2014
@bholt bholt deleted the typed-tutorial branch June 13, 2014 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants