Typed Tutorial #897

bholt · 2014-06-11T17:22:14Z

As part of learning Scalding, I wrote a new version of the tutorials that uses the Typed API, and I thought I would share it here in case it could be helpful to people learning Scalding for the first time, so they have some more examples walking through how the Typed API works.

If you have feedback on style, idioms, or my descriptions are incorrect or unclear, I would love to hear it so I can fix it, and more importantly, learn how scalding works.

I also made this all one file, with the different tutorials selected using a command-line flag because I found it easier to see everything in one file. But I would be happy to break it out into separate files as the original tutorial did.

jcoveney · 2014-06-11T17:52:22Z

tutorial/TypedTutorial.scala

+    **/
+    case "0" | "1" => {
+
+      val input_raw = TextLine(args("input"))


JVM languages generally prefer camelCase

jcoveney · 2014-06-11T18:03:05Z

This is really great! I think it might be very useful to have some more which showcase some of the methods that don't exist at all in the fields API, like mapValues and whatnot. These are things people often don't realize exist. Mainly formatting stuff.

johnynek · 2014-06-11T18:05:56Z

tutorial/TypedTutorial.scala

+        // we'll end up with a new entry for each word.
+        .flatMap{ _.split("\\s") }
+        // output of flatMap is still a collection of String
+        .write(TypedTsv[String](args("output")))


you didn't put the type above. Let's be consistent. I lean towards putting the type on output sinks.

I'll leave the types on each of the examples. The comment in "case 2" said that you could leave out the types, but in general you should use them for safety as things change.

- avoid qualified names (add imports) - use camelCase, TypedPipe.from, parentheses for simple closures - use TypedTsv for word scores (and add 'word_scores.tsv') to demonstrate using typed sources - also improved comments a bit

…es in comments

…ield For simplicity, TextLine in the typed DSL discards the byte offset which Cascading provides. This source is an alternative that keeps around the offset.

bholt · 2014-06-11T23:49:57Z

Working on adding tests, and then I'll ask for a final once-over.

bholt · 2014-06-12T18:19:58Z

Thanks for all the helpful feedback, @jcoveney & @johnynek.

Based on @johnynek's suggestion, I added a new source, OffsetTextLine that keeps both the 'offset and 'line fields when used as a TypedPipe. Let me know if a different name would be better.

I also tried to address the rest of the feedback. Any other suggestions for this?

johnynek · 2014-06-12T18:47:30Z

scalding-core/src/main/scala/com/twitter/scalding/FileSource.scala

+    TupleConverter.asSuperConverter[(Long,String), U](TupleConverter.of[(Long,String)])
+
+  //In TextLine, 0 is the byte position, the text string is in column 1
+  //override def sourceFields = Dsl.intFields((Seq(0),Seq(1)))


no commented lines, please. Also, don't you need to give the sourceFields?

Oops, meant to delete those lines. The default is to get all the fields by using converter to get the arity: Dsl.intFields(0 until converter.arity) (TypedSource:37), which does what we want (I think).

bholt · 2014-06-12T20:45:26Z

Okay, thanks. Added an additional section on interop, explaining about TypedPipe.from and toPipe.

johnynek · 2014-06-12T20:54:18Z

scalding-core/src/main/scala/com/twitter/scalding/FileSource.scala

+ */
+class OffsetTextLine(filepath: String,
+                       override val sinkMode: SinkMode,
+                       override val textEncoding: String = CHTextLine.DEFAULT_CHARSET)


this is repeated on like 396. There is a risk of them getting out of sync. Remove the default here, just have it in the object.

…ipt)

bholt · 2014-06-13T00:15:20Z

Added TypedTutorial to @sriramkrishnan's new tutorial_test script. Are we sure we don't want to run all but the ReplTutorial test in ScalaTest? Guessing it would run way faster than firing up scald.rb for each one.

sriramkrishnan · 2014-06-13T00:24:02Z

IMO running it under tutorial_test.sh serves as an integration test for both the scald.rb and scald-repl.sh. And full disclosure, that script has existed long before I touched it :).

bholt · 2014-06-13T00:33:18Z

Fair enough. The way things are right now, we don't verify the output, so they are pretty simplistic tests as it is. If it's not a concern to anyone else, no reason to rock the boat.

johnynek · 2014-06-13T01:26:25Z

tutorial/TypedTutorial.scala

+          // select the line offset and score fields
+          .map{ case (word,(offset,score)) => (offset,score) }
+          // group by line offset (groups all the words for a line together)
+          .group


should we also mention that this is the same as sumByKey?

Typed Tutorial

Brandon Holt and others added 9 commits June 11, 2014 10:10

Tutorial 0-4 implemented for typed tutorial.

31485eb

most of Tutorial5 (joins) implemented

7cc920c

working on getting Tutorial5 to do the right thing

82772d2

working Tutorial 5

540f244

refactor comments

3456b6e

clean up comment

87179e4

add comments, put 'TextLine()' into each case

68c2d6a

move out of the 'typed/' directory

62f5c65

fix tabs -> spaces

41d7f36

jcoveney reviewed Jun 11, 2014
View reviewed changes

johnynek reviewed Jun 11, 2014
View reviewed changes

Brandon Holt added 5 commits June 11, 2014 14:13

incorporate style feedback

70c49d0

- avoid qualified names (add imports) - use camelCase, TypedPipe.from, parentheses for simple closures - use TypedTsv for word scores (and add 'word_scores.tsv') to demonstrate using typed sources - also improved comments a bit

get rid of explicit CoGrouped/UnsortedGrouped references, explain typ…

5e2f687

…es in comments

Refactor TextLine and add 'OffsetTextSource' which keeps the offset f…

6ad0ad4

…ield For simplicity, TextLine in the typed DSL discards the byte offset which Cascading provides. This source is an alternative that keeps around the offset.

rename OffsetTextSource -> OffsetLineSource

3b0d514

use OffsetLineSource in typed tutorial

d11d569

Brandon Holt added 2 commits June 12, 2014 11:14

move OffsetLineSource up next to TextLine

bf64916

rename OffsetLineSource -> OffsetTextLine

2fe660a

johnynek reviewed Jun 12, 2014
View reviewed changes

Brandon Holt added 2 commits June 12, 2014 12:52

remove commented-out lines

bc00bc3

explicit TypedPipe.from, add section on interop

45cace7

johnynek reviewed Jun 12, 2014
View reviewed changes

Brandon Holt added 4 commits June 12, 2014 14:07

remove duplicate default

b994339

move TDsl import to where it's needed

a5dbc0b

Merge branch 'develop' into typed-tutorial (to get tutorial tests scr…

ec63f93

…ipt)

add tests for TypedTutorial

2deff79

johnynek reviewed Jun 13, 2014
View reviewed changes

clean up comments

b3b64e3

johnynek added a commit that referenced this pull request Jun 13, 2014

Merge pull request #897 from bholt/typed-tutorial

84c13c2

Typed Tutorial

johnynek merged commit 84c13c2 into twitter:develop Jun 13, 2014

bholt deleted the typed-tutorial branch June 13, 2014 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typed Tutorial #897

Typed Tutorial #897

bholt commented Jun 11, 2014

jcoveney Jun 11, 2014

jcoveney commented Jun 11, 2014

johnynek Jun 11, 2014

bholt Jun 11, 2014

bholt commented Jun 11, 2014

bholt commented Jun 12, 2014

johnynek Jun 12, 2014

bholt Jun 12, 2014

bholt commented Jun 12, 2014

johnynek Jun 12, 2014

bholt commented Jun 13, 2014

sriramkrishnan commented Jun 13, 2014

bholt commented Jun 13, 2014

johnynek Jun 13, 2014

Typed Tutorial #897

Typed Tutorial #897

Conversation

bholt commented Jun 11, 2014

jcoveney Jun 11, 2014

Choose a reason for hiding this comment

jcoveney commented Jun 11, 2014

johnynek Jun 11, 2014

Choose a reason for hiding this comment

bholt Jun 11, 2014

Choose a reason for hiding this comment

bholt commented Jun 11, 2014

bholt commented Jun 12, 2014

johnynek Jun 12, 2014

Choose a reason for hiding this comment

bholt Jun 12, 2014

Choose a reason for hiding this comment

bholt commented Jun 12, 2014

johnynek Jun 12, 2014

Choose a reason for hiding this comment

bholt commented Jun 13, 2014

sriramkrishnan commented Jun 13, 2014

bholt commented Jun 13, 2014

johnynek Jun 13, 2014

Choose a reason for hiding this comment