Part 1 of the new scalding-db and scalding-db-macros subprojects. Addresses #1124

This adds @ianoc's original JDBC macros along with some refactoring and fixes done during internal use at Twitter.

Part 2 to be followed will be the new JDBC source and sink classes based on these macros.

The motivation for this is to have an improved way of defining JDBC Typed Source and Sinks using two simple steps:

  1. Defining a case class that represents you table schema:
case class ExampleDBRecord(
  user_id: Long,
  tweet_id: Long,
  created_at: java.util.Date,
  deleted: Boolean)
  1. Defining a TypedJDBCSource based on the above case class:
case class ExampleDBRecordsTable(implicit dbsInEnv : AvailableDatabases)
    extends TypedJDBCSource[ExampleDBRecord](dbsInEnv) {
  override val tableName = TableName("example_table")
  override val database = Database("example_schema") 

Under the hood, this case class is automatically mapped to the underlying DB schema using a macro-generated DBTypeDescriptor. As a user, one does not need to specify SQL mappings or Injections to do this back and forth translation.

There are also optimizations for avoiding common pitfalls when talking to databases directly from hadoop nodes (too many open connections, inefficient OFFSET based queries, table lock contention). This includes performing a jdbc -> hdfs snapshot via the submitter for smaller datasets and likewise for writes. These will be in a separate PR along with the new jdbc source classes.

@ianoc ianoc commented on an outdated diff May 1, 2015
@@ -13,4 +13,6 @@ addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.10.2")
addSbtPlugin("com.typesafe.sbt" % "sbt-ghpages" % "0.5.1")
+addSbtPlugin("com.typesafe" % "sbt-mima-plugin" % "0.1.6")
ianoc May 1, 2015 Collaborator

mima is already in this file

ianoc commented May 1, 2015

@rubanm can you add in the description some examples of using the new jdbc stuff to show motivation for this stuff

ianoc commented May 1, 2015

Looks great though, thanks for splitting this out

rubanm commented May 1, 2015

@ianoc updated the description

johnynek commented May 6, 2015

can we add a README or something that explains how to get going with this? It's great, but reading the code to learn it is pretty hard.

rubanm commented May 6, 2015

@johnynek makes sense. Added a README. It probably needs a few more edits though.

@johnynek johnynek commented on an outdated diff May 22, 2015
+ res1: cascading.tuple.Fields = 'card_id', 'tweet_id', 'created_at', 'deleted | long, long, Date, boolean
+### Supported Mappings
+Scala type | SQL type
+------------- | -------------
+`Int` | `INTEGER`
+`Long` | `BIGINT`
+`Short` | `SMALLINT`
+`Double` | `DOUBLE`
+`@varchar @size(20) String `| `VARCHAR(20)`
+`@text String` | `TEXT`
+`java.util.Date` | `DATETIME`
+`@date java.util.Date` | `DATE`
+`Boolean` | `BOOL`, `BOOLEAN`, `TINYINT`
johnynek May 22, 2015 Collaborator

what is up with the three values on the right in this column?

johnynek May 22, 2015 Collaborator

Seems like the SQL type is BOOLEAN, from reading the code below.

@johnynek johnynek and 1 other commented on an outdated diff May 22, 2015
+Scala type | SQL type
+------------- | -------------
+`Int` | `INTEGER`
+`Long` | `BIGINT`
+`Short` | `SMALLINT`
+`Double` | `DOUBLE`
+`@varchar @size(20) String `| `VARCHAR(20)`
+`@text String` | `TEXT`
+`java.util.Date` | `DATETIME`
+`@date java.util.Date` | `DATE`
+`Boolean` | `BOOL`, `BOOLEAN`, `TINYINT`
+* Annotations are used for String types to clearly distinguish between TEXT and VARCHAR column types
+* Scala `Option`s can be used to denote columns that are `NULLABLE` in the DB
+* Nested case classes can be used as a workaround for the 22-size limitation on Scala tuples, case classes
johnynek May 22, 2015 Collaborator

this seems to stop abruptly: case classes... ? what? I think you are going to explain that case classes are also flattened in left to right order. You might want to show an example of that.

rubanm May 27, 2015 Collaborator

Added a nested case classes example.


Are the tests being run? Did we update the travis running script?

We really need a more reliable way to make sure we are testing our sub-modules.

rubanm commented Jun 17, 2015

Updated the travis script to include the new db modules.

@johnynek johnynek commented on the diff Jul 2, 2015
+(in the REPL)
+Necessary imports:
+ scalding> import com.twitter.scalding.db_
+ scalding> import com.twitter.scalding.db.macros._
+Case class representing your DB schema:
+ scalding> case class ExampleDBRecord(
+ | card_id: Long,
+ | tweet_id: Long,
+ | created_at: Option[java.util.Date],
+ | deleted: Boolean = false)
+ defined class ExampleDBRecord
johnynek Jul 2, 2015 Collaborator

what's the last step? How to I read or write from a Database? Seems like we are stopping short of clearly explaining the use of the whole package here.

rubanm Jul 2, 2015 Collaborator

Yes, the sources will be added in a follow-up PR. So this is a little incomplete in that sense.

