Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrderedSerialization support for faster key sorting #1208

Merged
merged 6 commits into from
Mar 3, 2015
Merged

Conversation

johnynek
Copy link
Collaborator

this does not include the macros. Want to stabilize this first.

@johnynek
Copy link
Collaborator Author

@ianoc @isnotinvain please carefully review this code. I do want to merge this.


def getRequireOrderedSerialization: Boolean =
get(ScaldingRequireOrderedSerialization)
.map(java.lang.Boolean.parseBoolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random: I think there's also .map(_.toBoolean), which throws for invalid boolean strings, compared to:

scala> java.lang.Boolean.parseBoolean("heyyy")
res2: Boolean = false

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call.

Serialization.successUnit
} catch { case NonFatal(e) => Failure(e) }

override def compareBinary(lhs: InputStream, rhs: InputStream) = try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does try with no catch do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odd that it compiles

@johnynek
Copy link
Collaborator Author

@avibryant @colinmarc @snoble if y'all have time to take a look, that would be great. The basic approach is the OrderedSerialization typeclass and see how we wire it into cascading (it is a bit painful, but that is life with cascading).

env: BUILD="base" TEST_TARGET="scalding-jdbc scalding-json"
script: "scripts/run_test.sh"

# not committed yet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets just move this to one of the other PR's? and remove from here -- if its not in those PR's we might forget to uncomment, and if its in them doesn't need to be here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I uncommented in the follow up (I think). I'm just as worried we won't remember to add it (since that happens to). I'll do what you prefer.

@ianoc
Copy link
Collaborator

ianoc commented Feb 26, 2015

I made some small comments on mostly really TODO's and comments. LGTM really. Though I was involved in enough of it to be suitably biased ;)

Given we are doing this smaller PR we should consider if doing a followup this week/next week that splits out the ordered serialization code that doesn't depend on scalding into its own package is a good idea or left for future work. (If it survives past a release together we need to change the classpath of the classes)

@johnynek
Copy link
Collaborator Author

@isnotinvain @ianoc addressed most of the comments. Want to take another look?

case f @ OrderedSerialization.CompareFailure(_) => f
case _ => cA // the first is not equal
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, staticSize and dynamicSize need to be added. I wonder if they should not have default implementations. Its too easy forget about them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable not to have them have any defaults, make someone think about the cost involved if they don't supply them if they write something manually.

@johnynek
Copy link
Collaborator Author

actually there were several bugs with combinators as a result of the default None in the size methods. I think making those required is a good call for performance critical code, such as this.

@johnynek
Copy link
Collaborator Author

@isnotinvain any more concerns on this base PR?

@isnotinvain
Copy link
Contributor

@johnynek Sorry, I'm still reading your comments. I will try to finish tonight.

@johnynek
Copy link
Collaborator Author

@isnotinvain changed the Iterable to a Map. Makes the code clearer and possibly faster. Anything else that needs addressing in this?

@isnotinvain
Copy link
Contributor

@johnynek

Do we have Laws for Orderings? I understand that we now have laws for total orderings applied to serializable orderings, but I think there's value in checking the other things that we expect to be true, like a < b => b > a -- it seems easy enough to check and could catch a buggy ordering.

The ignored exception in Serialization2.scala would be great to log or compose with the other exception, maybe like this:
new RuntimeException(s"There were 2 exceptions:\n${e1.stackTraceString}\n${e2.stackTraceString}

I have not had a chance to look up the contract around out.close() -- though I suppose it's a contract we find out empirically. If we close it early, that'll fail when someone tries to use it. If we don't close it, it might leak something, or we might not flush everything. Where does the stream come from? Hadoop or cascading? or cascading passed it through from hadoop?

@isnotinvain
Copy link
Contributor

Meant to say, +1 LGTM, the above questions aren't blockers.

@johnynek
Copy link
Collaborator Author

johnynek commented Mar 3, 2015

@isnotinvain two issues:

  1. So, the current laws in this code do assume that compare is the only method people override, and then we don't need to test the Ordering trait from scala that it correctly implements lt, lteq, etc... from compare. I think that is safe. What do you think? I mean, we could go on to verify that lt(a, b) => compare(a, b) < 0, lteq(a, b) => compare(a, b) <=0, equiv(a, b) => compare(a, b) == 0, gteq(a, b) => compare(a, b) >= 0, gt(a, b) => compare(a, b) > 0. But that is not so much about a correct implementation of Ordering but that they did not override some default methods with insane data.

  2. imagine if serialization were lazy, you would not see the second exception, so I don't think that hiding it is so bad. Also if you only have one exception, you will see that one, so I don't really feel like it is a case of losing data, it is more about ignoring some evaluated error in favor of showing the error that happened first. Also, there is no method (stackTraceString) so we have to use apache commons or write something here. I'd rather skip this since I think it is a corner case that is best solved with a designed dual exception that we can flatten into a List so we can see all failures, etc...

I'd rather merge this now and move on to the macro's branch. Is that okay?

@ianoc
Copy link
Collaborator

ianoc commented Mar 3, 2015

Given alex's + 1 above, merging

ianoc added a commit that referenced this pull request Mar 3, 2015
OrderedSerialization support for faster key sorting
@ianoc ianoc merged commit ff79f90 into develop Mar 3, 2015
@ianoc ianoc deleted the fbs-staging branch March 3, 2015 00:35
@isnotinvain
Copy link
Contributor

For 1), I guess I meant that for any Orderings we or our users have written by hand, it'd be nice to assert those properties. No need to assert them on ones that come built into the language. Similar to how we provide MonoidLaws etc.

For 2) I must be going crazy, I thought stackTraceString was an implicit enrichment on exception, but now I can't find it. Must have been local to some other project. I agree, if we surface the first exception that's fine since often we wouldn't even try the second stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants