Implement AFi summary stat using SparkSQL #210
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current implementation of the AFi stats processor, much like most of the stats currently in use are based on a fairly convoluted foundation of classes that are rooted in Spark's RDD API. While still the underpinning of modern Spark, it requires more rigorous hand optimization than the modern Spark SQL machinery (which includes the optimizing Catalyst engine).
This PR offers an alternative implementation of the AFi summary statistic command. This is a nearly wholesale rewrite of the statistic infrastructure, which refactors the layer infrastructure, and tries to step away from the byzantine class hierarchy which currently defines the shape of the stats processors. It isn't letter-perfect, but shows how we can move away from the hard-to-decipher and hard-to-maintain code structure that exists to something that tries to fit a more SQL-like work pattern that may be more familiar to data scientists.
The patterns adopted in this example reimplementation have the goals of (1) making the data flow as explicit and understandable as possible, (2) threading error handling through the entire pipeline, (3) providing a viable template for other processing tasks, and (4) doing the above with modern Scala usage.
The focal point here is
AFiAnalysis.scala
. This module should be understood to be a chain of processing stages:Many of these stages will find application in other stat processors, only requiring that the sections be extracted, lightly generalized, and organized into modules.
The niceness of this implementation is somewhat marred by the requirement to keep good track of errors, but I'm introducing the
Maybe
abstraction to help manage the complexity. Any operation that can potentially fail can be wrapped in aMaybe
which provides a three-value logic of sorts. We can have (1) failed computations which encapsulate an error message, (2) successful computations encapsulating a value, or (3) non-erroring results with no value. The third alternative should be used sparingly. Once a column is computed with Maybe values, it can be unpacked, its errors being merged with a pre-existing error column. The error column defines the validity of any row for use in further calculations, and we can use thewhenValid
operator to help define derived column values only for non-erring rows (to avoid null values). The existence of null values does lead to some troubles, but note that Spark will often interpret nullable columns asOption
-valued as a convenience. The goal should be to containMaybe
-valued columns to be handled entirely within a given processing stage to maximize the portability of these computations.This is an initial foray into improving the clarity of the derivation of stats, and should be seen as a model and not a final product. Further work into benchmarking will also be needed to confirm that this is a viable implementation. More effort beyond that will also be needed to validate the results of these computations against what exists.