Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement AFi summary stat using SparkSQL #210

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jpolchlo
Copy link
Contributor

The current implementation of the AFi stats processor, much like most of the stats currently in use are based on a fairly convoluted foundation of classes that are rooted in Spark's RDD API. While still the underpinning of modern Spark, it requires more rigorous hand optimization than the modern Spark SQL machinery (which includes the optimizing Catalyst engine).

This PR offers an alternative implementation of the AFi summary statistic command. This is a nearly wholesale rewrite of the statistic infrastructure, which refactors the layer infrastructure, and tries to step away from the byzantine class hierarchy which currently defines the shape of the stats processors. It isn't letter-perfect, but shows how we can move away from the hard-to-decipher and hard-to-maintain code structure that exists to something that tries to fit a more SQL-like work pattern that may be more familiar to data scientists.

The patterns adopted in this example reimplementation have the goals of (1) making the data flow as explicit and understandable as possible, (2) threading error handling through the entire pipeline, (3) providing a viable template for other processing tasks, and (4) doing the above with modern Scala usage.

The focal point here is AFiAnalysis.scala. This module should be understood to be a chain of processing stages:

  • loading
  • input conversions
  • breaking into work units
  • analysis
  • aggregation
  • post-processing
  • export

Many of these stages will find application in other stat processors, only requiring that the sections be extracted, lightly generalized, and organized into modules.

The niceness of this implementation is somewhat marred by the requirement to keep good track of errors, but I'm introducing the Maybe abstraction to help manage the complexity. Any operation that can potentially fail can be wrapped in a Maybe which provides a three-value logic of sorts. We can have (1) failed computations which encapsulate an error message, (2) successful computations encapsulating a value, or (3) non-erroring results with no value. The third alternative should be used sparingly. Once a column is computed with Maybe values, it can be unpacked, its errors being merged with a pre-existing error column. The error column defines the validity of any row for use in further calculations, and we can use the whenValid operator to help define derived column values only for non-erring rows (to avoid null values). The existence of null values does lead to some troubles, but note that Spark will often interpret nullable columns as Option-valued as a convenience. The goal should be to contain Maybe-valued columns to be handled entirely within a given processing stage to maximize the portability of these computations.

This is an initial foray into improving the clarity of the derivation of stats, and should be seen as a model and not a final product. Further work into benchmarking will also be needed to confirm that this is a viable implementation. More effort beyond that will also be needed to validate the results of these computations against what exists.

Spark SQL.  Removes reliance on existing RDD-based class infrastructure
and provides a DataFrame/Dataset interface.  Error handling is more
completely threaded through the process using the Maybe abstraction.

This is a WIP commit.  Most function is here, but not entirely correct.
@jpolchlo
Copy link
Contributor Author

This is still a work-in-progress, but I'm having trouble finding the time off the clock to push this further. This compiles, but is still encountering some runtime errors in the layer reader (and probably elsewhere that I haven't seen yet). The strategy is to keep forcing computation with either count or show on the various intermediate products, and debug each stage as needed. I only wanted to get something to you all to pick at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant