Implement AFi summary stat using SparkSQL #210

jpolchlo · 2024-01-23T03:49:41Z

The current implementation of the AFi stats processor, much like most of the stats currently in use are based on a fairly convoluted foundation of classes that are rooted in Spark's RDD API. While still the underpinning of modern Spark, it requires more rigorous hand optimization than the modern Spark SQL machinery (which includes the optimizing Catalyst engine).

This PR offers an alternative implementation of the AFi summary statistic command. This is a nearly wholesale rewrite of the statistic infrastructure, which refactors the layer infrastructure, and tries to step away from the byzantine class hierarchy which currently defines the shape of the stats processors. It isn't letter-perfect, but shows how we can move away from the hard-to-decipher and hard-to-maintain code structure that exists to something that tries to fit a more SQL-like work pattern that may be more familiar to data scientists.

The patterns adopted in this example reimplementation have the goals of (1) making the data flow as explicit and understandable as possible, (2) threading error handling through the entire pipeline, (3) providing a viable template for other processing tasks, and (4) doing the above with modern Scala usage.

The focal point here is AFiAnalysis.scala. This module should be understood to be a chain of processing stages:

loading
input conversions
breaking into work units
analysis
aggregation
post-processing
export

Many of these stages will find application in other stat processors, only requiring that the sections be extracted, lightly generalized, and organized into modules.

The niceness of this implementation is somewhat marred by the requirement to keep good track of errors, but I'm introducing the Maybe abstraction to help manage the complexity. Any operation that can potentially fail can be wrapped in a Maybe which provides a three-value logic of sorts. We can have (1) failed computations which encapsulate an error message, (2) successful computations encapsulating a value, or (3) non-erroring results with no value. The third alternative should be used sparingly. Once a column is computed with Maybe values, it can be unpacked, its errors being merged with a pre-existing error column. The error column defines the validity of any row for use in further calculations, and we can use the whenValid operator to help define derived column values only for non-erring rows (to avoid null values). The existence of null values does lead to some troubles, but note that Spark will often interpret nullable columns as Option-valued as a convenience. The goal should be to contain Maybe-valued columns to be handled entirely within a given processing stage to maximize the portability of these computations.

This is an initial foray into improving the clarity of the derivation of stats, and should be seen as a model and not a final product. Further work into benchmarking will also be needed to confirm that this is a viable implementation. More effort beyond that will also be needed to validate the results of these computations against what exists.

Spark SQL. Removes reliance on existing RDD-based class infrastructure and provides a DataFrame/Dataset interface. Error handling is more completely threaded through the process using the Maybe abstraction. This is a WIP commit. Most function is here, but not entirely correct.

jpolchlo · 2024-01-23T03:53:00Z

This is still a work-in-progress, but I'm having trouble finding the time off the clock to push this further. This compiles, but is still encountering some runtime errors in the layer reader (and probably elsewhere that I haven't seen yet). The strategy is to keep forcing computation with either count or show on the various intermediate products, and debug each stage as needed. I only wanted to get something to you all to pick at.

jpolchlo added 6 commits October 17, 2023 11:08

Remove unused imports

7a595d2

Moved to project structure

93da1be

Updated build.sbt for multi-project structure

e6ea062

Ignore emacs temporary save files

17cd93a

Move closer to a running version

96cd3cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement AFi summary stat using SparkSQL #210

Implement AFi summary stat using SparkSQL #210

jpolchlo commented Jan 23, 2024

jpolchlo commented Jan 23, 2024

Implement AFi summary stat using SparkSQL #210

Are you sure you want to change the base?

Implement AFi summary stat using SparkSQL #210

Conversation

jpolchlo commented Jan 23, 2024

jpolchlo commented Jan 23, 2024