Skip to content

Latest commit

 

History

History
543 lines (368 loc) · 27.2 KB

talk_draft.org

File metadata and controls

543 lines (368 loc) · 27.2 KB

JuliaCon 2023 Talk Draft

Time slot (according to the calendar invite) is half an hour.

Sketch: 2x 10-minute halves—first half is about how scientists would use this tool, and the second half is how we use Julia to do this in a convenient way.

Don’t forget to mention GPU FPX at some point!

So:

  • scientist story
  • case studies
  • how we did this in Julia

Notes on the slides: Julia has it’s own font JuliaMono and it will look extra-classy if we use that in all the examples. It’s not my favorite font for code, but it’s not bad either. Good character coverage.

Technical outline

  • Finding exceptions with the log files
    • What gets logged
      • Config: log everything → log a few things
      • Infs and NaNs
    • Tooling
      • Log splitter
    • Visualization
      • CSTG
      • diffing ← if we could get this to work…
  • Fuzzing
    • Control of where and when to inject
  • Wrapping inputs vs. type parametrization
  • How it works
    • Type-based dispatch

Technical points for case studies

  • ShallowWaters
    • Simulations use a CFL parameter → too high leads to instability
    • Tracking down where NaNs come from can help us stabilize simulations without sacrificing simulation speed
    • Also: NaNs in output does not necessarily result in a crash—finding kills is important too (no kills in this one)
    • We tracked down Inf - Inf ⇒ NaN; then tracked down most Infs to $big_number ^ 2
    • We used a visualization (CSTG) to summarize the huge log files
    • Nice thing: type parameterization
  • NBody/OrdinaryDiffEq
    • Fuzzing
      • Configured fuzzer to focus on libraries that we were interested in
      • Found an instance where OrdinaryDiffEq said it would exit, then went into an infinite loop
      • Reproduced with a recording
      • Bug was due to a kill—example of a kill causing a Bad Thing in the wild!
  • Finch and advection
    • Finch is a DSL for PDEs—in a way this is a nice compliment to the ODE library
    • Fuzzing found two places where Finch needed protection against user input:
      1. Inputting a mesh
      2. Setting bounds for the solver
    • CSTGs also useful for finding NaNs in unstable systems, just like with ShallowWaters
  • RxInfer
    • Bayesian inference library
    • ShallowWaters let us parameterize the type; this time the users just wrapped inputs
    • Proof that this is useful for more than just the creators :)
    • Got some good feedback from these folks on how to improve the ergonomics—you, dear audience member, can help us too

Talk

Introduction: the busy scientist

Suppose that you are a scientist doing some numerical work. Maybe you’re crunching data, maybe you’re running a simulation, maybe you’re training a model—whatever. You’ve got your workload spread across some CPUs and GPUs.

All is going well, until somewhere, somehow, a NaN creeps in and starts rendering the simulation unstable. Poof—all the time you spent setting up and running this simulation has been wasted, and now you’re looking at hours of manually tracing through the code, rerunning your simulation and praying that you’re logging the right things in the right places at the right times to catch the NaN.

Sounds like fun, no?

After hours and hours of tedious, boring, grunt work, you finally figure out that a NaN is coming from a division where the dividend underflowed to 0. You fine-tune some of the arithmetic to avoid this problem and, eyes brimming full of hope and excitement, you re-run your program with the final fix in place.

And it… works. Sort of. Something seems off about the results that you’ve gotten. Not wanting to fall into the trap of, “the computer did it so it must be correct”, you embark once again into the bowels of your arithmetic routines.

After another week of toil, you notice something strange: a NaN pops into existence at one point, but then quietly disappears leaving no hole in the output. After some searching, you uncover the killer: a conditional has swallowed the NaN whole.

See the problem?

You add a guard here for safety, but wonder: where did that NaN come from? Another week of painful debugging goes by until you finally figure out where values got too big too fast, or where one of the inputs in the 10 GB input file was missing a decimal point, etc.

This time when you run the program, the output matches the your intuition.

Hooray! Time to deploy this code to production!

TODO: show the short clip of the car running into the wall here https://www.youtube.com/watch?v=x4fdUx6d4QM

Well… not so fast.

You remember this mishap with the autonomous vehicle. See, what happened was that a faulty sensor sent a NaN over a sensor bus. The steering module didn’t know how to handle the NaN, so it went to its default position: locked to the right.

You begin to wonder if there are any other parts of their code that could be susceptible to such problematic behavior. It might be nice if you could somehow fuzz the code to find NaN-susceptible regions. But trying to figure that out would take a ton of effort, right?

Fortunately, we have tools to help!

If, instead of doing all that manual work, you simply used our FloatTracker tool, you would have had logs that would have lead you immediately to where the NaNs and Infs were being generated and where they were disappearing, as well as tools to help you fuzz your code and find cases where you could harden your routines against spurious NaNs that didn’t show up in testing.

Explaination of the FlowFPX toolkit

FloatTracker is one part of the FPX toolkit. It’s written in Julia for analyzing Julia programs—we’ll spend most of our time today talking about this tool. (This is JuliaCon after all!)

CSTG is a stand-alone tool that is really useful for analyzing the data that FloatTracker emits.

We do have some tools for working with GPU-based programs as well: GPU FPX has been developed by my colleague Xinyi Lee for dynamically catching floating-point exceptions in the GPU. Please see her paper and talk if you’d like to know more about that.

We’ll now take a brief foray into where floating-point is liable to trip you up. After that we’ll talk about how to use our tool to make debugging your numerical programs a breeze. Then we’ll wrap up with a quick look at some of the neat aspects of Julia that made building such a tool possible for our small research team.

The dark world of floating-point arithmetic

The IEEE 754 spec is a useful and performant way of representing floating-point numbers, but there are many counterintuitive aspects of both floating-point’s intrinsic behavior as well as of the 754 spec that can invalidate our results—or worse—silently cause incorrect behavior. It can be difficult to find the root cause of these bugs or harden our programs against exceptional values. Unexpected floating-point behavior has lead not only to race car crashes but also to rockets exploding or medical patients getting fatal doses of radiation.

Let’s take a quick look at why floating-point can be so tricky.

There’s necessarily some gap between the values that we are trying to represent and the values that we can represent. This means that there is always some kind of error. Moreover, that error accumulates throughout a computation. There are ways to work around this error, and for simple calculations it’s not that important, but sometimes it can push us just over the brink into exceptional values.

Exceptional values

There are two main exceptional values that you’ve likely run into: Inf and NaN. Inf of course represents a value too large to fit into your representation. Once a value goes to Inf, there’s no coming back.

One implication of this is that we can take algebraically equivalent expressions and get different answers.

x::Float32 = 2f38
y::Float32 = 1f38
[(x + x) - y, x + (x - y)]

This means that addition is not associative! We are not working with real numbers here, people!

Inf often begets NaN (though that’s not the only place where it can come from) which denotes some nonsensical computation.

Inf - Inf

Sometimes it also arises from bad sensor data, typos in data, etc.

NaN is a sticky value: almost all operations with NaN result in a NaN. This is good because if a NaN crops up in our computation, we want to see it in the result.

Now, I said that almost all operations involving NaN can result in a NaN. There are cases where the NaN can disappear silently—we call this a “kill”. A kill almost always is not what you want.

Our tool—FloatTracker—can detect where these exceptional values get generated, how they propagate through a program, and where NaNs can get killed. Let’s take a look at how FloatTracker can help us out.

Introducing FloatTracker

Here’s an example of some code that computes the wrong result because of a NaN kill.

function maximum(lst)
  max_seen = 0.0
  for x in lst
    if ! (x <= max_seen)
      max_seen = x              # swap if new val greater
    end
  end
  max_seen
end

maximum([1.0, 5.0, 4.0, NaN, 4.0])

See that? Not only does the NaN in the input not get propagated, it gives us the wrong answer! We’d never know that there’s a 5 in the list!

Fortunately, FloatTracker makes it easy to find the problem. Let’s use our tool:

using FloatTracker

function maximum(lst)
  max_seen = 0.0
  for x in lst
    if ! (x <= max_seen)
      max_seen = x              # swap if new val greater
    end
  end
  max_seen
end

maximum([TrackedFloat32(x) for x in [1.0, 5.0, 4.0, NaN, 4.0]])
ft_flush_logs()

That gives us some logs that look like this:

[NaN] check_error at /Users/ashton/.julia/dev/FloatTracker/src/TrackedFloat.jl:11
<= at /Users/ashton/.julia/dev/FloatTracker/src/TrackedFloat.jl:214
maximum at /Users/ashton/Research/FloatTrackerExamples/examples/max_min_example.jl:0
top-level scope at /Users/ashton/Research/FloatTrackerExamples/examples/max_min_example.jl:15

The second line shows us that the culprit was <=, and that it it came to a call to maximum on line 15 of our file, and the bottom line shows us where the top-level call originated.

Some of you eagle-eyed participants might have noticed that the third line might look a little suspect: the line number is 0. This seems to happen when Julia starts inlining things. We haven’t had too much of a problem with this—the other information in the stack trace is usually more than enough to trace the call back to the issue. We’d like to improve it, but we’re a little stuck with what kinds of stack traces we can get from Julia.

Now we’ll look at some real-world scenarios that we applied our tools in.

Case studies

ShallowWaters

Consider ShallowWaters: ShallowWaters is a Julia library that lets you take a mesh of a sea bed and then run a time series simulation and get the speed of currents over that sea floor.

Like many (most?) simulations, ShallowWaters operates by modeling one time frame after another. There’s a parameter—called the CFL parameter—that controls how fast information propagates through the system. It’s roughly equivalent to how big of a time step you take.

Small CFL values give you accurate simulations, while big values give you faster renders. The downside is that instability can crop up if it’s set too high.

For example, ShallowWaters with a modest CFL parameter might produce something like this.

[Good picture]

But if we dial it up to high…

[Bad picture]

If we can figure out where the NaNs came from, maybe we can run our simulation faster without wrecking our results.

Moreover, it’s noteworthy that NaNs do not necessarily result in a crash. We’re fortunate that it’s clear that something is off with this simulation. But what if that instability hides in something like the bad maximum function that we saw before? Finding NaN kills is important too.

FloatTracker helps us both figure out where instability in the form of NaNs and Infs are coming from, as well as where kills are happening.

All it takes to add FloatTracker is:

  1. Require the library
  2. Wrap inputs in TrackedFloat types
  3. Flush logs

Live demo

using ShallowWaters
using FloatTracker

P = run_model(cfl=TrackedFloat32(0.8), iterations=100,
              param1="nonperiodic", param2="double_gyre",
              param3="seamount")

savefig("output.fig")
ft_flush_logs()

It’s as simple as that. FloatTracker will log every operation that the TrackedFloat type touches. Moreover, all the results of any operation with a TrackedFloat will be a TrackedFloat too, so our tracking will spread like a virus. There are some limitations to this approach, and we’re working on more ergonomic ways of wrapping input.

But ShallowWaters has a feature that made our work even easier than this: ShallowWaters lets us parameterize the type of float used in the simulation. This seems to be the case with several libraries that we looked at. So we were able to enable FloatTracker with a simple modification to the code:

using ShallowWaters
using FloatTracker

P = run_model(T=TrackedFloat32,
              cfl=TrackedFloat32(0.8), iterations=100,
              param1="nonperiodic", param2="double_gyre",
              param3="seamount")

savefig("output.fig")
ft_flush_logs()

With either strategy, we get some nice logs about where those NaNs are coming from.

Let’s take a look at those logs now:

[NaN] check_error(Any[Inf32, -Inf32]) at FloatTracker/src/TrackedFloat.jl:11
+(::TrackedFloat32, ::TrackedFloat32) at FloatTracker/src/TrackedFloat.jl:103
momentum_v!(::ShallowWaters.DiagnosticVars{Float32, TrackedFloat32}, ::ShallowWaters.ModelSetup{Float32, TrackedFloat32}, ::Int64) at ShallowWaters/src/rhs.jl:275
rhs_nonlinear!(::Matrix{Float32}, ::Matrix{Float32}, ::Matrix{Float32}, ::ShallowWaters.DiagnosticVars{Float32, TrackedFloat32}, ::ShallowWaters.ModelSetup{Float32, TrackedFloat32}, ::Int64) at ShallowWaters/src/rhs.jl:51
rhs!() at ShallowWaters/src/rhs.jl:14
time_integration(::ShallowWaters.PrognosticVars{TrackedFloat32}, ::ShallowWaters.DiagnosticVars{Float32, TrackedFloat32}, ::ShallowWaters.ModelSetup{Float32, TrackedFloat32}) at ShallowWaters/src/time_integration.jl:77
run_model(::Type{Float32}, ::Parameter) at ShallowWaters/src/run_model.jl:37
#run_model#57() at ShallowWaters/src/run_model.jl:17
run_model##kw() at ShallowWaters/src/run_model.jl:12
run_model##kw(run_model) at ShallowWaters/src/run_model.jl:12
top-level scopeCore.tuple(:T, :cfl, :Ndays, :nx, :L_ratio, :bc, :wind_forcing_x, :topography) at /Users/ashton/Research/FloatTrackerExamples/examples/sw_nan_tf.jl:7
eval() at ./boot.jl:368
include_string(identity) at ./loading.jl:1428
_include(::Function, ::Module, ::String) at ./loading.jl:1488
include(::Module, ::String) at ./Base.jl:419
exec_options(::Base.JLOptions) at ./client.jl:303
_start() at ./client.jl:522

Let’s clean up one of these hunks.

[NaN] check_error(Any[Inf32, -Inf32])  at FloatTracker/src/TrackedFloat.jl:11
+(::TrackedFloat32, ::TrackedFloat32)  at FloatTracker/src/TrackedFloat.jl:103
momentum_v!(…)                         at ShallowWaters/src/rhs.jl:275
rhs_nonlinear!(…)                      at ShallowWaters/src/rhs.jl:51
rhs!()                                 at ShallowWaters/src/rhs.jl:14
time_integration(…)                    at ShallowWaters/src/time_integration.jl:77
run_model(…)                           at ShallowWaters/src/run_model.jl:37
#run_model#57()                        at ShallowWaters/src/run_model.jl:17
run_model##kw()                        at ShallowWaters/src/run_model.jl:12
run_model##kw(run_model)               at ShallowWaters/src/run_model.jl:12
top-level scopeCore.tuple(…)           at FloatTrackerExamples/examples/sw_nan_tf.jl:7
  • It’s a NaN event
  • Inf + (-Inf)
  • See where it’s coming from

There are a lot of logs—we’d like summary.

To get a quick summary, we can coalesce the logs into a handy graph that lets us see where most of the flows are going to/or from.

Looks like most of the problems are coming from the continuity_itself! function from the + routine.

So where are those Inf’s coming from? Well, fortunately, FloatTracker watches for that too.

Let’s go to to the CSTG summary first.

Looks like there’s one case where an addition is giving us an Inf, but 640 instances of Inf are coming from exponentiation.

Here’s one of those stack traces:

[Inf] check_error(Any[-1.5150702f31, 2])  at FloatTracker/src/TrackedFloat.jl:11
^(::TrackedFloat32, ::Int64)              at FloatTracker/src/TrackedFloat.jl:139
literal_pow()                             at ./intfuncs.jl:327
_broadcast_getindex_evalf()               at ./broadcast.jl:670
_broadcast_getindex()                     at ./broadcast.jl:643
getindex()                                at ./broadcast.jl:597
macro expansion()                         at ./broadcast.jl:961
macro expansion()                         at ./simdloop.jl:77
copyto!()                                 at ./broadcast.jl:960
copyto!()                                 at ./broadcast.jl:913
copy()                                    at ./broadcast.jl:885
materialize(^)                            at ./broadcast.jl:860
top-level scopeBase.getproperty(P, :u)    at FTExamples/examples/sw_nan_tf.jl:14

Looks like the arguments here are -1.5150702f31^2—no wonder a Float32 couldn’t handle that and went to Inf.

Now we leave it to a domain expert to figure out how to mitigate this. Some strategies:

  • use a bigger bit-width
  • use a tool like Herbie to rewrite floating-point expressions to reduce error
  • manual reorder operations to keep values from getting too big, like that example we saw earlier

Fuzzing: OrdinaryDiffEq

Next we took a look at the OrdinaryDiffEq library—a popular library for differential equations. I say we took a look at it—really we started with a library to do N-body simulation, and we ended up uncovering a bug with OrdinaryDiffEq.

Since this is such a widely used library, it’s important to ensure that there are no NaN kills.

FloatTracker has a utility akin to fuzz testing that lets us randomly inject NaNs during the run of a program. We can then watch the logs for any NaN kills and make corrections.

config_injector(odds=2,
                functions=[FunctionRef(:run_simulation, "nbody_simulation_result.jl")],
                libraries=["NBodySimulator", "OrdinaryDiffEq"])
record_injection("injection_recording.txt")

There are a few controls right now; here we’re setting the odds of an operation spontaneously turning into a NaN to 1:2, and we’re asking FloatTracker to only inject when we’re working inside the run_simulation function and within the libraries NBodySimulator or OrdinaryDiffEq. That way we don’t start injecting NaN into Base functions that we’ll trust are well-behaved.

On the bottom line there we record the injections so that we can replay them later. This helped us get to a reproducible issue.

With little effort we found a NaN kill that would cause OrdinaryDiffEq to go into an infinite loop. It wasn’t a case that you’d really think to test, and that’s kind of the point of fuzzing: catch edge cases before they ever crop up.

Going back to the example of the race car crashing into the wall, this kind of fuzzing might have helped the drivers notice the odd behavior of the steering column if a NaN happened to be sent over the sensor bus. Who knows that it will help you catch?

Finch and advection

Finch is a DSL for PDEs—in a way this is a nice compliment to the ODE library.

We spent some time fuzzing Finch and found two places where Finch needed protection against user input:

  1. Inputting a mesh
  2. Setting bounds for the solver

Beyond just finding points where Finch needs some guards, FloatTracker and CSTGs are useful for finding Infs and NaNs in unstable systems, just like with ShallowWaters.

RxInfer

Now, the previous three case studies are all things that our team was able to do with our tool. You might be thinking, “but I’m a busy scientist! I don’t have time to try out some cutting-edge research tool!”

I’m going to tell you about one instance where one busy scientist took FloatTracker out for a spin and found a bug in their code—with basically no help from us!

We were looking through issues on GitHub when we came across an issue with the RxInfer package, a library for Bayesian inference. The issue description said:

Now it is impossible to trace back the origin of the very first NaN without perform a lot of manual work. This limits the ability to debug the code and to prevent these ~NaN~s in the first place.

RxInfer.jl#116

We saw that and thought, that’s exactly the scenario our imaginary busy scientist was facing. We chimed in and said, “hey! we’re building a tool that makes it easy to do what you want!”

We asked them if we could look at the code, but they were doing work with some proprietary information, so we were not able to help them out much beyond pointing them at our tool.

RxInfer doesn’t have type parameterization like ShallowWaters, but they were able to wrap all the inputs that they needed.

Without help from us, they were able to find and fix a NaN-gen location easily and quickly. They also gave us some great feedback on the ergonomics of working with our tool.

Intermezzo

I hope that inspires some confidence in you. Next time you find yourself wishing that there were a faster way to track down NaNs and Infs, just remember that there is. We hope you try our tool out and that it helps you solve your problems. We’d also love to hear your feedback!

How we made this work

  • Intercept all operations involving FP
    • Use Julia type dispatch to do this
    • Optionally inject—show decision flow chart
  • Table showing the different kind of events
  • Buffer & log what we’re interested in
  • CSTG to summarize logs (optional—sometimes manual inspection of the logs is fine)
    • CSTG works by taking each of the chunks of stack trace and making a path for each

For the last part of the talk I’m going to show you how how we got FloatTracker to work. In principle we’re not doing anything that couldn’t be done in another language, but Julia makes it really easy to create the kind of tool that we did.

Intercept floating-point operations

We start by defining a new data type that wraps a regular float:

abstract type AbstractTrackedFloat <: AbstractFloat end

struct TrackedFloat32 <: AbstractTrackedFloat
  val::Float32
end

This means that a TrackedFloat32 is valid anywhere an AbstractFloat is allowed. We do this for Float16, Float32, and Float64 and make our own tracked versions.

And then all we have to do is implement overloaded methods for this type:

function Base.+(x::TrackedFloat32, y::TrackedFloat32)
  result = x.val + y.val
  check_error(+, result, x.val, y.val)
  TrackedFloat32(r)
end

Type dispatch makes this possible!

FIXME: fill in about type dispatch

  • First we run the function
  • Then we check to see if we had a NaN/Inf gen/prop/kill
    • If we did, grab the stack trace and record it.
  • Finally return the result wrapped in a TrackedFloat type

That’s the basics of tracking exceptional values.

Inject NaNs

To make the injection work, we actually run a series of checks before running the operation; if the checks pass, we inject. The basics of the checks that we do are:

  • Are we replaying an injection recording? If yes, do what the replay says to do. Otherwise…
  • Is injection turned on?
  • Do we have NaNs left to inject?
  • Roll a die to see if this operation succeeds; did we crit fail?
  • Are we in an “injectable region”?
    • Check to see if any of the functions in the config are on the stack OR any of the frames on the stack originate from a library we’re interested in.
    • This is the most expensive check, so we do it very last.

Getting the stack trace is the most expensive part of tracking exceptional values as well as checking if we’re good to go for injection, so we’ve taken care to trigger that as infrequently as possible.

CSTG summarization

  • Take all of the stack trace hunks and tally the lines
  • Make a graph with thicker lines meaning more events followed that path

TODO: get that diff working!

Using meta programming

That wrapper function, as you might assume, would be tedious to write out for every function—not to mention impossible to maintain. Fortunately, Julia lets us use macros, so we can automate an impressive amount of things.

You can write two nested for loops to quickly generate the code needed for this:

for TrackedFloatN in (:TrackedFloat16, :TrackedFloat32, :TrackedFloat64)
  for Op in (:+, :-, :/, :^)
    @eval function Base.$Op(x::$TrackedFloatN, y::$TrackedFloatN)
      result = $Op(x, y)
      check_error($Op, result, x.val, y.val)
      $TrackedFloatN(r)
    end
  end
end

The outer loop runs through each of the different TrackedFloat types we generate, while the inner loop goes through all the operators.

There are a few edge cases we handle, but Julia makes it pretty easy to handle.

We generate

  • 3 structs
  • 645 function variants
  • only 218 lines of code, about 23 of which are devoted to defining helper functions and boilerplate

I’ll mention something briefly real quick that makes FloatTracker work for everyone: thanks to Julia’s type dispatch mechanism, you can write your own override methods too! This is something that we encountered with some frequency. I know the RxInfer guys had to write at least one wrapper, but they didn’t have any difficulty with that.

Conclusion

Despite it’s young age, FloatTracker has been useful not only to us as researchers, but also to developers like you diagnose floating-point exceptions. It can be a valuable tool for hardening floating-point code against inadvertent NaN kills which can lead to baffling behavior or silently incorrect results.

We’ve been able to exercise some exciting metaprogramming abilities of Julia to make this possible.

Thank you for your attention. We hope you find FloatTracker useful to you as you write numerical code. I’ll be happy to answer your questions now.