Shared Memory Bug #660

coreyostrove · 2025-10-06T05:14:37Z

While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. debug_dist_objective_fn.py. Running this script on 0.9.14.1 yields the following output:

Number of Ranks is 4
Fit Log Single Rank
-----------------------------
  Precomputing CircuitOutcomeProbabilityArray layouts for each iteration.
  Layout for iteration 0
  Num Param Processors (1,)
  MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions.
     1 atoms, parameter block size limits (None,)
  *** Distributing 1 atoms to 1 atom-processing groups (1 cores) ***
      More atom-processors than hosts: each host gets ~1 atom-processors
      Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors.
  *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups ***
--- Iterative GST: Iter 1 of 1  555 circuits ---:
  --- chi2 GST ---
    --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08
        - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06
            (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=1.0189  mu * 0.333333 => 5.14e+11
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 1.4s
  Iteration 1 took 1.5s

  Last iteration:
  --- logl GST ---
    --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063
        - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06
            (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=0.999676  mu * 0.333333 => 1.02866e+06
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 1.3s
  Final optimization took 1.3s

Fit Log Four Ranks
-----------------------------
  Precomputing CircuitOutcomeProbabilityArray layouts for each iteration.
  Layout for iteration 0
  Num Param Processors (4,)
  MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions.
     1 atoms, parameter block size limits (None,)
  *** Distributing 1 atoms to 1 atom-processing groups (4 cores) ***
  *** Using shared memory to communicate between the 4 cores on each of 1 hosts ***
      More atom-processors than hosts: each host gets ~1 atom-processors
      Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors.
  *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups ***
--- Iterative GST: Iter 1 of 1  555 circuits ---:
  --- chi2 GST ---
    --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11
        - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18
            (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192
            Accepted! gain ratio=0.0419408  mu * 0.3 => 4.32423e+17

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 0.5s
  Iteration 1 took 0.6s

  Last iteration:
  --- logl GST ---
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
    --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11
        - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19
            (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=0.00865623  mu * 0.3 => 1.21666e+18
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 0.5s
  Final optimization took 0.5s

Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all:

Single threaded: norm_f = 4.5053e+09
Multi-threaded: norm_f = 6381.58

Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to this line. The bug here arises because in the multi-threaded setting lsvec is an instance of LocalNumpyArray, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as num_ranks times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of objectivefns.py where shared memory is being used (mainly terms in TimeIndependentMDCObjectiveFunction) and think I got the correct guards added in (the patch seemed to work in my testing).

I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this.

P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is MapForwardSimulator, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to.

Fixes a shared memory bug in TimeIndependentMDCObjectiveFunction which caused the square root of the objective function terms to be applied multiple times. These changes should hopefully guard those in-place updates by checking that the shared memory leader is the only one making the changes.

rileyjmurray

Thanks for suffering through this debugging, and finding the fix!

Note: For some reason the automatic merge from bugfix into develop failed on PR #660, so this PR is just manually performing that merge. While playing around with MPI parallelization across parameters (instead of circuits, which the default for the map simulator as of 0.9.13) I ran into a bug which I tracked down to a shared memory bug inadvertently introduced with PR #515. Attached is a script, which if run using MPI (I was using 4 ranks) runs a single iteration of two-qubit GST on a single thread and on all four threads and logs the result for comparison. [debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py). Running this script on 0.9.14.1 yields the following output: ``` Number of Ranks is 4 Fit Log Single Rank ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (1,) MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (1 cores) *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors. *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08 - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06 (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=1.0189 mu * 0.333333 => 5.14e+11 Least squares message = Maximum iterations (1) exceeded _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.4s Iteration 1 took 1.5s Last iteration: --- logl GST --- --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063 - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06 (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.999676 mu * 0.333333 => 1.02866e+06 Least squares message = Maximum iterations (1) exceeded _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 1.3s Final optimization took 1.3s Fit Log Four Ranks ----------------------------- Precomputing CircuitOutcomeProbabilityArray layouts for each iteration. Layout for iteration 0 Num Param Processors (4,) MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions. 1 atoms, parameter block size limits (None,) *** Distributing 1 atoms to 1 atom-processing groups (4 cores) *** *** Using shared memory to communicate between the 4 cores on each of 1 hosts *** More atom-processors than hosts: each host gets ~1 atom-processors Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors. *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups *** --- Iterative GST: Iter 1 of 1 555 circuits ---: --- chi2 GST --- --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11 - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18 (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192 Accepted! gain ratio=0.0419408 mu * 0.3 => 4.32423e+17 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Least squares message = Maximum iterations (1) exceeded _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Iteration 1 took 0.6s Last iteration: --- logl GST --- C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide p5over_lsvec = 0.5/lsvec --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11 - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19 (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674 WARNING: Treating result as *converged* after maximum iterations (1) were exceeded. Accepted! gain ratio=0.00865623 mu * 0.3 => 1.21666e+18 Least squares message = Maximum iterations (1) exceeded _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0) Completed in 0.5s Final optimization took 0.5s ``` Obviously looking at the final objective function values the results are quite different, but what caught my eye is that this was true even at the very initial step before the optimizer had moved at all: Single threaded: norm_f = 4.5053e+09 Multi-threaded: norm_f = 6381.58 Shared memory is a real mind-bender, so I'll spare folks the details on how I tracked this down, but ultimately I tracked this down to [this line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415). The bug here arises because in the multi-threaded setting `lsvec` is an instance of `LocalNumpyArray`, which generally points to an array that lives in shared memory. The square-root on that line was not being guarded, and since each rank was pointing to the same memory that resulted in the square-root being applied as many as `num_ranks` times (in my testing this was actually nondeterministic, though usually 3 or 4, probably a race condition in play). I patterned matched off of other parts of `objectivefns.py` where shared memory is being used (mainly `terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the correct guards added in (the patch seemed to work in my testing). I added back in a few comments that were removed in #515 related to the use of shared memory in this part of the code which I found useful in debugging this. P.S. As for why this wasn't caught sooner, the default forward simulator since 0.9.13 is `MapForwardSimulator`, and the default paralellization strategy for map is across circuits. It happens that when parallelizing only across circuits the portions of the shared memory being indexed into by each rank are non-overlapping, but that isn't true when parallelizing across parameters. So you'd have had to have manually changed the parallelization strategy to see this behavior, which most users don't do unless they have a specific reason to.

coreyostrove added this to the 0.9.14.2 milestone Oct 6, 2025

coreyostrove requested review from a team and rileyjmurray as code owners October 6, 2025 05:14

rileyjmurray approved these changes Oct 6, 2025

View reviewed changes

coreyostrove merged commit fd94318 into bugfix Oct 6, 2025
4 checks passed

coreyostrove deleted the bugfix-objfn-sharedmem branch October 6, 2025 19:42

coreyostrove mentioned this pull request Oct 7, 2025

Shared Memory Bug (#660) (Merge Into Develop) #662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Shared Memory Bug #660

Shared Memory Bug #660

Uh oh!

coreyostrove commented Oct 6, 2025 •

edited

Loading

Uh oh!

rileyjmurray left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Shared Memory Bug #660

Shared Memory Bug #660

Uh oh!

Conversation

coreyostrove commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rileyjmurray left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coreyostrove commented Oct 6, 2025 •

edited

Loading