From fd94318be898a7e40da4a213a69126ddd0b2c1a5 Mon Sep 17 00:00:00 2001
From: coreyostrove <78768318+coreyostrove@users.noreply.github.com>
Date: Mon, 6 Oct 2025 15:41:58 -0400
Subject: [PATCH] Shared Memory Bug (#660)

While playing around with MPI parallelization across parameters (instead
of circuits, which the default for the map simulator as of 0.9.13) I ran
into a bug which I tracked down to a shared memory bug inadvertently
introduced with PR #515. Attached is a script, which if run using MPI (I
was using 4 ranks) runs a single iteration of two-qubit GST on a single
thread and on all four threads and logs the result for comparison.
[debug_dist_objective_fn.py](https://github.com/user-attachments/files/22715098/debug_dist_objective_fn.py).
Running this script on 0.9.14.1 yields the following output:

```
Number of Ranks is 4
Fit Log Single Rank
-----------------------------
  Precomputing CircuitOutcomeProbabilityArray layouts for each iteration.
  Layout for iteration 0
  Num Param Processors (1,)
  MapLayout: 1 processors divided into 1 x 1 (= 1) grid along circuit and parameter directions.
     1 atoms, parameter block size limits (None,)
  *** Distributing 1 atoms to 1 atom-processing groups (1 cores) ***
      More atom-processors than hosts: each host gets ~1 atom-processors
      Atom-processors already occupy a single node, dividing atom-processor into 1 param-processors.
  *** Divided 1-host atom-processor (~1 procs) into 1 param-processing groups ***
--- Iterative GST: Iter 1 of 1  555 circuits ---:
  --- chi2 GST ---
    --- Outer Iter 0: norm_f = 4.5053e+09, mu=1, |x|=0, |J|=3.86694e+08
        - Inner Loop: mu=1.542e+12, norm_dx=4.69768e-06
            (cont): norm_new_f=1289.29, dL=4.42172e+09, dF=4.5053e+09, reldL=0.981449, reldF=1

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=1.0189  mu * 0.333333 => 5.14e+11
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 1289.29 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 1.4s
  Iteration 1 took 1.5s

  Last iteration:
  --- logl GST ---
    --- Outer Iter 0: norm_f = 644.825, mu=1, |x|=0.00216741, |J|=556063
        - Inner Loop: mu=3.08599e+06, norm_dx=8.62909e-06
            (cont): norm_new_f=549.175, dL=95.681, dF=95.65, reldL=0.148383, reldF=0.148335

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=0.999676  mu * 0.333333 => 1.02866e+06
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 1098.35 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 1.3s
  Final optimization took 1.3s

Fit Log Four Ranks
-----------------------------
  Precomputing CircuitOutcomeProbabilityArray layouts for each iteration.
  Layout for iteration 0
  Num Param Processors (4,)
  MapLayout: 4 processors divided into 1 x 4 (= 4) grid along circuit and parameter directions.
     1 atoms, parameter block size limits (None,)
  *** Distributing 1 atoms to 1 atom-processing groups (4 cores) ***
  *** Using shared memory to communicate between the 4 cores on each of 1 hosts ***
      More atom-processors than hosts: each host gets ~1 atom-processors
      Atom-processors already occupy a single node, dividing atom-processor into 4 param-processors.
  *** Divided 1-host atom-processor (~4 procs) into 4 param-processing groups ***
--- Iterative GST: Iter 1 of 1  555 circuits ---:
  --- chi2 GST ---
    --- Outer Iter 0: norm_f = 6381.58, mu=1, |x|=0, |J|=3.66935e+11
        - Inner Loop: mu=1.44141e+18, norm_dx=4.27565e-18
            (cont): norm_new_f=6177.24, dL=4871.93, dF=204.333, reldL=0.763437, reldF=0.0320192
            Accepted! gain ratio=0.0419408  mu * 0.3 => 4.32423e+17

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 6177.24 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 0.5s
  Iteration 1 took 0.6s

  Last iteration:
  --- logl GST ---
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
C:\Users\ciostro\Documents\pyGSTi_automated_model_selection\pygsti\objectivefns\objectivefns.py:4471: RuntimeWarning: divide by zero encountered in divide
  p5over_lsvec = 0.5/lsvec
    --- Outer Iter 0: norm_f = 2617.65, mu=1, |x|=2.06777e-09, |J|=6.23454e+11
        - Inner Loop: mu=4.05554e+18, norm_dx=3.83592e-19
            (cont): norm_new_f=2605.54, dL=1399.13, dF=12.1112, reldL=0.534499, reldF=0.00462674

WARNING: Treating result as *converged* after maximum iterations (1) were exceeded.
            Accepted! gain ratio=0.00865623  mu * 0.3 => 1.21666e+18
    Least squares message = Maximum iterations (1) exceeded
  _objfn = 5211.07 (1665 data params - 1440 (approx) model params = expected mean of 225; p-value = 0)
  Completed in 0.5s
  Final optimization took 0.5s
```

Obviously looking at the final objective function values the results are
quite different, but what caught my eye is that this was true even at
the very initial step before the optimizer had moved at all:

Single threaded: norm_f = 4.5053e+09
Multi-threaded: norm_f = 6381.58

Shared memory is a real mind-bender, so I'll spare folks the details on
how I tracked this down, but ultimately I tracked this down to [this
line](https://github.com/sandialabs/pyGSTi/blob/339c8a3df6e84c9a154e04d42e090ab9d27f78b2/pygsti/objectivefns/objectivefns.py#L4415).
The bug here arises because in the multi-threaded setting `lsvec` is an
instance of `LocalNumpyArray`, which generally points to an array that
lives in shared memory. The square-root on that line was not being
guarded, and since each rank was pointing to the same memory that
resulted in the square-root being applied as many as `num_ranks` times
(in my testing this was actually nondeterministic, though usually 3 or
4, probably a race condition in play). I patterned matched off of other
parts of `objectivefns.py` where shared memory is being used (mainly
`terms` in `TimeIndependentMDCObjectiveFunction`) and think I got the
correct guards added in (the patch seemed to work in my testing).

I added back in a few comments that were removed in #515 related to the
use of shared memory in this part of the code which I found useful in
debugging this.

P.S. As for why this wasn't caught sooner, the default forward simulator
since 0.9.13 is `MapForwardSimulator`, and the default paralellization
strategy for map is across circuits. It happens that when parallelizing
only across circuits the portions of the shared memory being indexed
into by each rank are non-overlapping, but that isn't true when
parallelizing across parameters. So you'd have had to have manually
changed the parallelization strategy to see this behavior, which most
users don't do unless they have a specific reason to.

Co-authored-by: Corey Ostrove <ciostro@sandia.gov>
---
 pygsti/objectivefns/objectivefns.py | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/pygsti/objectivefns/objectivefns.py b/pygsti/objectivefns/objectivefns.py
index a88975f16..81e40b1ca 100644
--- a/pygsti/objectivefns/objectivefns.py
+++ b/pygsti/objectivefns/objectivefns.py
@@ -4374,12 +4374,15 @@ def terms(self, paramvec=None, oob_check=False, profiler_str="TERMS OBJECTIVE"):
             self.model.from_vector(paramvec)
         terms = self.obj.view()
 
+        # Whether this rank is the "leader" of all the processors accessing the same shared self.jac and self.probs mem.
+        #  Only leader processors should modify the contents of the shared memory, so we only apply operations *once*
+        #  `unit_ralloc` is the group of all the procs targeting same destination into self.obj
         unit_ralloc = self.layout.resource_alloc('atom-processing')
         shared_mem_leader = unit_ralloc.is_host_leader
 
         with self.resource_alloc.temporarily_track_memory(self.nelements):  # 'e' (terms)
-            self.model.sim.bulk_fill_probs(self.probs, self.layout)
-            self._clip_probs()
+            self.model.sim.bulk_fill_probs(self.probs, self.layout) # syncs shared mem
+            self._clip_probs() # clips self.probs in place w/shared mem sync
     
             if oob_check:  # Only used for termgap cases
                 if not self.model.sim.bulk_test_if_paths_are_sufficient(self.layout, self.probs, verbosity=1):
@@ -4412,11 +4415,15 @@ def lsvec(self, paramvec=None, oob_check=False, raw_objfn_lsvec_signs=True):
             in {bad_locs}.
             """
             raise RuntimeError(msg)
-        lsvec **= 0.5
+        unit_ralloc = self.layout.resource_alloc('atom-processing')
+        shared_mem_leader = unit_ralloc.is_host_leader
+        if shared_mem_leader:
+            lsvec **= 0.5
         if raw_objfn_lsvec_signs:
-            if self.layout.resource_alloc('atom-processing').is_host_leader:
+            if shared_mem_leader:
                 raw_lsvec = self.raw_objfn.lsvec(self.probs, self.counts, self.total_counts, self.freqs)
                 lsvec[:self.nelements][raw_lsvec < 0] *= -1
+        unit_ralloc.host_comm_barrier()
         return lsvec
 
     def dterms(self, paramvec=None):