Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qsim 0.16 requires more RAM for a simple Hadamards circuit compared to qsim 0.12 #612

Open
rht opened this issue May 19, 2023 · 14 comments
Open

Comments

@rht
Copy link

rht commented May 19, 2023

I was running the following code on a cuQuantum Appliance 23.03 Docker instance, on an a2-highgpu-1g, with a RAM of 85 GB:

import cirq

num_qubits = 32
qc_cirq = cirq.Circuit()
qubits = cirq.LineQubit.range(num_qubits)
for i in range(num_qubits):
    qc_cirq.append(cirq.H(qubits[i]))
import qsimcirq
sim = qsimcirq.QSimSimulator()
# sim = cirq.Simulator()
sim.simulate(qc_cirq)

The code ran just fine with cirq.Simulator(), taking 2 s in total. It ran fine with qsimcirq.QSimSimulator() for qsimcirq 0.12, taking ~19 s in total (probably could be optimized to ~2 s), and ran fine for qsimcirq 0.13, taking ~1 min 8 s. But for qsimcirq 0.14 and 0.16, I got

Traceback (most recent call last):
  File "oom.py", line 15, in <module>
    sim.simulate(qc_cirq)
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/cirq/sim/simulator.py", line 495, in simulate
    return self.simulate_sweep(
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/cirq/sim/simulator.py", line 510, in simulate_sweep
    return list(self.simulate_sweep_iter(program, params, qubit_order, initial_state))
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/qsimcirq/qsim_simulator.py", line 529, in simulate_sweep_iter
    final_state = cirq.StateVectorSimulationState(
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/cirq/sim/state_vector_simulation_state.py", line 350, in __init__
    state = _BufferedStateVector.create(
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/cirq/sim/state_vector_simulation_state.py", line 85, in create
    return cls(state_vector, buffer)
  File "/home/cuquantum/conda/envs/cuquantum-23.03/lib/python3.8/site-packages/cirq/sim/state_vector_simulation_state.py", line 44, in __init__
    buffer = np.empty_like(state_vector)
  File "<__array_function__ internals>", line 180, in empty_like
numpy.core._exceptions.MemoryError: Unable to allocate 32.0 GiB for an array with shape (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2) and data type complex64

The line where the error happens, is for construction of the final state vector after the simulation has finished.

The most I can bisect is between releases, and so it is due to change(s) between 0.13 and 0.14. Any idea what could be the cause?

@rht
Copy link
Author

rht commented May 19, 2023

It could also be due to older versions of Cirq for the qsimcirq 0.13 that I tested. I did it on an older version of cuQuantum Appliance, which has an older Cirq.

@rht
Copy link
Author

rht commented May 19, 2023

I tried qsimcirq 0.16.3 + Cirq 0.14, still OOM-ed. As such, the code that causes the increased memory cost is very likely in the qsim repo.

@95-martin-orion
Copy link
Collaborator

Thanks for raising this issue! The root cause has several parts:

  1. The line in question was introduced in Update for compatibility with cirq 1.0 #555 to comply with Cirq 1.0 return type requirements for simulators. The subsequent yield was also added in that PR.
  2. Cirq 1.0 expects simulate_sweep_iter on state-vector-based simulators to yield cirq.StateVectorTrialResult, which requires cirq.StateVectorSimulationState as input.
  3. cirq.StateVectorSimulationState is also used for ongoing simulations, and maintains a swap buffer for that purpose. The buffer = np.empty_like(state_vector) line in the stack trace above is the creation of that buffer.

The appropriate solution to this likely involves passing the buffer qsim uses in its C++ layer up to the python level so we can recycle it for cirq.StateVectorSimulationState. We could also simply free the C++ buffer to make space for Cirq to create a new one, but that would incur additional time cost.

@rht
Copy link
Author

rht commented May 22, 2023

I see, it's a relief knowing the cause.

The appropriate solution to this likely involves passing the buffer qsim uses in its C++ layer up to the python level so we can recycle it for cirq.StateVectorSimulationState.

Why is the return type in

static py::array_t<float> simulate_fullstate(
py::array_t<float> instead of complex?

(digression: I suppose a full solution to the issue requires changing pybind_main.cpp to expose the buffer, and thus requires a recompilation. Unfortunately, I can't recompile qsim from scratch because I'm using cuQuantum Appliance, which has custom modifications over qsim, especially the multi-GPU backend. Seems to require #601 to be solved)

@95-martin-orion
Copy link
Collaborator

Why is the return type in {...} py::array_t<float> instead of complex?

@sergeisakov would know best, but I believe this is because qsim stores the state as a float array for vector operations. Real and imaginary components of each complex value are stored as separate floats and recombined once the result surfaces in python.

@sergeisakov
Copy link
Collaborator

@sergeisakov would know best, but I believe this is because qsim stores the state as a float array for vector operations. Real and imaginary components of each complex value are stored as separate floats and recombined once the result surfaces in python.

Yes, this is correct.

@sergeisakov
Copy link
Collaborator

(digression: I suppose a full solution to the issue requires changing pybind_main.cpp to expose the buffer, and thus requires a recompilation. Unfortunately, I can't recompile qsim from scratch because I'm using cuQuantum Appliance, which has custom modifications over qsim, especially the multi-GPU backend. Seems to require #601 to be solved)

Why do you think this issue is related to #601?

@rht
Copy link
Author

rht commented May 22, 2023

NVM, that was a digression comment specific to my use case. Though I was mistaken that the issue in that comment is related to #601, because the cuQuantum Appliance would still contain (proprietary? [1]) modifications over qsim, even if there is a dynamic link to CUDA and cuQuantum.

[1] I can't seem to find the string max_fused_diagonal_gate_size in the qsim codebase.

@sergeisakov
Copy link
Collaborator

because the cuQuantum Appliance would still contain (proprietary? [1]) modifications over qsim

Yes, it seems the cuQuantum Appliance contains proprietary modifications over qsim.

@sergeisakov
Copy link
Collaborator

A note on performance. Just running

sim = cirq.Simulator()
sim.simulate(qc_cirq)

doesn't really perform any simulation. Running, say,

sim = cirq.Simulator()
result = sim.simulate(qc_cirq)
print(result.state_vector()[0])

performs simulation and it takes 130 seconds on an a2-highgpu-1g (not just 2 seconds). Running that with qsimcirq.QSimSimulator() takes 43 seconds (I guess this time may vary). However, the actual simulation time on an A100 GPU is just 0.8 seconds (0.4 seconds with gate fusion). It takes about 25 seconds to transfer the data from the device to host memory and it takes about 15 seconds to create the cirq.StateVectorTrialResult object. It's not really recommended to use simulate for shallow circuits on GPUs.

@rht
Copy link
Author

rht commented May 22, 2023

I see.

It's not really recommended to use simulate for shallow circuits on GPUs.

I believe the OOM issue on a2-highgpu-1g happens even on deep circuits, because of the post-simulation creation of cirq.StateVectorTrialResult.

@sergeisakov
Copy link
Collaborator

I believe the OOM issue on a2-highgpu-1g happens even on deep circuits

Yes, the OOM issue happens even on deep circuits. I think this issue should be fixed on the Cirq level.

Cirq has another line state_vector = state_vector.copy() that allocates memory in addition to buffer = np.empty_like(state_vector). So Cirq allocates two buffers and it does not recycle the buffer that it gets from the C++ layer.

@rht
Copy link
Author

rht commented May 23, 2023

I tried commenting out the state_vector = state_vector.copy() line, and no more OOM, yay! However, is it safe to do so? My assumption is that _BufferedStatevector new allocation of a copy of the state vector, make senses only if there are going to be further operations on the state vector, and that it needs to be immutable where the initial_state is not modified. If there are no further operations, then they are not needed.

I suppose, in addition to reusing the buffer from the C++ layer, there needs to be a flag in cirq.StateVectorSimulationState that allows modification of initial_state in place?

@sergeisakov
Copy link
Collaborator

However, is it safe to do so?

I think it should be safe to do so.

I suppose, in addition to reusing the buffer from the C++ layer, there needs to be a flag in cirq.StateVectorSimulationState that allows modification of initial_state in place?

Yes, something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants