-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel circuit evaluation #304
Conversation
Codecov Report
@@ Coverage Diff @@
## master #304 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 55 57 +2
Lines 10743 10808 +65
=========================================
+ Hits 10743 10808 +65
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this, I believe it is very useful for many applications. I left some comments below. I also tried to do some benchmarks and I found the following issue:
Using the following common variational circuit for 10 layers, when I do
qibo.set_threads(1)
parameters = [np.random.uniform(0, 2*np.pi, size) for _ in range(10)]
result = parallel.parallel_reuploading_execution(c, parameters=parameters, processes=1)
it works for up to 15 qubits but freezes for 16 or more. The script executes but I never get result. I tested this in three machines with different CPU/RAM, including the DGX and I get the same behavior for 16 or more qubits in all cases. I thought that the number would depend on the CPU configuration but this does not seem to be the case. Have you observed something similar?
Note that standard sequential execution:
result = []
for params in parameters:
c.set_parameters(params)
result.append(c())
works for any number of qubits for the same example.
This kind of repeated execution is more useful for small circuits so the >15 case may end up being useless but it would be good to know why this issue happens. Once this is fixed I will run some benchmarks of this for different thread/processes configurations.
doc/source/qibo.rst
Outdated
|
||
We provide CPU multi-processing methods for circuit evaluation for multiple | ||
input states and multiple parameters (reuploading) for fixed input state. Make | ||
sure to set an adeguate number of threads per process with ``qibo.set_threads`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure to set an adeguate number of threads per process with ``qibo.set_threads`` | |
sure to set an adequate number of threads per process with ``qibo.set_threads`` |
(by the way does "adeguate" has the same meaning in Italian? I am not sure if it is used in English)
I also think it would be useful to provide a more detailed explanation that the processes
variable in the functions provided by qibo.parallel
controls the number of parallel processes (number of circuit copies), while qibo.set_threads
controls the number of threads used for the execution of each circuit. Currently we are documenting this in a note in qibo.optimizers.newtonian
, but since this PR generalizes the parallelization functionality for other applications, I think it is more useful to move this note here.
_check_parallel_configuration(processes) | ||
|
||
def operation(state, circuit): # pragma: no cover | ||
return circuit(state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be useful and simple to add an nshots
parameter here when calling the circuit? This will allow the user to simulate measurements, because currently both parallelization methods can only be used to get the final state vector. I guess for this to work nshots
should also be added as parameter in the parallel_execution
function.
|
||
Returns: | ||
Circuit evaluation for input states. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another useful application of this method is that it can be used to parallelize noise simulation with repetition if None
is passed as an initial state for all circuits. For example:
nreps = 10
c = models.QFT(20)
c = c.with_noise((0.2, 0.0, 0.2))
states = nreps * [None]
result = parallel_execution(c, states)
will simulate the circuit for nreps
different instances of the random numbers in the Pauli noise gates. The only issue with this is that currently the parallel_execution
does not support measurements and getting the final states is probably not very useful for noise simulation.
I would try to implement this feature (noise simulation with parallel repeated execution) in a different PR after this is merged, I am just mentioning it here to make sure that passing a list of None
as the initial state is okay in terms of the parallelization mechanism.
@stavros11 thanks for the careful review, I will implement your suggestions, but lets discuss about the crash for nqubits > 15. The problem seems to appears when calling the I can do more than 15 qubits if I set
|
Thanks for finding the issue. Indeed when I set the intra op threads it works for any number of qubits. I also agree that the problem is related to Note however that during these tests I discovered another bug: If the initial state passed to TypeError: Initial state type <class 'qibo.tensorflow.circuit.TensorflowCircuit'> is not recognized. On the other hand if I pass a initial_state = tf.cast(np.random.random(2 ** nqubits), dtype=tf.complex128) it works for any number of qubits.
I would prefer to avoid numpy because the 32 qubit limitation will break other features, although . I am fine with the custom operator solution but I guess that we would have to follow a slightly different approach compared to the custom operators we already have since we would have to create a new tensor. Our custom operators modify existing tensors but never create one from scratch I think. Another simple (but perhaps temporary) solution would be to use |
@stavros11 thanks for spotting this bug with numpy, I have just fixed it. |
custom initial_state GPU
custom initial_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is good after the fixes. Two small/optional comments:
- It may be useful to add the option for
nshots
inparallel_execution
. - I would reconsider the name
parallel_reuploading_execution
because it refers to a very specific example, while in principle this function may be useful in other cases. I don't have any better proposals though, perhaps something likeparallel_parametrized_execution
orparallel_variational_execution
?
Here are some benchmarks using the latest version:
Variational circuit for 20 qubits, 10 layers and 40 repetitions with parallel_reuploading_execution
nprocesses | 1 thread (sec) | 2 threads (sec) | 4 threads (sec) | 8 threads (sec) |
---|---|---|---|---|
40 | 12.805098533630371 | |||
38 | 13.695438861846924 | |||
36 | 13.342200756072998 | |||
32 | 12.5958993434906 | |||
30 | 12.373421430587769 | |||
28 | 12.331300020217896 | |||
26 | 12.263689756393433 | |||
24 | 12.209647178649902 | |||
22 | 12.227311134338379 | |||
20 | 12.40722107887268 | 12.147357702255249 | ||
18 | 13.226883888244629 | 12.18604850769043 | ||
16 | 12.482277393341064 | 12.095301866531372 | ||
14 | 12.233519077301025 | 12.05974268913269 | ||
12 | 13.351976156234741 | 12.101345539093018 | ||
10 | 12.468310356140137 | 12.153654336929321 | 11.824805974960327 | |
8 | 16.405864000320435 | 12.301052808761597 | 11.402990102767944 | |
6 | 20.804744482040405 | 12.999128341674805 | 11.134164571762085 | |
4 | 30.375078678131104 | 15.838544845581055 | 9.726101875305176 | 8.817081928253174 |
2 | 49.171082735061646 | 25.453674793243408 | 13.69065237045288 | 7.852612495422363 |
1 | 97.5296790599823 | 50.157207012176514 | 26.84400725364685 | 14.984319925308228 |
Using a single process
nthreads | sequential execution (sec) | nprocesses = 1 (sec) |
---|---|---|
1 | 96.41183614730835 | 97.5296790599823 |
2 | 48.74507212638855 | 50.157207012176514 |
4 | 25.092600107192993 | 26.84400725364685 |
8 | 13.313493490219116 | 14.984319925308228 |
10 | 10.96130084991455 | 12.503426551818848 |
15 | 7.736945152282715 | 9.565504312515259 |
20 | 10.791192531585693 | 12.425609827041626 |
25 | 8.981133460998535 | 10.492358446121216 |
30 | 7.747124433517456 | 9.72097110748291 |
35 | 7.45811915397644 | 8.76174020767212 |
QFT for 20 qubits and 40 repetitions with parallel_execution
nprocesses | 1 thread (sec) | 2 threads (sec) | 4 threads (sec) | 8 threads (sec) |
---|---|---|---|---|
40 | 3.5337796211242676 | |||
38 | 3.4507174491882324 | |||
36 | 3.4214696884155273 | |||
32 | 3.30759859085083 | |||
30 | 3.226335048675537 | |||
28 | 3.1414377689361572 | |||
26 | 3.0664708614349365 | |||
24 | 3.00372576713562 | |||
22 | 2.9865293502807617 | |||
20 | 3.030362844467163 | 2.9598896503448486 | ||
18 | 3.070789098739624 | 2.8873302936553955 | ||
16 | 3.17545485496521 | 2.89013671875 | ||
14 | 3.119136095046997 | 2.837908983230591 | ||
12 | 3.395042657852173 | 2.7712819576263428 | ||
10 | 3.6031875610351562 | 2.7387940883636475 | 2.528254747390747 | |
8 | 4.8466339111328125 | 3.141209840774536 | 2.7192327976226807 | |
6 | 5.985300064086914 | 3.478762149810791 | 2.783780574798584 | |
4 | 8.500499486923218 | 4.790364503860474 | 2.9625113010406494 | 2.8075079917907715 |
2 | 14.234696388244629 | 8.162927150726318 | 5.1821606159210205 | 3.489781618118286 |
1 | 27.89056634902954 | 15.519933462142944 | 9.260996341705322 | 6.132781505584717 |
@stavros11 thanks for the benchmark, the numbers look good. I will implement the suggestions, so we can then merge this PR. |
@stavros11 I have implemented your comments, except the nshots option, because the code hangs at the |
Thanks for the updates, I agree with merging. Regarding |
This PR implements:
parallel_execution
: executes a given circuit for multiple input states.parallel_reuploading_execution
: executes a given circuit for multiple parametrization values (reuploading)both approaches are based on multi-processing and share the parallel_L-BFGS-B procedure.
This PR also:
parallel.py
ParallelResources
singleton.