Parallel circuit evaluation #304

scarrazza · 2020-12-22T20:59:25Z

This PR implements:

parallel_execution: executes a given circuit for multiple input states.
parallel_reuploading_execution: executes a given circuit for multiple parametrization values (reuploading)
both approaches are based on multi-processing and share the parallel_L-BFGS-B procedure.

This PR also:

moves the parallel related code to parallel.py
cleanup the ParallelResources singleton.

codecov · 2020-12-22T21:32:40Z

Codecov Report

Merging #304 (6ca9a64) into master (f1a3490) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #304   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           55        57    +2     
  Lines        10743     10808   +65     
=========================================
+ Hits         10743     10808   +65

Flag	Coverage Δ
unittests	`100.00% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/qibo/__init__.py	`100.00% <100.00%> (ø)`
src/qibo/optimizers.py	`100.00% <100.00%> (ø)`
src/qibo/parallel.py	`100.00% <100.00%> (ø)`
src/qibo/tensorflow/circuit.py	`100.00% <100.00%> (ø)`
src/qibo/tensorflow/distutils.py	`100.00% <100.00%> (ø)`
src/qibo/tests/test_custom_operators.py	`100.00% <100.00%> (ø)`
src/qibo/tests/test_parallel.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a3490...6ca9a64. Read the comment docs.

stavros11

Thanks for implementing this, I believe it is very useful for many applications. I left some comments below. I also tried to do some benchmarks and I found the following issue:

Using the following common variational circuit for 10 layers, when I do

qibo.set_threads(1)
parameters = [np.random.uniform(0, 2*np.pi, size) for _ in range(10)]
result = parallel.parallel_reuploading_execution(c, parameters=parameters, processes=1)

it works for up to 15 qubits but freezes for 16 or more. The script executes but I never get result. I tested this in three machines with different CPU/RAM, including the DGX and I get the same behavior for 16 or more qubits in all cases. I thought that the number would depend on the CPU configuration but this does not seem to be the case. Have you observed something similar?

Note that standard sequential execution:

result = []
for params in parameters:
    c.set_parameters(params)
    result.append(c())

works for any number of qubits for the same example.

This kind of repeated execution is more useful for small circuits so the >15 case may end up being useless but it would be good to know why this issue happens. Once this is fixed I will run some benchmarks of this for different thread/processes configurations.

stavros11 · 2020-12-25T10:18:28Z

doc/source/qibo.rst

+
+We provide CPU multi-processing methods for circuit evaluation for multiple
+input states and multiple parameters (reuploading) for fixed input state. Make
+sure to set an adeguate number of threads per process with ``qibo.set_threads``


Suggested change

sure to set an adeguate number of threads per process with ``qibo.set_threads``

sure to set an adequate number of threads per process with ``qibo.set_threads``

(by the way does "adeguate" has the same meaning in Italian? I am not sure if it is used in English)

I also think it would be useful to provide a more detailed explanation that the processes variable in the functions provided by qibo.parallel controls the number of parallel processes (number of circuit copies), while qibo.set_threads controls the number of threads used for the execution of each circuit. Currently we are documenting this in a note in qibo.optimizers.newtonian, but since this PR generalizes the parallelization functionality for other applications, I think it is more useful to move this note here.

stavros11 · 2020-12-25T11:00:39Z

src/qibo/parallel.py

+    _check_parallel_configuration(processes)
+
+    def operation(state, circuit): # pragma: no cover
+        return circuit(state)


Do you think it would be useful and simple to add an nshots parameter here when calling the circuit? This will allow the user to simulate measurements, because currently both parallelization methods can only be used to get the final state vector. I guess for this to work nshots should also be added as parameter in the parallel_execution function.

stavros11 · 2020-12-25T11:04:31Z

src/qibo/parallel.py

+
+    Returns:
+        Circuit evaluation for input states.
+    """


Another useful application of this method is that it can be used to parallelize noise simulation with repetition if None is passed as an initial state for all circuits. For example:

nreps = 10 c = models.QFT(20) c = c.with_noise((0.2, 0.0, 0.2)) states = nreps * [None] result = parallel_execution(c, states)

will simulate the circuit for nreps different instances of the random numbers in the Pauli noise gates. The only issue with this is that currently the parallel_execution does not support measurements and getting the final states is probably not very useful for noise simulation.

I would try to implement this feature (noise simulation with parallel repeated execution) in a different PR after this is merged, I am just mentioning it here to make sure that passing a list of None as the initial state is okay in terms of the parallelization mechanism.

scarrazza · 2020-12-25T15:33:23Z

@stavros11 thanks for the careful review, I will implement your suggestions, but lets discuss about the crash for nqubits > 15. The problem seems to appears when calling the tf.zeros method in _default_initial_state and may be related to the tf thread pool environment.

I can do more than 15 qubits if I set tf.config.threading.set_intra_op_parallelism_threads(1), could you please check that from your side? If so, then we can propose 2 alternative fixes:

replace that line with numpy (which however I think cannot do more than 32 qubits)
try to rewrite the tf.zeros as a custom operator based on openmp.

stavros11 · 2020-12-25T18:33:07Z

I can do more than 15 qubits if I set tf.config.threading.set_intra_op_parallelism_threads(1), could you please check that from your side?

Thanks for finding the issue. Indeed when I set the intra op threads it works for any number of qubits. I also agree that the problem is related to tf.zeros, because if I pass a custom initial state (thus avoid tf.zeros) it works for any number of qubits without setting the intra op threads.

Note however that during these tests I discovered another bug: If the initial state passed to parallel_reuploading_execution is a numpy array I get the following error:

TypeError: Initial state type <class 'qibo.tensorflow.circuit.TensorflowCircuit'> is not recognized.

On the other hand if I pass a tf.Tensor, eg:

initial_state = tf.cast(np.random.random(2 ** nqubits), dtype=tf.complex128)

it works for any number of qubits.

If so, then we can propose 2 alternative fixes:

replace that line with numpy (which however I think cannot do more than 32 qubits)

try to rewrite the tf.zeros as a custom operator based on openmp.

I would prefer to avoid numpy because the 32 qubit limitation will break other features, although . I am fine with the custom operator solution but I guess that we would have to follow a slightly different approach compared to the custom operators we already have since we would have to create a new tensor. Our custom operators modify existing tensors but never create one from scratch I think.

Another simple (but perhaps temporary) solution would be to use np.zeros for up to 32 qubits and tf.zeros for more. This would not break any existing features as the initial_state op would automatically convert np.zeros to tf. It would also solve the parallelization problem for most cases, as it is highly unlikely one would use the parallel execution in a circuit with more than 32 qubits. This would be very CPU and memory intensive.

scarrazza · 2020-12-26T09:07:33Z

@stavros11 thanks for spotting this bug with numpy, I have just fixed it.
I will now give a try at the tensor initialization operator.

custom initial_state GPU

custom initial_state

stavros11

I believe this is good after the fixes. Two small/optional comments:

It may be useful to add the option for nshots in parallel_execution.
I would reconsider the name parallel_reuploading_execution because it refers to a very specific example, while in principle this function may be useful in other cases. I don't have any better proposals though, perhaps something like parallel_parametrized_execution or parallel_variational_execution?

Here are some benchmarks using the latest version:

Variational circuit for 20 qubits, 10 layers and 40 repetitions with parallel_reuploading_execution

nprocesses	1 thread (sec)	2 threads (sec)	4 threads (sec)	8 threads (sec)
40	12.805098533630371
38	13.695438861846924
36	13.342200756072998
32	12.5958993434906
30	12.373421430587769
28	12.331300020217896
26	12.263689756393433
24	12.209647178649902
22	12.227311134338379
20	12.40722107887268	12.147357702255249
18	13.226883888244629	12.18604850769043
16	12.482277393341064	12.095301866531372
14	12.233519077301025	12.05974268913269
12	13.351976156234741	12.101345539093018
10	12.468310356140137	12.153654336929321	11.824805974960327
8	16.405864000320435	12.301052808761597	11.402990102767944
6	20.804744482040405	12.999128341674805	11.134164571762085
4	30.375078678131104	15.838544845581055	9.726101875305176	8.817081928253174
2	49.171082735061646	25.453674793243408	13.69065237045288	7.852612495422363
1	97.5296790599823	50.157207012176514	26.84400725364685	14.984319925308228

Using a single process

nthreads	sequential execution (sec)	nprocesses = 1 (sec)
1	96.41183614730835	97.5296790599823
2	48.74507212638855	50.157207012176514
4	25.092600107192993	26.84400725364685
8	13.313493490219116	14.984319925308228
10	10.96130084991455	12.503426551818848
15	7.736945152282715	9.565504312515259
20	10.791192531585693	12.425609827041626
25	8.981133460998535	10.492358446121216
30	7.747124433517456	9.72097110748291
35	7.45811915397644	8.76174020767212

QFT for 20 qubits and 40 repetitions with parallel_execution

nprocesses	1 thread (sec)	2 threads (sec)	4 threads (sec)	8 threads (sec)
40	3.5337796211242676
38	3.4507174491882324
36	3.4214696884155273
32	3.30759859085083
30	3.226335048675537
28	3.1414377689361572
26	3.0664708614349365
24	3.00372576713562
22	2.9865293502807617
20	3.030362844467163	2.9598896503448486
18	3.070789098739624	2.8873302936553955
16	3.17545485496521	2.89013671875
14	3.119136095046997	2.837908983230591
12	3.395042657852173	2.7712819576263428
10	3.6031875610351562	2.7387940883636475	2.528254747390747
8	4.8466339111328125	3.141209840774536	2.7192327976226807
6	5.985300064086914	3.478762149810791	2.783780574798584
4	8.500499486923218	4.790364503860474	2.9625113010406494	2.8075079917907715
2	14.234696388244629	8.162927150726318	5.1821606159210205	3.489781618118286
1	27.89056634902954	15.519933462142944	9.260996341705322	6.132781505584717

scarrazza · 2020-12-31T12:55:27Z

@stavros11 thanks for the benchmark, the numbers look good. I will implement the suggestions, so we can then merge this PR.

scarrazza · 2021-01-03T13:49:37Z

@stavros11 I have implemented your comments, except the nshots option, because the code hangs at the sample call, for the same reason as tf.zeros, the only way to preserve the threadpool is with the set_intra*(1) at the beginning of the script. So, if you agree let merge this PR and then open another where we try to check if it is possible to solve this issue without writing custom operators to tf.cast, tf.random.categorical, tf.transpose etc.

stavros11 · 2021-01-03T14:03:20Z

Thanks for the updates, I agree with merging. Regarding nshots, we can indeed look at this in a different PR. I was also planning to have a look on using the parallel functionality for noise with repetition.

scarrazza added 3 commits December 22, 2020 21:23

implementing parallel circuit evaluation

5c53c3a

adjustments

a86ce83

fixing coverage

bdb014e

fixing documentation and typos

94f43d3

stavros11 reviewed Dec 25, 2020

View reviewed changes

scarrazza added 2 commits December 26, 2020 10:03

fixing crash with numpy states

60641aa

fixing docs typo

10fcd88

scarrazza added 3 commits December 26, 2020 16:06

reducing memory footprint

62cae3b

fixing syntax

d496c5a

adding custom operators and fixing tests

30fe445

scarrazza mentioned this pull request Dec 27, 2020

custom initial_state #305

Merged

scarrazza and others added 8 commits December 28, 2020 11:47

changing interface for gpu support

4453849

fixing gpu type

62903c5

cleanup, setting omp threads globally

034ca6a

fixing cuda typo

34144b3

implementing stavros suggestion, fixes GPU tests

8e27923

Merge pull request #306 from Quantum-TII/custominittopgpu

e749430

custom initial_state GPU

fixing typo

110dfdc

Merge pull request #305 from Quantum-TII/custominitop

08fcb8a

custom initial_state

stavros11 approved these changes Dec 31, 2020

View reviewed changes

scarrazza added 3 commits January 3, 2021 12:00

renaming parallel function

6756d9f

updating documentation

0d97840

updating test name

6ca9a64

scarrazza merged commit e6ef7c8 into master Jan 3, 2021

scarrazza deleted the paralleleval branch February 6, 2021 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel circuit evaluation #304

Parallel circuit evaluation #304

scarrazza commented Dec 22, 2020

codecov bot commented Dec 22, 2020 •

edited

Loading

stavros11 left a comment •

edited

Loading

stavros11 Dec 25, 2020

stavros11 Dec 25, 2020

stavros11 Dec 25, 2020

scarrazza commented Dec 25, 2020

stavros11 commented Dec 25, 2020

scarrazza commented Dec 26, 2020

stavros11 left a comment

scarrazza commented Dec 31, 2020

scarrazza commented Jan 3, 2021

stavros11 commented Jan 3, 2021

	sure to set an adeguate number of threads per process with ``qibo.set_threads``
	sure to set an adequate number of threads per process with ``qibo.set_threads``

Parallel circuit evaluation #304

Parallel circuit evaluation #304

Conversation

scarrazza commented Dec 22, 2020

codecov bot commented Dec 22, 2020 • edited Loading

Codecov Report

stavros11 left a comment • edited Loading

Choose a reason for hiding this comment

stavros11 Dec 25, 2020

Choose a reason for hiding this comment

stavros11 Dec 25, 2020

Choose a reason for hiding this comment

stavros11 Dec 25, 2020

Choose a reason for hiding this comment

scarrazza commented Dec 25, 2020

stavros11 commented Dec 25, 2020

scarrazza commented Dec 26, 2020

stavros11 left a comment

Choose a reason for hiding this comment

scarrazza commented Dec 31, 2020

scarrazza commented Jan 3, 2021

stavros11 commented Jan 3, 2021

codecov bot commented Dec 22, 2020 •

edited

Loading

stavros11 left a comment •

edited

Loading