Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel circuit evaluation #304

Merged
merged 20 commits into from
Jan 3, 2021
Merged

Parallel circuit evaluation #304

merged 20 commits into from
Jan 3, 2021

Conversation

scarrazza
Copy link
Member

This PR implements:

  • parallel_execution: executes a given circuit for multiple input states.
  • parallel_reuploading_execution: executes a given circuit for multiple parametrization values (reuploading)
    both approaches are based on multi-processing and share the parallel_L-BFGS-B procedure.

This PR also:

  • moves the parallel related code to parallel.py
  • cleanup the ParallelResources singleton.

@codecov
Copy link

codecov bot commented Dec 22, 2020

Codecov Report

Merging #304 (6ca9a64) into master (f1a3490) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master      #304   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           55        57    +2     
  Lines        10743     10808   +65     
=========================================
+ Hits         10743     10808   +65     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibo/__init__.py 100.00% <100.00%> (ø)
src/qibo/optimizers.py 100.00% <100.00%> (ø)
src/qibo/parallel.py 100.00% <100.00%> (ø)
src/qibo/tensorflow/circuit.py 100.00% <100.00%> (ø)
src/qibo/tensorflow/distutils.py 100.00% <100.00%> (ø)
src/qibo/tests/test_custom_operators.py 100.00% <100.00%> (ø)
src/qibo/tests/test_parallel.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1a3490...6ca9a64. Read the comment docs.

Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, I believe it is very useful for many applications. I left some comments below. I also tried to do some benchmarks and I found the following issue:

Using the following common variational circuit for 10 layers, when I do

qibo.set_threads(1)
parameters = [np.random.uniform(0, 2*np.pi, size) for _ in range(10)]
result = parallel.parallel_reuploading_execution(c, parameters=parameters, processes=1)

it works for up to 15 qubits but freezes for 16 or more. The script executes but I never get result. I tested this in three machines with different CPU/RAM, including the DGX and I get the same behavior for 16 or more qubits in all cases. I thought that the number would depend on the CPU configuration but this does not seem to be the case. Have you observed something similar?

Note that standard sequential execution:

result = []
for params in parameters:
    c.set_parameters(params)
    result.append(c())

works for any number of qubits for the same example.

This kind of repeated execution is more useful for small circuits so the >15 case may end up being useless but it would be good to know why this issue happens. Once this is fixed I will run some benchmarks of this for different thread/processes configurations.


We provide CPU multi-processing methods for circuit evaluation for multiple
input states and multiple parameters (reuploading) for fixed input state. Make
sure to set an adeguate number of threads per process with ``qibo.set_threads``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sure to set an adeguate number of threads per process with ``qibo.set_threads``
sure to set an adequate number of threads per process with ``qibo.set_threads``

(by the way does "adeguate" has the same meaning in Italian? I am not sure if it is used in English)

I also think it would be useful to provide a more detailed explanation that the processes variable in the functions provided by qibo.parallel controls the number of parallel processes (number of circuit copies), while qibo.set_threads controls the number of threads used for the execution of each circuit. Currently we are documenting this in a note in qibo.optimizers.newtonian, but since this PR generalizes the parallelization functionality for other applications, I think it is more useful to move this note here.

_check_parallel_configuration(processes)

def operation(state, circuit): # pragma: no cover
return circuit(state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be useful and simple to add an nshots parameter here when calling the circuit? This will allow the user to simulate measurements, because currently both parallelization methods can only be used to get the final state vector. I guess for this to work nshots should also be added as parameter in the parallel_execution function.


Returns:
Circuit evaluation for input states.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another useful application of this method is that it can be used to parallelize noise simulation with repetition if None is passed as an initial state for all circuits. For example:

nreps = 10

c = models.QFT(20)
c = c.with_noise((0.2, 0.0, 0.2))

states = nreps * [None]
result = parallel_execution(c, states)

will simulate the circuit for nreps different instances of the random numbers in the Pauli noise gates. The only issue with this is that currently the parallel_execution does not support measurements and getting the final states is probably not very useful for noise simulation.

I would try to implement this feature (noise simulation with parallel repeated execution) in a different PR after this is merged, I am just mentioning it here to make sure that passing a list of None as the initial state is okay in terms of the parallelization mechanism.

@scarrazza
Copy link
Member Author

@stavros11 thanks for the careful review, I will implement your suggestions, but lets discuss about the crash for nqubits > 15. The problem seems to appears when calling the tf.zeros method in _default_initial_state and may be related to the tf thread pool environment.

I can do more than 15 qubits if I set tf.config.threading.set_intra_op_parallelism_threads(1), could you please check that from your side? If so, then we can propose 2 alternative fixes:

  • replace that line with numpy (which however I think cannot do more than 32 qubits)
  • try to rewrite the tf.zeros as a custom operator based on openmp.

@stavros11
Copy link
Member

I can do more than 15 qubits if I set tf.config.threading.set_intra_op_parallelism_threads(1), could you please check that from your side?

Thanks for finding the issue. Indeed when I set the intra op threads it works for any number of qubits. I also agree that the problem is related to tf.zeros, because if I pass a custom initial state (thus avoid tf.zeros) it works for any number of qubits without setting the intra op threads.

Note however that during these tests I discovered another bug: If the initial state passed to parallel_reuploading_execution is a numpy array I get the following error:

TypeError: Initial state type <class 'qibo.tensorflow.circuit.TensorflowCircuit'> is not recognized.

On the other hand if I pass a tf.Tensor, eg:

initial_state = tf.cast(np.random.random(2 ** nqubits), dtype=tf.complex128)

it works for any number of qubits.

If so, then we can propose 2 alternative fixes:

  • replace that line with numpy (which however I think cannot do more than 32 qubits)
  • try to rewrite the tf.zeros as a custom operator based on openmp.

I would prefer to avoid numpy because the 32 qubit limitation will break other features, although . I am fine with the custom operator solution but I guess that we would have to follow a slightly different approach compared to the custom operators we already have since we would have to create a new tensor. Our custom operators modify existing tensors but never create one from scratch I think.

Another simple (but perhaps temporary) solution would be to use np.zeros for up to 32 qubits and tf.zeros for more. This would not break any existing features as the initial_state op would automatically convert np.zeros to tf. It would also solve the parallelization problem for most cases, as it is highly unlikely one would use the parallel execution in a circuit with more than 32 qubits. This would be very CPU and memory intensive.

@scarrazza
Copy link
Member Author

@stavros11 thanks for spotting this bug with numpy, I have just fixed it.
I will now give a try at the tensor initialization operator.

@scarrazza scarrazza mentioned this pull request Dec 27, 2020
Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is good after the fixes. Two small/optional comments:

  • It may be useful to add the option for nshots in parallel_execution.
  • I would reconsider the name parallel_reuploading_execution because it refers to a very specific example, while in principle this function may be useful in other cases. I don't have any better proposals though, perhaps something like parallel_parametrized_execution or parallel_variational_execution?

Here are some benchmarks using the latest version:

Variational circuit for 20 qubits, 10 layers and 40 repetitions with parallel_reuploading_execution
nprocesses 1 thread (sec) 2 threads (sec) 4 threads (sec) 8 threads (sec)
40 12.805098533630371
38 13.695438861846924
36 13.342200756072998
32 12.5958993434906
30 12.373421430587769
28 12.331300020217896
26 12.263689756393433
24 12.209647178649902
22 12.227311134338379
20 12.40722107887268 12.147357702255249
18 13.226883888244629 12.18604850769043
16 12.482277393341064 12.095301866531372
14 12.233519077301025 12.05974268913269
12 13.351976156234741 12.101345539093018
10 12.468310356140137 12.153654336929321 11.824805974960327
8 16.405864000320435 12.301052808761597 11.402990102767944
6 20.804744482040405 12.999128341674805 11.134164571762085
4 30.375078678131104 15.838544845581055 9.726101875305176 8.817081928253174
2 49.171082735061646 25.453674793243408 13.69065237045288 7.852612495422363
1 97.5296790599823 50.157207012176514 26.84400725364685 14.984319925308228

Using a single process

nthreads sequential execution (sec) nprocesses = 1 (sec)
1 96.41183614730835 97.5296790599823
2 48.74507212638855 50.157207012176514
4 25.092600107192993 26.84400725364685
8 13.313493490219116 14.984319925308228
10 10.96130084991455 12.503426551818848
15 7.736945152282715 9.565504312515259
20 10.791192531585693 12.425609827041626
25 8.981133460998535 10.492358446121216
30 7.747124433517456 9.72097110748291
35 7.45811915397644 8.76174020767212
QFT for 20 qubits and 40 repetitions with parallel_execution
nprocesses 1 thread (sec) 2 threads (sec) 4 threads (sec) 8 threads (sec)
40 3.5337796211242676
38 3.4507174491882324
36 3.4214696884155273
32 3.30759859085083
30 3.226335048675537
28 3.1414377689361572
26 3.0664708614349365
24 3.00372576713562
22 2.9865293502807617
20 3.030362844467163 2.9598896503448486
18 3.070789098739624 2.8873302936553955
16 3.17545485496521 2.89013671875
14 3.119136095046997 2.837908983230591
12 3.395042657852173 2.7712819576263428
10 3.6031875610351562 2.7387940883636475 2.528254747390747
8 4.8466339111328125 3.141209840774536 2.7192327976226807
6 5.985300064086914 3.478762149810791 2.783780574798584
4 8.500499486923218 4.790364503860474 2.9625113010406494 2.8075079917907715
2 14.234696388244629 8.162927150726318 5.1821606159210205 3.489781618118286
1 27.89056634902954 15.519933462142944 9.260996341705322 6.132781505584717

@scarrazza
Copy link
Member Author

@stavros11 thanks for the benchmark, the numbers look good. I will implement the suggestions, so we can then merge this PR.

@scarrazza
Copy link
Member Author

@stavros11 I have implemented your comments, except the nshots option, because the code hangs at the sample call, for the same reason as tf.zeros, the only way to preserve the threadpool is with the set_intra*(1) at the beginning of the script. So, if you agree let merge this PR and then open another where we try to check if it is possible to solve this issue without writing custom operators to tf.cast, tf.random.categorical, tf.transpose etc.

@stavros11
Copy link
Member

Thanks for the updates, I agree with merging. Regarding nshots, we can indeed look at this in a different PR. I was also planning to have a look on using the parallel functionality for noise with repetition.

@scarrazza scarrazza merged commit e6ef7c8 into master Jan 3, 2021
@scarrazza scarrazza deleted the paralleleval branch February 6, 2021 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants