Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metropolis algorithm for sampling shots #332

Merged
merged 44 commits into from
Mar 5, 2021
Merged

Conversation

stavros11
Copy link
Member

Implements a custom operator for calculating measurement frequencies based on the Metropolis algorithm. Specifically given a probability distribution p(s) over bitstrings (usually the square of the wavefunction), we start s = argmax p(s) and we perform bit-flips (add random powers of 2) which we accept if p(s') / p(s) > random[0, 1]. All bitstrings sampled with this procedure are added in the frequencies, that is we do not use any additional moves to get uncorrelated samples.

Validity

I checked the statistics of this procedure by calculating the KL divergence between the target probability distribution p(s) and the one that corresponds to the measured frequencies q(s) = (number of times s appears) / nshots. Lower KL means better agreement between the samples and the target distribution. I compare the Metropolis approach implemented here with the rejection sampling implemented in #330.

kl_both

  • Left: Using a uniform distribution for p(s), that is for example the one obtained with the following circuit:
c = Circuit(10)
c.add((gates.H(i) for i in range(10)))
c.add(gates.M(*range(10)))
  • Right: Averaging the KL over 30 random distributions p(s) (also for 10 qubits, that is of size 1024).

Rejection sampling indeed performs slightly better in this test, however I believe Metropolis, even with correlated samples, is also acceptable.

Performance

I performed the following benchmark using this branch and #330 (rejection sampling) on DGX CPU

from qibo import K

nqubits = 10
probs = K.ones(2 ** nqubits)
probs = probs / K.sum(probs)

start_time = time.time()
frequencies = K.sample_frequencies(probs, nshots)
print(time.time() - start_time)
nshots Rejection Sampling Metropolis
1e8 3.215017 0.333904
2e8 6.362053 0.565742
4e8 12.60637 1.085155
6e8 18.87215 1.566486
8e8 25.160118 2.056643
1e9 31.419534 2.562747
2e9 62.782110 5.049047
4e9 124.32199 9.682504
6e9 180.82613 15.061134

@codecov
Copy link

codecov bot commented Feb 28, 2021

Codecov Report

Merging #332 (dc2c88a) into measurements (6fd90b4) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff               @@
##           measurements      #332   +/-   ##
==============================================
  Coverage        100.00%   100.00%           
==============================================
  Files                75        75           
  Lines             12187     12214   +27     
==============================================
+ Hits              12187     12214   +27     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/qibo/backends/tensorflow.py 100.00% <100.00%> (ø)
src/qibo/config.py 100.00% <100.00%> (ø)
src/qibo/tensorflow/custom_operators/__init__.py 100.00% <100.00%> (ø)
...m_operators/python/ops/qibo_tf_custom_operators.py 100.00% <100.00%> (ø)
...o/tests_new/test_measurement_gate_probabilistic.py 100.00% <100.00%> (ø)
...qibo/tests_new/test_tensorflow_custom_operators.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fd90b4...dc2c88a. Read the comment docs.

@@ -47,7 +47,8 @@ def __init__(self):
self.op = None
if op._custom_operators_loaded:
self.op = op

self._seed = 1234 # seed to use in the measurement frequency custom op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set the current seed used by tf instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my original plan, however I could not find a "get_seed" method for Tensorflow. I will have a second look though. An alternative solution would be to set the seed randomly, for example via time, and just set the same seed for Tensorflow, like

self._seed = int(time.time())
self.backend.random.set_seed(self._seed)

{
int64 nstates = 1 << nqubits;
srand(user_seed);
unsigned thread_seed[omp_get_max_threads()];
Copy link
Member

@scarrazza scarrazza Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using C99 specs here, I believe these omp_* functions return int not const int, so we may prefer to allocate dynamically or use a std::vector.

@scarrazza
Copy link
Member

@stavros11 thanks for this implementation, it looks quite promising. I have a couple of questions concerning the quality of this approach:

  • how the kl scales with nqubits?
  • if I understand correctly, you are comparing the uniform approach vs rejection/metropolis. What about using an analytic expression, e.g. for the QFT as reference?

@stavros11
Copy link
Member Author

stavros11 commented Feb 28, 2021

Thank you for the comments.

  • how the kl scales with nqubits?

I have been doing some tests regarding this and here is a plot that does the same test for 20 qubits:

kl20qubits

It appears that more shots are required to get low KL but this is true both for the Metropolis and the rejection algorithm. This plot is again for uniform p(s) but the same happens when using a random p(s). I will try to produce some KL vs nqubits plots.

  • if I understand correctly, you are comparing the uniform approach vs rejection/metropolis. What about using an analytic expression, e.g. for the QFT as reference?

No the KL is calculated by comparing the distribution of the rejection/metropolis samples (which is q(s) = frequency(s) / nshots) with the exact target distribution p(s). Specifically KL = sum over all s ( p(s) log ( q(s) / p(s) ) ) and this we can calculate since we have p(s) for each s (just the wavefunction squared).

In the first post p(s) is indeed uniform (that is p(s) = 1 / nstates) for the left plot but it is a random distribution p(s) = tf.random.uniform(nstates) (normalized) for the right plot. I can do it for the QFT but I would expect it to have similar behavior. For example QFT with |000...0> as initial state will actually give the uniform p(s) so it will be the same as the left plot.
What I haven't compared is what is the KL of the tf.random.categorical approach that is currently used in Qibo. This would be a third line in the above plots but I would expect it to be close to the rejection one.

@scarrazza
Copy link
Member

Great, thanks for the test and clarification.
I think we can consider this approach as the default for sampling.

@stavros11
Copy link
Member Author

Great, thanks for the test and clarification.
I think we can consider this approach as the default for sampling.

In principle we could keep both sampling approaches but I agree that this seems to be a better choice since it is much faster and not much different in terms of statistical correctness. I made the two changes you suggested in the comments above and I will do a few more plots regarding the nqubits scaling of the KL. I am not sure what is the issue with CI, perhaps not related to this PR?

@scarrazza
Copy link
Member

@stavros11, thanks I have fixed CI (was a temporary failure). I think this PR is ready to be merged if you are not planning other changes. Let me know.

@stavros11
Copy link
Member Author

@stavros11, thanks I have fixed CI (was a temporary failure). I think this PR is ready to be merged if you are not planning other changes. Let me know.

I agree with merging. I have checked up to 30 qubits and up to 10^10 shots with 10 qubits and everything seems to work without any precision issues. My only concern is that this seems to run when using a machine with GPU. I think the execution is done on GPU but it automatically fall backs to CPU for the measurement since we do not have the CUDA kernel for this yet. I am not sure if this is the expected behavior.

Here are some additional KL results including the categorical kernel:

  • Scaling with number of shots for 10 qubits.

kl_vs_shots

  • Scaling with number of qubits for 10^6 shots.

kl_vs_qubits

Execution times for these runs are the following:

  • For 10 qubits:
nshots Metropolis (sec) Rejection (sec) tf.random.categorical (sec)
100 0.001926 0.0018654 0.001610
1000 0.001919 0.001943 0.001789
1e4 0.002068 0.0028894 0.003031
1e5 0.002682 0.0101371 0.013999
1e6 0.006419 0.0565665 0.113498
1e7 0.029231 0.3882477 1.033740
1e8 0.217336 3.2174122 10.416167
3e8 0.578732 9.4760861
6e8 1.517557 18.883093
1e9 2.488320 31.408768
  • For 10^6 shots:
nshots Metropolis (sec) Rejection (sec) tf.random.categorical (sec)
6 0.005672 0.009360 0.078742
8 0.004703 0.019943 0.094087
10 0.00725 0.046534 0.111727
12 0.00912 0.183703 0.137952
14 0.01830 0.535228 0.169207
16 0.040208 1.9285371 0.252327
18 0.046489 7.4501459 0.362962
20 0.062015 29.454006 0.466291
22 0.084872 118.180581 0.743794
24 0.144699 565.525054 1.596632
26 0.372883 2964.00242 4.021918

@scarrazza
Copy link
Member

My only concern is that this seems to run when using a machine with GPU. I think the execution is done on GPU but it automatically fall backs to CPU for the measurement since we do not have the CUDA kernel for this yet. I am not sure if this is the expected behaviour.

What do you mean exactly? By default, the custom operator is accessible for CPU and GPU, so it will try to run MeasureFrequenciesOp. If you remember, some time ago, we had checks in order to raise an error if the operator was not available for GPU.

// grab the input tensor
Tensor frequencies = context->input(0);
const Tensor& cumprobs = context->input(1);
const Tensor& nshots = context->input(2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stavros11, for the GPU would be great if we could consider nshots as attribute, otherwise we cannot access its value for the GPU block/thread computation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that if we define nshots as attribute:

REGISTER_OP("MeasureFrequencies")
    .Attr("Tfloat: {float32, float64}")
    .Attr("Tint: {int32, int64}")
    .Input("frequencies: Tint")
    .Input("probs: Tfloat")
    .Attr("nshots: int")
    ...

then its type is int32 and this creates problems for many shots (like 10^10). I am not sure if Tensorflow supports int64 for attributes, but I think no according to their custom op guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so I think a potential solution is to store the nshots value as the shape of the input tensor, so we can access this information from CPU. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no tensor with shape that involves nshots, do you mean creating a new tensor of that shape as input? Won't this cause memory problems when nshots is very large?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will cause memory issues. What about always float/double and then we cast to int64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of range, float should work, the range is between 1.2E-38 to 3.4E+38.

Maybe a more elegant solution is to always copy the nshots to CPU by allocating a new variable using with tf.device('CPU'): and then calling the measurement operator. This should allow us to keep using the int64 but also access kernels from custom operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more elegant solution is to always copy the nshots to CPU by allocating a new variable using with tf.device('CPU'): and then calling the measurement operator. This should allow us to keep using the int64 but also access kernels from custom operators.

Would this require any change in the custom operator or we leave nshots as it is (input) and we just cast it on CPU in the Python code? Would there be any performance issues with the GPU due to copying nshots from CPU?

Copy link
Member

@scarrazza scarrazza Mar 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we keep the op unchanged but overload the python method with the cpu cast. I don't think we will see memory issues, if I understand the code, nshots will be always on a cpu tensor. Could you please give a try?

Copy link
Member Author

@stavros11 stavros11 Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understood correctly how this will help in the GPU kernel but can you please check if what I pushed in the last commit is correct? The reason I did not overload the python method in qibo_tf_custom_operators.py is that I don't want to import K there (it leads to a circular import so we would have to import it from within the function).

The measure_frequencies op is only used by sample_frequencies in the Tensorflow backend so what I did should be equivalent to overloading the method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have tested, however nshots is copied back to the GPU after the cast. So this approach doesn't work.

@stavros11
Copy link
Member Author

What do you mean exactly? By default, the custom operator is accessible for CPU and GPU, so it will try to run MeasureFrequenciesOp. If you remember, some time ago, we had checks in order to raise an error if the operator was not available for GPU.

If I run a circuit with measurements on GPU using this branch, then the circuit is executed on GPU but the measurement op runs on CPU. I am not sure if this is the expected default behavior or if it should raise an error since this op does not exist for GPU.

@scarrazza
Copy link
Member

scarrazza commented Mar 1, 2021

Yes, this behaviour is expected, given that we did not register GPU operators, all tensors are cast to CPU and executed by the custom operator.

@scarrazza scarrazza merged commit 549652a into measurements Mar 5, 2021
@scarrazza scarrazza deleted the metropolisop branch March 6, 2021 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants