-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metropolis algorithm for sampling shots #332
Conversation
Codecov Report
@@ Coverage Diff @@
## measurements #332 +/- ##
==============================================
Coverage 100.00% 100.00%
==============================================
Files 75 75
Lines 12187 12214 +27
==============================================
+ Hits 12187 12214 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
src/qibo/backends/tensorflow.py
Outdated
@@ -47,7 +47,8 @@ def __init__(self): | |||
self.op = None | |||
if op._custom_operators_loaded: | |||
self.op = op | |||
|
|||
self._seed = 1234 # seed to use in the measurement frequency custom op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we set the current seed used by tf instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original plan, however I could not find a "get_seed" method for Tensorflow. I will have a second look though. An alternative solution would be to set the seed randomly, for example via time, and just set the same seed for Tensorflow, like
self._seed = int(time.time())
self.backend.random.set_seed(self._seed)
{ | ||
int64 nstates = 1 << nqubits; | ||
srand(user_seed); | ||
unsigned thread_seed[omp_get_max_threads()]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid using C99 specs here, I believe these omp_*
functions return int
not const int
, so we may prefer to allocate dynamically or use a std::vector
.
@stavros11 thanks for this implementation, it looks quite promising. I have a couple of questions concerning the quality of this approach:
|
Thank you for the comments.
I have been doing some tests regarding this and here is a plot that does the same test for 20 qubits: It appears that more shots are required to get low KL but this is true both for the Metropolis and the rejection algorithm. This plot is again for uniform p(s) but the same happens when using a random p(s). I will try to produce some KL vs nqubits plots.
No the KL is calculated by comparing the distribution of the rejection/metropolis samples (which is q(s) = frequency(s) / nshots) with the exact target distribution p(s). Specifically KL = sum over all s ( p(s) log ( q(s) / p(s) ) ) and this we can calculate since we have p(s) for each s (just the wavefunction squared). In the first post p(s) is indeed uniform (that is p(s) = 1 / nstates) for the left plot but it is a random distribution p(s) = tf.random.uniform(nstates) (normalized) for the right plot. I can do it for the QFT but I would expect it to have similar behavior. For example QFT with |000...0> as initial state will actually give the uniform p(s) so it will be the same as the left plot. |
Great, thanks for the test and clarification. |
In principle we could keep both sampling approaches but I agree that this seems to be a better choice since it is much faster and not much different in terms of statistical correctness. I made the two changes you suggested in the comments above and I will do a few more plots regarding the nqubits scaling of the KL. I am not sure what is the issue with CI, perhaps not related to this PR? |
@stavros11, thanks I have fixed CI (was a temporary failure). I think this PR is ready to be merged if you are not planning other changes. Let me know. |
I agree with merging. I have checked up to 30 qubits and up to 10^10 shots with 10 qubits and everything seems to work without any precision issues. My only concern is that this seems to run when using a machine with GPU. I think the execution is done on GPU but it automatically fall backs to CPU for the measurement since we do not have the CUDA kernel for this yet. I am not sure if this is the expected behavior. Here are some additional KL results including the categorical kernel:
Execution times for these runs are the following:
|
What do you mean exactly? By default, the custom operator is accessible for CPU and GPU, so it will try to run |
// grab the input tensor | ||
Tensor frequencies = context->input(0); | ||
const Tensor& cumprobs = context->input(1); | ||
const Tensor& nshots = context->input(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stavros11, for the GPU would be great if we could consider nshots
as attribute, otherwise we cannot access its value for the GPU block/thread computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that if we define nshots
as attribute:
REGISTER_OP("MeasureFrequencies")
.Attr("Tfloat: {float32, float64}")
.Attr("Tint: {int32, int64}")
.Input("frequencies: Tint")
.Input("probs: Tfloat")
.Attr("nshots: int")
...
then its type is int32
and this creates problems for many shots (like 10^10). I am not sure if Tensorflow supports int64 for attributes, but I think no according to their custom op guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so I think a potential solution is to store the nshots
value as the shape of the input tensor, so we can access this information from CPU. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no tensor with shape that involves nshots
, do you mean creating a new tensor of that shape as input? Won't this cause memory problems when nshots is very large?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will cause memory issues. What about always float/double
and then we cast to int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of range, float should work, the range is between 1.2E-38 to 3.4E+38.
Maybe a more elegant solution is to always copy the nshots
to CPU by allocating a new variable using with tf.device('CPU'):
and then calling the measurement operator. This should allow us to keep using the int64 but also access kernels from custom operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more elegant solution is to always copy the
nshots
to CPU by allocating a new variable usingwith tf.device('CPU'):
and then calling the measurement operator. This should allow us to keep using the int64 but also access kernels from custom operators.
Would this require any change in the custom operator or we leave nshots as it is (input) and we just cast it on CPU in the Python code? Would there be any performance issues with the GPU due to copying nshots from CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we keep the op unchanged but overload the python method with the cpu cast. I don't think we will see memory issues, if I understand the code, nshots will be always on a cpu tensor. Could you please give a try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understood correctly how this will help in the GPU kernel but can you please check if what I pushed in the last commit is correct? The reason I did not overload the python method in qibo_tf_custom_operators.py
is that I don't want to import K there (it leads to a circular import so we would have to import it from within the function).
The measure_frequencies
op is only used by sample_frequencies
in the Tensorflow backend so what I did should be equivalent to overloading the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have tested, however nshots is copied back to the GPU after the cast. So this approach doesn't work.
If I run a circuit with measurements on GPU using this branch, then the circuit is executed on GPU but the measurement op runs on CPU. I am not sure if this is the expected default behavior or if it should raise an error since this op does not exist for GPU. |
Yes, this behaviour is expected, given that we did not register GPU operators, all tensors are cast to CPU and executed by the custom operator. |
Add nshot switcher
Speeding up Metropolis for frequency
Implements a custom operator for calculating measurement frequencies based on the Metropolis algorithm. Specifically given a probability distribution p(s) over bitstrings (usually the square of the wavefunction), we start s = argmax p(s) and we perform bit-flips (add random powers of 2) which we accept if p(s') / p(s) > random[0, 1]. All bitstrings sampled with this procedure are added in the frequencies, that is we do not use any additional moves to get uncorrelated samples.
Validity
I checked the statistics of this procedure by calculating the KL divergence between the target probability distribution p(s) and the one that corresponds to the measured frequencies q(s) = (number of times s appears) / nshots. Lower KL means better agreement between the samples and the target distribution. I compare the Metropolis approach implemented here with the rejection sampling implemented in #330.
Rejection sampling indeed performs slightly better in this test, however I believe Metropolis, even with correlated samples, is also acceptable.
Performance
I performed the following benchmark using this branch and #330 (rejection sampling) on DGX CPU