Implement the OrdinaryDiffEq interface for Kets #16

Krastanov · 2021-04-04T01:45:19Z

TODO

enough broadcasting for StateVectors and Schroedinger equations to work naively with OrdinaryDiffEq
same for density matrices and Lindblad
same for simple Monte Carlo
semiclassical
stochastic
for all the them, the speed and allocations should be the same or better
sparse states?

Actually converting the current solvers to use this simpler "no-recast" interface is out of scope to this pull request.

Old message:

The following now works

using QuantumOptics
using DifferentialEquations

ℋ = SpinBasis(1//2)

σx = sigmax(ℋ)

↓ = s =  spindown(ℋ)

schrod(ψ,p,t) = im * σx * ψ

t₀, t₁ = (0.0, pi)
Δt = 0.1

prob = ODEProblem(schrod, ↓, (t₀, t₁))
sol = solve(prob,Tsit5())

It works for Bras as well.
It works for in-place operations, however there are spurrious
allocations due to inefficient broadcasting that ruin the performance.

Krastanov · 2021-04-04T02:05:41Z

These changes are self-contained and let you define ODEProblems directly on the the StateVector objects. However, the current in-place version actually make more allocations than the naive version... Something is quite wrong.

Here is a quick example:

t₀, t₁ = (0.0, pi)
ℋ = SpinBasis(1//2)
σx = sigmax(ℋ)
↓ =  spindown(ℋ)
iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob = ODEProblem(schrod!, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~600KiB
schrod(ψ,p,t) = iσx*ψ
prob = ODEProblem(schrod, ↓, (t₀, t₁))
@benchmark sol = solve(prob,DP5(),save_everystep=false) # Allocates ~170KiB which is less than the schrod! !?!?
@benchmark timeevolution.schroedinger([t₀,t₁], ↓, σx) # Allocates ~12KiB

Help with this would be appreciated :) I am fairly sure it is not due to the current broadcasting implementation (which does have some overhead, but it is a constant overhead, not something that scales with size). Any idea how to find what allocates so much?

david-pl · 2021-04-06T12:48:58Z

Thanks for digging into this!

I think it's the broadcasting that ruins the performance here though. Replacing the in-place broadcasting method here

QuantumOpticsBase.jl/src/states.jl

Line 242 in 201b630

    
           @inline function Base.copyto!(dest::Ket{B}, bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args}

with

@inline function Base.copyto!(dest::Ket{B}, bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args}
    axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
    bc′ = Base.Broadcast.preprocess(dest, bc)
    dest′ = dest.data
    @inbounds @simd for I in eachindex(bc′)
        dest′[I] = bc′[I]
    end
    return dest
end
Base.getindex(st::StateVector, idx) = getindex(st.data, idx)

which is essentially the example Chris Rackauckas posted in your discourse thread:
https://github.com/SciML/OrdinaryDiffEq.jl/blob/4a110ca8761845b3aa667a109583775c4c718132/test/interface/noindex_tests.jl#L23

Then I get

julia> @benchmark sol = solve($prob,$alg,save_everystep=false)
BenchmarkTools.Trial: 
  memory estimate:  9.92 KiB
  allocs estimate:  133
  --------------
  minimum time:     14.402 μs (0.00% GC)
  median time:      16.446 μs (0.00% GC)
  mean time:        17.666 μs (3.53% GC)
  maximum time:     6.282 ms (99.16% GC)
  --------------
  samples:          10000
  evals/sample:     1

for the in-place version of your code. This is much better, but still not as good as timeevolution.schroedinger and the difference scales with the problem size. So I guess we'd really need a cleaner and more performant implementation of broadcasting here.

FYI you're also benchmarking in global scope and use some non-constant globals which you should avoid to get proper results. Either put all the benchmarking inside a function or make the globals constant (in this case you need const iσx = im * σx for schrod! to be type-stable) and quote the global variables such as prob when calling @benchmark using a $.

david-pl · 2021-04-06T13:10:37Z

Actually the amount of additional allocations of the in-place version is constant in that the "allocs estimate" is the same regardless of the problem size. The memory size increases though and is always higher than with timeevolution.schroedinger. What's odd is that it's actually faster. Changing to ℋ = SpinBasis(20//2) I get

For the in-place solve

BenchmarkTools.Trial: 
  memory estimate:  15.67 KiB
  allocs estimate:  145
  --------------
  minimum time:     101.143 μs (0.00% GC)
  median time:      109.095 μs (0.00% GC)
  mean time:        111.976 μs (0.94% GC)
  maximum time:     5.413 ms (97.68% GC)
  --------------
  samples:          10000
  evals/sample:     1

for timeevolution.schroedinger:

BenchmarkTools.Trial: 
  memory estimate:  13.62 KiB
  allocs estimate:  73
  --------------
  minimum time:     155.181 μs (0.00% GC)
  median time:      168.481 μs (0.00% GC)
  mean time:        170.973 μs (0.30% GC)
  maximum time:     5.366 ms (96.15% GC)
  --------------
  samples:          10000
  evals/sample:     1

david-pl · 2021-04-06T16:05:27Z

Another chunk of performance is lost in the out-of-place broadcasting method. Using the following:

@inline function Base.copy(bc::Broadcast.Broadcasted{Style,Axes,F,Args}) where {B<:Basis,Style<:KetStyle{B},Axes,F,Args<:Tuple}
    bcf = Broadcast.flatten(bc)
    b = find_basis(bcf)
    T = find_dType(bcf)
    data = zeros(T, length(b))
    @inbounds @simd for I in eachindex(bcf)
        data[I] = bcf[I]
    end
    return Ket{B}(b, data)
end

for f ∈ [:find_basis,:find_dType]
    @eval ($f)(bc::Broadcast.Broadcasted) = ($f)(bc.args)
    @eval ($f)(args::Tuple) = ($f)(($f)(args[1]), Base.tail(args))
    @eval ($f)(x) = x
    @eval ($f)(::Any, rest) = ($f)(rest)
end
find_basis(a::StateVector, rest) = a.basis
find_dType(a::StateVector, rest) = eltype(a)

puts me really close in speed and allocations for your original example. The allocations also seem to scale a bit better. For my spin-20//2 example, I get 13.95KiB vs the 13.62KiB from the timeevolution call.

Krastanov · 2021-04-06T17:42:55Z

I am a bit lost what is happening here. Could you explain what Broadcast.preprocess does? Also, what does indexing a Broadcasted object like bcf[I] do? I did not find documentation on these. Lastly, how come this still works with scalars (I had to implement a getdata method that is just x->x for scalars)?

Krastanov · 2021-04-06T18:12:10Z

I updated the pull request with the changes you explained and also added some extra tests. In some cases, it is curiously faster than the timeevolution.schroedinger method.

ℋ = SpinBasis(20//1)
↓ = spindown(ℋ)
t₀, t₁ = (0.0, pi)
const σx = sigmax(ℋ)
const iσx = im * σx
schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ)
prob! = ODEProblem(schrod!, ↓, (t₀, t₁))

julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false)
BenchmarkTools.Trial: 
  memory estimate:  22.67 KiB
  allocs estimate:  178
  --------------
  minimum time:     374.463 μs (0.00% GC)
  median time:      397.327 μs (0.00% GC)
  mean time:        406.738 μs (0.37% GC)
  maximum time:     4.386 ms (89.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx)
BenchmarkTools.Trial: 
  memory estimate:  23.34 KiB
  allocs estimate:  161
  --------------
  minimum time:     748.106 μs (0.00% GC)
  median time:      774.601 μs (0.00% GC)
  mean time:        786.933 μs (0.14% GC)
  maximum time:     4.459 ms (80.46% GC)
  --------------
  samples:          6350
  evals/sample:     1

codecov · 2021-04-06T18:16:14Z

Codecov Report

Merging #16 (4ca6500) into master (201b630) will increase coverage by 0.14%.
The diff coverage is 77.19%.

@@            Coverage Diff             @@
##           master      #16      +/-   ##
==========================================
+ Coverage   93.61%   93.76%   +0.14%     
==========================================
  Files          24       24              
  Lines        2475     2486      +11     
==========================================
+ Hits         2317     2331      +14     
+ Misses        158      155       -3

Impacted Files	Coverage Δ
src/superoperators.jl	`82.65% <ø> (ø)`
src/operators_dense.jl	`89.15% <70.00%> (-2.64%)`	⬇️
src/states.jl	`76.92% <81.08%> (+7.58%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 201b630...4ca6500. Read the comment docs.

Krastanov · 2021-04-06T18:23:36Z

I also updated the issue description with a short TODO. Does the current list sound reasonable to you? Next weekend I might have some time to make the listed changes, but feel free to jump in in case you are impatient ;)

@benchmark

The following now works ```julia using QuantumOptics using DifferentialEquations ℋ = SpinBasis(1//2) σx = sigmax(ℋ) ↓ = s = spindown(ℋ) schrod(ψ,p,t) = im * σx * ψ t₀, t₁ = (0.0, pi) Δt = 0.1 prob = ODEProblem(schrod, ↓, (t₀, t₁)) sol = solve(prob,Tsit5()) ``` It works for Bras as well. It works for in-place operations and in some situations it is faster than the standard `timeevolution.schroedinger`. ```julia ℋ = SpinBasis(20//1) ↓ = spindown(ℋ) t₀, t₁ = (0.0, pi) const σx = sigmax(ℋ) const iσx = im * σx schrod!(dψ,ψ,p,t) = mul!(dψ, iσx, ψ) prob! = ODEProblem(schrod!, ↓, (t₀, t₁)) julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false) BenchmarkTools.Trial: memory estimate: 22.67 KiB allocs estimate: 178 -------------- minimum time: 374.463 μs (0.00% GC) median time: 397.327 μs (0.00% GC) mean time: 406.738 μs (0.37% GC) maximum time: 4.386 ms (89.76% GC) -------------- samples: 10000 evals/sample: 1 julia> @benchmark timeevolution.schroedinger([$t₀,$t₁], $↓, $σx) BenchmarkTools.Trial: memory estimate: 23.34 KiB allocs estimate: 161 -------------- minimum time: 748.106 μs (0.00% GC) median time: 774.601 μs (0.00% GC) mean time: 786.933 μs (0.14% GC) maximum time: 4.459 ms (80.46% GC) -------------- samples: 6350 evals/sample: 1 ```

david-pl · 2021-04-07T08:03:51Z

I am a bit lost what is happening here. Could you explain what Broadcast.preprocess does? Also, what does indexing a Broadcasted object like bcf[I] do?

I got them from Chris Rackauckas' example, but had to look them up myself. If I understand correctly, the preprocess function does two things: it unaliases the destination from the source, and it precomputes the size of the output when broadcasting with arrays which is sometimes beneficial for performance.
Indexing bcf[I] propagates the indexes through the arguments and calls the broadcasting function on them, so the element corresponding to dest[I] is computed.

Lastly, how come this still works with scalars (I had to implement a getdata method that is just x->x for scalars)?

I'm not 100% sure what you're asking here, but I'll try to answer anyway. The getdata method was there because broadcasting was done on the .data fields, so the arguments were unwrapped before the broadcast was evaluated. This was quite bad for performance, so now we're just indexing a StateVector directly with the new getindex method. When indexing a broadcasted object like bcf[I] the index gets ignored for scalars.

I updated the pull request with the changes you explained and also added some extra tests. In some cases, it is curiously faster than the timeevolution.schroedinger method.

Well we're actually comparing two different things. timeevolution.schroedinger creates the ODEProblem and then calls solve (with different tolerances and a SavingCallback) whereas we only look at solve. The real benchmark we want to compare against is a solve with pure Julia types:

const ix = iσx.data
schrod_data!(dψ,ψ,p,t) = mul!(dψ, ix, ψ)
u0 = (↓).data
prob_data! = ODEProblem(schrod_data!, u0, (t₀, t₁))
alg = DP5()
@benchmark solve($prob_data!, $alg, save_everystep=false)

matching the performance of this (with a constant overhead) is the goal.

I also updated the issue description with a short TODO. Does the current list sound reasonable to you?

Those sound good! You could also split the broadcasting for StateVectors and Operators into two separate PRs since they are self-contained, but you already started with the latter so whichever you prefer is fine.

Finally, just for reference, we can in principle do the same for the semiclassical and stochastic modules, but that is definitely for another time.

@benchmark

The following works now: ```julia ℋ = SpinBasis(20//1) const σx = sigmax(ℋ) const iσx = im * σx const σ₋ = sigmam(ℋ) const σ₊ = σ₋' const mhalfσ₊σ₋ = -σ₊*σ₋/2 ↓ = spindown(ℋ) ρ = dm(↓) lind(ρ,p,t) = - iσx * ρ + ρ * iσx + σ₋*ρ*σ₊ + mhalfσ₊σ₋ * ρ + ρ * mhalfσ₊σ₋ t₀, t₁ = (0.0, pi) Δt = 0.1 prob = ODEProblem(lind, ρ, (t₀, t₁)) sol = solve(prob,Tsit5()) ``` Works in-place as well. It is slightly slower than `timeevolution.master`: ```julia function makelind!() tmp = zero(ρ) # this is the global rho function lind!(dρ,ρ,p,t) # TODO this can be much better with a good Tullio kernel mul!(tmp, ρ, σ₊) mul!(dρ, σ₋, ρ) mul!(dρ, ρ, mhalfσ₊σ₋, true, true) mul!(dρ, mhalfσ₊σ₋, ρ, true, true) mul!(dρ, iσx, ρ, -ComplexF64(1), ComplexF64(1)) mul!(dρ, ρ, iσx, true, true) return dρ end end lind! = makelind!() prob! = ODEProblem(lind!, ρ, (t₀, t₁)) julia> @benchmark sol = solve($prob!,DP5(),save_everystep=false) BenchmarkTools.Trial: memory estimate: 408.94 KiB allocs estimate: 213 -------------- minimum time: 126.334 ms (0.00% GC) median time: 127.359 ms (0.00% GC) mean time: 127.876 ms (0.00% GC) maximum time: 138.660 ms (0.00% GC) -------------- samples: 40 evals/sample: 1 julia> @benchmark timeevolution.master([$t₀,$t₁], $ρ, $σx, [$σ₋]) BenchmarkTools.Trial: memory estimate: 497.91 KiB allocs estimate: 210 -------------- minimum time: 97.902 ms (0.00% GC) median time: 98.469 ms (0.00% GC) mean time: 98.655 ms (0.00% GC) maximum time: 104.850 ms (0.00% GC) -------------- samples: 51 evals/sample: 1 ```

Krastanov · 2021-05-01T21:15:10Z

@david-pl, I looked through the rest of the TODOs, but it seems most of that work is to be done in QuantumOptics, not in QOBase. Does it make sense to merge this PR now, and then I will make an issue listing the rest of the TODOs and start working on it in a QuantumOptics pull request?

An exception to this is figuring out what is necessary in order to work with sparse state vectors, but that might be out of scope.

david-pl · 2021-05-03T08:28:12Z

@Krastanov Before moving on to the other time evolution functions, we should make sure the ones that already work are performant. This needs to be done in QOBase. Currently, there are allocations and slowdowns happening using Kets and Operators in ODEProblems. I attempted a couple of different things to achieve the same performance as with pure Julia vectors but didn't succeed so far. I still think that it's due to the broadcasting, but I'm not sure.

An exception to this is figuring out what is necessary in order to work with sparse state vectors, but that might be out of scope.

I'd say this is definitely out of scope for now.

davidschlegel · 2023-11-21T10:08:14Z

This has been stale for a long time. Is there any update on this?

Krastanov force-pushed the simplerode branch from b008271 to c3621ca Compare April 4, 2021 01:47

Krastanov force-pushed the simplerode branch from c3621ca to 594d41a Compare April 6, 2021 18:11

Krastanov force-pushed the simplerode branch from 594d41a to f536761 Compare April 7, 2021 03:07

Krastanov force-pushed the simplerode branch from f536761 to 4ca6500 Compare April 7, 2021 16:35

david-pl mentioned this pull request Jul 15, 2021

Get the DifferentialEquations.jl problem object qojulia/QuantumOptics.jl#298

Open

Krastanov mentioned this pull request May 27, 2023

compatibility with SciMLOperators #99

Open

Krastanov marked this pull request as draft May 13, 2024 00:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the OrdinaryDiffEq interface for Kets #16

Implement the OrdinaryDiffEq interface for Kets #16

Krastanov commented Apr 4, 2021 •

edited

Krastanov commented Apr 4, 2021 •

edited

david-pl commented Apr 6, 2021 •

edited

david-pl commented Apr 6, 2021

david-pl commented Apr 6, 2021

Krastanov commented Apr 6, 2021

Krastanov commented Apr 6, 2021

codecov bot commented Apr 6, 2021 •

edited

Krastanov commented Apr 6, 2021

david-pl commented Apr 7, 2021

Krastanov commented May 1, 2021

david-pl commented May 3, 2021

davidschlegel commented Nov 21, 2023

Implement the OrdinaryDiffEq interface for Kets #16

Are you sure you want to change the base?

Implement the OrdinaryDiffEq interface for Kets #16

Conversation

Krastanov commented Apr 4, 2021 • edited

TODO

Krastanov commented Apr 4, 2021 • edited

david-pl commented Apr 6, 2021 • edited

david-pl commented Apr 6, 2021

david-pl commented Apr 6, 2021

Krastanov commented Apr 6, 2021

Krastanov commented Apr 6, 2021

codecov bot commented Apr 6, 2021 • edited

Codecov Report

Krastanov commented Apr 6, 2021

david-pl commented Apr 7, 2021

Krastanov commented May 1, 2021

david-pl commented May 3, 2021

davidschlegel commented Nov 21, 2023

Krastanov commented Apr 4, 2021 •

edited

Krastanov commented Apr 4, 2021 •

edited

david-pl commented Apr 6, 2021 •

edited

codecov bot commented Apr 6, 2021 •

edited