Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hcl2rgb bottleneck in colorize methods #79

Closed
thorsteinssonh opened this issue Mar 17, 2021 · 10 comments · Fixed by #122
Closed

hcl2rgb bottleneck in colorize methods #79

thorsteinssonh opened this issue Mar 17, 2021 · 10 comments · Fixed by #122

Comments

@thorsteinssonh
Copy link

thorsteinssonh commented Mar 17, 2021

Hi there old pytroller friends, hope you are all well !

I've been playing with the excellent trollimage library again,
and I found that there is quite a significant bottleneck in the hcl2rgb routine on larger images.
On 1000x1000 size images I'm measuring the the np.interp step is about 20ms, while
the conversion of the result back to RGB space with hcl2rgb adds 350ms to the whole procedure.

def _interpolate_rgb_colors(arr, colors, values):
    hcl_colors = _convert_rgb_list_to_hcl(colors) # this is peanuts...
    channels = [np.interp(arr,
                          np.array(values),
                          np.array(hcl_colors)[:, i])
                for i in range(3)]                   # this is ca 17ms
    channels = list(hcl2rgb(*channels))  # this is 350ms
    return channels

At first I suspected the np.interp was the issue,
but after profiling I saw that hcl2rgb was the bottleneck, at least on my machine.

Maybe give the user an option... hcl=True/False could also solve the problem for those that want more speed.
I'm working on plotting quite a few weather maps on the fly using the library,
work fantastic, but noticed maybe some chance to speed this up.

I'll take a look into hcl2rgb and see if anything I could help with.

@mraspaud
Copy link
Member

Hi @thorsteinssonh , really nice to hear from you again!
Yes, the conversion to hcl was needed in my opinion to get a nice smooth result. It would be great if you find ways to optimize it :)

@mraspaud
Copy link
Member

You can of course check what the result of skipping this step is, but iirc I didn't like it in some cases.

@thorsteinssonh
Copy link
Author

yes, the HCL results are beautiful... would be best to find a solution... I'll look into it.

@djhoese
Copy link
Member

djhoese commented Jan 22, 2023

So I started working on #121 to work around some edge cases for the existing HCL conversion and saw this issue and thought the existing wasn't in numpy so I should automatically have a faster algorithm. Turns out it isn't using numpy arrays. Even though I have some optimizations I can do in my code, the new algorithm I'm doing is actually slower than the existing one for a test image of random data (1000 x 1000) and test data of (3000 x 3000). It would definitely be good if I could do stuff in Cython/C-level, but I'm not sure I'm ready to jump to that.

# new version:

In [1]: from trollimage.colormap import Colormap, rgb2hcl_numpy, hcl2rgb_numpy; import numpy as np

In [2]: cmap = Colormap(values=list(range(4)), colors=[(1.0, 1.0, 0.0), (0.0, 1.0, 1.0), (1.0, 1.0, 1.0), (0.0, 0.0, 0.0)])

In [3]: data = np.random.random((3000, 3000))

In [4]: %%timeit
   ...: colorized_image = cmap.colorize(data)
   ...:
   ...:
1.97 s ± 50 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# old version:

In [1]: from trollimage.colormap import Colormap, rgb2hcl_numpy, hcl2rgb_numpy; import numpy as np

In [2]: cmap = Colormap(values=list(range(4)), colors=[(1.0, 1.0, 0.0), (0.0, 1.0, 1.0), (1.0, 1.0, 1.0), (0.0, 0.0, 0.0)])

In [3]: data = np.random.random((3000, 3000))

In [4]: %%timeit
   ...: colorized_image = cmap.colorize(data)
   ...:
   ...:
/home/davidh/repos/git/trollimage/trollimage/colorspaces.py:128: RuntimeWarning: invalid value encountered in power
  1.055 * (arr ** (1.0 / 2.4)) - 0.055,
1.68 s ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@djhoese
Copy link
Member

djhoese commented Jan 31, 2023

My new PR #122 makes these operations much faster. For the same test in my above comment I can now do it in <500ms. That PR is mostly ready for a first review. I could add more tests, but there are also some failing tests that I need to verify are only failing because of small differences between the implementations and not something broken.

@djhoese
Copy link
Member

djhoese commented Aug 19, 2023

FYI I rewrote all of the cython code in #122 to use different colorspace code and sadly it isn't nearly as fast as #122 originally was with the old cython code. Main still takes ~1.68s with the above test, and #122 currently takes 1.5s. So faster, but not much. I'm also not sure what other types of optimizations can be done.

Edit: Thought about it. I could completely reorganize the conversion methods so they operate on a single pixel at a time instead of looping. I assumed/hoped that doing a for loop for each conversion would be faster than doing all the conversions on one pixel...but memory locality is probably a problem for the way I'm doing it now.

@djhoese
Copy link
Member

djhoese commented Aug 24, 2023

FYI I just tried to optimize the code added in #122 by switching from doing the for loops in each conversion to doing the for loop at the top level and changing how things were passed to the conversion functions. All of them were slower than one I'm doing now.

  1. outer for loop and pass pointers around and inline all: ~2.5
  2. same as 1, but pass inputs by value instead of pointer: ~2.5
  3. do for loop inside the main conversion function, pass inputs by value: ~1.75s

I think I could still try adding structs for the output and see if that helps the compiler inline stuff better.

@djhoese
Copy link
Member

djhoese commented Aug 24, 2023

Hhhmm I can't figure out how to get fused types and structs to work properly. I can make a fused type of the two structs (32-bit float and 64-bit float) but I can't figure out a way for cython to let me use it.

@djhoese
Copy link
Member

djhoese commented Sep 15, 2023

I was talking with someone at a programmer meetup about this Cython-ization of colorspace code and they made me realize that I don't think I ever tested it with dask and that I never tried using OpenMP (via cython's prange). I just did some tests and the prange did make it go from ~1.75s to ~1.54s. Using a dask array with chunk size of 1000 for the same overall size made the newest version (no prange) in 468ms, while the same dask array in the old python-only version ran in 578ms. So about the same speed up overall I feel like. Not sure how or why that is that way...I would have expected the amount of nogil code in the new Cython version to out perform the old version when dask was used by a lot more.

@mraspaud
Copy link
Member

Interesting, thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants