Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Majority voting filter in ndimage #9873

Open
skjerns opened this issue Feb 25, 2019 · 20 comments
Open

[feature request] Majority voting filter in ndimage #9873

skjerns opened this issue Feb 25, 2019 · 20 comments
Labels
enhancement A new feature or improvement scipy.ndimage

Comments

@skjerns
Copy link
Contributor

skjerns commented Feb 25, 2019

ndimage.filters has quite useful filters, but in my opinion one filter could be implemented:
An occurrence based filter, such as a majority filter.

The filter would go over an array, count occurrences within the given window size and take the n-ths most occuring value as an output, e.g.:

arr = [1,1,2,3,4,4,5,6,6,6,7,8,8,8]
res = occurence_filter(arr, size=5, rank=-1,  mode='valid')
# pseudocode
for i in range(len(arr)-size):
    w = arr[i:i+size]
    occurences = count_unique_values(w) # e.g [2:1,3:1,4:1,1:2]
    ranked = occurences[rank] # =1
    res[i] = ranked
    
# res = [1,4,4,4,4,4,6,6,6,6,8]

would it make sense to implement such a filter?
Or is there any other way I can express such an operation that I haven't seen yet?

@ilayn ilayn added enhancement A new feature or improvement scipy.ndimage labels Feb 25, 2019
@rgommers
Copy link
Member

Sounds reasonable, and there's at least some demand for it: https://stackoverflow.com/questions/39829716/majority-filter-python

May be a bit specialized though and fit better in scikit-image. @stefanv @jni any opinion on this functionality?

@jni
Copy link
Contributor

jni commented Feb 27, 2019

Seems like a job for ndi.generic_filter and LowLevelCallables! =) We would certainly welcome this in scikit-image filters, but given that it is straightforward (cough) to implement with ndimage primitives, possibly it could sit there as a model for how to use these features.

@skjerns for references on how to use these:

https://ilovesymposia.com/2017/03/12/scipys-new-lowlevelcallable-is-a-game-changer/
https://ilovesymposia.com/2017/03/15/prettier-lowlevelcallables-with-numba-jit-and-decorators/

Those both use Numba, which neither SciPy nor scikit-image have as a dependency yet. This notebook from @kne42 has a reference about how to do this with Cython:

https://gist.github.com/kne42/b487af595332db468654a5c5b1ce416c

And of course see also the SciPy documentation:
https://docs.scipy.org/doc/scipy/reference/tutorial/ndimage.html#ndimage-ccallbacks

@stefanv
Copy link
Member

stefanv commented Feb 27, 2019

This sounds like it may be implemented effectively with skimage.filters.rank. The generic-filter version is going to be pretty slow without some kind of acceleration (such as the ones mentioned above).

@jni
Copy link
Contributor

jni commented Feb 27, 2019

@stefanv what do you mean exactly? I'm probably missing a step in my brain, but I can't map a rank filter to "bincount" which is what needs to happen here?

@stefanv
Copy link
Member

stefanv commented Feb 27, 2019

@jni Rank filters calculate the histogram of values under the window, and that histogram is updated each time the window moves. From that, you can find the n-th most occurring value, I suspect?

@jni
Copy link
Contributor

jni commented Feb 27, 2019

🤯

Having said that, the real solution is to fix SciPy's filters to use this technique, right? =) See #4878

@stefanv
Copy link
Member

stefanv commented Feb 27, 2019

The rank filters only operate on integers, though, and SciPy's API is quite complex (as discussed in #4878). That said, we need to find a champion for ndimage: it is in maintenance mode right now, and as we discovered during the GSoC that @jaimefrio mentored, there is a lot of room for improvement.

@jni
Copy link
Contributor

jni commented Feb 27, 2019

The rank filters only operate on integers

and are 2D only, if I remember correctly.

@skjerns
Copy link
Contributor Author

skjerns commented Feb 27, 2019

Great that there seems to be some interested in this.

@jni I can have a look at the low-level API and see if I can come up with a solution. However, this would be new terrain for me and I always had a bit of respect for contributing to such important packages as scipy.

The rank filters only operate on integers
and are 2D only, if I remember correctly.

indeed, however the 1D operation can just be reframed as a 2D problem via reshaping?

@jni
Copy link
Contributor

jni commented Feb 27, 2019

However, this would be new terrain for me and I always had a bit of respect for contributing to such important packages as scipy.

Welcome and never fear! Packages like SciPy have coalesced into something of extremely high quality, but it is through the gradual contributions and improvement of scientists who have not been trained as software engineers.

Having said this the ndimage code base is quite challenging, as hinted at by @stefanv. Where are you based, @skjerns? Any chance you'll be coming to the SciPy conference in July or EuroSciPy in August?

however the 1D operation can just be reframed as a 2D problem via reshaping?

It is not 1D arrays that I am worried about, but 3D and higher. =)

@stefanv
Copy link
Member

stefanv commented Feb 27, 2019 via email

@skjerns
Copy link
Contributor Author

skjerns commented Mar 3, 2019

So the preliminary solution right now is to write some code based on generic_filter and (possibly) numba and include that as a.. tutorial or something like that? Or as a function to ndimage?

Having said this the ndimage code base is quite challenging, as hinted at by @stefanv. Where are you based, @skjerns? Any chance you'll be coming to the SciPy conference in July or EuroSciPy in August?

I’m europe based, but I don’t think I will come to the conference, sorry :-/

@rgommers
Copy link
Member

rgommers commented Mar 3, 2019

So the preliminary solution right now is to write some code based on ‘generic_filter’ and (possibly) ‘numba’ and include that as a.. tutorial or something like that? Or as a function to ndimage?

I'll let @stefanv and @jni comment on the generic_filter bit, but regarding Numba: we can't accept it as a dependency (yet, we'd like to get there at some point), but it may be the most user-friendly way to implement this functionality in a performant way. I haven't used it yet, but
Numba's @stencil decorator may be very helpful here.

@rgommers rgommers changed the title [feature request] Majority voting [feature request] Majority voting filter in ndimage Mar 3, 2019
@jni
Copy link
Contributor

jni commented Mar 5, 2019

@skjerns some of the links I shared show how to use Cython instead of Numba. You could implement it with Cython and then add it as a PR either here or in scikit-image. Both places make sense to me.

@skjerns
Copy link
Contributor Author

skjerns commented Mar 7, 2019

I gave it a try and created a pure-Cython version using C++ map.
To compare, there is a regular Cython version using ordereddict.
My biggest hurdle was to find a good implementation of a counter in Cython.
This version only supports returning the majority element. Some optimization might be done using early stopping if the counter is higher than len(array)//2
However, a better version should implement a way of choosing the rank of the wanted occurence (like second-most occuring, etc.)

If we were constrained to ints 0-255 the problem would be much easier and faster to solve using an array, the second implementation does this and is magnitudes faster. But because we have double it seems a bit tricky to get the count in an efficient way.

Maybe you can have a quick look at it and see if I'm going in the right direction?

majority_filter.pyx

cimport cython
import numpy as np
cimport numpy as np
from collections import defaultdict
from numpy cimport npy_intp
from libcpp.map cimport map as cpp_map # using ordered map seems faster than unordered map
cdef extern from "math.h": 
    float INFINITY


# version to compare to using standard python objects
@cython.boundscheck(False)
@cython.wraparound(False)
def vanilla(np.ndarray x not None):
    cdef:
        Py_ssize_t i,n  
    n = len(x)    
    dx = defaultdict(int)
    for i from 0 <= i < n:
        dx[x[i]] += 1
    max_v = max(sorted(dx), key=lambda k: dx[k])
    return max_v

# pure - Cython variant
cdef api int llc_cython(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data):
    
    cdef cpp_map[double, int] counter
    cdef int c
    cdef int c_largest = -1
    cdef double k
    cdef double k_smallest = INFINITY # in case of a tie, we take the smaller key value
    for i from 0 <= i < filter_size:
        k = buffer[i]
        c = counter[k] + 1
        counter[k] = c
        if c < c_largest: # most of the time it will be smaller
            continue
        elif c > c_largest: # second most often it will be larger
            c_largest = c
            k_smallest = k
        else: # rarely it will have the same count
            if k<k_smallest:
                c_largest = c
                k_smallest = k
    return_value[0] = k_smallest   
    return 1

# pure Cython - int variant (see comment below)
cdef api int llc_cython_int(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data):
    
    cdef int[256] counter = [0]*256
    cdef int c
    cdef int c_largest = -1
    cdef int k
    cdef int k_smallest = <int> INFINITY# in case of a tie, we take the smaller key value
    for i from 0 <= i < filter_size:
        k = <int>buffer[i]
        c = counter[k] + 1
        counter[k] = c
        if c < c_largest: # most of the time it will be smaller
            continue
        elif c > c_largest: # second most often it will be larger
            c_largest = c
            k_smallest = k
        else: # rarely it will have the same count
            if k<k_smallest:
                c_largest = c
                k_smallest = k
    return_value[0] = <double>k_smallest   
    return 1

setup.py

from distutils.core import setup
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules = cythonize("majority_filter.pyx", language="c++"),
    include_dirs = np.get_include()
)

test.py

from scipy.ndimage.filters import generic_filter
from scipy import LowLevelCallable
import time
import numpy as np
import pyximport
pyximport.install(inplace=True, reload_support=True)
import majority_filter

#%%
np.random.seed(0)
# int is realistic with voting, you would not call this on random numbers
arr = np.random.randint(0,10,1024*1024).reshape([1024,1024]) 
footprint = np.ones(25).reshape([5,5])
llc_python = LowLevelCallable.from_cython(majority_filter, "llc_cython")


start = time.time()
res1 = generic_filter(arr, majority_filter.vanilla , footprint=footprint, mode='constant')
print(time.time() - start) # -> 10.4 s

start = time.time()
res2= generic_filter(arr, llc_python , footprint=footprint, mode='constant')
print(time.time() - start) # -> 1.6 s

print(np.mean(np.isclose(res1, res2))) # -> all close True

1.6 seconds seems still rather slow for 1024x1024 image, yet I don't know how else to implement a counter in Cython? Is there any way to significantly speed this up?

@jni
Copy link
Contributor

jni commented Mar 8, 2019

Hey @skjerns! Thanks for the code samples.

  1. Broad picture, I think this gets back to @stefanv's comment that the correct way to implement this is with rolling histograms, or, if you want to support floats, a rolling map as you have. The point is that you are actually iterating over a length-25 array on every pixel of the image. So you should think about processing a (25, 1024, 1024) array to get a bit closer to reasonable expectations.

  2. More specifically for your code, have you run cython -a on it? It'll tell you if it's expected to be slow in various bits. If you are not touching Python code (all lines of your Cython function are white), then there is likely not much else for you to do. I don't think you'll go much faster than a cpp map for this problem. Having said that I don't know why you didn't turn off bounds checking and wrap around in your "fast" version?

@skjerns
Copy link
Contributor Author

skjerns commented Mar 8, 2019

@jni Thanks!

  1. That's true. Rolling map will require another solution than generic_filter I guess.

  2. There's no dependency of Python, all lines are white. The bounds and wrap checks were not intentional. I've adapted some code from the internet which used it, removing them make no difference.

  3. I've created a version that assumes ints of 0-256, using an array lookup to store occurrences. It's (of course) much faster, however generic_filter assumes double.

Using (25,1024,1024) I have
338 seconds for Python-Cython using defaultdict
104 seconds for double-pure Cython using map
4.4 seconds for int-pure Cython version (int 0-255) (see below)

So to wrap up: We leave this solution for now?
Or should I somehow still add it as an example of Cython-LLC-generic_filter, or make it explicit to be used only with int, or select the best case for the dtype (int, float, string)?

int variant llc (click me)

Variant to work with `int` from 0-255

# pure Cython - int variant
cdef api int llc_cython_int(double *buffer, npy_intp filter_size, 
             double *return_value, void *user_data):
    
    cdef int[256] counter = [0]*256
    cdef int c
    cdef int c_largest = -1
    cdef int k
    cdef int k_smallest = <int> INFINITY# in case of a tie, we take the smaller key value
    for i from 0 <= i < filter_size:
        k = <int>buffer[i]
        c = counter[k] + 1
        counter[k] = c
        if c < c_largest: # most of the time it will be smaller
            continue
        elif c > c_largest: # second most often it will be larger
            c_largest = c
            k_smallest = k
        else: # rarely it will have the same count
            if k<k_smallest:
                c_largest = c
                k_smallest = k
    return_value[0] = <double>k_smallest   
    return 1

@jni
Copy link
Contributor

jni commented Mar 8, 2019

Personally I think it would make a very cool documentation example. @rgommers what do you think? The thing is it's best put in a gallery kind of setting, which I don't think (?) SciPy has?

@rgommers
Copy link
Member

rgommers commented Mar 8, 2019

So to wrap up: We leave this solution for now?

Yeah, I think the function is too slow to add it to ndimage.

The thing is it's best put in a gallery kind of setting, which I don't think (?) SciPy has?

We've got a pretty big one: https://scipy-cookbook.readthedocs.io/. A lot of content is quite old, but that is our user-contributed gallery. We accept PRs with notebooks for it (repo at https://github.com/scipy/scipy-cookbook).

We also have a doc section where this would fit, http://scipy.github.io/devdocs/tutorial/ndimage.html#extending-scipy-ndimage-in-c, however I'm not sure how much it adds over the example that's already there. So I think adding it to scipy-cookbook and perhaps linking it from that "extending ndimage" doc section would be quite nice.

@rfezzani
Copy link

Hi all,
I made a PR related to this issue in scikit-image. The proposed implementation relies on the skimage.rank Cython framework. Its major drawback is that it only supports uint8 and uint16 data type.
May be a generalization to more dtypes can be implemented in scipy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.ndimage
Projects
None yet
Development

No branches or pull requests

6 participants