Allow grouping noisy arrays using tolerances #32

Illviljan · 2021-08-28T11:43:01Z

I find grouping data can be a little more intuitive when using tolerances.
This can be done for example like this:

def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
    """
    Return a group idx of a noisy array.

    TODO: argsort not available in dask.

    Examples
    --------
    reps = 2
    y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
    y = np.stack(reps*[y], 0)
    noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
    y = y * noise(y)
    groupby_isclose_argsort(y)
    array([[0, 0, 1, 1, 2, 2, 3, 3],
           [0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
    """
    # Sort values to make sure values are monotonically increasing:
    a = arr.ravel()
    i = np.argsort(a)
    i_rev = np.empty(i.shape, dtype=np.intp, like=a)
    i_rev[i] = np.arange(len(a), like=a)
    a = a[i]

    # Calculate a monotonically increasing index that increase when the
    # difference between current and previous value changes:
    b = np.roll(a, 1)
    b[0] = a[0]
    by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
    # tolerance = atol + rtol * b
    # by = np.cumsum(np.abs(a - b) > tolerance)

    return by[i_rev].reshape(arr.shape)

So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).

The text was updated successfully, but these errors were encountered:

dcherian · 2021-08-28T19:27:28Z

Looks interesting. Usually I do something similar by specifying bins which this package now supports. Would that work for you? You do have to know the bins though...

You can always "factorize" arr yourself and then pass it to groupby directly...

PS: it's nice to see you here. please file issues for any bugs you find.

Illviljan · 2021-08-28T21:54:47Z

Yeah, it's the knowing the bins beforehand that I don't like. :)
It gets difficult figuring out each group visually when you start getting values that are 100x-1000x larger than the smallest value and if the values (noise staying the same) changes between each test it gets tiresome redoing the bins each time.
This is also a dask-focused project and that implies a little bit that visually inspecting data might be slow and cumbersome.
But maybe you have a way that I haven't thought of?

I will likely factorize myself if I don't manage to convince you, but I still have hope!

Illviljan · 2022-02-17T06:21:05Z

There's https://github.com/deepcharles/ruptures that deals with this issue.
A lot of python for loops along the arrays though, I think that can be improved though. There are C-implementations for some of the algorithms however.

dcherian · 2022-10-11T22:45:01Z

It'd be good to add this in the documentation under "Labeling Groups"

dcherian · 2022-11-29T03:00:53Z

Now I think https://flox.readthedocs.io/en/latest/user-stories.html would be a good place. "Overlapping Groups" also illustrates a trick.

dcherian · 2023-11-28T17:48:25Z

Here's something interesting that seems relevant: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html, https://pypi.org/project/matrixprofile-ts/

Illviljan · 2023-12-03T23:40:58Z

Here's a version using sklearns meanshift:

import numpy as np
import matplotlib.pyplot as plt
import flox

def _add_noise(
    arrs: tuple[np.ndarray, ...], mult: bool = True, shuffle: bool = False
) -> tuple[np.ndarray, ...]:
    noise = lambda y: 1 + 0.025 * (np.random.rand(*y.shape) - 0.5)

    out = ()
    for a in arrs:
        b = a * noise(a) if mult else a + noise(a)
        if shuffle:
            np.random.shuffle(a)
        out += (b,)

    return out


def _cluster_meanshift(
    arrs: tuple[np.ndarray, ...], bandwith: tuple[float, ...]
) -> tuple[np.ndarray, ...]:
    from sklearn.cluster import MeanShift

    labels = ()
    for a, b in zip(arrs, bandwith):
        ms = MeanShift(bandwidth=b, bin_seeding=True)
        ms.fit(a.reshape(-1, 1))
        labels += (ms.labels_,)

    return labels


def _combine_cluster(labels: tuple[np.ndarray, ...]) -> np.ndarray:
    _, out = np.unique(labels, return_inverse=1, axis=1)
    return out


def plot_clusters(x, ys: tuple[np.ndarray, ...], bys: tuple[np.ndarray, ...]):
    fig, axs = plt.subplots(len(ys), 1, sharex=True, layout="constrained")
    if hasattr(axs, "__iter__"):
        axs = axs
    else:
        axs = [axs]

    axs[-1].set_xlabel("x")
    for i, (ax, y, by) in enumerate(zip(axs, ys, bys)):
        for g in np.unique(by):
            idx = by == g

            ax.scatter(x[idx], y[idx])
            ax.set_title(f"ys[{i}]")
            ax.set_ylabel(f"ys[{i}]")

    return fig, axs


reps = 8
x = np.tile(np.array([3000, 3000, 3000, 3000, 5000, 5000, 5000, 5000]), reps)
y = np.tile(np.array([75, 75, 100, 100, 300, 300, 500, 500]), reps)
z = np.tile(np.array([1, 2, 1, 2, 1, 2, 1, 2]), reps)
w = np.tile(np.array([10, 10, 10, 10, 10, 10, 10, 10]), reps)

time = np.arange(*y.shape) * 0.1

x, y, z, w = _add_noise((x, y, z, w))

xl, yl, zl = _cluster_meanshift((x, y, z), bandwith=(500, 15, 0.1))
plot_clusters(time, (x, y, z), (xl, yl, zl))
g = _combine_cluster((xl, zl))
# print(g)

plot_clusters(time, (w,), (g,))
results, groups = flox.groupby_reduce(w, g, func="max")
print(results)

Some other meanshift alternatives;
https://github.com/jenniferjang/meanshiftpp - article
https://github.com/abhisheka456/GridShift - article

Illviljan changed the title ~~Alllow grouping noisy arrays using tolerances~~ Allow grouping noisy arrays using tolerances Aug 28, 2021

xarray-contrib deleted a comment from Illviljan Dec 23, 2021

dcherian added the documentation Improvements or additions to documentation label Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow grouping noisy arrays using tolerances #32

Allow grouping noisy arrays using tolerances #32

Illviljan commented Aug 28, 2021 •

edited

dcherian commented Aug 28, 2021

Illviljan commented Aug 28, 2021

Illviljan commented Feb 17, 2022

dcherian commented Oct 11, 2022 •

edited

dcherian commented Nov 29, 2022

dcherian commented Nov 28, 2023

Illviljan commented Dec 3, 2023 •

edited

Allow grouping noisy arrays using tolerances #32

Allow grouping noisy arrays using tolerances #32

Comments

Illviljan commented Aug 28, 2021 • edited

dcherian commented Aug 28, 2021

Illviljan commented Aug 28, 2021

Illviljan commented Feb 17, 2022

dcherian commented Oct 11, 2022 • edited

dcherian commented Nov 29, 2022

dcherian commented Nov 28, 2023

Illviljan commented Dec 3, 2023 • edited

Illviljan commented Aug 28, 2021 •

edited

dcherian commented Oct 11, 2022 •

edited

Illviljan commented Dec 3, 2023 •

edited