Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow grouping noisy arrays using tolerances #32

Open
Illviljan opened this issue Aug 28, 2021 · 7 comments
Open

Allow grouping noisy arrays using tolerances #32

Illviljan opened this issue Aug 28, 2021 · 7 comments
Labels
documentation Improvements or additions to documentation

Comments

@Illviljan
Copy link
Contributor

Illviljan commented Aug 28, 2021

I find grouping data can be a little more intuitive when using tolerances.
This can be done for example like this:

def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
    """
    Return a group idx of a noisy array.

    TODO: argsort not available in dask.

    Examples
    --------
    reps = 2
    y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
    y = np.stack(reps*[y], 0)
    noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
    y = y * noise(y)
    groupby_isclose_argsort(y)
    array([[0, 0, 1, 1, 2, 2, 3, 3],
           [0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
    """
    # Sort values to make sure values are monotonically increasing:
    a = arr.ravel()
    i = np.argsort(a)
    i_rev = np.empty(i.shape, dtype=np.intp, like=a)
    i_rev[i] = np.arange(len(a), like=a)
    a = a[i]

    # Calculate a monotonically increasing index that increase when the
    # difference between current and previous value changes:
    b = np.roll(a, 1)
    b[0] = a[0]
    by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
    # tolerance = atol + rtol * b
    # by = np.cumsum(np.abs(a - b) > tolerance)

    return by[i_rev].reshape(arr.shape)

So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).

@Illviljan Illviljan changed the title Alllow grouping noisy arrays using tolerances Allow grouping noisy arrays using tolerances Aug 28, 2021
@dcherian
Copy link
Collaborator

Looks interesting. Usually I do something similar by specifying bins which this package now supports. Would that work for you? You do have to know the bins though...

You can always "factorize" arr yourself and then pass it to groupby directly...

PS: it's nice to see you here. please file issues for any bugs you find.

@Illviljan
Copy link
Contributor Author

Yeah, it's the knowing the bins beforehand that I don't like. :)
It gets difficult figuring out each group visually when you start getting values that are 100x-1000x larger than the smallest value and if the values (noise staying the same) changes between each test it gets tiresome redoing the bins each time.
This is also a dask-focused project and that implies a little bit that visually inspecting data might be slow and cumbersome.
But maybe you have a way that I haven't thought of?

I will likely factorize myself if I don't manage to convince you, but I still have hope!

@xarray-contrib xarray-contrib deleted a comment from Illviljan Dec 23, 2021
@Illviljan
Copy link
Contributor Author

There's https://github.com/deepcharles/ruptures that deals with this issue.
A lot of python for loops along the arrays though, I think that can be improved though. There are C-implementations for some of the algorithms however.

@dcherian
Copy link
Collaborator

dcherian commented Oct 11, 2022

It'd be good to add this in the documentation under "Labeling Groups"

@dcherian
Copy link
Collaborator

Now I think https://flox.readthedocs.io/en/latest/user-stories.html would be a good place. "Overlapping Groups" also illustrates a trick.

@dcherian dcherian added the documentation Improvements or additions to documentation label Nov 29, 2022
@dcherian
Copy link
Collaborator

Here's something interesting that seems relevant: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html, https://pypi.org/project/matrixprofile-ts/

@Illviljan
Copy link
Contributor Author

Illviljan commented Dec 3, 2023

Here's a version using sklearns meanshift:

import numpy as np
import matplotlib.pyplot as plt
import flox

def _add_noise(
    arrs: tuple[np.ndarray, ...], mult: bool = True, shuffle: bool = False
) -> tuple[np.ndarray, ...]:
    noise = lambda y: 1 + 0.025 * (np.random.rand(*y.shape) - 0.5)

    out = ()
    for a in arrs:
        b = a * noise(a) if mult else a + noise(a)
        if shuffle:
            np.random.shuffle(a)
        out += (b,)

    return out


def _cluster_meanshift(
    arrs: tuple[np.ndarray, ...], bandwith: tuple[float, ...]
) -> tuple[np.ndarray, ...]:
    from sklearn.cluster import MeanShift

    labels = ()
    for a, b in zip(arrs, bandwith):
        ms = MeanShift(bandwidth=b, bin_seeding=True)
        ms.fit(a.reshape(-1, 1))
        labels += (ms.labels_,)

    return labels


def _combine_cluster(labels: tuple[np.ndarray, ...]) -> np.ndarray:
    _, out = np.unique(labels, return_inverse=1, axis=1)
    return out


def plot_clusters(x, ys: tuple[np.ndarray, ...], bys: tuple[np.ndarray, ...]):
    fig, axs = plt.subplots(len(ys), 1, sharex=True, layout="constrained")
    if hasattr(axs, "__iter__"):
        axs = axs
    else:
        axs = [axs]

    axs[-1].set_xlabel("x")
    for i, (ax, y, by) in enumerate(zip(axs, ys, bys)):
        for g in np.unique(by):
            idx = by == g

            ax.scatter(x[idx], y[idx])
            ax.set_title(f"ys[{i}]")
            ax.set_ylabel(f"ys[{i}]")

    return fig, axs


reps = 8
x = np.tile(np.array([3000, 3000, 3000, 3000, 5000, 5000, 5000, 5000]), reps)
y = np.tile(np.array([75, 75, 100, 100, 300, 300, 500, 500]), reps)
z = np.tile(np.array([1, 2, 1, 2, 1, 2, 1, 2]), reps)
w = np.tile(np.array([10, 10, 10, 10, 10, 10, 10, 10]), reps)

time = np.arange(*y.shape) * 0.1

x, y, z, w = _add_noise((x, y, z, w))

xl, yl, zl = _cluster_meanshift((x, y, z), bandwith=(500, 15, 0.1))
plot_clusters(time, (x, y, z), (xl, yl, zl))
g = _combine_cluster((xl, zl))
# print(g)

plot_clusters(time, (w,), (g,))
results, groups = flox.groupby_reduce(w, g, func="max")
print(results)

image

Some other meanshift alternatives;
https://github.com/jenniferjang/meanshiftpp - article
https://github.com/abhisheka456/GridShift - article

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants