-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow grouping noisy arrays using tolerances #32
Comments
Looks interesting. Usually I do something similar by specifying bins which this package now supports. Would that work for you? You do have to know the bins though... You can always "factorize" PS: it's nice to see you here. please file issues for any bugs you find. |
Yeah, it's the knowing the bins beforehand that I don't like. :) I will likely factorize myself if I don't manage to convince you, but I still have hope! |
There's https://github.com/deepcharles/ruptures that deals with this issue. |
It'd be good to add this in the documentation under "Labeling Groups" |
Now I think https://flox.readthedocs.io/en/latest/user-stories.html would be a good place. "Overlapping Groups" also illustrates a trick. |
Here's something interesting that seems relevant: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html, https://pypi.org/project/matrixprofile-ts/ |
Here's a version using sklearns meanshift: import numpy as np
import matplotlib.pyplot as plt
import flox
def _add_noise(
arrs: tuple[np.ndarray, ...], mult: bool = True, shuffle: bool = False
) -> tuple[np.ndarray, ...]:
noise = lambda y: 1 + 0.025 * (np.random.rand(*y.shape) - 0.5)
out = ()
for a in arrs:
b = a * noise(a) if mult else a + noise(a)
if shuffle:
np.random.shuffle(a)
out += (b,)
return out
def _cluster_meanshift(
arrs: tuple[np.ndarray, ...], bandwith: tuple[float, ...]
) -> tuple[np.ndarray, ...]:
from sklearn.cluster import MeanShift
labels = ()
for a, b in zip(arrs, bandwith):
ms = MeanShift(bandwidth=b, bin_seeding=True)
ms.fit(a.reshape(-1, 1))
labels += (ms.labels_,)
return labels
def _combine_cluster(labels: tuple[np.ndarray, ...]) -> np.ndarray:
_, out = np.unique(labels, return_inverse=1, axis=1)
return out
def plot_clusters(x, ys: tuple[np.ndarray, ...], bys: tuple[np.ndarray, ...]):
fig, axs = plt.subplots(len(ys), 1, sharex=True, layout="constrained")
if hasattr(axs, "__iter__"):
axs = axs
else:
axs = [axs]
axs[-1].set_xlabel("x")
for i, (ax, y, by) in enumerate(zip(axs, ys, bys)):
for g in np.unique(by):
idx = by == g
ax.scatter(x[idx], y[idx])
ax.set_title(f"ys[{i}]")
ax.set_ylabel(f"ys[{i}]")
return fig, axs
reps = 8
x = np.tile(np.array([3000, 3000, 3000, 3000, 5000, 5000, 5000, 5000]), reps)
y = np.tile(np.array([75, 75, 100, 100, 300, 300, 500, 500]), reps)
z = np.tile(np.array([1, 2, 1, 2, 1, 2, 1, 2]), reps)
w = np.tile(np.array([10, 10, 10, 10, 10, 10, 10, 10]), reps)
time = np.arange(*y.shape) * 0.1
x, y, z, w = _add_noise((x, y, z, w))
xl, yl, zl = _cluster_meanshift((x, y, z), bandwith=(500, 15, 0.1))
plot_clusters(time, (x, y, z), (xl, yl, zl))
g = _combine_cluster((xl, zl))
# print(g)
plot_clusters(time, (w,), (g,))
results, groups = flox.groupby_reduce(w, g, func="max")
print(results) Some other meanshift alternatives; |
I find grouping data can be a little more intuitive when using tolerances.
This can be done for example like this:
So on a dataset the api would look something like this
ds.groupby("arr", atol=0, rtol=0.1)
.The text was updated successfully, but these errors were encountered: