-
I am interested in maximizing the discrepancy between two large datasets where each sample is a weighted feature vector. More specifically, I would like to split a large set of feature vectors in 2 subsets and maximize the discrepancy across the two subsets (think of unsupervised 2 class clustering). The weights would act as soft assignments between the 2 groups. The number of samples (N) could be around 1e6 and the dimension of each feature vector (D) could be around 1e2. I am thus looking for efficient approaches. I tried the sliced Wasserstein distance: I also tried the MMD implementation in geomloss ( I could work with an approximate algorithm and am not stuck on a specific notion of discrepancy at this stage (hence the test of sliced EMD and Gaussian MMD). Is there a recommended sample loss for such large-scale problems? I haven't looked at the gradient computation yet but will eventually need the gradient of the discrepancy measure with respect to the weights. I also considered the Wasserstein Discriminant Analysis but it doesn't seem to support soft labels so I guess it can't be used to get a gradient with respect to the labels: |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
If you have a very large number of points you should definitely consider minibach OT. It consists in optimizing the expectation ofover minibatch with SGD. You can do that manually easily enough with sliced wasserstein or exact solver (ot.emd2/ot.solve_sample) or sinkhorn divergence from POT that are very efficient on small batches. |
Beta Was this translation helpful? Give feedback.
If you have a very large number of points you should definitely consider minibach OT. It consists in optimizing the expectation ofover minibatch with SGD. You can do that manually easily enough with sliced wasserstein or exact solver (ot.emd2/ot.solve_sample) or sinkhorn divergence from POT that are very efficient on small batches.