Add more customizable hyperparameters for SpectralClustering #995

wq2012 · 2022-05-26T19:15:44Z

Example config:

pipeline:
  name: pyannote.audio.pipelines.SpeakerDiarization
  params:
    segmentation: pyannote/segmentation
    embedding: speechbrain/spkrec-ecapa-voxceleb
    clustering: SpectralClustering

params:
  clustering:
    laplacian: GraphCut
    eigengap: Ratio
    gaussian_blur_sigma: 1
    p_percentile: 0.95
    refinement_sequence: ["GaussianBlur", "RowWiseThreshold", "Symmetrize"]
    symmetrize_type: Average
    thresholding_with_binarization: False
    thresholding_preserve_diagonal: False
    thresholding_type: RowMax
    use_autotune: True
  min_activity: 6.073193238899291
  min_duration_off: 0.09791355693027545
  min_duration_on: 0.05537587440407595
  offset: 0.4806866463041527
  onset: 0.8104268538848918
  stitch_threshold: 0.04033955907446252

hbredin · 2022-05-31T12:10:46Z

Quick update on the performance (DER %) of this proposed spectral clustering (SC) pipeline compared to the default pyannote/speaker-diarization pipeline based on hierarchical agglomerative clustering (HAC):

Dataset	HAC	SC
AMI	21.5	21.4
DIHARD	22.2	27.3

Overall, it is significantly worse on DIHARD (currently evaluating on VoxConverse).

wq2012 · 2022-05-31T16:05:28Z

Overall, it is significantly worse on DIHARD

That's interesting, but I wouldn't be too surprised by that, since the optimal spectral clustering hyper-params are usually highly dependent on other modules in the diarization system, such as how the speaker embeddings are trained, and how the speaker segmentation is implemented.

Some general suggestions that might be interesting to be explored:

gaussian_blur_sigma: This really depends on how "dense" the embeddings are. If they are very dense, maybe a larger sigma would be better (but I never tried sigma>3). If the embeddings are extracted from speaker turns, then usually we don't use Gaussian blur at all.
p_percentile: this was previously the most important and most sensitive hyper-param. But now we have auto-tune, so as long as use_autotune=true, this is no longer important.
thresholding_type: this is very important and needs to be tuned.
thresholding_with_binarization and thresholding_preserve_diagonal: worth tuning but not that critical.
default_autotune: currently I hardcoded it in the CL instead of making it a hyperparams. We could make p_percentile_max slightly larger and init_search_step slightly smaller to search for more steps (while sacrificing the efficiency).

Also, spectral clustering is known to work very badly when the sequence of embeddings is very short. So I added fallback_options in wq2012/SpectralCluster@d431505

We could try to use something like FallbackOptions.spectral_min_embeddings=5.

…to develop

hbredin · 2022-06-01T09:14:37Z

Thanks for your feedback. Will try to optimize a bunch of those hyperparameters... but cannot promise any ETA.

wq2012 · 2022-06-01T14:27:45Z

Thanks. Just curious, do you have a script to do the data download, extraction, and parsing? (such that for audio, reference in evaluation_set: can be called)

If so, I can give it a try as well.

hbredin · 2022-06-01T15:41:30Z

I have something like this for the AMI dataset.
https://github.com/pyannote/AMI-diarization-setup/tree/main/pyannote

wq2012 · 2022-06-04T17:12:17Z

Shall we merge this PR regardless of the results?

We can always update default_parameters later when we have better results. But currently spectral clustering is added but not fully configurable from the YAML file.

codecov · 2022-06-07T06:52:21Z

Codecov Report

Merging #995 (0e021b5) into develop (aede20e) will decrease coverage by 0.28%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           develop     #995      +/-   ##
===========================================
- Coverage    35.61%   35.32%   -0.29%     
===========================================
  Files           58       58              
  Lines         3431     3459      +28     
===========================================
  Hits          1222     1222              
- Misses        2209     2237      +28

Impacted Files	Coverage Δ
pyannote/audio/pipelines/clustering.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aede20e...0e021b5. Read the comment docs.

hbredin · 2022-06-07T06:49:55Z

pyannote/audio/pipelines/speaker_diarization.py

+        elif (
+            self.segmentation == "pyannote/segmentation"
+            and self.embedding == "speechbrain/spkrec-ecapa-voxceleb"
+            and self.klustering == "SpectralClustering"
+            and not self.expects_num_speakers
+        ):
+            # SpectralClustering has not been optimized.
+            return {
+                "onset": 0.810,
+                "offset": 0.481,
+                "min_duration_on": 0.055,
+                "min_duration_off": 0.098,
+                "min_activity": 6.073,
+                "stitch_threshold": 0.040,
+                "clustering": {
+                    "laplacian": "GraphCut",
+                    "eigengap": "Ratio",
+                    "spectral_min_embeddings": 5,
+                    "gaussian_blur_sigma": 1,
+                    "p_percentile": 0.95,
+                    "refinement_sequence": ["GaussianBlur", "RowWiseThreshold", "Symmetrize"],
+                    "symmetrize_type": "Average",
+                    "thresholding_with_binarization": False,
+                    "thresholding_preserve_diagonal": False,
+                    "thresholding_type": "RowMax",
+                    "use_autotune": True,
+                },
+            }


Until I/you/we/anyone actually optimize this, I'd rather not provide default parameters for this combination.

Suggested change

elif (

self.segmentation == "pyannote/segmentation"

and self.embedding == "speechbrain/spkrec-ecapa-voxceleb"

and self.klustering == "SpectralClustering"

and not self.expects_num_speakers

):

# SpectralClustering has not been optimized.

return {

"onset": 0.810,

"offset": 0.481,

"min_duration_on": 0.055,

"min_duration_off": 0.098,

"min_activity": 6.073,

"stitch_threshold": 0.040,

"clustering": {

"laplacian": "GraphCut",

"eigengap": "Ratio",

"spectral_min_embeddings": 5,

"gaussian_blur_sigma": 1,

"p_percentile": 0.95,

"refinement_sequence": ["GaussianBlur", "RowWiseThreshold", "Symmetrize"],

"symmetrize_type": "Average",

"thresholding_with_binarization": False,

"thresholding_preserve_diagonal": False,

"thresholding_type": "RowMax",

"use_autotune": True,

},

}

Sounds good. I have removed this.

hbredin · 2022-06-07T06:57:48Z