Add summation #67

nryant · 2023-04-10T22:37:49Z

Overview

Adds following methods to BaseMetric to support summation:

__add__
__radd__

Motivation

The motivation is two-fold:

multiprocessing

Suppose we want to compute metrics for a large volume of data -- sufficiently large that we would like to parallelize the collection of sufficient statistics. Computing the sufficient statistics using multiprocessing or joblib is straightforward; e.g., using joblib:

def process_one(reference, hypothesis, uem):
   """Wrapper function for multiprocessing."""
    der = DiarizationErrorRate()
    der(reference, hypothesis, uem=uem)
    return der

def main():
    # Mapping from URIs to reference annotations.
    reference_anns = ...
    
    # Mapping from URIs to hypothesis annotations.
    hypothesis_anns = ...
    
    # Mapping from URIs to scoring regions.
    uems = ...
      
    # Number of parallel processes to use.
    n_jobs = 10
    
    # Actual joblib call.
    f = delayed(process_one)
    file_ders = Parallel(n_jobs)(f(reference_anns[uri], hypothesis_anns[uri], uems[uri])
                                 for uri in reference_anns)

However, the sufficient statistics for DER computation are now spread across the file-level metrics. Combining these into a single instance reduces to:

combined_der = sum(file_ders)

aggregation

Suppose we want to compute metrics not just overall and at a file-level, but by various logical subdivision; e.g., DIHARD III domains. This is now trivial using pandas dataframes. E.g. suppose data contains the following columns:

file_id -- file id
domain -- domain file is from
der -- instance of DiarizationErrorRate

then:

data.groupby('domain', as_index=False).sum()

will contain domain-level metrics, from which suitable reports may be generated.

Sync with pyannote/develop

hbredin · 2023-04-11T06:59:24Z

I have struggled in the past to make pyannote.metrics work well in a distributed way.

Your proposed solution is great! Thanks @nryant.

hbredin · 2023-04-11T07:05:47Z

pyannote/metrics/base.py

@@ -247,6 +247,21 @@ def __iter__(self):
        for uri, component in self.results_:
            yield uri, component

+    def __add__(self, other):
+        cls = self.__class__
+        result = cls()


Can we find a way to make sure result is initialized with the same options (e.g. collar and skip_overlap for DiarizationErrorRate instances) as self and other?

This probably means adding some kind of sklearn-like mechanism to clone metrics.

Good point! I cribbed this PR from some monkey patching I did a couple of years ago for an internal SAD scoring tool and in that context, default parameter values were being used, so the issue didn't come up. After looking at how sklearn handles this, maybe we could add a similar method to ensure the resulting instance is initialized with the same arguments as the first summand. If so, I should also document that sum([m1, m2, ...]) assumes all metrics were initialized identically (reasonable).

I've made an initial attempt at a clone function, which also required me to implement a get_params method for metrics. NOTE that get_params makes the assumption that all subclasses of BaseMetric include **kwargs in their signature and pass these keyword arguments to the constructor of the superclass (or if multiple superclasses, to one of them). Should this assumption be violated, weirdness could ensue.

An alternate approach would be that used within sklearn, which bans use of *args and **kwargs within constructors and forces each metric to be explicit about its parameters. This would require touching more lines of the codebase, but beyond being a bit of a chore, shouldn't be difficult to implement.

hbredin · 2023-04-11T07:07:15Z

pyannote/metrics/base.py

+        return result
+
+    def __radd__(self, other):
+        if other == 0:


Is this how the built-in sum function initializes its own accumulator?
Would be nice to add a quick comment mentionning this...

The built-in sum actually has a start parameter that controls the initial value of the summation. As you might gather, the default value is for start is 0, so I just hard coded that value as an additive identity for metrics. Probably would be good to add a one or two line comment to this effect to save someone having to read up on __radd__ and sum.

hbredin · 2023-04-11T07:13:29Z

Adding back a "parallel processing" entry in the documentation (mentioning your joblib example, for instance) would be nice to have as well!

4c1be01

pyannote/metrics/base.py

hbredin · 2023-04-19T13:02:49Z

pyannote/metrics/base.py

@@ -36,6 +37,51 @@
 from pyannote.metrics.types import Details, MetricComponents


+# TODO:
+# - fit/unfit = proper terminology?
+def clone(metric):


Can we make it a method of BaseMetric?

No objection. Preference for method or staticmethod?

It needs access to self, doesn't it? So regular method.
Unless there is something I don't understand about the distinction between method and statcimethod...

hbredin · 2023-04-19T13:05:43Z

pyannote/metrics/base.py

+        cls = self.__class__
+        result = cls()


Once clone becomes a method of BaseMetric, we would do:

Suggested change

cls = self.__class__

result = cls()

result = self.clone()

hbredin · 2023-04-19T13:06:29Z

tests/test_base.py

+def test_clone():
+    # Tests that clone creates deep copy of "unfit" metric.
+    metric = M1(a=10)
+    metric_new = clone(metric)


Once clone is a method of BaseMetric:

Suggested change

metric_new = clone(metric)

metric_new = self.clone(metric)

Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

nryant added 3 commits April 11, 2020 17:59

Merge pull request #2 from pyannote/develop

ac019df

Sync with pyannote/develop

Merge branch 'pyannote:develop' into develop

05ef917

feat: allow summation of metrics

7c8675e

hbredin reviewed Apr 11, 2023

View reviewed changes

feat: add clone function

32108ce

hbredin requested changes Apr 19, 2023

View reviewed changes

nryant and others added 4 commits April 19, 2023 09:55

Update pyannote/metrics/base.py

bfeae7e

Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

Update pyannote/metrics/base.py

edae71c

Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

Update pyannote/metrics/base.py

6adfff0

Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

Update pyannote/metrics/base.py

aa1827f

Co-authored-by: Hervé BREDIN <hbredin@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add summation #67

Add summation #67

nryant commented Apr 10, 2023

hbredin commented Apr 11, 2023

hbredin Apr 11, 2023

nryant Apr 11, 2023

nryant Apr 17, 2023

hbredin Apr 11, 2023

nryant Apr 11, 2023

hbredin commented Apr 11, 2023 •

edited

hbredin Apr 19, 2023

nryant Apr 19, 2023

hbredin Apr 19, 2023

hbredin Apr 19, 2023

hbredin Apr 19, 2023

Add summation #67

Are you sure you want to change the base?

Add summation #67

Conversation

nryant commented Apr 10, 2023

Overview

Motivation

hbredin commented Apr 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbredin commented Apr 11, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hbredin commented Apr 11, 2023 •

edited