在 SML 中实现 metric 算子（分类） #383

Candicepan · 2023-11-02T15:24:36Z

此 ISSUE 为隐语开源共建计划（SecretFlow Open Source Contribution Plan，简称 SF OSCP）第三期任务 ISSUE，欢迎社区开发者参与共建～
若有感兴趣想要认领的任务，但还未报名，辛苦先完成报名进行哈～

任务介绍

任务名称：在 SML 中实现 metric 算子(分类)
技术方向：SPU/SML
任务难度：进阶🌟🌟

详细要求

为 SML 增加metric算子（分类），包括：
ⅰ. f1_score
ⅱ. precision_score
ⅲ. recall_score
ⅳ. accuracy_score
具体功能可参考sklearn
正确性：请确保提交的代码内容为可以直接运行的
代码规范：Python 代码需要使用 black+isort 进行格式化（流水线包含代码规范检查卡点）; bazel需要使用buildifier格式化
一次认领需要实现所有算子

若有其他建议实现的算法，也可在本 ISSUE 下回复

能力要求

熟悉经典的机器学习算法
熟悉 JAX 或 NumPy，可以使用 NumPy 实现算法

操作说明

认领说明

请在认领任务后，在该 issue 下 comment 你的具体设计思路
设计思路说明：简单说明计划使用什么算法、什么技术方式进行实现

tarantula-leo · 2023-11-08T08:58:05Z

tarantula-leo Give it to me.

tarantula-leo · 2023-11-09T07:11:46Z

针对二分类：

from sklearn import metrics
import jax.numpy as jnp

def f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

def precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    precision = tp / (tp + fp + 1e-10)
    return precision

def recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    recall = tp / (tp + fn)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

y_true = jnp.array([0, 1, 1, 0, 1, 1])
y_pred = jnp.array([0, 0, 1, 0, 1, 1])

print("F1-score_jax: ", f1_score(y_true, y_pred))
print("Precision score_jax: ", precision_score(y_true, y_pred))
print("Recall score_jax: ", recall_score(y_true, y_pred))
print("Accuracy score_jax: ", accuracy_score(y_true, y_pred))

print("F1-score_sklearn: ", metrics.f1_score(y_true, y_pred))
print("Precision score_sklearn: ", metrics.precision_score(y_true, y_pred))
print("Recall score_sklearn: ", metrics.recall_score(y_true, y_pred))
print("Accuracy score_sklearn: ", metrics.accuracy_score(y_true, y_pred))

deadlywing · 2023-11-09T08:22:41Z

针对二分类：

from sklearn import metrics
import jax.numpy as jnp

def f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

def precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    precision = tp / (tp + fp + 1e-10)
    return precision

def recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    recall = tp / (tp + fn)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

y_true = jnp.array([0, 1, 1, 0, 1, 1])
y_pred = jnp.array([0, 0, 1, 0, 1, 1])

print("F1-score_jax: ", f1_score(y_true, y_pred))
print("Precision score_jax: ", precision_score(y_true, y_pred))
print("Recall score_jax: ", recall_score(y_true, y_pred))
print("Accuracy score_jax: ", accuracy_score(y_true, y_pred))

print("F1-score_sklearn: ", metrics.f1_score(y_true, y_pred))
print("Precision score_sklearn: ", metrics.precision_score(y_true, y_pred))
print("Recall score_sklearn: ", metrics.recall_score(y_true, y_pred))
print("Accuracy score_sklearn: ", metrics.accuracy_score(y_true, y_pred))

Hi，实现基本ok，有几个细节可以斟酌一下

可以通过简单增加字段来支持多分类，比如labels可以指定所有label的可能取值（或者我们直接assume label一定是0，1，2...，也可以直接给一个label_nums）
实现可以更精细一些，比如算了tp以后，fp可以直接计算而不需要再计算乘法

BTW，SML里已经有metric目录，可以直接把代码增加到classification.py里

tarantula-leo · 2023-11-09T08:59:23Z

以下三个多分类都需要修改代码，使用average = None么？

ⅰ. f1_score
ⅱ. precision_score
ⅲ. recall_score

deadlywing · 2023-11-09T09:14:56Z

以下三个多分类都需要修改代码，使用average = None么？
ⅰ. f1_score
ⅱ. precision_score
ⅲ. recall_score

可以参考sklearn，留一个average的参数，你可以只实现None或者其他方式。
主要的原则就是尽可能和明文下的使用模式差不多吧。

tarantula-leo · 2023-11-15T08:48:21Z

classification：

import jax
import jax.numpy as jnp

def _f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum(y_pred) - tp
    fn = jnp.sum(y_true) - tp
    precision = tp / (tp + fp + 1e-10)
    recall = tp / (tp + fn + 1e-10)
    f1 = 2 * precision * recall / (precision + recall + 1e-10)
    return f1

def _precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum(y_pred) - tp
    precision = tp / (tp + fp + 1e-10)
    return precision

def _recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true) - tp
    recall = tp / (tp + fn + 1e-10)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

def transform(y_true, y_pred, label):
    y_true_transform = jnp.where(y_true == label, 1, 0)
    y_pred_transform = jnp.where(y_pred != label, 0, 1)
    return y_true_transform, y_pred_transform

def f1_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_f1_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _f1_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

def precision_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_precision_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _precision_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

def recall_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_recall_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _recall_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

test：

import os
import sys
import unittest

import jax.numpy as jnp
import numpy as np
from sklearn import metrics

import spu.spu_pb2 as spu_pb2
import spu.utils.simulation as spsim

# Add the library directory to the path
sys.path.append(os.path.join(os.path.dirname(__file__), "../../../"))
from classification import f1_score,precision_score,recall_score,accuracy_score

class UnitTests(unittest.TestCase):
    def test_classification(self):
        sim = spsim.Simulator.simple(
            3, spu_pb2.ProtocolKind.ABY3, spu_pb2.FieldType.FM128
        )

        def proc(y_true, y_pred, average='binary', labels=None, pos_label=1):
            f1 = f1_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            precision = precision_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            recall = recall_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            accuracy = accuracy_score(y_true, y_pred)
            return f1,precision,recall,accuracy
        
        def sklearn_proc(y_true, y_pred, average='binary', labels=None, pos_label=1):
            f1 = metrics.f1_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            precision = metrics.precision_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            recall = metrics.recall_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            accuracy = metrics.accuracy_score(y_true, y_pred)
            return f1,precision,recall,accuracy
        
        def check(spu_result, sk_result):
            for  pair in zip(spu_result, sk_result):
                np.testing.assert_allclose(pair[0], pair[1], rtol=1, atol=1e-5)

        # Test binary
        y_true = jnp.array([0, 1, 1, 0, 1, 1])
        y_pred = jnp.array([0, 0, 1, 0, 1, 1])
        spu_result = spsim.sim_jax(sim, proc)(y_true, y_pred)
        sk_result = sklearn_proc(y_true, y_pred)
        check(spu_result, sk_result)

        # Test multiclass
        y_true = jnp.array([0, 1, 1, 0, 2, 1])
        y_pred = jnp.array([0, 0, 1, 0, 2, 1])
        spu_result = spsim.sim_jax(sim, proc)(y_true, y_pred, average=None, labels=[0,1,2])
        sk_result = sklearn_proc(y_true, y_pred, average=None, labels=[0,1,2])
        check(spu_result, sk_result)

if __name__ == "__main__":
    unittest.main()

deadlywing · 2023-11-15T11:53:40Z

@tarantula-leo

hi，大体上都是OK的，有一些可以优化的：

对f1_score，直接按定义需要3次除法，但f1显然可以把precision和recall的定义代入化简一下，可以优化到只需1次除法；
f1_score, precision_score, recall_score的实现似乎高度一致，区别只在_func的调用，可以合成一个公共函数
可以考虑增加一个参数，not_transform，可以不调用transform函数（很多时候二分类任务的label都是0，1）
在公共函数处增加一些注释，解释所有参数的含义～

BTW，可以考虑直接发PR，这样我可以直接在对应行comment～ Thanks

tarantula-leo · 2023-11-15T14:02:01Z

第三条，对于二分类，不使用transform，用pos_label表明正样本数值，labels参数仅在多分类中生效。

deadlywing · 2023-11-16T02:06:36Z

第三条，对于二分类，不使用transform，用pos_label表明正样本数值，labels参数仅在多分类中生效。

一般pos_label只会用在二分类；labels只用在多分类；
但是二分类不一定是0,1；可能是-1,1，这种情况的时候实际上也需要调用transform；所以我考虑有一个not_transform这个参数，一定程度上把是否transform的主动权交给用户。

如果有其他设计也可以抛出来讨论一下～

tarantula-leo · 2023-11-16T03:23:47Z

我上面说得有些问题，二分类是不用labels，不是不用transform，对于-1,1这样的标记，只要指定正样本标记为1，且average为binary，就会transform成0,1。not_transform这个你是想在什么场景下使用？

deadlywing · 2023-11-16T06:35:33Z

我上面说得有些问题，二分类是不用labels，不是不用transform，对于-1,1这样的标记，只要指定正样本标记为1，且average为binary，就会transform成0,1。not_transform这个你是想在什么场景下使用？

二分类场景：
label为0，1：不需要transform
label为-1，1: 需要transform

为了区分这两种情况，可能需要用户指定not_transform？

# Pull Request ## What problem does this PR solve? Issue Number: Fixed #383 ## Possible side effects? - Performance: support multi-classification - Backward compatibility:

Candicepan added enhancement New feature or request OSCP SecretFlow Open Source Contribution Plan labels Nov 2, 2023

Candicepan changed the title ~~在SML中实现metirc算子（分类）~~ 在 SML 中实现 metirc 算子（分类） Nov 2, 2023

Candicepan assigned tarantula-leo Nov 8, 2023

Candicepan changed the title ~~在 SML 中实现 metirc 算子（分类）~~ 在 SML 中实现 metric 算子（分类） Nov 14, 2023

tarantula-leo mentioned this issue Nov 16, 2023

[Add] f1_score/precision_score/recall_score/accuracy_score #405

Merged

deadlywing closed this as completed in #405 Nov 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在 SML 中实现 metric 算子（分类） #383

在 SML 中实现 metric 算子（分类） #383

Candicepan commented Nov 2, 2023 •

edited

tarantula-leo commented Nov 8, 2023

tarantula-leo commented Nov 9, 2023

deadlywing commented Nov 9, 2023

tarantula-leo commented Nov 9, 2023

deadlywing commented Nov 9, 2023

tarantula-leo commented Nov 15, 2023

deadlywing commented Nov 15, 2023

tarantula-leo commented Nov 15, 2023

deadlywing commented Nov 16, 2023

tarantula-leo commented Nov 16, 2023

deadlywing commented Nov 16, 2023

在 SML 中实现 metric 算子（分类） #383

在 SML 中实现 metric 算子（分类） #383

Comments

Candicepan commented Nov 2, 2023 • edited

任务介绍

详细要求

能力要求

操作说明

认领说明

tarantula-leo commented Nov 8, 2023

tarantula-leo commented Nov 9, 2023

deadlywing commented Nov 9, 2023

tarantula-leo commented Nov 9, 2023

deadlywing commented Nov 9, 2023

tarantula-leo commented Nov 15, 2023

deadlywing commented Nov 15, 2023

tarantula-leo commented Nov 15, 2023

deadlywing commented Nov 16, 2023

tarantula-leo commented Nov 16, 2023

deadlywing commented Nov 16, 2023

Candicepan commented Nov 2, 2023 •

edited