Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在 SML 中实现 metric 算子(分类) #383

Closed
Candicepan opened this issue Nov 2, 2023 · 11 comments · Fixed by #405
Closed

在 SML 中实现 metric 算子(分类) #383

Candicepan opened this issue Nov 2, 2023 · 11 comments · Fixed by #405
Assignees
Labels
enhancement New feature or request OSCP SecretFlow Open Source Contribution Plan

Comments

@Candicepan
Copy link
Contributor

Candicepan commented Nov 2, 2023

此 ISSUE 为 隐语开源共建计划(SecretFlow Open Source Contribution Plan,简称 SF OSCP)第三期任务 ISSUE,欢迎社区开发者参与共建~
若有感兴趣想要认领的任务,但还未报名,辛苦先完成报名进行哈~

任务介绍

  • 任务名称:在 SML 中实现 metric 算子(分类)
  • 技术方向:SPU/SML
  • 任务难度:进阶🌟🌟

详细要求

  • 为 SML 增加metric算子(分类),包括:
    ⅰ. f1_score
    ⅱ. precision_score
    ⅲ. recall_score
    ⅳ. accuracy_score
    具体功能可参考sklearn
  • 正确性:请确保提交的代码内容为可以直接运行的
  • 代码规范:Python 代码需要使用 black+isort 进行格式化(流水线包含代码规范检查卡点); bazel需要使用buildifier格式化
  • 一次认领需要实现所有算子

若有其他建议实现的算法,也可在本 ISSUE 下回复

能力要求

  • 熟悉经典的机器学习算法
  • 熟悉 JAX 或 NumPy,可以使用 NumPy 实现算法

操作说明

认领说明

  • 请在认领任务后,在该 issue 下 comment 你的具体设计思路
  • 设计思路说明:简单说明计划使用什么算法、什么技术方式进行实现
@Candicepan Candicepan added enhancement New feature or request OSCP SecretFlow Open Source Contribution Plan labels Nov 2, 2023
@Candicepan Candicepan changed the title 在SML中实现metirc算子(分类) 在 SML 中实现 metirc 算子(分类) Nov 2, 2023
@tarantula-leo
Copy link
Contributor

tarantula-leo Give it to me.

@tarantula-leo
Copy link
Contributor

针对二分类:

from sklearn import metrics
import jax.numpy as jnp

def f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

def precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    precision = tp / (tp + fp + 1e-10)
    return precision

def recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    recall = tp / (tp + fn)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

y_true = jnp.array([0, 1, 1, 0, 1, 1])
y_pred = jnp.array([0, 0, 1, 0, 1, 1])

print("F1-score_jax: ", f1_score(y_true, y_pred))
print("Precision score_jax: ", precision_score(y_true, y_pred))
print("Recall score_jax: ", recall_score(y_true, y_pred))
print("Accuracy score_jax: ", accuracy_score(y_true, y_pred))

print("F1-score_sklearn: ", metrics.f1_score(y_true, y_pred))
print("Precision score_sklearn: ", metrics.precision_score(y_true, y_pred))
print("Recall score_sklearn: ", metrics.recall_score(y_true, y_pred))
print("Accuracy score_sklearn: ", metrics.accuracy_score(y_true, y_pred))

@deadlywing
Copy link
Contributor

针对二分类:

from sklearn import metrics
import jax.numpy as jnp

def f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f1 = 2 * precision * recall / (precision + recall)
    return f1

def precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum((1 - y_true) * y_pred)
    precision = tp / (tp + fp + 1e-10)
    return precision

def recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true * (1 - y_pred))
    recall = tp / (tp + fn)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

y_true = jnp.array([0, 1, 1, 0, 1, 1])
y_pred = jnp.array([0, 0, 1, 0, 1, 1])

print("F1-score_jax: ", f1_score(y_true, y_pred))
print("Precision score_jax: ", precision_score(y_true, y_pred))
print("Recall score_jax: ", recall_score(y_true, y_pred))
print("Accuracy score_jax: ", accuracy_score(y_true, y_pred))

print("F1-score_sklearn: ", metrics.f1_score(y_true, y_pred))
print("Precision score_sklearn: ", metrics.precision_score(y_true, y_pred))
print("Recall score_sklearn: ", metrics.recall_score(y_true, y_pred))
print("Accuracy score_sklearn: ", metrics.accuracy_score(y_true, y_pred))

Hi,实现基本ok,有几个细节可以斟酌一下

  1. 可以通过简单增加字段来支持多分类,比如labels可以指定所有label的可能取值(或者我们直接assume label一定是0,1,2...,也可以直接给一个label_nums)
  2. 实现可以更精细一些,比如算了tp以后,fp可以直接计算而不需要再计算乘法

BTW,SML里已经有metric目录,可以直接把代码增加到classification.py里

@tarantula-leo
Copy link
Contributor

以下三个多分类都需要修改代码,使用average = None么?

ⅰ. f1_score
ⅱ. precision_score
ⅲ. recall_score

@deadlywing
Copy link
Contributor

以下三个多分类都需要修改代码,使用average = None么?

ⅰ. f1_score
ⅱ. precision_score
ⅲ. recall_score

可以参考sklearn,留一个average的参数,你可以只实现None或者其他方式。
主要的原则就是尽可能和明文下的使用模式差不多吧。

@Candicepan Candicepan changed the title 在 SML 中实现 metirc 算子(分类) 在 SML 中实现 metric 算子(分类) Nov 14, 2023
@tarantula-leo
Copy link
Contributor

classification:

import jax
import jax.numpy as jnp

def _f1_score(y_true, y_pred):
    """Calculate the F1 score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum(y_pred) - tp
    fn = jnp.sum(y_true) - tp
    precision = tp / (tp + fp + 1e-10)
    recall = tp / (tp + fn + 1e-10)
    f1 = 2 * precision * recall / (precision + recall + 1e-10)
    return f1

def _precision_score(y_true, y_pred):
    """Calculate the Precision score."""
    tp = jnp.sum(y_true * y_pred)
    fp = jnp.sum(y_pred) - tp
    precision = tp / (tp + fp + 1e-10)
    return precision

def _recall_score(y_true, y_pred):
    """Calculate the Recall score."""
    tp = jnp.sum(y_true * y_pred)
    fn = jnp.sum(y_true) - tp
    recall = tp / (tp + fn + 1e-10)
    return recall  

def accuracy_score(y_true, y_pred):
    """Calculate the Accuracy score."""
    correct = jnp.sum(y_true == y_pred)
    total = len(y_true)
    accuracy = correct / total
    return accuracy

def transform(y_true, y_pred, label):
    y_true_transform = jnp.where(y_true == label, 1, 0)
    y_pred_transform = jnp.where(y_pred != label, 0, 1)
    return y_true_transform, y_pred_transform

def f1_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_f1_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _f1_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

def precision_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_precision_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _precision_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

def recall_score(y_true, y_pred, average='binary', labels=None, pos_label=1):
    if average is None:
        assert (
            labels is not None
        ), f"labels cannot be None"
        f1_result = []
        for i in labels:
            y_true_transform, y_pred_transform = transform(y_true, y_pred, i)
            f1_result.append(_recall_score(y_true_transform, y_pred_transform))
    elif average == 'binary':
        y_true_transform, y_pred_transform = transform(y_true, y_pred, pos_label)
        f1_result = _recall_score(y_true_transform, y_pred_transform)
    else:
        raise ValueError("average should be None or 'binary'")
    return f1_result

test:

import os
import sys
import unittest

import jax.numpy as jnp
import numpy as np
from sklearn import metrics

import spu.spu_pb2 as spu_pb2
import spu.utils.simulation as spsim

# Add the library directory to the path
sys.path.append(os.path.join(os.path.dirname(__file__), "../../../"))
from classification import f1_score,precision_score,recall_score,accuracy_score

class UnitTests(unittest.TestCase):
    def test_classification(self):
        sim = spsim.Simulator.simple(
            3, spu_pb2.ProtocolKind.ABY3, spu_pb2.FieldType.FM128
        )

        def proc(y_true, y_pred, average='binary', labels=None, pos_label=1):
            f1 = f1_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            precision = precision_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            recall = recall_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            accuracy = accuracy_score(y_true, y_pred)
            return f1,precision,recall,accuracy
        
        def sklearn_proc(y_true, y_pred, average='binary', labels=None, pos_label=1):
            f1 = metrics.f1_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            precision = metrics.precision_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            recall = metrics.recall_score(y_true, y_pred, average=average, labels=labels, pos_label=pos_label)
            accuracy = metrics.accuracy_score(y_true, y_pred)
            return f1,precision,recall,accuracy
        
        def check(spu_result, sk_result):
            for  pair in zip(spu_result, sk_result):
                np.testing.assert_allclose(pair[0], pair[1], rtol=1, atol=1e-5)

        # Test binary
        y_true = jnp.array([0, 1, 1, 0, 1, 1])
        y_pred = jnp.array([0, 0, 1, 0, 1, 1])
        spu_result = spsim.sim_jax(sim, proc)(y_true, y_pred)
        sk_result = sklearn_proc(y_true, y_pred)
        check(spu_result, sk_result)

        # Test multiclass
        y_true = jnp.array([0, 1, 1, 0, 2, 1])
        y_pred = jnp.array([0, 0, 1, 0, 2, 1])
        spu_result = spsim.sim_jax(sim, proc)(y_true, y_pred, average=None, labels=[0,1,2])
        sk_result = sklearn_proc(y_true, y_pred, average=None, labels=[0,1,2])
        check(spu_result, sk_result)

if __name__ == "__main__":
    unittest.main()

@deadlywing
Copy link
Contributor

@tarantula-leo

hi,大体上都是OK的,有一些可以优化的:

  1. 对f1_score,直接按定义需要3次除法,但f1显然可以把precision和recall的定义代入化简一下,可以优化到只需1次除法;
  2. f1_score, precision_score, recall_score的实现似乎高度一致,区别只在_func的调用,可以合成一个公共函数
  3. 可以考虑增加一个参数,not_transform,可以不调用transform函数(很多时候二分类任务的label都是0,1)
  4. 在公共函数处增加一些注释,解释所有参数的含义~

BTW,可以考虑直接发PR,这样我可以直接在对应行comment~ Thanks

@tarantula-leo
Copy link
Contributor

第三条,对于二分类,不使用transform,用pos_label表明正样本数值,labels参数仅在多分类中生效。

@deadlywing
Copy link
Contributor

第三条,对于二分类,不使用transform,用pos_label表明正样本数值,labels参数仅在多分类中生效。

一般pos_label只会用在二分类;labels只用在多分类;
但是二分类不一定是0,1;可能是-1,1,这种情况的时候实际上也需要调用transform;所以我考虑有一个not_transform这个参数,一定程度上把是否transform的主动权交给用户。

如果有其他设计也可以抛出来讨论一下~

@tarantula-leo
Copy link
Contributor

我上面说得有些问题,二分类是不用labels,不是不用transform,对于-1,1这样的标记,只要指定正样本标记为1,且average为binary,就会transform成0,1。not_transform这个你是想在什么场景下使用?

@deadlywing
Copy link
Contributor

我上面说得有些问题,二分类是不用labels,不是不用transform,对于-1,1这样的标记,只要指定正样本标记为1,且average为binary,就会transform成0,1。not_transform这个你是想在什么场景下使用?

二分类场景:
label为0,1:不需要transform
label为-1,1: 需要transform

为了区分这两种情况,可能需要用户指定not_transform

deadlywing pushed a commit that referenced this issue Nov 19, 2023
# Pull Request

## What problem does this PR solve?

Issue Number: Fixed #383 

## Possible side effects?

- Performance: support multi-classification

- Backward compatibility:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request OSCP SecretFlow Open Source Contribution Plan
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants