# 使用评价指标工具
训练和测试模型时往往需要计算不同的评价指标，如正确率、查准率、查全率、F1值等，具体指标往往和处理的数据和任务类型有关。  
分类问题评估指标：  
- 准确率 — Accuracy
- 精确率（差准率）- Precision
- 召回率（查全率）- Recall
- F1分数
- ROC曲线
- AUC曲线  

回归问题评估指标：
- MAE
- MSE

huggingface 提供了评价指标工具，计算结果隐藏。
## 1. 使用评价指标工具
### 1. 列出可用的评价指标
使用 list_metrics() 函数列出所有可用的评价指标

In [1]:
# !pip install datasets

Collecting datasets
  Downloading datasets-2.14.2-py3-none-any.whl (518 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m518.9/518.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.16.4-py3-none-a

In [2]:
from datasets import list_metrics
metrics_list = list_metrics()
len(metrics_list), metrics_list[:5]

  metrics_list = list_metrics()


(121, ['accuracy', 'bertscore', 'bleu', 'bleurt', 'brier_score'])

### 2. 加载一个评价指标
利用 load_metric() 函数，评价指标往往会对应的数据集配套使用，下面以 glue 数据集的 mrpc 子集位例:  
注意并不是每一个数据集都有评价指标，实际应用时应根据具体情况选择合适的评价指标。

In [6]:
from datasets import load_metric
metric = load_metric('glue','mrpc')
print(metric)

Metric(name: "glue", features: {'predictions': Value(dtype='int64', id=None), 'references': Value(dtype='int64', id=None)}, usage: """
Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(res

### 3. 获取评价指标的使用说明
评价指标的 inputs_description 属性为一段文本，描述评价指标的使用方法，不同的评价指标输入往往是不同的。

In [8]:
print(metric.inputs_description)


Compute GLUE evaluation metric associated to each GLUE dataset.
Args:
    predictions: list of predictions to score.
        Each translation should be tokenized into a list of tokens.
    references: list of lists of references for each translation.
        Each reference should be tokenized into a list of tokens.
Returns: depending on the GLUE subset, one or several of:
    "accuracy": Accuracy
    "f1": F1 score
    "pearson": Pearson Correlation
    "spearmanr": Spearman Correlation
    "matthews_correlation": Matthew Correlation
Examples:

    >>> glue_metric = datasets.load_metric('glue', 'sst2')  # 'sst2' or any of ["mnli", "mnli_mismatched", "mnli_matched", "qnli", "rte", "wnli", "hans"]
    >>> references = [0, 1]
    >>> predictions = [0, 1]
    >>> results = glue_metric.compute(predictions=predictions, references=references)
    >>> print(results)
    {'accuracy': 1.0}

    >>> glue_metric = datasets.load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
    >>> references = [0, 1]

### 4. 计算评价指标


In [10]:
glue_metric = load_metric('glue', 'mrpc')  # 'mrpc' or 'qqp'
references = [0, 1, 0, 1]  # 标记
predictions = [0, 1, 1, 0]  # 模型预测结果
results = glue_metric.compute(predictions=predictions, references=references)
print(results)

{'accuracy': 0.5, 'f1': 0.5}
