# Validation of MOZ

## HOW TO USE

1. set 2 variables via environment variable or standart input.
    1. `PRED_SS_ID` evaluation target google sheet id
    2. `GOOGLE_CLOUD_PROJECT` google cloud project id
    3. `TRUTH_DATA_ID` bigquery table id which holds ground truth
2. execute all the cells.

In [30]:
import os
import sys
module_path = os.path.abspath(os.path.join('../../'))
if module_path not in sys.path:
    sys.path.append(module_path)

PRED_SS_ID = os.environ.get('PRED_SS_ID') or input("Input Prediction Google Sheet ID")
PROJECT_ID = os.environ.get('GOOGLE_CLOUD_PROJECT') or input('Input Google Cloud Project ID')
TRUTH_DATA_ID = os.environ.get('TRUTH_BQ_TABLE_ID') or input("Input Truth Data BigQuery Table ID")

### 予測データ

In [31]:
from b_moz.libs.io.google import GoogleSpreadSheet, GoogleDriveAuth

from gspread_dataframe import get_as_dataframe # type: ignore

def load_pred_data(ss_id):
    GoogleDriveAuth.get_credentials(
        os.environ.get("CLIENT_ID"), os.environ.get("CLIENT_SECRET")  # type: ignore
    )
    client = GoogleSpreadSheet.get_client()

    # prediction data
    pred_gsheet = client.open_by_key(ss_id)
    return pred_gsheet


### 正解データ

In [32]:
import pandas_gbq as pbq
import polars as pl

def load_truth_data():
    sql = (
        f"""
        select * from `{TRUTH_DATA_ID}`
        """
    )

    ground_truth_df = pl.from_pandas(
        pbq.read_gbq(sql, project_id=PROJECT_ID, dialect="standard", progress_bar_type=None) # type: ignore
    )
    return ground_truth_df


## evaluation

マルチラベル分類の評価手法に則り、下記指標を用いて評価

- $T_i$ $i$ 番目のモデルの正解の色名の集合
- $Y_i$ $i$ 番目のモデルの予測した色名の集合


1. EM (Exact Match)
    - $
        \frac{1}{N} \sum_{i=1}^{N} I[Y_i = T_i]
      $
    - モデルごとに全ての色名を正確に予測した割合
2. accuracy 正解率
    - $
        \frac{1}{N} \sum_{i=1}^{N} \frac{|Y_i \cap T_i|}{|Y_i \cup T_i|}
      $
    - 予測した色名のうち、正しく予測できていたものの割合
3. recall 再現率
    - $
        \frac{1}{N} \sum_{i=1}^{N} \frac{|Y_i \cap T_i|}{|T_i|}
      $
    - 正解に含まれる色名のうち、実際に予測できたものの割合
4. precision 適合率
    - $
        \frac{1}{N} \sum_{i=1}^{N} \frac{|Y_i \cap T_i|}{|Y_i|}
      $
    - 予測した色名のうち、正解に含まれていたものの割合
    - 正解の色名に予測した色名が全て含まれている場合にも、余分な色名を予測しているとスコアが小さくなる
5. F1 score F値
     - $
        \frac{ 2 \times precision \times recall}{precision + recall}
       $
    - precision と recall の調和平均

### c.f.

- <https://zero2one.jp/ai-word/accuracy-precision-recall-f-measure/>
- <https://qiita.com/jyori112/items/110596b4f04e4e1a3c9b#%E5%A4%9A%E3%83%A9%E3%83%99%E3%83%AB%E5%88%86%E9%A1%9E%E3%82%BF%E3%82%B9%E3%82%AFmulti-label-classification>

In [36]:
def exact_match(truth: set, pred: set) -> bool:
    return truth == pred

def accuracy(truth: set, pred: set) -> float:
    if len(truth | pred) == 0:
        return 0
    return len(truth & pred) / len(truth | pred)

def recall(truth: set, pred: set) -> float:
    if len(truth) == 0:
        return 0
    return len(truth & pred) / len(truth)

def precision(truth: set, pred: set) -> float:
    if len(pred) == 0:
        return 0
    return len(truth & pred) / len(pred)

def calc_metrics(validation_df: pl.DataFrame):
    # calclate metrics per model
    eval_indicators = []
    for category, model_name, true_labels, pred_labels in validation_df.select("category", "model", "truth", "pred").iter_rows():
        truth: set = set(true_labels or [])
        pred: set = set(pred_labels or [])

        eval_indicators.append(dict(
            category=category,
            model=model_name,
            exact_match = exact_match(truth, pred),
            accuracy = accuracy(truth, pred),
            recall = recall(truth, pred),
            precision = precision(truth, pred),
        ))
    metric_df = pl.DataFrame(eval_indicators)

    # calculate metrics per category
    metric_per_category = metric_df.group_by("category").agg(
        pl.col("category").count().alias("count"),
        pl.col("exact_match").mean().alias("exact_match"),
        pl.col("accuracy").mean().alias("accuracy"),
        pl.col("recall").mean().alias("recall"),
        pl.col("precision").mean().alias("precision"),
    ).with_columns(
        (2 / (1 / pl.col("recall") +  1 / pl.col("precision"))).alias("f1"),
    ).sort("category")

    # calculate metrics
    metric_df = (
        metric_df
        .select("exact_match", "accuracy", "recall", "precision")
    )
    metric_all = (
        metric_df.select("exact_match", "accuracy", "recall", "precision").sum() / len(eval_indicators)
    ).with_columns(
        (2 / (1 / pl.col("recall") +  1 / pl.col("precision"))).alias("f1"),
    )

    return metric_all, metric_per_category


def evaluate(pred_sheet_id, target_category='color', display_result=True):
    # load_data
    truth_df = load_truth_data()
    pred_gsheet = load_pred_data(pred_sheet_id)
    pred_df: pl.DataFrame = pl.from_pandas(
        get_as_dataframe(pred_gsheet.worksheet(f"model_{target_category}"))
    ) # type: ignore

    # create validation data
    agg_pred_df = (
        pred_df
        .group_by('model')
        .agg(pl.col(target_category).alias("truth"))
    )
    agg_truth_df = (
        truth_df
        .group_by('category', 'model')
        .agg(pl.col(target_category).alias("pred"))
    )
    validation_df = (
        agg_pred_df
        .join(agg_truth_df, on='model', how='left')
        .select("category", "model", "truth", "pred")
    )
    # calculate metrics
    metric_all, metric_per_category = calc_metrics(validation_df)
    if display_result:
        # display(validation_df.head(5))
        display(metric_all)
        display(metric_per_category)
    return metric_all, metric_per_category


In [37]:
metric_all, metric_per_category = evaluate(PRED_SS_ID, 'color')

exact_match,accuracy,recall,precision,f1
f64,f64,f64,f64,f64
0.289855,0.465346,0.565942,0.48712,0.523581


category,count,exact_match,accuracy,recall,precision,f1
str,u32,f64,f64,f64,f64,f64
,0,0.0,0.0,0.0,0.0,0.0
"""IoT端末""",2,0.5,0.833333,0.833333,1.0,0.909091
"""ガラケー""",10,0.6,0.77,0.88,0.813333,0.845354
"""スマートフォン""",43,0.27907,0.505633,0.641473,0.522742,0.576053
"""タブレット""",4,0.25,0.25,0.25,0.25,0.25
