-
Notifications
You must be signed in to change notification settings - Fork 49
Description
Problem Description
There are many ways and reasons to perform data augmentation with synthetic data for the purposes for building ML models. While we have some ML Efficacy metrics in beta, we'd like to create a suite of metrics that more effectively cover the use case. The BinaryClassifierPrecisionEfficacy
metric will specifically measure if synthetic data improves the precision of a binary classifier.
Expected behavior
This metric should be defined in the data_augmentation
sub-module inside single_table
.
from sdmetrics.single_table.data_augmentation import BinaryClassifierPrecisionEfficacy
BinaryClassifierPrecisionEfficacy.compute_breakdown(
real_training_data=real_df,
synthetic_data=synthetic_df,
real_validation_data=real_holdout_df,
metadata=single_table_metadata_dict,
prediction_column_name='covid_status',
minority_class_label=1,
classifier='XGBoost',
fixed_recall_value=0.9
)
compute_breakdown
API
- Args
real_training_data (pd.DataFrame)
- A dataframe containing the real data that was used for training the synthesizer. The metric will use this data for training a Binary Classification model.synthetic_data (pd.DataFrame)
- A dataframe containing the synthetic data sampled from the synthesizer. The metric will use this data for training a Binary Classification model.real_validation_data (pd.DataFrame)
- A dataframe containing a holdout set of real data. This data should not have been used to train the synthesizer. This data will be used to evaluate a Binary Classification model.metadata (dict)
- The metadata dictionary describing the table of data.prediction_column_name (str)
- The name of the column to be predicted. The column should be a categorical or boolean column.minority_class_label [str/int/float]
- The value in the prediction column that should be considered a positive result, from the perspective of Binary Classfication. All other values in the column will be considered negative results.classifier [str, optional]
- The ML algorithm to use when building a Binary Classfication. Supported options are 'XGBoost'. Defaults to 'XGBoost'.- Note: as an MVP, we will only support XGBoost. Future feature requests may add support for additional algorithms.
fixed_recall_value [float, optional]
- A float in the range (0, 1.0) describing the value to fix for the recall when building the Binary Classification model. Defaults to 0.9.
- Returns
- A dictionary of the breakdown of the score, with the following information:
- The score for the metric. This is the improvement precision score (from baseline -> augmented data) in percentage points,
score = MIN(0, augmented_precision_score - baseline_precision_score))
. - The parameters used to run the metric
- For each of the augmented data and the real data baseline:
- The recall score achieved during training. This should be at least the requested score input as a parameter, but may not be exactly equal.
- The actual recall score achieved on the validation (holdout) set.
- The precision score achieved on the validation set.
- The prediction counts achieved on the validation set (true positive, false positive, true negative, and false negative).
- The score for the metric. This is the improvement precision score (from baseline -> augmented data) in percentage points,
- Expected dictionary output:
{ 'score': 0.86, 'augmented_data': { 'recall_score_training': 0.950, 'recall_score_validation': 0.912 'precision_score_validation': 0.84, 'prediction_counts_validation': { 'true_positive': 21, 'false_positive': 4, 'true_negative': 73, 'false_negative': 3 }, }, 'real_data_baseline': { # keys are the same as the 'augmented_data' dictionary }, 'parameters': { 'prediction_column_name': 'covid_status', 'minority_class_label': 1, 'classifier': 'XGBoost', 'fixed_recall_value': 0.9 } }
- A dictionary of the breakdown of the score, with the following information:
Algorithm
- Concatanate the
real_training_data
andsynthetic_data together
- Train a binary classification model on the data, using the classifier algorithm selected (default: XGBoost)
a) Need to pre-process the data to turn discrete columns into continuous columns (note that we cannot use RDT, and should use scikit learn methods instead)
b) Data pre-processing to convert theprediction_columm
into a boolean column with the correct positive/negative values- If multi-class, consider only the
minority_class_label
as positive values. All other values will be considered negative.
- If multi-class, consider only the
- Based on the parameters, fix the recall for the minority class
a) This will require finding the right threshold to achieve as close of the fixed recall as specified. The classifier will return a continuous value for each data point in training data and we would have to find the threshold that will achieve the value closest to the fixed rate. Note that we should always choose a threshold that is as close as possible to the requested recall value but never less than it. That is to say, ensure that the training set recall is >= the requested recall value.
b) Save this threshold to use on the validation data. This threshold is now a learnt parameter alongside the classifier - Take the classifier and apply it on the
real_validation_data
. Compute the statistics that we want to return. - Calculate the baseline. Repeat steps 1-4 but this time, only use the
real_training_data
(do not concatenatesynthetic_data
).
compute
The compute
method should take the same arguments as the compute_breakdown
method.
The compute
method should return just the overall score
parameter calculated by compute_breakdown
.
Additional context
See this doc
There will be significant overlap of required pre-processing/helper functions between data augmentation metrics. When possible, general functionality should be abstracted into utility functions that can be reused across many metrics.