# Applying data minimization to a trained regression ML model

In this tutorial we will show how to perform data minimization for regression ML models using the minimization module.

We will show you applying data minimization to a different trained regression models.

## Load data
QI parameter determines which features will be minimized.

In [None]:
!pip install ai-privacy-toolkit

Collecting ai-privacy-toolkit
  Downloading ai_privacy_toolkit-0.2.1-py3-none-any.whl.metadata (3.3 kB)
Downloading ai_privacy_toolkit-0.2.1-py3-none-any.whl (57 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ai-privacy-toolkit
Successfully installed ai-privacy-toolkit-0.2.1


In [None]:
!pip install adversarial-robustness-toolbox

Collecting adversarial-robustness-toolbox
  Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl.metadata (11 kB)
Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: adversarial-robustness-toolbox
Successfully installed adversarial-robustness-toolbox-1.19.1


In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

dataset = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.5, random_state=14)

features = ['age', 'sex', 'bmi', 'bp',
                's1', 's2', 's3', 's4', 's5', 's6']
QI = ['age', 'bmi', 's2', 's5', 's6']

## Train DecisionTreeRegressor model

In [None]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.tree import DecisionTreeRegressor

model1 = DecisionTreeRegressor(random_state=10, min_samples_split=2)
model1.fit(X_train, y_train)
print('Base model accuracy (R2 score): ', model1.score(X_test, y_test))

Base model accuracy (R2 score):  0.15014421352446072


## Run minimization
We will try to run minimization with only a subset of the features.

In [None]:
print(dir(minimizer1))


['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__sklearn_clone__', '__sklearn_tags__', '__str__', '__subclasshook__', '__weakref__', '_are_inseparable', '_attach_cells_representatives', '_build_request_for_signature', '_calc_ncp_categorical', '_calc_ncp_for_generalization', '_calc_ncp_numeric', '_calculate_accuracy', '_calculate_categorical_features_values', '_calculate_categories', '_calculate_cell_generalizations', '_calculate_cell_label', '_calculate_cells', '_calculate_cells_recursive', '_calculate_generalizations', '_calculate_generalizations_for_cell', '_calculate_level_cell_label', '_calculate_level_cells', '_calculate_ncp_for_feature_from_cells', '_calculate_ranges', '_calculate_untouched', '_categ

In [None]:
from sklearn.model_selection import train_test_split

# Reload dataset if missing
from sklearn.datasets import load_diabetes
dataset = load_diabetes()
x, y = dataset.data, dataset.target

# Ensure x_train1 and y_train1 exist
x_train1, x_test1, y_train1, y_test1 = train_test_split(x, y, test_size=0.5, random_state=42)

print("✅ x_train1 and y_train1 redefined. Shapes:", x_train1.shape, y_train1.shape)


✅ x_train1 and y_train1 redefined. Shapes: (221, 10) (221,)


In [None]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes

# Load dataset
dataset = load_diabetes()
x, y = dataset.data, dataset.target

# Split into train & test
x_train1, x_test1, y_train1, y_test1 = train_test_split(x, y, test_size=0.5, random_state=42)

# Initialize minimizer with a reasonable target accuracy
minimizer1 = GeneralizeToRepresentative(model1, target_accuracy=0.95)

# Fit minimizer
minimizer1.fit(x_train1, y_train1)

# Apply transformation
transformed1 = minimizer1.transform(x_test1)

# Evaluate accuracy
print("✅ Accuracy on minimized data:", model1.score(transformed1, y_test1))

# Check generalizations
print("✅ Generalizations:", minimizer1.generalizations)


Could not stratify split due to uncommon class value, doing unstratified split instead




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.216960
Improving accuracy




feature to remove: 2




Removed feature: 2, new relative accuracy: 0.319438




feature to remove: 8




Removed feature: 8, new relative accuracy: 0.247505




feature to remove: 7




Removed feature: 7, new relative accuracy: 0.238493




feature to remove: 9




Removed feature: 9, new relative accuracy: 0.401236




feature to remove: 0




Removed feature: 0, new relative accuracy: 0.517824




feature to remove: 4




Removed feature: 4, new relative accuracy: 0.560658




feature to remove: 5




Removed feature: 5, new relative accuracy: 0.602825




feature to remove: 6
Removed feature: 6, new relative accuracy: 0.618031




feature to remove: 3
Removed feature: 3, new relative accuracy: 0.691260




feature to remove: 1
Removed feature: 1, new relative accuracy: 0.706485
feature to remove: none




✅ Accuracy on minimized data: 0.492791679982421
✅ Generalizations: {'ranges': {}, 'categories': {}, 'untouched': ['1', '8', '5', '3', '9', '7', '0', '6', '2', '4'], 'category_representatives': {}, 'range_representatives': {}}


## Train linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
from apt.minimization import GeneralizeToRepresentative

model2 = LinearRegression()
model2.fit(X_train, y_train)
print('Base model accuracy (R2 score): ', model2.score(X_test, y_test))

Base model accuracy (R2 score):  0.5080563960651394


## Run minimization
We will try to run minimization with only a subset of the features.

In [None]:
# note that is_regression param is True

minimizer2 = GeneralizeToRepresentative(model2, target_accuracy=0.7, is_regression=True,
                                    features_to_minimize=QI)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train2, x_test2, y_generalizer_train2, y_test2 = train_test_split(X_test, y_test,
                                                                test_size = 0.4, random_state = 38)

x_train_predictions2 = model2.predict(X_generalizer_train2)
minimizer2.fit(X_generalizer_train2, x_train_predictions2, features_names=features)
transformed2 = minimizer2.transform(x_test2, features_names=features)
print('Accuracy on minimized data: ', model2.score(transformed2, y_test2))
print('generalizations: ',minimizer2.generalizations)



Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.305679
Improving accuracy




feature to remove: s5
Removed feature: s5, new relative accuracy: 0.461508




feature to remove: s6
Removed feature: s6, new relative accuracy: 0.455118




feature to remove: s2
Removed feature: s2, new relative accuracy: 0.955282




Accuracy on minimized data:  0.4562452536356322
generalizations:  {'ranges': {'age': [-0.06181889958679676, -0.036391131579875946, -0.027309785597026348, -0.0036982858437113464, 0.0017505218856967986, 0.0035667913034558296, 0.009015598334372044, 0.009015598800033331, 0.02717829099856317, 0.028994559310376644, 0.028994561173021793, 0.039892174769192934, 0.04534098319709301], 'bmi': [-0.0660245232284069, -0.06171327643096447, -0.048779530450701714, -0.036923596635460854, -0.022912041284143925, -0.015906263142824173, -0.009978296235203743, 0.007266696775332093, 0.022356065921485424, 0.028822937980294228, 0.04499012045562267, 0.04876246117055416, 0.053073709830641747, 0.10103634744882584]}, 'categories': {}, 'untouched': ['s6', 's1', 's5', 's3', 's2', 'bp', 'sex', 's4'], 'category_representatives': {}, 'range_representatives': {'age': [-0.06181889958679676, -0.09269547780327612, -0.045472477940023646, -0.027309785684926546, -0.020044708782887707, -0.0018820165277906047, 0.00175052192322881