# Applying data minimization with categorical data and only a subset of the features to a trained ML model

In this tutorial we will show how to perform data minimization for ML models using the minimization module.

This will be demonstarted using the German Credit dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data).

## Load data
QI parameter determines which features will be minimized.

In [None]:
!pip install ai-privacy-toolkit

Collecting ai-privacy-toolkit
  Downloading ai_privacy_toolkit-0.2.1-py3-none-any.whl.metadata (3.3 kB)
Downloading ai_privacy_toolkit-0.2.1-py3-none-any.whl (57 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ai-privacy-toolkit
Successfully installed ai-privacy-toolkit-0.2.1


In [None]:
!pip install adversarial-robustness-toolbox

Collecting adversarial-robustness-toolbox
  Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl.metadata (11 kB)
Downloading adversarial_robustness_toolbox-1.19.1-py3-none-any.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: adversarial-robustness-toolbox
Successfully installed adversarial-robustness-toolbox-1.19.1


In [None]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
columns = [
    "Existing_checking_account", "Duration_in_month", "Credit_history", "Purpose", "Credit_amount",
    "Savings_account", "Present_employment_since", "Installment_rate", "Personal_status_sex", "debtors",
    "Present_residence", "Property", "Age", "Other_installment_plans", "Housing",
    "Number_of_existing_credits", "Job", "N_people_being_liable_provide_maintenance", "Telephone",
    "Foreign_worker", "Target"
]

df = pd.read_csv(url, delimiter=" ", names=columns, header=None)

# Separate features and target
x_train = df.drop(columns=["Target"])
y_train = (df["Target"] == 1).astype(int)  # Convert to binary format

print("✅ Dataset loaded successfully.")


✅ Dataset loaded successfully.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Define categorical and numerical feature lists
categorical_features = [
    "Existing_checking_account", "Credit_history", "Purpose", "Savings_account",
    "Present_employment_since", "Personal_status_sex", "debtors", "Property",
    "Other_installment_plans", "Housing", "Job"
]

numeric_features = [
    "Duration_in_month", "Credit_amount", "Installment_rate",
    "Present_residence", "Age", "Number_of_existing_credits",
    "N_people_being_liable_provide_maintenance", "Telephone", "Foreign_worker"
]

print("✅ Categorical and numeric features defined.")


✅ Categorical and numeric features defined.


In [None]:
# Define transformations
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="constant", fill_value=0))]
)

categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# Apply ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

print("✅ Preprocessing pipeline defined.")


✅ Preprocessing pipeline defined.


In [None]:
from sklearn.model_selection import train_test_split

# Split dataset into train & test sets (80% train, 20% test)
x_train, x_test, y_train, y_test = train_test_split(
    x_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

print("✅ Dataset split into training and testing sets.")
print("Training data shape:", x_train.shape)
print("Test data shape:", x_test.shape)


✅ Dataset split into training and testing sets.
Training data shape: (800, 20)
Test data shape: (200, 20)


In [None]:
import pandas as pd

# Apply transformations
encoded_train = preprocessor.fit_transform(x_train)
encoded_test = preprocessor.transform(x_test)

# Convert to DataFrame with correct feature names
encoded_feature_names = preprocessor.get_feature_names_out()
x_train_df = pd.DataFrame(encoded_train, columns=encoded_feature_names)
x_test_df = pd.DataFrame(encoded_test, columns=encoded_feature_names)

print("✅ Data successfully transformed!")
print("Shape of transformed training data:", x_train_df.shape)
print("Shape of transformed test data:", x_test_df.shape)


✅ Data successfully transformed!
Shape of transformed training data: (800, 59)
Shape of transformed test data: (200, 59)


In [None]:
print(x_train_df.dtypes.value_counts())  # Should all be 'float64' or 'int64'
print(x_train_df.head())  # Check for any string values


object    59
Name: count, dtype: int64
  num__Duration_in_month num__Credit_amount num__Installment_rate  \
0                     30               4530                     4   
1                     30               2503                     4   
2                     12               1567                     1   
3                     21               3976                     2   
4                      9               2301                     2   

  num__Present_residence num__Age num__Number_of_existing_credits  \
0                      4       26                               1   
1                      2       41                               2   
2                      1       22                               1   
3                      3       35                               1   
4                      4       22                               1   

  num__N_people_being_liable_provide_maintenance num__Telephone  \
0                                              1           A192 

In [None]:
# Ensure OneHotEncoder outputs a numerical array
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

# Reapply preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Re-transform dataset
encoded_train = preprocessor.fit_transform(x_train)
encoded_test = preprocessor.transform(x_test)

# Convert to DataFrame with correct feature names
encoded_feature_names = preprocessor.get_feature_names_out()
x_train_df = pd.DataFrame(encoded_train, columns=encoded_feature_names)
x_test_df = pd.DataFrame(encoded_test, columns=encoded_feature_names)

print("✅ Data transformation fixed!")
print(x_train_df.dtypes.value_counts())  # Should now be only float64/int64


✅ Data transformation fixed!
object    59
Name: count, dtype: int64


In [None]:
print("Columns in x_train_df:")
print(x_train_df.columns.tolist())  # Show the actual column names


Columns in x_train_df:
['num__Duration_in_month', 'num__Credit_amount', 'num__Installment_rate', 'num__Present_residence', 'num__Age', 'num__Number_of_existing_credits', 'num__N_people_being_liable_provide_maintenance', 'num__Telephone', 'num__Foreign_worker', 'cat__Existing_checking_account_A11', 'cat__Existing_checking_account_A12', 'cat__Existing_checking_account_A13', 'cat__Existing_checking_account_A14', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A32', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A410', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__Savings_account_A61', 'cat__Savings_account_A62', 'cat__Savings_account_A63', 'cat__Savings_account_A64', 'cat__Savings_account_A65', 'cat__Present_employment_since_A71', 'cat__Present_employment_since_A72', 'cat__Present_employment_since_A73', 

In [None]:
# Convert only the numeric columns
for col in x_train_df.columns:
    if col.startswith("num__"):  # Select only prefixed numerical features
        x_train_df[col] = pd.to_numeric(x_train_df[col], errors="coerce")
        x_test_df[col] = pd.to_numeric(x_test_df[col], errors="coerce")

print("✅ Successfully converted numeric features!")
print(x_train_df.dtypes.value_counts())  # Should now be float64 or int64 only


✅ Successfully converted numeric features!
object     50
int64       7
float64     2
Name: count, dtype: int64


In [None]:
model = DecisionTreeClassifier(random_state=42)
model.fit(x_train_df, y_train)

print("✅ Model trained successfully!")


✅ Model trained successfully!


In [None]:
# Evaluate the model on the test data
accuracy = model.score(x_test_df, y_test)

print(f"✅ Model accuracy on test data: {accuracy:.4f}")


✅ Model accuracy on test data: 0.6450


In [None]:
from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split

# Define the features to be minimized (Quasi-Identifiers)
QI = [
    "Duration_in_month", "Credit_history", "Purpose", "debtors", "Property",
    "Other_installment_plans", "Housing", "Job"
]

# Ensure feature names are correctly prefixed after transformation
qi_features = [f"num__{qi}" if f"num__{qi}" in x_train_df.columns else f"cat__{qi}" for qi in QI]

print("✅ Features selected for minimization:", qi_features)


✅ Features selected for minimization: ['num__Duration_in_month', 'cat__Credit_history', 'cat__Purpose', 'cat__debtors', 'cat__Property', 'cat__Other_installment_plans', 'cat__Housing', 'cat__Job']


In [None]:
# Split test set into generalizer training and final test set
X_generalizer_train, x_final_test, y_generalizer_train, y_final_test = train_test_split(
    x_test_df, y_test, stratify=y_test, test_size=0.4, random_state=38
)

# Reset index for consistency
X_generalizer_train.reset_index(drop=True, inplace=True)
y_generalizer_train.reset_index(drop=True, inplace=True)
x_final_test.reset_index(drop=True, inplace=True)
y_final_test.reset_index(drop=True, inplace=True)

print("✅ Data split for minimization.")


✅ Data split for minimization.


In [None]:
# Check if all features exist in x_train_df
missing_features = [f for f in qi_features if f not in x_train_df.columns]

if missing_features:
    print("❌ The following features are missing:", missing_features)
else:
    print("✅ All features exist in x_train_df!")


❌ The following features are missing: ['cat__Credit_history', 'cat__Purpose', 'cat__debtors', 'cat__Property', 'cat__Other_installment_plans', 'cat__Housing', 'cat__Job']


In [None]:
# Filter only existing features
qi_features = [f for f in qi_features if f in x_train_df.columns]

print("✅ Updated features for minimization:", qi_features)


✅ Updated features for minimization: ['num__Duration_in_month']


In [None]:
# Initialize the minimizer
minimizer = GeneralizeToRepresentative(
    model, categorical_features=[col for col in x_train_df.columns if col.startswith("cat__")],
    features_to_minimize=qi_features
)

# Fit the minimizer using generalizer training data
x_train_predictions = model.predict(X_generalizer_train)
minimizer.fit(X_generalizer_train, x_train_predictions, features_names=x_train_df.columns.tolist())

print("✅ Feature minimization complete!")




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
✅ Feature minimization complete!


In [None]:
# Collect all encoded variants for categorical features
updated_qi_features = []

for qi in QI:
    encoded_variants = [col for col in x_train_df.columns if qi in col]
    updated_qi_features.extend(encoded_variants)

print("✅ Updated QI Features for Minimization:", updated_qi_features)


✅ Updated QI Features for Minimization: ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A32', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A410', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']


In [None]:
from sklearn.model_selection import train_test_split

# Split BEFORE applying transformations
X_generalizer_train, x_final_test, y_generalizer_train, y_final_test = train_test_split(
    x_test, y_test, stratify=y_test, test_size=0.4, random_state=38
)

# Apply the same transformation as x_train_df
encoded_generalizer_train = preprocessor.transform(X_generalizer_train)
encoded_final_test = preprocessor.transform(x_final_test)

# Convert to DataFrame with proper column names
X_generalizer_train_df = pd.DataFrame(encoded_generalizer_train, columns=x_train_df.columns)
x_final_test_df = pd.DataFrame(encoded_final_test, columns=x_train_df.columns)

print("✅ Generalizer training data successfully re-transformed!")
print("Shape of X_generalizer_train_df:", X_generalizer_train_df.shape)
print("Shape of x_final_test_df:", x_final_test_df.shape)


✅ Generalizer training data successfully re-transformed!
Shape of X_generalizer_train_df: (120, 59)
Shape of x_final_test_df: (80, 59)


In [None]:
print("Checking data types in X_generalizer_train_df:")
print(X_generalizer_train_df.dtypes.value_counts())  # Should only show int64 and float64
print(X_generalizer_train_df.head())  # Look at the first few rows


Checking data types in X_generalizer_train_df:
object    59
Name: count, dtype: int64
  num__Duration_in_month num__Credit_amount num__Installment_rate  \
0                     24               1469                     4   
1                     12               1291                     4   
2                     21               2580                     4   
3                     24               3660                     2   
4                     11               1322                     4   

  num__Present_residence num__Age num__Number_of_existing_credits  \
0                      4       41                               1   
1                      2       35                               2   
2                      2       41                               1   
3                      4       28                               1   
4                      4       40                               2   

  num__N_people_being_liable_provide_maintenance num__Telephone  \
0                

In [None]:
# Convert all object columns to numeric values
for col in X_generalizer_train_df.columns:
    if X_generalizer_train_df[col].dtype == "object":
        X_generalizer_train_df[col] = pd.to_numeric(X_generalizer_train_df[col], errors="coerce")

print("✅ Successfully converted all categorical features!")
print(X_generalizer_train_df.dtypes.value_counts())  # Should now be only float64 or int64


✅ Successfully converted all categorical features!
float64    52
int64       7
Name: count, dtype: int64


In [None]:
x_train_predictions = model.predict(X_generalizer_train_df)

minimizer.fit(X_generalizer_train_df, x_train_predictions, features_names=x_train_df.columns.tolist())

print("✅ Feature minimization complete!")




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.750000
Improving accuracy




KeyError: 'cat__Credit_history_A33'

In [None]:
# Ensure X_generalizer_train_df has all columns that exist in x_train_df
missing_cols = set(x_train_df.columns) - set(X_generalizer_train_df.columns)

# Add missing columns with default value 0 (OneHotEncoding default)
for col in missing_cols:
    X_generalizer_train_df[col] = 0.0

# Ensure column order is the same as x_train_df
X_generalizer_train_df = X_generalizer_train_df[x_train_df.columns]

print("✅ Fixed missing columns! New shape:", X_generalizer_train_df.shape)


✅ Fixed missing columns! New shape: (120, 59)


In [None]:
# Step 1: Find missing columns
missing_cols = set(x_train_df.columns) - set(X_generalizer_train_df.columns)
extra_cols = set(X_generalizer_train_df.columns) - set(x_train_df.columns)

print("✅ Columns in x_train_df but missing in X_generalizer_train_df:", missing_cols)
print("⚠️ Extra columns in X_generalizer_train_df:", extra_cols)

# Step 2: Add missing columns to X_generalizer_train_df with default value 0.0
for col in missing_cols:
    X_generalizer_train_df[col] = 0.0

# Step 3: Remove any extra columns (if any)
X_generalizer_train_df = X_generalizer_train_df[x_train_df.columns]

print("✅ Fixed missing columns! New shape:", X_generalizer_train_df.shape)


✅ Columns in x_train_df but missing in X_generalizer_train_df: set()
⚠️ Extra columns in X_generalizer_train_df: set()
✅ Fixed missing columns! New shape: (120, 59)


In [None]:
# Ensure only features that exist in X_generalizer_train_df are passed to minimizer
valid_qi_features = [f for f in updated_qi_features if f in X_generalizer_train_df.columns and X_generalizer_train_df[f].sum() > 0]

print("✅ Final list of QI features for minimization:", valid_qi_features)


✅ Final list of QI features for minimization: ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A32', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']


In [None]:
# Ensure only features that have at least one occurrence are passed to minimizer
final_qi_features = [f for f in valid_qi_features if f in X_generalizer_train_df.columns and X_generalizer_train_df[f].sum() > 0]

print("✅ Final list of QI features (with actual values):", final_qi_features)


✅ Final list of QI features (with actual values): ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A32', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']


In [None]:
# Check if the problematic column exists and has nonzero values
if 'cat__Credit_history_A32' in X_generalizer_train_df.columns:
    print(f"✅ Found 'cat__Credit_history_A32' in dataset. Total count: {X_generalizer_train_df['cat__Credit_history_A32'].sum()}")
else:
    print("❌ 'cat__Credit_history_A32' is missing from X_generalizer_train_df!")


✅ Found 'cat__Credit_history_A32' in dataset. Total count: 55.0


In [None]:
# Remove unexpected spaces or hidden characters
X_generalizer_train_df.columns = X_generalizer_train_df.columns.str.strip()

# Ensure feature names exactly match x_train_df
X_generalizer_train_df = X_generalizer_train_df[x_train_df.columns]

print("✅ Standardized column names in X_generalizer_train_df!")


✅ Standardized column names in X_generalizer_train_df!


In [None]:
# Convert all columns to float
X_generalizer_train_df = X_generalizer_train_df.astype(float)

print("✅ Converted all features to numeric format!")
print(X_generalizer_train_df.dtypes.value_counts())  # Should now be only float64


✅ Converted all features to numeric format!
float64    59
Name: count, dtype: int64


In [None]:
# Step 1: Try minimization without 'cat__Credit_history_A32'
safe_qi_features = [f for f in final_qi_features if f != "cat__Credit_history_A32"]

print("✅ New QI features without problematic column:", safe_qi_features)

# Step 2: Initialize the minimizer again with the updated feature set
minimizer = GeneralizeToRepresentative(
    model,
    categorical_features=[col for col in x_train_df.columns if col.startswith("cat__")],
    features_to_minimize=safe_qi_features
)

# Step 3: Fit the minimizer without problematic columns
x_train_predictions = model.predict(X_generalizer_train_df)
minimizer.fit(X_generalizer_train_df, x_train_predictions, features_names=x_train_df.columns.tolist())

print("✅ Feature minimization complete without 'cat__Credit_history_A32'!")


✅ New QI features without problematic column: ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.750000
Improving accuracy




KeyError: 'cat__Credit_history_A33'

In [None]:
# Identify problematic features dynamically
working_qi_features = []

for feature in final_qi_features:
    try:
        # Test minimization with one feature at a time
        test_minimizer = GeneralizeToRepresentative(
            model,
            categorical_features=[col for col in x_train_df.columns if col.startswith("cat__")],
            features_to_minimize=[feature]
        )
        test_minimizer.fit(X_generalizer_train_df, x_train_predictions, features_names=x_train_df.columns.tolist())

        # If minimization succeeds, keep the feature
        working_qi_features.append(feature)
    except KeyError:
        print(f"❌ Removing problematic feature: {feature}")

print("✅ Final safe QI features after filtering:", working_qi_features)

# Re-initialize minimizer with only safe features
minimizer = GeneralizeToRepresentative(
    model,
    categorical_features=[col for col in x_train_df.columns if col.startswith("cat__")],
    features_to_minimize=working_qi_features
)

# Fit the minimizer with the corrected feature list
x_train_predictions = model.predict(X_generalizer_train_df)
minimizer.fit(X_generalizer_train_df, x_train_predictions, features_names=x_train_df.columns.tolist())

print("✅ Feature minimization complete without any problematic features!")




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.979167
Improving accuracy




feature to remove: cat__Purpose_A45
Removed feature: cat__Purpose_A45, new relative accuracy: 1.000000




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
("Illegal level %d' % level", 1)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)
Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 1.000000
Improving generalizations
Pruned tree to level: 1, new relative accuracy: 1.000000
("Illegal level %d' % level", 2)




✅ Final safe QI features after filtering: ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A32', 'cat__Credit_history_A33', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']




Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.729167
Improving accuracy




KeyError: 'cat__Credit_history_A32'

In [None]:
# List all available feature names in X_generalizer_train_df
print("🔍 Features in X_generalizer_train_df:", set(X_generalizer_train_df.columns))

# List all expected feature names in x_train_df
print("📌 Features in x_train_df:", set(x_train_df.columns))

# Identify features that are expected but missing
missing_features = set(x_train_df.columns) - set(X_generalizer_train_df.columns)
if missing_features:
    print("❌ Features missing in X_generalizer_train_df:", missing_features)
else:
    print("✅ All features are correctly aligned!")


🔍 Features in X_generalizer_train_df: {'cat__Credit_history_A32', 'cat__debtors_A102', 'cat__Savings_account_A61', 'cat__Existing_checking_account_A13', 'cat__Savings_account_A62', 'num__Present_residence', 'cat__Present_employment_since_A72', 'num__Duration_in_month', 'cat__Personal_status_sex_A91', 'cat__Purpose_A44', 'cat__Personal_status_sex_A93', 'cat__Present_employment_since_A71', 'cat__Job_A173', 'num__Telephone', 'cat__Purpose_A48', 'cat__Housing_A152', 'num__Age', 'cat__Purpose_A41', 'num__Installment_rate', 'cat__Purpose_A45', 'cat__Present_employment_since_A73', 'cat__Property_A122', 'cat__Present_employment_since_A74', 'cat__Housing_A153', 'cat__Other_installment_plans_A143', 'cat__debtors_A101', 'cat__Savings_account_A64', 'cat__Personal_status_sex_A94', 'cat__debtors_A103', 'cat__Credit_history_A31', 'cat__Job_A174', 'cat__Existing_checking_account_A14', 'cat__Property_A124', 'cat__Savings_account_A65', 'cat__Present_employment_since_A75', 'cat__Savings_account_A63', 'ca

In [None]:
# Convert all features to float64 to avoid any type mismatches
X_generalizer_train_df = X_generalizer_train_df.astype(float)

print("✅ All features converted to numeric format!")
print(X_generalizer_train_df.dtypes.value_counts())  # Should now only show float64


✅ All features converted to numeric format!
float64    59
Name: count, dtype: int64


In [None]:
# Create a working list of QI features
safe_qi_features = final_qi_features.copy()

# Remove features that keep causing errors
problematic_features = ["cat__Credit_history_A32", "cat__Credit_history_A33"]
safe_qi_features = [f for f in safe_qi_features if f not in problematic_features]

print("✅ Updated QI features for minimization:", safe_qi_features)


✅ Updated QI features for minimization: ['num__Duration_in_month', 'cat__Credit_history_A30', 'cat__Credit_history_A31', 'cat__Credit_history_A34', 'cat__Purpose_A40', 'cat__Purpose_A41', 'cat__Purpose_A42', 'cat__Purpose_A43', 'cat__Purpose_A44', 'cat__Purpose_A45', 'cat__Purpose_A46', 'cat__Purpose_A48', 'cat__Purpose_A49', 'cat__debtors_A101', 'cat__debtors_A102', 'cat__debtors_A103', 'cat__Property_A121', 'cat__Property_A122', 'cat__Property_A123', 'cat__Property_A124', 'cat__Other_installment_plans_A141', 'cat__Other_installment_plans_A142', 'cat__Other_installment_plans_A143', 'cat__Housing_A151', 'cat__Housing_A152', 'cat__Housing_A153', 'cat__Job_A171', 'cat__Job_A172', 'cat__Job_A173', 'cat__Job_A174']


In [None]:
import numpy as np  # Ensure NumPy is imported

# Check for NaN values
print("🔍 Checking for NaN values...")
print(X_generalizer_train_df.isna().sum())

# Check for Infinite values
print("🔍 Checking for Inf values...")
print((X_generalizer_train_df == np.inf).sum())
print((X_generalizer_train_df == -np.inf).sum())


🔍 Checking for NaN values...
num__Duration_in_month                              0
num__Credit_amount                                  0
num__Installment_rate                               0
num__Present_residence                              0
num__Age                                            0
num__Number_of_existing_credits                     0
num__N_people_being_liable_provide_maintenance      0
num__Telephone                                    120
num__Foreign_worker                               120
cat__Existing_checking_account_A11                  0
cat__Existing_checking_account_A12                  0
cat__Existing_checking_account_A13                  0
cat__Existing_checking_account_A14                  0
cat__Credit_history_A30                             0
cat__Credit_history_A31                             0
cat__Credit_history_A32                             0
cat__Credit_history_A33                             0
cat__Credit_history_A34                             0

In [None]:
X_generalizer_train_df.drop(columns=['num__Telephone', 'num__Foreign_worker'], inplace=True, errors='ignore')
print("✅ Removed `num__Telephone` and `num__Foreign_worker` as they contained only NaNs.")


✅ Removed `num__Telephone` and `num__Foreign_worker` as they contained only NaNs.


In [None]:
# Drop the columns from training data
X_train_filtered = x_train_df.drop(columns=['num__Foreign_worker', 'num__Telephone'], errors='ignore')
X_test_filtered = x_test_df.drop(columns=['num__Foreign_worker', 'num__Telephone'], errors='ignore')

# Re-train the model with the filtered dataset
model.fit(X_train_filtered, y_train)

print("✅ Model re-trained without `num__Foreign_worker` and `num__Telephone`!")


✅ Model re-trained without `num__Foreign_worker` and `num__Telephone`!


In [None]:
# Get features used when the model was trained
trained_features = model.feature_names_in_  # Extracts the exact order used at fit time

print("🔍 Features the model was trained with:", trained_features)
print("🔍 Features in X_generalizer_train_df:", X_generalizer_train_df.columns.tolist())

# Find any mismatches
missing_in_generalizer = set(trained_features) - set(X_generalizer_train_df.columns)
extra_in_generalizer = set(X_generalizer_train_df.columns) - set(trained_features)

print("❌ Missing in generalizer data:", missing_in_generalizer)
print("⚠️ Extra in generalizer data:", extra_in_generalizer)


🔍 Features the model was trained with: ['num__Duration_in_month' 'num__Credit_amount' 'num__Installment_rate'
 'num__Present_residence' 'num__Age' 'num__Number_of_existing_credits'
 'num__N_people_being_liable_provide_maintenance'
 'cat__Existing_checking_account_A11' 'cat__Existing_checking_account_A12'
 'cat__Existing_checking_account_A13' 'cat__Existing_checking_account_A14'
 'cat__Credit_history_A30' 'cat__Credit_history_A31'
 'cat__Credit_history_A32' 'cat__Credit_history_A33'
 'cat__Credit_history_A34' 'cat__Purpose_A40' 'cat__Purpose_A41'
 'cat__Purpose_A410' 'cat__Purpose_A42' 'cat__Purpose_A43'
 'cat__Purpose_A44' 'cat__Purpose_A45' 'cat__Purpose_A46'
 'cat__Purpose_A48' 'cat__Purpose_A49' 'cat__Savings_account_A61'
 'cat__Savings_account_A62' 'cat__Savings_account_A63'
 'cat__Savings_account_A64' 'cat__Savings_account_A65'
 'cat__Present_employment_since_A71' 'cat__Present_employment_since_A72'
 'cat__Present_employment_since_A73' 'cat__Present_employment_since_A74'
 'cat__Pr

In [None]:
# Drop 'num__Foreign_worker' and 'num__Telephone' to match model training
X_generalizer_train_df = X_generalizer_train_df.drop(columns=['num__Foreign_worker', 'num__Telephone'], errors='ignore')

print("✅ Removed extra features to match training data.")


✅ Removed extra features to match training data.


In [None]:
# Select only numeric features for minimization
numeric_qi_features = [col for col in x_train_df.columns if col.startswith("num__")]

print("✅ Minimizing only numeric features:", numeric_qi_features)

# Initialize the minimizer with only numeric features
minimizer = GeneralizeToRepresentative(
    model,
    categorical_features=[],  # No categorical features
    features_to_minimize=numeric_qi_features
)

# Fit the minimizer with numeric features only
x_train_predictions = model.predict(X_generalizer_train_df)
minimizer.fit(X_generalizer_train_df, x_train_predictions, features_names=x_train_df.columns.tolist())

print("✅ Feature minimization complete (only numeric features)!")


✅ Minimizing only numeric features: ['num__Duration_in_month', 'num__Credit_amount', 'num__Installment_rate', 'num__Present_residence', 'num__Age', 'num__Number_of_existing_credits', 'num__N_people_being_liable_provide_maintenance', 'num__Telephone', 'num__Foreign_worker']


ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- num__Foreign_worker
- num__Telephone


In [None]:
import os
import sys
sys.path.insert(0, os.path.abspath('..'))

from apt.utils.dataset_utils import get_german_credit_dataset_pd

(x_train, y_train), (x_test, y_test) = get_german_credit_dataset_pd()
features = ["Existing_checking_account", "Duration_in_month", "Credit_history", "Purpose", "Credit_amount",
                "Savings_account", "Present_employment_since", "Installment_rate", "Personal_status_sex", "debtors",
                "Present_residence", "Property", "Age", "Other_installment_plans", "Housing",
                "Number_of_existing_credits", "Job", "N_people_being_liable_provide_maintenance", "Telephone",
                "Foreign_worker"]
categorical_features = ["Existing_checking_account", "Credit_history", "Purpose", "Savings_account",
                        "Present_employment_since", "Personal_status_sex", "debtors", "Property",
                        "Other_installment_plans", "Housing", "Job"]
QI = ["Duration_in_month", "Credit_history", "Purpose", "debtors", "Property", "Other_installment_plans",
      "Housing", "Job"]

print(x_train)

    Existing_checking_account  Duration_in_month Credit_history Purpose  \
0                         A14                 24            A32     A41   
1                         A14                 33            A33     A49   
2                         A11                  9            A32     A42   
3                         A14                 28            A34     A43   
4                         A11                 24            A33     A43   
..                        ...                ...            ...     ...   
695                       A14                 12            A32     A43   
696                       A14                 13            A32     A43   
697                       A11                 48            A30     A41   
698                       A12                 21            A34     A42   
699                       A13                 15            A32     A46   

     Credit_amount Savings_account Present_employment_since  Installment_rate  \
0             7814

## Train decision tree model
we use OneHotEncoder to handle categorical features.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
numeric_features = [f for f in features if f not in categorical_features]
numeric_transformer = Pipeline(
    steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0))]
)
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse=False)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)
encoded_train = preprocessor.fit_transform(x_train)
model = DecisionTreeClassifier()
model.fit(encoded_train, y_train)

encoded_test = preprocessor.transform(x_test)
print('Base model accuracy: ', model.score(encoded_test, y_test))

Base model accuracy:  0.6933333333333334


## Run minimization
We will try to run minimization with categorical features and only a subset of the features with different possible values of target accuracy (how close to the original model's accuracy we want to get, 1 being same accuracy as for original data).

In [None]:
import sys
import os
sys.path.insert(0, os.path.abspath('..'))

from apt.minimization import GeneralizeToRepresentative
from sklearn.model_selection import train_test_split

# default target_accuracy is 0.998
minimizer = GeneralizeToRepresentative(model,
                                     categorical_features=categorical_features, features_to_minimize=QI)

# Fitting the minimizar can be done either on training or test data. Doing it with test data is better as the
# resulting accuracy on test data will be closer to the desired target accuracy (when working with training
# data it could result in a larger gap)
# Don't forget to leave a hold-out set for final validation!
X_generalizer_train, x_test, y_generalizer_train, y_test = train_test_split(x_test, y_test, stratify=y_test,
                                                                test_size = 0.4, random_state = 38)
X_generalizer_train.reset_index(drop=True, inplace=True)
y_generalizer_train.reset_index(drop=True, inplace=True)
x_test.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)
encoded_generalizer_train = preprocessor.transform(X_generalizer_train)
x_train_predictions = model.predict(encoded_generalizer_train)
minimizer.fit(X_generalizer_train, x_train_predictions, features_names=features)
transformed = minimizer.transform(x_test, features_names=features)

encoded_transformed = preprocessor.transform(transformed)
print('Accuracy on minimized data: ', model.score(encoded_transformed, y_test))

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.805556
Improving accuracy
feature to remove: Credit_history
Removed feature: Credit_history, new relative accuracy: 0.819444
feature to remove: Other_installment_plans
Removed feature: Other_installment_plans, new relative accuracy: 0.847222
feature to remove: Duration_in_month
Removed feature: Duration_in_month, new relative accuracy: 0.847222
feature to remove: Property
Removed feature: Property, new relative accuracy: 0.847222
feature to remove: Housing
Removed feature: Housing, new relative accuracy: 0.847222
feature to remove: Purpose
Removed feature: Purpose, new relative accuracy: 0.986111
feature to remove: debtors
Removed feature: debtors, new relative accuracy: 0.986111
feature to remove: Job
Removed feature: Job, new relative accuracy: 1.000000
Accuracy on minimized data:  0.6666666666666666


#### Let's see what features were generalized

In [None]:
generalizations = minimizer.generalizations
print(generalizations)

{'ranges': {}, 'categories': {}, 'untouched': ['Foreign_worker', 'Other_installment_plans', 'Existing_checking_account', 'Purpose', 'debtors', 'Housing', 'N_people_being_liable_provide_maintenance', 'Present_employment_since', 'Installment_rate', 'Credit_history', 'Property', 'Present_residence', 'Age', 'Credit_amount', 'Duration_in_month', 'Job', 'Personal_status_sex', 'Number_of_existing_credits', 'Savings_account', 'Telephone']}


We can see that for the default target accuracy of 0.998 of the original accuracy, no generalizations are possible (all features are left untouched, i.e., not generalized).

Let's change to a slightly lower target accuracy.

In [None]:
# We allow a 1% deviation in accuracy from the original model accuracy
minimizer2 = GeneralizeToRepresentative(model, target_accuracy=0.92,
                                     categorical_features=categorical_features, features_to_minimize=QI)

minimizer2.fit(X_generalizer_train, x_train_predictions, features_names=features)
transformed2 = minimizer2.transform(x_test, features_names=features)

encoded_transformed2 = preprocessor.transform(transformed2)
print('Accuracy on minimized data: ', model.score(encoded_transformed2, y_test))
generalizations2 = minimizer2.generalizations
print(generalizations2)

Initial accuracy of model on generalized data, relative to original model predictions (base generalization derived from tree, before improvements): 0.805556
Improving accuracy
feature to remove: Credit_history
Removed feature: Credit_history, new relative accuracy: 0.819444
feature to remove: Other_installment_plans
Removed feature: Other_installment_plans, new relative accuracy: 0.847222
feature to remove: Duration_in_month
Removed feature: Duration_in_month, new relative accuracy: 0.847222
feature to remove: Property
Removed feature: Property, new relative accuracy: 0.847222
feature to remove: Housing
Removed feature: Housing, new relative accuracy: 0.847222
feature to remove: Purpose
Removed feature: Purpose, new relative accuracy: 0.986111
Accuracy on minimized data:  0.6666666666666666
{'ranges': {}, 'categories': {'debtors': [['A103', 'A102'], ['A101']], 'Job': [['A173', 'A174'], ['A171'], ['A172']]}, 'untouched': ['Credit_amount', 'Duration_in_month', 'Credit_history', 'Foreign_

This time we were able to generalize two features (debtors and Job).