# Using ML anonymization to defend against attribute inference attacks

In this tutorial we will show how to anonymize models using the ML anonymization module.

We will demonstrate running inference attacks both on a vanilla model, and then on different anonymized versions of the model. We will run both black-box and white-box attribute inference attacks using ART's inference module (https://github.com/Trusted-AI/adversarial-robustness-toolbox/tree/main/art/attacks/inference).

This will be demonstarted using the Nursery dataset (original dataset can be found here: https://archive.ics.uci.edu/ml/datasets/nursery).

The sensitive feature we are trying to infer is the 'social' feature, after turning it into a binary feature (the original value 'problematic' receives the new value 1 and the rest 0). We also preprocess the data such that all categorical features are one-hot encoded.

## Load data

In [None]:
!mkdir -p ../datasets
!wget -P ../datasets https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
!wget -P ../datasets https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data
!wget -P ../datasets https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data

--2025-02-22 07:34:48--  https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘../datasets/adult.data.1’

adult.data.1            [    <=>             ]   3.79M  6.19MB/s    in 0.6s    

2025-02-22 07:34:50 (6.19 MB/s) - ‘../datasets/adult.data.1’ saved [3974305]

--2025-02-22 07:34:50--  https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘../datasets/german.data.1’

german.data.1           [  <=>               ]  77.92K   386KB/s    in 0.2s    

2025-02-22 07:34:

In [None]:
!pip install ai-privacy-toolkit



In [None]:
!pip install adversarial-robustness-toolbox



In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from art.estimators.classification.scikitlearn import ScikitlearnDecisionTreeClassifier
from art.attacks.inference.attribute_inference import AttributeInferenceBlackBox, AttributeInferenceWhiteBoxDecisionTree

# Ensure dataset directory exists
os.makedirs("../datasets", exist_ok=True)

# Download dataset
nursery_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/nursery/nursery.data"
nursery_path = "../datasets/nursery.data"
if not os.path.exists(nursery_path):
    os.system(f"wget -P ../datasets {nursery_url}")

# Define column names based on dataset documentation
columns = ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health", "class"]

# Load dataset
nursery_data = pd.read_csv(nursery_path, names=columns, header=None)

# Encode categorical features
x = nursery_data.drop(columns=["class"])
y = nursery_data["class"]

categorical_features = x.columns.tolist()
categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
preprocessor = ColumnTransformer(
    transformers=[("cat", categorical_transformer, categorical_features)]
)

# Transform dataset
x_encoded = preprocessor.fit_transform(x)

# Encode labels
label_mapping = {label: idx for idx, label in enumerate(y.unique())}
y_encoded = y.map(label_mapping).values

# Train-test split
x_train, x_test, y_train, y_test = train_test_split(x_encoded, y_encoded, test_size=0.2, stratify=y_encoded, random_state=42)

y_train = y_train.astype(int)
y_test = y_test.astype(int)

print(f'Final Train Set: {x_train.shape}, Test Set: {x_test.shape}')


Final Train Set: (10368, 27), Test Set: (2592, 27)


## Train decision tree model

In [None]:
# Train the Decision Tree Model
model = DecisionTreeClassifier()
model.fit(x_train, y_train)  # Use the already one-hot encoded x_train

# Wrap model with ART
art_classifier = ScikitlearnDecisionTreeClassifier(model)

# Evaluate model
accuracy = model.score(x_test, y_test)  # Use already one-hot encoded x_test
print(f'Base model accuracy: {accuracy:.4f}')


Base model accuracy: 0.9973


In [None]:
# import pandas as pd

# Convert x_train and x_test back to DataFrames before encoding
# x_train_df = pd.DataFrame(x_train, columns=nursery_data.drop(columns=["class"]).columns)
#x_test_df = pd.DataFrame(x_test, columns=nursery_data.drop(columns=["class"]).columns)

# Define categorical and numerical features
#numeric_features = ['social']
#categorical_features = ['children', 'parents', 'has_nurs', 'form', 'housing', 'finance', 'health']

# Define transformations
#numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value=0))])
#categorical_transformer = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

#preprocessor = ColumnTransformer(
 #   transformers=[
  #      ("num", numeric_transformer, numeric_features),
   #     ("cat", categorical_transformer, categorical_features),
    #]
#)

# **Fit and transform x_train, but also get feature names from OneHotEncoder**
#train_encoded = preprocessor.fit_transform(x_train_df)
#test_encoded = preprocessor.transform(x_test_df)

# Extract correct feature names after transformation
#feature_names = (
 #   preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_features).tolist()
#)
#feature_names = numeric_features + feature_names  # Combine numeric + encoded categorical features

# Convert to DataFrame with correct feature names
#train_encoded_df = pd.DataFrame(train_encoded, columns=feature_names)
#test_encoded_df = pd.DataFrame(test_encoded, columns=feature_names)

# Train the Decision Tree Model
# Train the Decision Tree Model
#model = DecisionTreeClassifier()
#model.fit(x_train, y_train)  # Use already-encoded x_train

# Wrap model with ART
#art_classifier = ScikitlearnDecisionTreeClassifier(model)

# Evaluate model
#accuracy = model.score(x_test, y_test)  # Use already-encoded x_test
#print(f'Base model accuracy: {accuracy:.4f}')



## Attack
### Black-box attack
The black-box attack basically trains an additional classifier (called the attack model) to predict the attacked feature's value from the remaining n-1 features as well as the original (attacked) model's predictions.
#### Train attack model

In [None]:
import numpy as np
from art.attacks.inference.attribute_inference import AttributeInferenceBlackBox

# Select the attacked feature (index 0 corresponds to 'social' after encoding)
attack_feature = 0

# Create training data **without** the attacked feature
x_train_for_attack = np.delete(x_train, attack_feature, axis=1)  # Remove the targeted feature

# Isolate the attacked feature
x_train_feature = x_train[:, attack_feature].reshape(-1, 1)

# Initialize the Black-Box Attack
bb_attack = AttributeInferenceBlackBox(art_classifier, attack_feature=attack_feature)

# Get original model's predictions
x_train_predictions = np.array([np.argmax(arr) for arr in art_classifier.predict(x_train)]).reshape(-1, 1)

# Use 50% of the training set to train the attack model
attack_train_ratio = 0.5
attack_train_size = int(len(x_train) * attack_train_ratio)

# Train the attack model
bb_attack.fit(x_train[:attack_train_size])

print("Black-Box Attack Model Trained ✅")


  return self._call_impl(*args, **kwargs)


Black-Box Attack Model Trained ✅


#### Infer sensitive feature and check accuracy

In [None]:
# Get inferred values for the remaining 50% of the dataset
values = [0, 1]  # Binary classification for 'social'

inferred_train_bb = bb_attack.infer(
    x_train_for_attack[attack_train_size:],  # Use the remaining dataset
    pred=x_train_predictions[attack_train_size:],  # Use original model's predictions
    values=values
)

# Calculate accuracy
train_acc = np.sum(
    inferred_train_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1, -1)
) / len(inferred_train_bb)

print(f"Black-Box Attack Accuracy: {train_acc:.4f}")


Black-Box Attack Accuracy: 1.0000


  return self._call_impl(*args, **kwargs)


This means that for 60% of the training set, the attacked feature is inferred correctly using this attack.

## Whitebox attack
This attack does not train any additional model, it simply uses additional information coded within the attacked decision tree model to compute the probability of each value of the attacked feature and outputs the value with the highest probability.

In [None]:
from art.attacks.inference.attribute_inference import AttributeInferenceWhiteBoxDecisionTree

priors = [6925 / 10366, 3441 / 10366]

wb2_attack = AttributeInferenceWhiteBoxDecisionTree(art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_wb2 = wb2_attack.infer(x_train_for_attack, x_train_predictions, values=values, priors=priors)

# check accuracy
train_acc = np.sum(inferred_train_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_wb2)
print(train_acc)

0.7336998456790124


The white-box attack is able to correctly infer the attacked feature value in 69% of the training set.

# Anonymized data
## k=100

Now we will apply the same attacks on an anonymized version of the same dataset (k=100). The data is anonymized on the quasi-identifiers: finance, social, health.

k=100 means that each record in the anonymized dataset is identical to 99 others on the quasi-identifier values (i.e., when looking only at those 3 feature, the records are indistinguishable).

In [None]:
!pip install ai-privacy-toolkit



In [None]:
print("🔹 First 5 rows of the original dataset:")
print(nursery_data.head())

print("\n🔹 Column names in the original dataset:")
print(nursery_data.columns)


🔹 First 5 rows of the original dataset:
  parents has_nurs      form children     housing     finance         social  \
0   usual   proper  complete        1  convenient  convenient        nonprob   
1   usual   proper  complete        1  convenient  convenient        nonprob   
2   usual   proper  complete        1  convenient  convenient        nonprob   
3   usual   proper  complete        1  convenient  convenient  slightly_prob   
4   usual   proper  complete        1  convenient  convenient  slightly_prob   

        health      class  
0  recommended  recommend  
1     priority   priority  
2    not_recom  not_recom  
3  recommended  recommend  
4     priority   priority  

🔹 Column names in the original dataset:
Index(['parents', 'has_nurs', 'form', 'children', 'housing', 'finance',
       'social', 'health', 'class'],
      dtype='object')


In [None]:
for col in ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health"]:
    print(f"\nUnique values in {col}: {nursery_data[col].unique()}")



Unique values in parents: ['usual' 'pretentious' 'great_pret']

Unique values in has_nurs: ['proper' 'less_proper' 'improper' 'critical' 'very_crit']

Unique values in form: ['complete' 'completed' 'incomplete' 'foster']

Unique values in children: ['1' '2' '3' 'more']

Unique values in housing: ['convenient' 'less_conv' 'critical']

Unique values in finance: ['convenient' 'inconv']

Unique values in social: ['nonprob' 'slightly_prob' 'problematic']

Unique values in health: ['recommended' 'priority' 'not_recom']


In [None]:
# Get transformed feature names after One-Hot Encoding
encoded_feature_names = preprocessor.named_transformers_["cat"].get_feature_names_out(input_features=categorical_features).tolist()

print("\nTransformed Feature Names After One-Hot Encoding:")
print(encoded_feature_names[:10])  # Print first 10 feature names

# Check how "finance", "social", and "health" transformed
for col in ["finance", "social", "health"]:
    matches = [name for name in encoded_feature_names if col in name]
    print(f"{col} transformed into: {matches}")



Transformed Feature Names After One-Hot Encoding:
['parents_great_pret', 'parents_pretentious', 'parents_usual', 'has_nurs_critical', 'has_nurs_improper', 'has_nurs_less_proper', 'has_nurs_proper', 'has_nurs_very_crit', 'form_complete', 'form_completed']
finance transformed into: ['finance_convenient', 'finance_inconv']
social transformed into: ['social_nonprob', 'social_problematic', 'social_slightly_prob']
health transformed into: ['health_not_recom', 'health_priority', 'health_recommended']


In [None]:
# Convert transformed data into DataFrame
x_train_df = pd.DataFrame(x_train, columns=encoded_feature_names)

print("\n🔹 First 5 rows of One-Hot Encoded Data:")
print(x_train_df.head())



🔹 First 5 rows of One-Hot Encoded Data:
   parents_great_pret  parents_pretentious  parents_usual  has_nurs_critical  \
0                 0.0                  1.0            0.0                0.0   
1                 0.0                  1.0            0.0                0.0   
2                 0.0                  1.0            0.0                0.0   
3                 0.0                  1.0            0.0                1.0   
4                 1.0                  0.0            0.0                1.0   

   has_nurs_improper  has_nurs_less_proper  has_nurs_proper  \
0                0.0                   0.0              1.0   
1                0.0                   0.0              0.0   
2                1.0                   0.0              0.0   
3                0.0                   0.0              0.0   
4                0.0                   0.0              0.0   

   has_nurs_very_crit  form_complete  form_completed  ...  housing_critical  \
0                 0.

In [None]:
print("Transformed feature names:", x_train_df.columns.tolist())


Transformed feature names: ['parents_great_pret', 'parents_pretentious', 'parents_usual', 'has_nurs_critical', 'has_nurs_improper', 'has_nurs_less_proper', 'has_nurs_proper', 'has_nurs_very_crit', 'form_complete', 'form_completed', 'form_foster', 'form_incomplete', 'children_1', 'children_2', 'children_3', 'children_more', 'housing_convenient', 'housing_critical', 'housing_less_conv', 'finance_convenient', 'finance_inconv', 'social_nonprob', 'social_problematic', 'social_slightly_prob', 'health_not_recom', 'health_priority', 'health_recommended']


In [None]:
dataset = ArrayDataset(x_train, x_train_predictions)
print("Feature Names in ArrayDataset:", dataset.features_names)


Feature Names in ArrayDataset: None


In [None]:
x_train_named = pd.DataFrame(x_train, columns=x_train_df.columns)
dataset = ArrayDataset(x_train_named.to_numpy(), x_train_predictions)


In [None]:
QI_indices = [x_train_named.columns.get_loc(col) for col in QI]  # Get index positions of QI
print("Quasi-Identifier Indices:", QI_indices)


Quasi-Identifier Indices: [19, 20, 21, 22, 23, 24, 25, 26]


In [None]:
# Extract transformed feature names after One-Hot Encoding
categorical_feature_names = preprocessor.named_transformers_["cat"].get_feature_names_out(categorical_features)
categorical_feature_names = list(categorical_feature_names)  # Convert to list for easier access
print("Updated Categorical Feature Names:", categorical_feature_names)


Updated Categorical Feature Names: ['parents_great_pret', 'parents_pretentious', 'parents_usual', 'has_nurs_critical', 'has_nurs_improper', 'has_nurs_less_proper', 'has_nurs_proper', 'has_nurs_very_crit', 'form_complete', 'form_completed', 'form_foster', 'form_incomplete', 'children_1', 'children_2', 'children_3', 'children_more', 'housing_convenient', 'housing_critical', 'housing_less_conv', 'finance_convenient', 'finance_inconv', 'social_nonprob', 'social_problematic', 'social_slightly_prob', 'health_not_recom', 'health_priority', 'health_recommended']


In [None]:
categorical_feature_indices = [x_train_named.columns.get_loc(col) for col in categorical_feature_names]
print("Categorical Feature Indices:", categorical_feature_indices)


Categorical Feature Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]


In [None]:
anonymizer = Anonymize(100, QI_indices, categorical_features=categorical_feature_indices)
anon = anonymizer.anonymize(dataset)


In [None]:
print("🔹 First 5 rows of Anonymized Data:")
print(pd.DataFrame(anon, columns=x_train_named.columns).head())


🔹 First 5 rows of Anonymized Data:
   parents_great_pret  parents_pretentious  parents_usual  has_nurs_critical  \
0                 0.0                  1.0            0.0                0.0   
1                 0.0                  1.0            0.0                0.0   
2                 0.0                  1.0            0.0                0.0   
3                 0.0                  1.0            0.0                1.0   
4                 1.0                  0.0            0.0                1.0   

   has_nurs_improper  has_nurs_less_proper  has_nurs_proper  \
0                0.0                   0.0              1.0   
1                0.0                   0.0              0.0   
2                1.0                   0.0              0.0   
3                0.0                   0.0              0.0   
4                0.0                   0.0              0.0   

   has_nurs_very_crit  form_complete  form_completed  ...  housing_critical  \
0                 0.0     

In [None]:
print("Distinct Rows in Original Data:", len(pd.DataFrame(x_train_named).drop_duplicates()))
print("Distinct Rows in Anonymized Data:", len(pd.DataFrame(anon, columns=x_train_named.columns).drop_duplicates()))


Distinct Rows in Original Data: 10368
Distinct Rows in Anonymized Data: 3160


In [None]:
# Convert anonymized data to DataFrame with correct column names
anon_df = pd.DataFrame(anon, columns=x_train_named.columns)
anon_encoded = anon_df.to_numpy()  # ✅ Use directly as NumPy array

# Train a new Decision Tree model
anon_model = DecisionTreeClassifier()
anon_model.fit(anon_encoded, y_train)

# Wrap with ART
anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

# ✅ Ensure x_test has the same structure as anon_df
x_test_df = pd.DataFrame(x_test, columns=x_train_named.columns)
x_test_encoded = x_test_df.to_numpy()  # Convert test set to same format

# ✅ Evaluate on correctly formatted x_test
anon_accuracy = anon_model.score(x_test_encoded, y_test)
print("📌 Anonymized Model Accuracy:", round(anon_accuracy, 4))


📌 Anonymized Model Accuracy: 0.8326


In [None]:
# number of distinct rows in original data
x_train_df = pd.DataFrame(x_train, columns=x_train_named.columns)  # Ensure column names
print("Distinct Rows in Original Data:", len(x_train_df.drop_duplicates()))


Distinct Rows in Original Data: 10368


In [None]:
# number of distinct rows in anonymized data
anon_df = pd.DataFrame(anon, columns=x_train_named.columns)  # Ensure column names
print("Distinct Rows in Anonymized Data:", len(anon_df.drop_duplicates()))


Distinct Rows in Anonymized Data: 3160


## Train decision tree model

In [None]:
anon_encoded = anon_df.to_numpy()  # Convert back to NumPy for training
anon_model = DecisionTreeClassifier()
anon_model.fit(anon_encoded, y_train)

# Wrap with ART
anon_art_classifier = ScikitlearnDecisionTreeClassifier(anon_model)

# Evaluate Model
anon_accuracy = anon_model.score(x_test_encoded, y_test)
print("📌 Anonymized Model Accuracy:", round(anon_accuracy, 4))


📌 Anonymized Model Accuracy: 0.8345


## Attack
### Black-box attack

In [None]:
# Ensure we are using the correct attack feature
attack_feature = 0  # Social feature

# Remove the attacked feature from training data
x_train_for_attack = np.delete(anon_encoded, attack_feature, axis=1)

# Isolate the attacked feature
x_train_feature = anon_encoded[:, attack_feature].copy().reshape(-1, 1)

# Initialize the Black-Box Attack
anon_bb_attack = AttributeInferenceBlackBox(anon_art_classifier, attack_feature=attack_feature)

# Get original model's predictions
anon_x_train_predictions = np.array([np.argmax(arr) for arr in anon_art_classifier.predict(anon_encoded)]).reshape(-1, 1)

# Use 50% of the training set to train the attack model
attack_train_ratio = 0.5
attack_train_size = int(len(anon_encoded) * attack_train_ratio)

# Train the attack model
anon_bb_attack.fit(anon_encoded[:attack_train_size])

# Get inferred values
inferred_train_anon_bb = anon_bb_attack.infer(
    x_train_for_attack[attack_train_size:],
    pred=anon_x_train_predictions[attack_train_size:],
    values=[0, 1]
)

# Compute attack accuracy
anon_bb_attack_acc = np.sum(
    inferred_train_anon_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1, -1)
) / len(inferred_train_anon_bb)

print("🔴 Black-Box Attack Accuracy After Anonymization:", round(anon_bb_attack_acc, 4))


  return self._call_impl(*args, **kwargs)


🔴 Black-Box Attack Accuracy After Anonymization: 1.0


  return self._call_impl(*args, **kwargs)


In [None]:
# training data without attacked feature
# x_train_for_attack = np.delete(train_encoded, attack_feature, 1)
# only attacked feature#
# x_train_feature = train_encoded[:, attack_feature].copy().reshape(-1, 1)

# anon_bb_attack = AttributeInferenceBlackBox(anon_art_classifier, attack_feature=attack_feature)

# get original model's predictions
#anon_x_train_predictions = np.array([np.argmax(arr) for arr in anon_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
#anon_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
#inferred_train_anon_bb = anon_bb_attack.infer(x_train_for_attack[attack_train_size:], pred=anon_x_train_predictions[attack_train_size:], values=values)
# check accuracy
#train_acc = np.sum(inferred_train_anon_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon_bb)
#print(train_acc)

In [None]:
import pandas as pd
import numpy as np

# Ensure we are using the correct attack feature index
attack_feature = 0  # Change this if 'social' is not at index 0

# **Use raw categorical data before encoding**
x_train_raw = nursery_data.drop(columns=["class"]).copy()  # Keep original categorical features
x_test_raw = x_train_raw.iloc[x_train.shape[0]:].copy()  # Extract test set

# **Ensure categorical features are strings before encoding**
x_train_raw = x_train_raw.astype(str)
x_test_raw = x_test_raw.astype(str)

# Ensure preprocessor is properly fitted before transforming
preprocessor.fit(x_train_raw)  # Fit on raw categorical data

# **Ensure One-Hot Encoding is properly applied before conversion to float**
x_train_transformed = preprocessor.transform(x_train_raw)  # One-Hot Encoded dataset
x_test_transformed = preprocessor.transform(x_test_raw)

# **Convert transformed data into numeric values only if One-Hot Encoding worked**
if isinstance(x_train_transformed, np.ndarray):
    x_train_transformed = x_train_transformed.astype(float)  # Convert to float
    x_test_transformed = x_test_transformed.astype(float)  # Convert to float
else:
    raise ValueError("One-Hot Encoding did not apply properly. Check categorical feature processing.")

# Extract correct feature names after One-Hot Encoding
categorical_feature_names = preprocessor.named_transformers_["cat"].get_feature_names_out().tolist()
numeric_feature_names = list(numeric_features)  # Ensure numeric features are a proper list

# Combine numeric and categorical feature names
feature_names = numeric_feature_names + categorical_feature_names

# **Convert transformed NumPy array to DataFrame**
x_train_transformed_df = pd.DataFrame(x_train_transformed, columns=feature_names)
x_test_transformed_df = pd.DataFrame(x_test_transformed, columns=feature_names)

# **Fix: Convert `attacked_feature_name` to a single string**
attacked_feature_name = feature_names[attack_feature]
if isinstance(attacked_feature_name, list):
    attacked_feature_name = attacked_feature_name[0]  # Extract first element if it's a list

# **Ensure x_train_for_attack is a DataFrame Before Dropping the Feature**
x_train_for_attack = x_train_transformed_df.drop(columns=[attacked_feature_name])

# **Ensure x_train_feature is a DataFrame Before Extracting Values**
x_train_feature = x_train_transformed_df[[attacked_feature_name]].values  # Keep as 2D array

# Get inferred values for the remaining 50% of the dataset
values = [0, 1]  # Binary classification for 'social'

# **Ensure only numerical values are passed to `art_classifier.predict()`**
x_train_for_attack_np = x_train_for_attack.iloc[attack_train_size:].to_numpy().astype(float)  # Convert to float

inferred_train_bb = bb_attack.infer(
    x_train_for_attack_np,  # Ensure numerical input
    pred=art_classifier.predict(x_train_transformed_df.iloc[attack_train_size:].to_numpy().astype(float)),  # Ensure numerical input
    values=values
)

# Calculate accuracy correctly
train_acc = np.mean(inferred_train_bb == x_train_feature[attack_train_size:].reshape(-1, 1))

print(f"✅ Fixed Black-Box Attack Accuracy: {train_acc:.4f}")


In [None]:
import pandas as pd

# Ensure we are using the correct attack feature index
attack_feature = 0  # Change this if 'social' is not at index 0

# **Use raw categorical data before encoding**
x_train_raw = nursery_data.drop(columns=["class"])  # Keep original categorical features
x_test_raw = x_train_raw.iloc[x_train.shape[0]:]  # Extract test set

# Ensure preprocessor is properly fitted before transforming
preprocessor.fit(x_train_raw)  # Fit on raw categorical data

# Get the transformed dataset (ensuring output is numerical)
x_train_transformed = preprocessor.transform(x_train_raw).astype(float)  # Convert to float
x_test_transformed = preprocessor.transform(x_test_raw).astype(float)  # Convert to float

# Extract correct feature names after One-Hot Encoding
categorical_feature_names = preprocessor.named_transformers_["cat"].get_feature_names_out().tolist()
numeric_feature_names = list(numeric_features)  # Ensure numeric features are a proper list

# Combine numeric and categorical feature names
feature_names = numeric_feature_names + categorical_feature_names

# **Convert transformed NumPy array to DataFrame**
x_train_transformed_df = pd.DataFrame(x_train_transformed, columns=feature_names)
x_test_transformed_df = pd.DataFrame(x_test_transformed, columns=feature_names)

# **Fix: Convert `attacked_feature_name` to a single string**
attacked_feature_name = feature_names[attack_feature]
if isinstance(attacked_feature_name, list):
    attacked_feature_name = attacked_feature_name[0]  # Extract first element if it's a list

# **Ensure x_train_for_attack is a DataFrame Before Dropping the Feature**
x_train_for_attack = x_train_transformed_df.drop(columns=[attacked_feature_name])

# **Ensure x_train_feature is a DataFrame Before Extracting Values**
x_train_feature = x_train_transformed_df[[attacked_feature_name]].values  # Keep as 2D array

# Get inferred values for the remaining 50% of the dataset
values = [0, 1]  # Binary classification for 'social'

# **Ensure only numerical values are passed to `art_classifier.predict()`**
x_train_for_attack_np = x_train_for_attack.iloc[attack_train_size:].to_numpy().astype(float)  # Convert to float

inferred_train_bb = bb_attack.infer(
    x_train_for_attack_np,  # Ensure numerical input
    pred=art_classifier.predict(x_train_transformed_df.iloc[attack_train_size:].to_numpy().astype(float)),  # Ensure numerical input
    values=values
)

# Calculate accuracy correctly
train_acc = np.mean(inferred_train_bb == x_train_feature[attack_train_size:].reshape(-1, 1))

print(f"✅ Fixed Black-Box Attack Accuracy: {train_acc:.4f}")


### White box attack

In [None]:
anon_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon_art_classifier, attack_feature=attack_feature)

# get inferred values
inferred_train_anon_wb2 = anon_wb2_attack.infer(x_train_for_attack, anon_x_train_predictions, values=values, priors=priors)

# check accuracy
anon_train_acc = np.sum(inferred_train_anon_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
print(anon_train_acc)

In [None]:
print("Unique values in y_train:", np.unique(y_train))


Unique values in y_train: [0 1 2 3 4]


In [None]:
y_train_binary = np.where(y_train > 0, 1, 0)  # Convert all non-zero classes to 1
print("Unique values in y_train after binarization:", np.unique(y_train_binary))


Unique values in y_train after binarization: [0 1]


In [None]:
social_counts = np.bincount(y_train_binary)
priors = social_counts / len(y_train_binary)  # Normalize counts to probabilities
print("Computed Priors:", priors)


Computed Priors: [1.92901235e-04 9.99807099e-01]


In [None]:
from art.attacks.inference.attribute_inference import AttributeInferenceWhiteBoxDecisionTree

# Ensure y_train is binary
y_train_binary = np.where(y_train > 0, 1, 0)

# Compute priors for binary classification
social_counts = np.bincount(y_train_binary)
priors = social_counts / len(y_train_binary)  # Normalize counts to probabilities
print("Computed Priors:", priors)

# Ensure we are using the correct attack feature index
attack_feature = 0  # Index of 'social' after encoding

# Prepare input features by removing the attacked feature
x_train_for_attack = np.delete(anon_encoded, attack_feature, axis=1)

# Initialize the White-Box Attack
anon_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon_art_classifier, attack_feature=attack_feature)

# Get inferred values
inferred_train_anon_wb2 = anon_wb2_attack.infer(
    x_train_for_attack, anon_x_train_predictions, values=[0, 1], priors=priors
)

# Compute attack accuracy
anon_wb_attack_acc = np.sum(
    inferred_train_anon_wb2 == np.around(x_train_feature, decimals=8).reshape(1, -1)
) / len(inferred_train_anon_wb2)

print("🔵 White-Box Attack Accuracy After Anonymization:", round(anon_wb_attack_acc, 4))


Computed Priors: [1.92901235e-04 9.99807099e-01]
🔵 White-Box Attack Accuracy After Anonymization: 0.5106


The accuracy of the attacks remains more or less the same. Let's check the precision and recall for each case:

In [None]:
def calc_precision_recall(predicted, actual, positive_value=1):
    score = 0  # both predicted and actual are positive
    num_positive_predicted = 0  # predicted positive
    num_positive_actual = 0  # actual positive
    for i in range(len(predicted)):
        if predicted[i] == positive_value:
            num_positive_predicted += 1
        if actual[i] == positive_value:
            num_positive_actual += 1
        if predicted[i] == actual[i]:
            if predicted[i] == positive_value:
                score += 1

    if num_positive_predicted == 0:
        precision = 1
    else:
        precision = score / num_positive_predicted  # the fraction of predicted “Yes” responses that are correct
    if num_positive_actual == 0:
        recall = 1
    else:
        recall = score / num_positive_actual  # the fraction of “Yes” responses that are predicted correctly

    return precision, recall

# black-box regular
print(calc_precision_recall(inferred_train_bb, x_train_feature))
# black-box anonymized
print(calc_precision_recall(inferred_train_anon_bb, x_train_feature))

(0.345475910693302, 0.33696275071633236)
(0.345475910693302, 0.33696275071633236)


In [None]:
# white-box regular
print(calc_precision_recall(inferred_train_wb2, x_train_feature))
# white-box anonymized
print(calc_precision_recall(inferred_train_anon_wb2, x_train_feature))

(0.7355769230769231, 0.31070496083550914)
(0.4045299847435747, 1.0)


Precision and recall remain almost the same, sometimes even slightly increasing.

Now let's see what happens when we increase k to 1000.

## k=1000

Now we apply the attacks on an anonymized version of the same dataset (k=1000). The data has been anonymized on the quasi-identifiers: finance, social, health.

In [None]:
print("Available columns in x_train_named:", x_train_named.columns.tolist())


Available columns in x_train_named: ['parents_great_pret', 'parents_pretentious', 'parents_usual', 'has_nurs_critical', 'has_nurs_improper', 'has_nurs_less_proper', 'has_nurs_proper', 'has_nurs_very_crit', 'form_complete', 'form_completed', 'form_foster', 'form_incomplete', 'children_1', 'children_2', 'children_3', 'children_more', 'housing_convenient', 'housing_critical', 'housing_less_conv', 'finance_convenient', 'finance_inconv', 'social_nonprob', 'social_problematic', 'social_slightly_prob', 'health_not_recom', 'health_priority', 'health_recommended']


In [None]:
categorical_features = [
    "parents_great_pret", "parents_pretentious", "parents_usual",
    "has_nurs_critical", "has_nurs_improper", "has_nurs_less_proper",
    "has_nurs_proper", "has_nurs_very_crit",
    "form_complete", "form_completed", "form_foster", "form_incomplete",
    "children_1", "children_2", "children_3", "children_more",
    "housing_convenient", "housing_critical", "housing_less_conv"
]


In [None]:
categorical_feature_indices = [x_train_named.columns.get_loc(col) for col in categorical_features if col in x_train_named.columns]
print("Updated Categorical Feature Indices:", categorical_feature_indices)


Updated Categorical Feature Indices: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]


In [None]:
anonymizer2 = Anonymize(1000, QI_indices, categorical_features=categorical_feature_indices)
anon2 = anonymizer2.anonymize(ArrayDataset(x_train, x_train_predictions))

print("✅ Anonymization with k=1000 completed!")


✅ Anonymization with k=1000 completed!


In [None]:
print("Distinct Rows in Original Data:", len(pd.DataFrame(x_train).drop_duplicates()))
print("Distinct Rows in Anonymized Data (k=1000):", len(pd.DataFrame(anon2).drop_duplicates()))


Distinct Rows in Original Data: 10368
Distinct Rows in Anonymized Data (k=1000): 1727


In [None]:
# print("Original Categorical Features:", categorical_features)


Original Categorical Features: ['parents', 'has_nurs', 'form', 'children', 'housing', 'finance', 'social', 'health']


In [None]:
# QI = [
#    "finance_convenient", "finance_inconv",
#    "social_nonprob", "social_problematic", "social_slightly_prob",
#    "health_not_recom", "health_priority", "health_recommended"
#]


In [None]:
#QI_indices = [x_train_named.columns.get_loc(col) for col in QI if col in x_train_named.columns]
#print("Updated QI Indices:", QI_indices)


Updated QI Indices: [19, 20, 21, 22, 23, 24, 25, 26]


In [None]:
#categorical_feature_indices = [x_train_named.columns.get_loc(col) for col in categorical_features if col in x_train_named.columns]
#print("Updated Categorical Feature Indices:", categorical_feature_indices)


Updated Categorical Feature Indices: []


In [None]:
# number of distinct rows in anonymized data
#len(anon2.drop_duplicates())

## Train decision tree model

In [None]:
# Convert anonymized data to DataFrame with correct column names
anon2_df = pd.DataFrame(anon2, columns=x_train_named.columns)
anon2_encoded = anon2_df.to_numpy()  # Convert back to NumPy array

# Train a new Decision Tree model
anon2_model = DecisionTreeClassifier()
anon2_model.fit(anon2_encoded, y_train)

# Wrap with ART
anon2_art_classifier = ScikitlearnDecisionTreeClassifier(anon2_model)

# Evaluate Model
anon2_accuracy = anon2_model.score(x_test_encoded, y_test)
print("📌 Anonymized Model Accuracy (k=1000):", round(anon2_accuracy, 4))


📌 Anonymized Model Accuracy (k=1000): 0.7477


In [None]:
anon2_bb_attack = AttributeInferenceBlackBox(anon2_art_classifier, attack_feature=0)

# Get predictions from the new model
anon2_x_train_predictions = np.array([np.argmax(arr) for arr in anon2_art_classifier.predict(anon2_encoded)]).reshape(-1, 1)

# Train the attack model
anon2_bb_attack.fit(anon2_encoded[:attack_train_size])

# Get inferred values
inferred_train_anon2_bb = anon2_bb_attack.infer(
    x_train_for_attack[attack_train_size:], pred=anon2_x_train_predictions[attack_train_size:], values=[0, 1]
)

# Compute attack accuracy
anon2_bb_attack_acc = np.mean(inferred_train_anon2_bb == y_train[attack_train_size:])
print("🔴 Black-Box Attack Accuracy After Anonymization (k=1000):", round(anon2_bb_attack_acc, 4))


  return self._call_impl(*args, **kwargs)


🔴 Black-Box Attack Accuracy After Anonymization (k=1000): 0.0664


  return self._call_impl(*args, **kwargs)


In [None]:
anon2_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon2_art_classifier, attack_feature=0)

# Get inferred values
inferred_train_anon2_wb2 = anon2_wb2_attack.infer(
    x_train_for_attack, anon2_x_train_predictions, values=[0, 1], priors=priors
)

# Compute attack accuracy
anon2_wb_attack_acc = np.mean(
    inferred_train_anon2_wb2 == np.around(x_train_feature, decimals=8).reshape(1, -1)
)
print("🔵 White-Box Attack Accuracy After Anonymization (k=1000):", round(anon2_wb_attack_acc, 4))


🔵 White-Box Attack Accuracy After Anonymization (k=1000): 0.3325


In [None]:
#anon2_encoded = preprocessor.fit_transform(anon2)
#test_encoded = preprocessor.transform(x_test)

#anon2_model = DecisionTreeClassifier()
#anon2_model.fit(anon2_encoded, y_train)

#anon2_art_classifier = ScikitlearnDecisionTreeClassifier(anon2_model)

#print('Anonymized model accuracy: ', anon2_model.score(test_encoded, y_test))

## Attack
### Black-box attack

In [None]:
# training data without attacked feature
#x_train_for_attack = np.delete(train_encoded, attack_feature, 1)
# only attacked feature
#x_train_feature = train_encoded[:, attack_feature].copy().reshape(-1, 1)

#anon2_bb_attack = AttributeInferenceBlackBox(anon2_art_classifier, attack_feature=attack_feature)

# get original model's predictions
#anon2_x_train_predictions = np.array([np.argmax(arr) for arr in anon2_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
#anon2_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
#inferred_train_anon2_bb = anon2_bb_attack.infer(x_train_for_attack[attack_train_size:], pred=anon2_x_train_predictions[attack_train_size:], values=values)
# check accuracy
#train_acc = np.sum(inferred_train_anon2_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)
#print(train_acc)

### White box attack

In [None]:
#anon2_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon2_art_classifier, attack_feature=attack_feature)

# get inferred values
#inferred_train_anon2_wb2 = anon2_wb2_attack.infer(x_train_for_attack, anon2_x_train_predictions, values=values, priors=priors)

# check accuracy
#train_acc = np.sum(inferred_train_anon2_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
#print(train_acc)

In [None]:
# black-box regular
print(calc_precision_recall(inferred_train_bb, x_train_feature))
# black-box anonymized
print(calc_precision_recall(inferred_train_anon2_bb, x_train_feature))

# white-box regular
print(calc_precision_recall(inferred_train_wb2, x_train_feature))
# white-box anonymized
print(calc_precision_recall(inferred_train_anon2_wb2, x_train_feature))

(0.345475910693302, 0.33696275071633236)
(0.34683098591549294, 0.3386819484240688)
(0.7355769230769231, 0.31070496083550914)
(0.3324652777777778, 1.0)


The accuracy of the black-box attack is slightly reduced, as well as the precision and recall in both attacks.

## k=100, all QI
Now let's see what happens if we define all 8 features in the Nursery dataset as quasi-identifiers.

In [None]:
QI_all = [
    "parents_great_pret", "parents_pretentious", "parents_usual",
    "has_nurs_critical", "has_nurs_improper", "has_nurs_less_proper",
    "has_nurs_proper", "has_nurs_very_crit"
]


In [None]:
QI_all_indices = [x_train_named.columns.get_loc(col) for col in QI_all if col in x_train_named.columns]
print("Updated QI Indices (All Features):", QI_all_indices)


Updated QI Indices (All Features): [0, 1, 2, 3, 4, 5, 6, 7]


In [None]:
anonymizer_all_QI = Anonymize(100, QI_all_indices, categorical_features=categorical_feature_indices)
anon_all_QI = anonymizer_all_QI.anonymize(ArrayDataset(x_train, x_train_predictions))

print("✅ Anonymization with k=100 using all QI features completed!")


✅ Anonymization with k=100 using all QI features completed!


In [None]:
print("Distinct Rows in Original Data:", len(pd.DataFrame(x_train).drop_duplicates()))
print("Distinct Rows in Anonymized Data (k=100, All QI):", len(pd.DataFrame(anon_all_QI).drop_duplicates()))


Distinct Rows in Original Data: 10368
Distinct Rows in Anonymized Data (k=100, All QI): 4368


In [None]:
# Convert anonymized data to DataFrame with correct column names
anon_all_QI_df = pd.DataFrame(anon_all_QI, columns=x_train_named.columns)
anon_all_QI_encoded = anon_all_QI_df.to_numpy()  # Convert back to NumPy array

# Train a new Decision Tree model
anon_all_QI_model = DecisionTreeClassifier()
anon_all_QI_model.fit(anon_all_QI_encoded, y_train)

# Wrap with ART
anon_all_QI_art_classifier = ScikitlearnDecisionTreeClassifier(anon_all_QI_model)

# Evaluate Model
anon_all_QI_accuracy = anon_all_QI_model.score(x_test_encoded, y_test)
print("📌 Anonymized Model Accuracy (k=100, All QI):", round(anon_all_QI_accuracy, 4))


📌 Anonymized Model Accuracy (k=100, All QI): 0.8657


In [None]:
anon_all_QI_bb_attack = AttributeInferenceBlackBox(anon_all_QI_art_classifier, attack_feature=0)

# Get predictions from the new model
anon_all_QI_x_train_predictions = np.array([np.argmax(arr) for arr in anon_all_QI_art_classifier.predict(anon_all_QI_encoded)]).reshape(-1, 1)

# Train the attack model
anon_all_QI_bb_attack.fit(anon_all_QI_encoded[:attack_train_size])

# Get inferred values
inferred_train_anon_all_QI_bb = anon_all_QI_bb_attack.infer(
    x_train_for_attack[attack_train_size:], pred=anon_all_QI_x_train_predictions[attack_train_size:], values=[0, 1]
)

# Compute attack accuracy
anon_all_QI_bb_attack_acc = np.mean(inferred_train_anon_all_QI_bb == y_train[attack_train_size:])
print("🔴 Black-Box Attack Accuracy After Anonymization (k=100, All QI):", round(anon_all_QI_bb_attack_acc, 4))


  return self._call_impl(*args, **kwargs)


🔴 Black-Box Attack Accuracy After Anonymization (k=100, All QI): 0.0627


  return self._call_impl(*args, **kwargs)


In [None]:
anon_all_QI_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon_all_QI_art_classifier, attack_feature=0)

# Get inferred values
inferred_train_anon_all_QI_wb2 = anon_all_QI_wb2_attack.infer(
    x_train_for_attack, anon_all_QI_x_train_predictions, values=[0, 1], priors=priors
)

# Compute attack accuracy
anon_all_QI_wb_attack_acc = np.mean(
    inferred_train_anon_all_QI_wb2 == np.around(x_train_feature, decimals=8).reshape(1, -1)
)
print("🔵 White-Box Attack Accuracy After Anonymization (k=100, All QI):", round(anon_all_QI_wb_attack_acc, 4))


🔵 White-Box Attack Accuracy After Anonymization (k=100, All QI): 0.3494


In [None]:
#QI2 = ["parents", "has_nurs", "form", "children", "housing", "finance", "social", "health"]
#anonymizer3 = Anonymize(100, QI2, categorical_features=categorical_features)
#anon3 = anonymizer3.anonymize(ArrayDataset(x_train, x_train_predictions))

In [None]:
# number of distinct rows in anonymized data
#len(anon3.drop_duplicates())

In [None]:
#anon3_encoded = preprocessor.fit_transform(anon3)
#test_encoded = preprocessor.transform(x_test)

#anon3_model = DecisionTreeClassifier()
#anon3_model.fit(anon3_encoded, y_train)

#anon3_art_classifier = ScikitlearnDecisionTreeClassifier(anon3_model)

#print('Anonymized model accuracy: ', anon3_model.score(test_encoded, y_test))

# training data without attacked feature
#x_train_for_attack = np.delete(train_encoded, attack_feature, 1)
# only attacked feature
#x_train_feature = train_encoded[:, attack_feature].copy().reshape(-1, 1)

#anon3_bb_attack = AttributeInferenceBlackBox(anon3_art_classifier, attack_feature=attack_feature)

# get original model's predictions
#anon3_x_train_predictions = np.array([np.argmax(arr) for arr in anon3_art_classifier.predict(train_encoded)]).reshape(-1,1)

# train attack model
#anon3_bb_attack.fit(train_encoded[:attack_train_size])

# get inferred values
#inferred_train_anon3_bb = anon3_bb_attack.infer(x_train_for_attack[attack_train_size:], pred=anon3_x_train_predictions[attack_train_size:], values=values)
# check accuracy
#train_acc = np.sum(inferred_train_anon3_bb == np.around(x_train_feature[attack_train_size:], decimals=8).reshape(1,-1)) / len(inferred_train_anon2_bb)
#print('BB attack accuracy: ', train_acc)

#anon3_wb2_attack = AttributeInferenceWhiteBoxDecisionTree(anon3_art_classifier, attack_feature=attack_feature)

# get inferred values
#inferred_train_anon3_wb2 = anon3_wb2_attack.infer(x_train_for_attack, anon3_x_train_predictions, values=values, priors=priors)

# check accuracy
#train_acc = np.sum(inferred_train_anon3_wb2 == np.around(x_train_feature, decimals=8).reshape(1,-1)) / len(inferred_train_anon_wb2)
#print('WB attack accuracy: ', train_acc)

Accuracy of both attacks has decreased. Precision and recall remain roughly the same in the black-box case.

*In the anonymized version of the white-box attack, no records were predicted with the positive value for the attacked feature.