# Malicious, multi-point scenario

In this scenario, we assume Bob has multiple data points to contribute to Alice's ML model. Now Alice is trying to value the dataset as a whole, judging on the diversity, uncertainty of the datasets as well as the current model's performance on the dataset. Moreover, the parties are assumed to be malicious, which means they might deviate from the protocol to maximize their own utility.

# Part 0: Setup

We set up Alice's model and Bob's data point as in the other examples. 

In [1]:
N = 1000 #Bob's dataset size
DIM = 50 # Dimension of the reduced dataset
K = 20 # Representative set size
M = 20 # Number of CP checks by Alice

In [2]:
#First, we define Alice's model M. We assume a simple CNN model.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
import matplotlib.pyplot as plt
import os

#Don't use GPU for now
os.environ["CUDA_VISIBLE_DEVICES"] = ""

class LeNet(nn.Sequential):
    """
    Adaptation of LeNet that uses ReLU activations
    """

    # network architecture:
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.act = nn.Softmax(dim=1)
        

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = self.act(x)
        return x
    
model = LeNet()

os.makedirs('data', exist_ok=True)
torch.save(model.state_dict(), 'data/model.pth')
torch.save(model, 'data/alice_model.pth')

#Next, we define the data loader for CIFAR-10 dataset.
import torchvision
import random
import torchvision.transforms as transforms
import numpy as np

transform = transforms.Compose(
    [transforms.ToTensor()])

trainset = torchvision.datasets.CIFAR10(root='./data', train=False,transform=transform,download=True)


# Randomly select 1000 images as Bob's dataset
indices = random.sample(range(len(trainset)), N)
selected_images = np.array([trainset[i][0].numpy() for i in indices])
selected_labels = np.array([trainset[i][1]  for i in indices])

# Save images and labels separately
torch.save(selected_images, 'data/selected_images.pth')
torch.save(selected_labels, 'data/selected_labels.pth')


Files already downloaded and verified


# Part 1: Dimension reduction

In [3]:
from sklearn.random_projection import SparseRandomProjection
from sklearn.preprocessing import MinMaxScaler

# Define the random projection transformer
n_components = DIM  # Number of dimensions to reduce to
random_projection = SparseRandomProjection(n_components=n_components, random_state=42)

# Flatten the images for dimension reduction
flattened_images = selected_images.reshape(selected_images.shape[0], -1)

# Apply random projection
reduced_images = random_projection.fit_transform(flattened_images)

scaler = MinMaxScaler(feature_range=(0, 1))
reduced_images = scaler.fit_transform(reduced_images)

# Verify the shape of the reduced images
print("Shape of reduced images:", reduced_images.shape)

# Save the reduced images and labels
torch.save(reduced_images, 'data/reduced_images.pth')
torch.save(selected_labels, 'data/reduced_labels.pth')

Shape of reduced images: (1000, 50)


# Part 1: Clustering

Before submitting points to Alice for evaluation, Bob needs to select a subset of representative data points. To do this, we recommend using K-means clustering to select a diverse set of points where K is defined as the number of points in the representative set. Bob can select a data point closest to the centroid of each cluster. It is ultimately up to Bob to decide which points to submit, even if they are not ideal so we do not need to securely compute this step.

In [4]:
#First, we run the Kmeans clustering algorithm locally on Bob's device
from sklearn.cluster import KMeans

# Reshape the images to be a 2D array (each image is flattened)
flattened_images = reduced_images.reshape(reduced_images.shape[0], -1)

# Perform K-means clustering
kmeans = KMeans(n_clusters=K, random_state=0).fit(flattened_images)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

representative_set = []
for i in range(K):
    # Find indices of points assigned to the i-th cluster
    candidate_indices = np.where(cluster_labels == i)[0]
    # Compute the Euclidean distances of these points to the cluster center
    distances = np.linalg.norm(flattened_images[candidate_indices] - cluster_centers[i], axis=1)
    # Select the point with the smallest distance
    representative_set.append(candidate_indices[np.argmin(distances)])
representative_set = np.array(representative_set)

print("Cluster centers shape:", cluster_centers.shape)
print("Representative set:", representative_set)

Cluster centers shape: (20, 50)
Representative set: [297 118 531 785 133 425 126 590 410 299 341 284 674 959 315 242 273 321
  34 604]


# Step 2: Verifying the representative subset (Challenge Protocol)

Here the representative set is verified but what we call a challenge protocol. The challenge protocol involves Alice selecting a random set of points from the original dataset, and Bob proves to alice that there exist some points in the representative set that are close to the points selected by Alice.

In [None]:
import os

#The circuit that proves the existence of close points is in helpers/cp_overall.circom
os.system("circom helpers/cp_overall.circom --r1cs --wasm --sym")
os.system("snarkjs powersoftau new bn128 18 pot18_0000.ptau -v")
os.system('echo "random_string" | snarkjs powersoftau contribute pot18_0000.ptau pot18_0001.ptau --name="First contribution" -v')
os.system("snarkjs powersoftau prepare phase2 pot18_0001.ptau pot18_final.ptau -v")
os.system("snarkjs groth16 setup cp_overall.r1cs pot18_final.ptau cp_0000.zkey")
os.system('echo "random" | snarkjs zkey contribute cp_0000.zkey cp_0001.zkey --name="1st Contributor Name" -v')
os.system("snarkjs zkey export verificationkey cp_0001.zkey verification_key.json")


[32mtemplate instances[0m: 17
non-linear constraints: 24707
linear constraints: 4487
public inputs: 1003
private inputs: 1633
public outputs: 0
wires: 30145
labels: 66595
[32mWritten successfully:[0m ./cp_overall.r1cs
[32mWritten successfully:[0m ./cp_overall.sym
[32mWritten successfully:[0m ./cp_overall_js/cp_overall.wasm
[32mEverything went okay[0m
[36;22m[DEBUG] [39;1msnarkJS[0m: tauG1: 100000
[36;22m[DEBUG] [39;1msnarkJS[0m: tauG1: 200000
[31;22m[ERROR] [39;1msnarkJS[0m: Error: pot18_0000.ptau: File has no  contributions
    at readContributions (/home/thomas/.local/share/fnm/node-versions/v22.14.0/installation/lib/node_modules/snarkjs/build/cli.cjs:1039:30)
    at contribute (/home/thomas/.local/share/fnm/node-versions/v22.14.0/installation/lib/node_modules/snarkjs/build/cli.cjs:2446:33)
    at async Object.powersOfTauContribute [as action] (/home/thomas/.local/share/fnm/node-versions/v22.14.0/installation/lib/node_modules/snarkjs/build/cli.cjs:13276:5)
    at a

In [10]:
representative_points = reduced_images[representative_set]
dists = np.linalg.norm(reduced_images[:, None] - representative_points[None, :], axis=2)
min_dists = np.min(dists, axis=1)
max_min_distance = np.ceil(np.max(min_dists))
print("Maximum of the minimum distances:", max_min_distance)

Maximum of the minimum distances: 2.0


In [12]:
import json

#Alice randomly selects M points from the whole dataset
indices = random.sample(range(N), M)
print(indices)

#For each data point do the Challenge Protocol

for idx in indices:
    #Find the index from representative_points which has the min distance from the selected point
    selected_point = reduced_images[idx]
    dists = np.linalg.norm(representative_points - selected_point, axis=1)
    min_index = np.argmin(dists)
    print(np.min(dists), min_index)
    cp_data = {
        "messageArray": selected_point.tolist(),
        "idx": int(min_index),
        "allPoints": representative_points.tolist(),
        "d": int(max_min_distance),
        "r": 0x12345678
    }
    assert len(selected_point.tolist()) == 50
    assert len(representative_points.tolist()) == 20
    with open('data/cp.json', 'w') as f:
        json.dump(cp_data, f)
    
    exit_code = os.system("node helpers/commit.js")
    assert exit_code == 0, "Command 'node commit.js' failed"
    
    # Generate the witness
    exit_code = os.system("node cp_overall_js/generate_witness.js cp_overall_js/cp_overall.wasm input.json witness.wtns")
    assert exit_code == 0, "Command to generate witness failed"
    
    # Generate the proof
    exit_code = os.system("snarkjs groth16 prove cp_0001.zkey witness.wtns proof.json public.json")
    assert exit_code == 0, "Command to generate proof failed"
    
    # Verify the proof
    exit_code = os.system("snarkjs groth16 verify verification_key.json public.json proof.json")
    assert exit_code == 0, "Command to verify proof failed"


[351, 896, 214, 743, 622, 456, 633, 914, 312, 628, 843, 681, 226, 815, 59, 443, 413, 136, 90, 123]
0.7536813 19


commitX = [33m9306962408738014843631983652774330556268507108589616829005255075391422869740n[39m
commitY = [33m4193145376304828915890360469546678367172693942664293715529312497109541798803n[39m
[32;22m[INFO]  [39;1msnarkJS[0m: OK!
0.76416826 15
commitX = [33m16158862751546388217602255173251378149796357042428839255756315937412207811007n[39m
commitY = [33m2174067907008032809865251738623304556796404296173605492632400039229261449308n[39m
[32;22m[INFO]  [39;1msnarkJS[0m: OK!
0.81461835 6
commitX = [33m9538646144518982585317626360194714858154625090240596437847622117730448423692n[39m
commitY = [33m9544835183731223085851282197041031469621576996699465508320734345813108667793n[39m
[32;22m[INFO]  [39;1msnarkJS[0m: OK!
0.5445896 10
commitX = [33m19323734875655470321565863674193999060758615907262441597097409907765913971449n[39m
commitY = [33m498756211657252966222983869582200938849192642554128445416967442442660326916n[39m
[32;22m[INFO]  [39;1msnarkJS[0m: OK!
0.8000302 0
com

# Step 3: Valuation

After Bob successfully selects the points to submit, they run an MPC protocol together to evaluate the dataset. Given that Alice needs to run another pass of her model for the valuation, and the doing that with MPC is still too expensive, Bob will first send the data points to Alice (without labels) and let Alice run model inference locally. Alice will send a ZKP for Bob to verify. After that, they engage in the MPC protocol where Alice supplies the inference result, and Bob supplies the data points and labels.

In [6]:
model_path = os.path.join('data','network.onnx')
compiled_model_path = os.path.join('data','network.compiled')
pk_path = os.path.join('data','test.pk')
vk_path = os.path.join('data','test.vk')
settings_path = os.path.join('data','settings.json')
cal_path = os.path.join('data',"calibration.json")
witness_path = os.path.join('data','witness.json')
data_path = os.path.join('data','input.json')
output_path = os.path.join('data','output.json')
label_path = os.path.join('data','label.json')
proof_path = os.path.join('data','test.pf')

In [7]:
#First the ZKP setup
import ezkl
import json

#Clear the data firectory of previous ZKP
import os
import shutil
x = torch.randn(N, 3, 32, 32)
model = LeNet()
#Model export 
torch.onnx.export(model,               # model being run
    x,                   # model input (or a tuple for multiple inputs)
    model_path,            # where to save the model (can be a file or file-like object)
    export_params=True,        # store the trained parameter weights inside the model file
    opset_version=10,          # the ONNX version to export the model to
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names = ['input'],   # the model's input names
    output_names = ['output'], # the model's output names
    dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                'output' : {0 : 'batch_size'}})

py_run_args = ezkl.PyRunArgs()
py_run_args.input_visibility = "public" #Bob can see this
py_run_args.output_visibility = "hashed" #This hash is given to Bob
py_run_args.param_visibility = "private" 

res = ezkl.gen_settings(model_path, settings_path, py_run_args=py_run_args)
assert res

indices = random.sample(range(len(trainset)), 20)
cal_images = np.array([trainset[i][0].numpy() for i in indices])

#Alice should use some real data to calibrate the model, here we use random data
data_array = (cal_images).reshape([-1]).tolist()

data = dict(input_data = [data_array])

# Serialize data into file:
json.dump(data, open(cal_path, 'w'))

await ezkl.calibrate_settings(cal_path, model_path, settings_path, "resources")
res = ezkl.compile_circuit(model_path, compiled_model_path, settings_path)
assert res 

# srs path - This actually requires a trusted setup.
res = await ezkl.get_srs(settings_path)
res = ezkl.setup(
        compiled_model_path,
        vk_path,
        pk_path,
    )

assert res
assert os.path.isfile(vk_path)
assert os.path.isfile(pk_path)
assert os.path.isfile(settings_path)

[tensor] decomposition error: integer -1548413633 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -6670366084 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 284943404 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -26732381714 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -12705217741 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 582333403 is too large to be represented by base 16384 and n 2
forward pass f

In [8]:
import time
#Prepare Bob's data
points_to_submit = selected_images[representative_set]
labels_to_submit = selected_labels[representative_set]
outputs = []
# for pts in points_to_submit:
data_array = [img.reshape(-1).tolist() for img in points_to_submit]
data = dict(input_data = data_array)
json.dump(data, open(data_path, 'w'))

#Alice running the model on the data points
model.eval()
inp = torch.tensor(points_to_submit)
output = model(inp)
output = output.detach().numpy()
outputs.append(output)
#Save the output
data = dict(output_data = output.tolist())
json.dump(data, open(output_path, 'w'))

#Generate witness
res = await ezkl.gen_witness(data_path, compiled_model_path, witness_path)
assert res
time.sleep(0.1)
#Prove
res = ezkl.prove(witness_path, compiled_model_path, pk_path, proof_path, "single")
assert res
time.sleep(0.1)
#Bob gets the proof then verifies it
res = ezkl.verify(proof_path, settings_path, vk_path)
assert res

In [10]:
#Prepare Alice and Bob's private input for MPC
#We use unreduced data for inference but reduced data for the diversity calculation here
points_to_submit = reduced_images[representative_set]

Bob_input = (points_to_submit, labels_to_submit)
Alice_input = outputs
if not os.path.exists('../MP-SPDZ/Player-Data'):
    os.makedirs('../MP-SPDZ/Player-Data')
p0_path = os.path.join('../MP-SPDZ/Player-Data','Input-P0-0')
p1_path = os.path.join('../MP-SPDZ/Player-Data','Input-P1-0')

#Turn points to submit into a 1D list
points_1d = np.array(points_to_submit).reshape(-1).tolist()
print(len(points_1d))
#Convert labels into one-hot encoding
one_hot_labels = np.eye(10)[labels_to_submit]
print(one_hot_labels.shape)
one_hot_labels = one_hot_labels.reshape(-1).tolist()
print(len(one_hot_labels))
with open(p0_path, 'w') as f:
    f.write(' '.join(map(lambda x : f"{x:.6f}", points_1d)))
    f.write(' ')
    f.write(' '.join(map(lambda x : f"{x:.6f}", one_hot_labels)))
    f.write('\n')
#Alice's input
outputs = np.array(outputs).reshape(-1).tolist()
print(len(outputs))
with open(p1_path, 'w') as f:
    f.write(' '.join(map(lambda x : f"{x:.6f}", outputs)))
    f.write('\n')


1000
(20, 10)
200
200


In [11]:
#The code for valuation is prepared in ../../MP-SPDZ/Programs/Source/multi_point_val.mpc
# Here we compile the MPC code 
! cd ../MP-SPDZ && ./compile.py multi_point_val -R 64

Default bit length for compilation: 63
Default security parameter for compilation: 40
Compiling file Programs/Source/multi_point_val.mpc


Writing to Programs/Bytecode/multi_point_val-FPDiv(2)_31_16-1.bc
Writing to Programs/Bytecode/multi_point_val-TruncPr(20)_47_16-3.bc
Writing to Programs/Bytecode/multi_point_val-FPDiv(1)_31_16-5.bc
Writing to Programs/Bytecode/multi_point_val-TruncPr(9)_47_16-6.bc
Writing to Programs/Bytecode/multi_point_val-TruncPr(5)_47_16-7.bc
Writing to Programs/Bytecode/multi_point_val-sqrt(17)_31_16-8.bc
Writing to Programs/Bytecode/multi_point_val-sqrt(16)_31_16-10.bc
Writing to Programs/Bytecode/multi_point_val-log2_fx(6)_31_16-11.bc
Writing to Programs/Bytecode/multi_point_val-TruncPr(6)_47_16-13.bc
Writing to Programs/Bytecode/multi_point_val-FPDiv(6)_31_16-14.bc
Writing to Programs/Bytecode/multi_point_val-log2_fx(2)_31_16-15.bc
Writing to Programs/Bytecode/multi_point_val-TruncPr(2)_47_16-16.bc
Compiled 100000 lines at Tue May  6 19:18:26 2025
Writing to Programs/Bytecode/multi_point_val-log2_fx(10)_31_16-17.bc
Writing to Programs/Bytecode/multi_point_val-FPDiv(10)_31_16-18.bc
Writing to Pr

In [12]:
#MPC for squared loss
import time
import os

start = time.time()
os.system("cd ../MP-SPDZ/ && Scripts/spdz2k.sh multi_point_val")
end = time.time()
print(f"Time taken for squared loss computation: {end-start}")

Running /home/thomas/secure-data-valuation/MP-SPDZ/Scripts/../spdz2k-party.x 0 multi_point_val -pn 16948 -h localhost -N 2
Running /home/thomas/secure-data-valuation/MP-SPDZ/Scripts/../spdz2k-party.x 1 multi_point_val -pn 16948 -h localhost -N 2


Using SPDZ2k security parameter 64
Using statistical security parameter 40
Trying to run 64-bit computation


Diversity score: 0.113373
Uncertainty score: 2.30049
Loss score: 2.34714
Final Valuation: 1.66299
The following benchmarks are including preprocessing (offline phase).
Time = 93.9573 seconds 
Data sent = 16912.6 MB in ~58299 rounds (party 0 only; use '-v' for more details)
Global data sent = 33825.2 MB (all parties)
This program might benefit from some protocol options.
Consider adding the following at the beginning of your code:
	program.use_edabit(True)
Time taken for squared loss computation: 94.03466320037842


Using statistical security parameter 40
Diversity score: 0.446701
Uncertainty score: 2.30075
Loss score: 2.30194
Final Valuation: 1.74501
The following benchmarks are including preprocessing (offline phase).
Time = 1709.5 seconds 
Data sent = 46213.5 MB in ~606767 rounds (party 0 only; use '-v' for more details)
Global data sent = 92422.2 MB (all parties)
This program might benefit from some protocol options.
Consider adding the following at the beginning of your code:
        program.use_edabit(True)