# Malicious, multi-point scenario

In this scenario, we assume Bob has multiple data points to contribute to Alice's ML model. Now Alice is trying to value the dataset as a whole, judging on the diversity, uncertainty of the datasets as well as the current model's performance on the dataset. Moreover, the parties are assumed to be malicious, which means they might deviate from the protocol to maximize their own utility.

# Part 0: Setup

We set up Alice's model and Bob's data point as in the other examples. 

In [16]:
#First, we define Alice's model M. We assume a simple CNN model.
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
import matplotlib.pyplot as plt
import os

#Don't use GPU for now
os.environ["CUDA_VISIBLE_DEVICES"] = ""

class LeNet(nn.Sequential):
    """
    Adaptation of LeNet that uses ReLU activations
    """

    # network architecture:
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.act = nn.Softmax(dim=1)
        

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = self.act(x)
        return x
    
model = LeNet()

os.makedirs('data', exist_ok=True)
torch.save(model.state_dict(), 'data/model.pth')
torch.save(model, 'data/alice_model.pth')

#Next, we define the data loader for CIFAR-10 dataset.
import torchvision
import random
import torchvision.transforms as transforms
import numpy as np

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=False,transform=transform,download=True)


# Randomly select 100 images as Bob's dataset
indices = random.sample(range(len(trainset)), 100)
selected_images = np.array([trainset[i][0].numpy() for i in indices])
selected_labels = np.array([trainset[i][1]  for i in indices])

# Save images and labels separately
# torch.save(selected_images, 'data/selected_images.pth')
# torch.save(selected_labels, 'data/selected_labels.pth')


Files already downloaded and verified


# Part 1: Clustering

Before submitting points to Alice for evaluation, Bob needs to select a subset of representative data points. To do this, we recommend using K-means clustering to select a diverse set of points where K is defined by the number of data points Alice wishs to check. Bob can select a data point closest to the centroid of each cluster. It is ultimately up to Bob to decide which points to submit, even if they are not ideal so we do not need to securely compute this step.

In [2]:
#First, we run the Kmeans clustering algorithm locally on Bob's device
from sklearn.cluster import KMeans

# Set the number of clusters
K = 10

# Reshape the images to be a 2D array (each image is flattened)
flattened_images = selected_images.reshape(selected_images.shape[0], -1)

# Perform K-means clustering
kmeans = KMeans(n_clusters=K, random_state=0).fit(flattened_images)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Get the cluster centers
cluster_centers = kmeans.cluster_centers_

print("Cluster centers shape:", cluster_centers.shape)

Cluster centers shape: (10, 3072)


We further make an enhancement to pure K-means selection by trying to select the most uncertain points in each cluster. As determining the uncertainty requires model inference, we define a computing budget B which is the number of points Bob and Alice can afford to evaluate. 

In the malicious case however, computing model inference via a secure multi-party computation protocol is not very feasible, as such computation is extremely expensive. Therefore, we will allow Alice to see the data point (without labels) and compute the model inference herself. A ZKP will be required from Alice to prove the correctness of the result.

In [3]:
#Set up

#First, a slightly modified version of the model 
#that returns the difference between the top 2 probs of the output.
class LeNetZKP(nn.Sequential):
    """
    Adaptation of LeNet that uses ReLU activations
    """

    # network architecture:
    def __init__(self):
        super(LeNetZKP, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.act = nn.Softmax(dim=1)
        

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = self.act(x)
        top2_values, _ = torch.topk(x, 2)
        diff = top2_values[:, 0] - top2_values[:, 1]
        return -diff

modelZK = LeNetZKP()
x = torch.randn(1, 3, 32, 32)
y = modelZK(x)
print(y)

tensor([-0.0016], grad_fn=<NegBackward0>)


In [4]:
#Create directory if not exists
os.makedirs('dataZKP', exist_ok=True)
#Specifying some path parameters
model_path = os.path.join('dataZKP','network.onnx')
compiled_model_path = os.path.join('dataZKP','network.compiled')
pk_path = os.path.join('dataZKP','test.pk')
vk_path = os.path.join('dataZKP','test.vk')
settings_path = os.path.join('dataZKP','settings.json')

witness_path = os.path.join('dataZKP','witness.json')
data_path = os.path.join('dataZKP','input.json')
output_path = os.path.join('dataZKP','output.json')
label_path = os.path.join('dataZKP','label.json')
proof_path = os.path.join('dataZKP','test.pf')
cal_path = os.path.join('dataZKP',"calibration.json")

#Model export 
torch.onnx.export(modelZK,               # model being run
    x,                   # model input (or a tuple for multiple inputs)
    model_path,            # where to save the model (can be a file or file-like object)
    export_params=True,        # store the trained parameter weights inside the model file
    opset_version=10,          # the ONNX version to export the model to
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names = ['input'],   # the model's input names
    output_names = ['output'], # the model's output names
    dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                'output' : {0 : 'batch_size'}})

In [5]:
#Settigns, calibration
import ezkl
import json 
py_run_args = ezkl.PyRunArgs()
py_run_args.input_visibility = "public" #Bob can see this
py_run_args.output_visibility = "public" #This is also public as Bob needs to know the uncertainty
py_run_args.param_visibility = "private" 

res = ezkl.gen_settings(model_path, settings_path, py_run_args=py_run_args)
assert res

indices = random.sample(range(len(trainset)), 10)
cal_images = np.array([trainset[i][0].numpy() for i in indices])

#Alice should use some real data to calibrate the model, here we use random data
data_array = (cal_images).reshape([-1]).tolist()

data = dict(input_data = [data_array])

# Serialize data into file:
json.dump(data, open(cal_path, 'w'))

await ezkl.calibrate_settings(cal_path, model_path, settings_path, "resources")
res = ezkl.compile_circuit(model_path, compiled_model_path, settings_path)
assert res 

# srs path - This actually requires a trusted setup.
res = await ezkl.get_srs(settings_path)
res = ezkl.setup(
        compiled_model_path,
        vk_path,
        pk_path,
    )

assert res
assert os.path.isfile(vk_path)
assert os.path.isfile(pk_path)
assert os.path.isfile(settings_path)

[tensor] decomposition error: integer -1347300901 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -2384205769 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 339451904 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -16782712372 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -8896388413 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 280106447 is too large to be represented by base 16384 and n 2
forward pass fa

In [8]:
# Wrap the ZKP prove and verify process into one function for convenience
import time

def prove_and_verify(prover_input: torch.Tensor):
    print(prover_input) # I do not know why, but this line avoided errors in verifying the proof
    # Serialize the prover input into a file
    data_array = prover_input.detach().numpy().reshape([-1]).tolist()
    data = dict(input_data = [data_array])
    json.dump(data, open(data_path, 'w'))
    
    # This simulates Alice running the model inference
    modelZK.eval()
    output = modelZK(prover_input)
    print(output)
    time.sleep(0.1)
    
    #Generate witness
    res = ezkl.gen_witness(data_path, compiled_model_path, witness_path)
    assert res
    print("Done witness")
    time.sleep(0.1)

    # Prove
    res = ezkl.prove(witness_path, compiled_model_path, pk_path, proof_path, "single")
    assert res
    print("Done proving")
    time.sleep(0.1)

    # Verify
    res = ezkl.verify(proof_path, settings_path, vk_path)
    assert res
    print("Done verifying")
    return output.detach().item()

In [9]:
TOTAL_BUDGET = 20 #This means maximum of 3 queries per cluster
budget = TOTAL_BUDGET // K
# Loop through each cluster
points_to_submit = []
labels_to_submit = []
for cluster_idx in range(K):
    # Get the indices of points in the current cluster
    cluster_points_indices = np.where(cluster_labels == cluster_idx)[0]
    
    # Get the points in the current cluster
    cluster_points = flattened_images[cluster_points_indices]
    
    # Calculate the distance of each point to the cluster center
    distances = np.linalg.norm(cluster_points - cluster_centers[cluster_idx], axis=1)
    
    # Sort the points by distance (from closest to furthest)
    sorted_indices = np.argsort(distances)
    
    used_budget = 0
    max_uncertainty = -999
    best_point = None
    best_label = None
    for idx in sorted_indices:
        if used_budget >= budget:
            if best_point is None:
                best_point = cluster_points[idx].reshape(3,32,32) #Simply choose the point closest if no point can be queried.
                best_label = selected_labels[cluster_points_indices[idx]]
            break
        print(f"Point index: {cluster_points_indices[idx]}, Distance: {distances[idx]}")
        point = cluster_points[idx].reshape(3,32,32)
        label = selected_labels[cluster_points_indices[idx]]
        #Reshape the point back to input shape
        point_tensor = torch.tensor(point).unsqueeze(0)
        answer = prove_and_verify(point_tensor)
        if answer > max_uncertainty:
            max_uncertainty = answer
            best_point = point
            best_label = label
        used_budget += 1
    points_to_submit.append(best_point)    
    labels_to_submit.append(best_label)
assert len(points_to_submit) == K

Point index: 95, Distance: 15.554551124572754
tensor([[[[-0.8588, -0.6000, -0.2000,  ..., -0.7412, -0.4667, -0.4196],
          [-0.8039, -0.4745, -0.0745,  ..., -0.7255, -0.6235, -0.4902],
          [-0.5922, -0.3647, -0.2000,  ..., -0.7490, -0.7647, -0.6392],
          ...,
          [ 0.6000,  0.3725,  0.6157,  ...,  0.4118,  0.5216,  0.6706],
          [ 0.5294,  0.4353,  0.6392,  ...,  0.4824,  0.6314,  0.6314],
          [ 0.5843,  0.3961,  0.3961,  ...,  0.4824,  0.6157,  0.4745]],

         [[-0.8902, -0.6078, -0.1529,  ..., -0.6706, -0.3961, -0.3020],
          [-0.8588, -0.5137, -0.1059,  ..., -0.6863, -0.5451, -0.3725],
          [-0.6627, -0.4431, -0.3176,  ..., -0.7176, -0.6706, -0.5216],
          ...,
          [ 0.3569,  0.0824,  0.4353,  ...,  0.2863,  0.2784,  0.5843],
          [ 0.0588, -0.0275,  0.3647,  ...,  0.1843,  0.2549,  0.4353],
          [ 0.3255,  0.2784,  0.3647,  ...,  0.1922,  0.3961,  0.3569]],

         [[-0.8902, -0.8039, -0.6941,  ..., -0.9059, -0.

Done proving
Done verifying
Point index: 15, Distance: 15.8023042678833
tensor([[[[-0.8980, -0.8980, -0.8980,  ..., -0.8902, -0.6314, -0.5294],
          [-0.8980, -0.8980, -0.8980,  ..., -0.8902, -0.6392, -0.5294],
          [-0.8980, -0.8980, -0.8980,  ..., -0.8902, -0.6314, -0.5216],
          ...,
          [ 0.2235,  0.2157,  0.2314,  ...,  0.3569,  0.2627,  0.3020],
          [ 0.2392,  0.2784,  0.3333,  ...,  0.3490,  0.3333,  0.3255],
          [ 0.2863,  0.2863,  0.3333,  ...,  0.4039,  0.3961,  0.3255]],

         [[-0.8510, -0.8510, -0.8510,  ..., -0.8275, -0.6314, -0.5686],
          [-0.8510, -0.8510, -0.8510,  ..., -0.8275, -0.6314, -0.5686],
          [-0.8510, -0.8510, -0.8510,  ..., -0.8275, -0.6314, -0.5608],
          ...,
          [ 0.3020,  0.2941,  0.3098,  ...,  0.4667,  0.3804,  0.4196],
          [ 0.3333,  0.3647,  0.4196,  ...,  0.4667,  0.4510,  0.4510],
          [ 0.3725,  0.3725,  0.3961,  ...,  0.5059,  0.4980,  0.4431]],

         [[-0.8824, -0.8824, -

# Part 2: Valuation

After Bob successfully selects the points to submit, they run an MPC protocol together to evaluate the dataset. Given that Alice needs to run another pass of her model for the valuation, and the doing that with MPC is still too expensive, Bob will first send the data points to Alice (without labels) and let Alice run model inference locally. Alice will send a ZKP for Bob to verify. After that, they engage in the MPC protocol where Alice supplies the inference result, and Bob supplies the data points and labels.

In [20]:
#First the ZKP setup
import ezkl
import json

#Clear the data firectory of previous ZKP
import os
import shutil
shutil.rmtree('dataZKP', ignore_errors=True)
os.makedirs('dataZKP', exist_ok=True)
x = torch.randn(1, 3, 32, 32)
model = LeNet()
#Model export 
torch.onnx.export(model,               # model being run
    x,                   # model input (or a tuple for multiple inputs)
    model_path,            # where to save the model (can be a file or file-like object)
    export_params=True,        # store the trained parameter weights inside the model file
    opset_version=10,          # the ONNX version to export the model to
    do_constant_folding=True,  # whether to execute constant folding for optimization
    input_names = ['input'],   # the model's input names
    output_names = ['output'], # the model's output names
    dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                'output' : {0 : 'batch_size'}})

py_run_args = ezkl.PyRunArgs()
py_run_args.input_visibility = "public" #Bob can see this
py_run_args.output_visibility = "hashed" #This hash is given to Bob
py_run_args.param_visibility = "private" 

res = ezkl.gen_settings(model_path, settings_path, py_run_args=py_run_args)
assert res

indices = random.sample(range(len(trainset)), 10)
cal_images = np.array([trainset[i][0].numpy() for i in indices])

#Alice should use some real data to calibrate the model, here we use random data
data_array = (cal_images).reshape([-1]).tolist()

data = dict(input_data = [data_array])

# Serialize data into file:
json.dump(data, open(cal_path, 'w'))

await ezkl.calibrate_settings(cal_path, model_path, settings_path, "resources")
res = ezkl.compile_circuit(model_path, compiled_model_path, settings_path)
assert res 

# srs path - This actually requires a trusted setup.
res = await ezkl.get_srs(settings_path)
res = ezkl.setup(
        compiled_model_path,
        vk_path,
        pk_path,
    )

assert res
assert os.path.isfile(vk_path)
assert os.path.isfile(pk_path)
assert os.path.isfile(settings_path)

[tensor] decomposition error: integer -591153822 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 8338085585 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 450477056 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -1093363577 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer -4626530856 is too large to be represented by base 16384 and n 2
forward pass failed: "failed to forward: [halo2] General synthesis error"
[tensor] decomposition error: integer 404529974 is too large to be represented by base 16384 and n 2
forward pass faile

In [23]:
#Prepare Bob's data
outputs = []
for pts in points_to_submit:
    data_array = pts.reshape([-1]).tolist()
    data = dict(input_data = [data_array])
    json.dump(data, open(data_path, 'w'))

    #Alice running the model on the data points
    model.eval()
    inp = torch.tensor(pts).unsqueeze(0)
    output = model(inp)
    output = output.detach().numpy()
    outputs.append(output)
    #Save the output
    data = dict(output_data = output.tolist())
    json.dump(data, open(output_path, 'w'))

    #Generate witness
    res = ezkl.gen_witness(data_path, compiled_model_path, witness_path)
    assert res

    #Prove
    res = ezkl.prove(witness_path, compiled_model_path, pk_path, proof_path, "single")
    assert res

    #Bob gets the proof then verifies it
    res = ezkl.verify(proof_path, settings_path, vk_path)
    assert res

In [24]:
#Prepare Alice and Bob's private input for MPC
Bob_input = (points_to_submit, labels_to_submit)
Alice_input = outputs