#### Step 1: Initialize Model Parameters and Gradients

We'll assume three clients, each having some local model parameters and local gradients for simplicity.

Client Model Parameters (simplified)

For each client, the model parameters are initialized as random matrices. The gradients for the last layer will also be computed randomly based on a given batch.

In [26]:
import numpy as np

# Initialize model parameters for each client
np.random.seed(42)
clients = 3
hidden_dim = 4  # Let's assume the hidden layer size is 4 for simplicity

# Local model weights for each client
W1 = np.random.rand(hidden_dim, hidden_dim)
W2 = np.random.rand(hidden_dim, hidden_dim)
W3 = np.random.rand(hidden_dim, hidden_dim)

# Assume each client has gradients from their last layer
grad1 = np.random.rand(hidden_dim, hidden_dim)
grad2 = np.random.rand(hidden_dim, hidden_dim)
grad3 = np.random.rand(hidden_dim, hidden_dim)

# Collecting gradients from clients
gradients = [grad1, grad2, grad3]


In [27]:
print("Client Gradients:")
for i, grad in enumerate(gradients, 1):
    print(f"Client {i} Gradient:\n{grad}\n")


Client Gradients:
Client 1 Gradient:
[[0.54671028 0.18485446 0.96958463 0.77513282]
 [0.93949894 0.89482735 0.59789998 0.92187424]
 [0.0884925  0.19598286 0.04522729 0.32533033]
 [0.38867729 0.27134903 0.82873751 0.35675333]]

Client 2 Gradient:
[[0.28093451 0.54269608 0.14092422 0.80219698]
 [0.07455064 0.98688694 0.77224477 0.19871568]
 [0.00552212 0.81546143 0.70685734 0.72900717]
 [0.77127035 0.07404465 0.35846573 0.11586906]]

Client 3 Gradient:
[[0.86310343 0.62329813 0.33089802 0.06355835]
 [0.31098232 0.32518332 0.72960618 0.63755747]
 [0.88721274 0.47221493 0.11959425 0.71324479]
 [0.76078505 0.5612772  0.77096718 0.4937956 ]]



#### Step 2: Gradient Sanitization

To mimic the Exponential Moving Average (EMA) and gradient sanitization steps:

- First, we apply EMA on the gradient over m iterations to smooth the gradient for each client.

- Then, we apply Gaussian noise and gradient compression for privacy.

In [28]:
# Function for Exponential Moving Average (EMA)
def ema(grad, ema_prev, alpha=0.9):
    return (1 - alpha) * ema_prev + alpha * grad

# Function for adding noise
def add_noise(grad, noise_scale=0.1):
    noise = np.random.normal(0, noise_scale, grad.shape)
    return grad + noise

# Initialize the EMA of gradients
ema_gradients = [np.zeros_like(grad1) for _ in range(clients)]

# Simulating EMA updates for a few rounds (we'll do just 2 rounds here)
for round in range(2):
    for i in range(clients):
        ema_gradients[i] = ema(gradients[i], ema_gradients[i])
        gradients[i] = add_noise(ema_gradients[i])  # Adding noise for sanitization

# After sanitization, we get the final gradients
sanitized_gradients = gradients


In [29]:
print(sanitized_gradients)

[array([[ 0.39654162,  0.10015513,  0.93914611,  0.88975444],
       [ 0.75475388,  0.56983364,  0.52830465,  0.81147938],
       [ 0.05758737,  0.13659655, -0.13829024,  0.29499841],
       [ 0.34025333,  0.42879517,  0.61986001,  0.33506591]]), array([[ 0.34292976,  0.45251809,  0.08639302,  0.82320196],
       [-0.12894248,  0.77494528,  0.7476594 ,  0.28052138],
       [-0.11826495,  0.58075678,  0.5609824 ,  0.64794684],
       [ 0.68834829,  0.02908005,  0.24010097,  0.1638721 ]]), array([[ 0.97585706,  0.50524518,  0.50756521,  0.09788567],
       [-0.01193567,  0.3559341 ,  0.56459811,  0.87420197],
       [ 0.89703854,  0.37006446,  0.20084837,  0.57801738],
       [ 0.86976661,  0.76250275,  0.74052453,  0.28719755]])]


#### Step 3: Task Relevance Calculation

To measure task relevance between clients, we compute cosine similarity between their sanitized gradients. Cosine similarity between two vectors ùëé and ùëè is given by:

$$
S_{ij} = \cos(\theta_{ij}) = \frac{\mathbf{a}_i \cdot \mathbf{a}_j}{\lVert \mathbf{a}_i \rVert \,\lVert \mathbf{a}_j \rVert}
$$


In [30]:
# Function to compute cosine similarity
def cosine_similarity(a, b):
    return np.dot(a.flatten(), b.flatten()) / (np.linalg.norm(a) * np.linalg.norm(b))

# Compute the relevance matrix S based on sanitized gradients
relevance_matrix = np.zeros((clients, clients))

for i in range(clients):
    for j in range(clients):
        relevance_matrix[i][j] = cosine_similarity(sanitized_gradients[i], sanitized_gradients[j])

print("Client Relevance Matrix (S):")
print(relevance_matrix)


Client Relevance Matrix (S):
[[1.         0.61306093 0.69216238]
 [0.61306093 1.         0.63654134]
 [0.69216238 0.63654134 1.        ]]


#### Step 4: Global Model Aggregation

Now that we have the client relevance matrix, we use it to aggregate the models. We calculate the weighted sum of local models using the relevance matrix.

In [32]:
# Initialize the local models L (here we use W1, W2, W3 as local models for simplicity)
local_models = [W1, W2, W3]

# Normalize the relevance matrix to use as weights for aggregation
weights = np.exp(relevance_matrix / 1.0)  # Softmax temperature tau = 1.0 for simplicity
weights /= np.sum(weights, axis=1, keepdims=True)  # Normalize each row


# Weighted sum of local models using relevance weights
for i in range(clients):
    global_model += weights[i, i] * local_models[i]  # We are aggregating using the relevance of each client to itself


print("Aggregated Global Model:")
print(global_model)


Aggregated Global Model:
[[0.31004383 1.00525158 0.88245523 0.70326347]
 [0.44765353 0.16355468 0.42864121 0.6941956 ]
 [0.49128803 0.82773403 0.10677085 0.99249143]
 [0.70079699 0.38012746 0.45944904 0.36178103]]


### Step 5: PQ-LoRA Update

For simplicity, assume the PQ-LoRA matrices \(P\) and \(Q\) are trained locally on each client and aggregated. We update the PQ-LoRA components via:

$$
h_O = W_p h_I + (1 - \beta) h_L + \beta h_G
$$

Where:
- $h_I$ is the input hidden state
- $W_p$ is the pretrained weight
- $h_L$ is the local model output
- $h_G$ is the global model output
- $\beta$ is the learnable gating parameter




In [34]:
# Assume a simple PQ-LoRA structure where P and Q are trainable low-rank matrices
P = np.random.rand(hidden_dim, hidden_dim)
Q = np.random.rand(hidden_dim, hidden_dim)

# Assume h_I as a random hidden input state
h_I = np.random.rand(hidden_dim, hidden_dim)
print("Input Hidden State (h_I):")
print(h_I)
# Compute local output (h_L) and global output (h_G)
h_L = np.dot(W1, h_I)
h_G = global_model @ h_I
print("Local Output (h_L):")
print(h_L)
print("Global Output (h_G):")
print(h_G)
# Learnable gating parameter (beta) for blending
beta = np.random.rand(1)
print("Gating Parameter (beta):")
print(beta)

# Compute output using PQ-LoRA update
h_O = np.dot(W1, h_I) + (1 - beta) * h_L + beta * h_G

print("Output of PQ-LoRA:")
print(h_O)


Input Hidden State (h_I):
[[0.97585208 0.51630035 0.32295647 0.79518619]
 [0.27083225 0.43897142 0.07845638 0.02535074]
 [0.96264841 0.83598012 0.69597421 0.40895294]
 [0.17329432 0.15643704 0.2502429  0.54922666]]
Local Output (h_L):
[[1.43137677 1.31629635 0.8548087  0.95008073]
 [0.40051697 0.33308839 0.31980512 0.62749895]
 [0.96626372 0.79011757 0.50672613 1.03706697]
 [1.07666564 0.70369454 0.45794326 0.84241835]]
Global Output (h_G):
[[1.54617716 1.44908394 0.96915183 1.01916016]
 [1.01407045 0.76985295 0.62944526 0.91667897]
 [0.97837744 0.8615245  0.54627936 1.00041614]
 [1.29180747 0.96937304 0.66644816 0.95349343]]
Gating Parameter (beta):
[0.71459592]
Output of PQ-LoRA:
[[2.94478943 2.72748217 1.79132653 1.94952533]
 [1.23947675 0.97828696 0.86087782 1.46164475]
 [1.94118386 1.63126224 1.04171683 2.04794341]
 [2.30707075 1.59724186 1.06488327 1.76421051]]


#### Step 6: Weight Alignment for Heterogeneous Models

The next step is to align the A and B matrices of heterogeneous models across clients. This can be done using L2 loss for ùê¥ and CCA for ùêµ

However, in this toy example, we'll skip the alignment as the concept primarily involves adjusting matrices 
ùê¥ and ùêµ based on shared public data (or pretraining data) and using optimization techniques to minimize the misalignment. You would need a dataset and further implementation to actually perform this alignment.

#### Step 7: Final Aggregated Model for Clients

Finally, after updating the local models with PQ-LoRA and performing weight alignment, the server sends the aggregated model back to the clients for further fine-tuning.

In [35]:
# Assume that each client will now fine-tune the global model using their data
# For simplicity, we just update the global model slightly
for i in range(clients):
    local_model = local_models[i] + 0.1 * global_model  # Fine-tune with small learning rate
    local_models[i] = local_model  # Update local model

print("Final Fine-Tuned Models for Clients:")
for i in range(clients):
    print(f"Client {i+1} Model:")
    print(local_models[i])


Final Fine-Tuned Models for Clients:
Client 1 Model:
[[0.4055445  1.05123946 0.82023947 0.66898483]
 [0.20078399 0.17234999 0.10094773 0.93559571]
 [0.65024381 0.79084598 0.03126158 1.069159  ]
 [0.90252234 0.25035186 0.22776987 0.21958261]]
Client 2 Model:
[[0.33524663 0.62528159 0.52019054 0.36155549]
 [0.65661825 0.15584933 0.33500877 0.4357814 ]
 [0.50519879 0.86794936 0.21035087 0.61348358]
 [0.66249427 0.08446316 0.65348976 0.20670223]]
Client 3 Model:
[[0.09605598 1.0494107  1.05387756 0.87872369]
 [0.34937912 0.11402758 0.72709715 0.50957205]
 [0.17116704 0.57795031 0.04506561 1.00856954]
 [0.32885968 0.70053503 0.35765598 0.55624612]]
