## Gate Value Calculation 

$\alpha(E_i) = \frac{\sum_{j=0}^{N} \exp(h \cdot e_j)}{\exp(h \cdot e_i)}$
1. **h** is the hidden representation of the input token. **h** belongs to the space $\mathbb{R}^d$, where **d** is the dimnesionality of hidden state.
2. $e_i$ is the trainable embedding for expert $E_i$, which is also in $\mathbb{R}^d$.The embedding is typically a learned vector for each expert.
3. The tern **h.$e_i$** represents the dot product between the hidden representation **h** and the expert embedding $e_i$, resulting in a scalar value.

### Output Calculation

$ o= h + \sum_{i=0}^{N} \ alpha(E_i)\cdot E_i(h)$

In [1]:
import numpy as np
np.random.seed(10)

In [2]:

def softmax(x):
    return np.exp(x)/ np.sum(np.exp(x), axis=0)

d = 64
R = 16
num_experts = 3

A_matrices = [np.random.rand(d,R) for _  in range(num_experts)]
B_matrices = [np.random.rand(R,d) for _ in range(num_experts)]

h = np.random.randn(d) # gating value of routing h.

trainable_embeddings = [np.random.randn(d) for _ in range(num_experts)]  # e_i for each expert.

lora_outputs = [] 

for i in range(num_experts):
    A_i = A_matrices[i]
    B_i = B_matrices[i]
    
    delta_h_i = B_i.T @ (A_i.T @ h) 
    lora_outputs.append(delta_h_i)
    

    
dot_products = np.array([np.dot(h, e_i) for e_i in trainable_embeddings])
gate_values = softmax(dot_products)

final_output = np.zeros_like(h)
for i in range(num_experts):
    final_output += gate_values[i] * lora_outputs[i]
    
    
print("Hidden representation h: ", h.shape)
print("Trainable embeddings e_i:",len(trainable_embeddings), trainable_embeddings[0].shape)
print("Dot products h.e_i: ", dot_products.shape)
print("Gate values (alpha(E_i)): ", gate_values.shape)
print("Final output after applying LoRA modules and gate values: ", final_output.shape)





Hidden representation h:  (64,)
Trainable embeddings e_i: 3 (64,)
Dot products h.e_i:  (3,)
Gate values (alpha(E_i)):  (3,)
Final output after applying LoRA modules and gate values:  (64,)


### Mixture of LoRA Experts (MoLE),
MoLE extends the transformer architecture by combining outputs from multiple LoRA modules via a gating function.

In [33]:
import numpy as np

def softmax(x, temperature=1.0):
    scaled_x = x / temperature
    return np.exp(scaled_x) / np.sum(np.exp(scaled_x), axis=0)


d = 64  
R = 16  
num_experts = 3  
L = 10  

pretrained_params = np.random.randn(d, d)


A_matrices = [np.random.randn(d, R) for _ in range(num_experts)]  
B_matrices = [np.random.randn(R, d) for _ in range(num_experts)]

trainable_embeddings = [np.random.randn(d) for _ in range(num_experts)]


x = np.random.randn(L, d)
def transformer_block(x, pretrained_params):
    
    # x_prime = x + np.dot(np.dot(np.linalg.inv(pretrained_params), x.T).T, pretrained_params)
    # add some attention
    x_prime = x + np.dot(pretrained_params, x.T).T
    return x_prime


F_theta_x = transformer_block(x, pretrained_params)

lora_outputs = []
for i in range(num_experts):
    A_i = A_matrices[i]
    B_i = B_matrices[i]

    delta_h_i = np.dot(B_i.T, np.dot(A_i.T, F_theta_x.T))
    lora_outputs.append(delta_h_i)
    
E_omega_x_normalize = (F_theta_x - np.mean(F_theta_x,axis=0))/np.std(F_theta_x,axis=0)
E_flatten = np.mean( E_omega_x_normalize,axis=0)
gating_logits = np.array([np.dot(E_flatten, e_i) for e_i in trainable_embeddings])

gating_values = softmax(gating_logits, temperature=1.0)
final_output = np.zeros_like(F_theta_x).T
for i in range(num_experts):
    final_output += gating_values[i] * lora_outputs[i]

print("Final output after applying LoRA modules and gating values:", final_output.shape)
print(final_output)


Final output after applying LoRA modules and gating values: (64, 10)
[[ 6.63664001e+01  1.21169162e+02  1.19005032e+02 -5.48564765e+01
   1.52995656e+02  4.38109495e+00  9.80307783e+01 -3.41103151e+01
   1.48912680e+02 -1.00108733e+01]
 [-4.15256861e+00 -2.43828317e+02  1.44777236e+02  1.62484582e+00
  -2.42556224e+01  4.45106700e+01  1.14470025e+02 -1.85887045e+02
   1.81165093e+02  3.07562093e+01]
 [-4.04077237e+00 -5.36999342e+01 -5.36575322e+01 -4.06371245e+01
   3.84304986e+01  5.67823824e+01 -1.35863398e+02  1.21899585e+02
  -4.07118474e+01  3.80647807e+01]
 [-7.46804561e+01  8.60298787e+01  1.19382769e+02  1.96895284e+02
  -7.55310386e+01  5.77029444e+01 -1.93285081e+02  2.95792889e+01
  -9.89373177e+01 -2.45987734e+02]
 [ 3.56483224e+01 -9.20338770e+01 -1.14513615e+01 -4.01326046e+01
   1.46201390e+02  3.14412101e+01  2.92663914e+02 -7.63706103e+01
   1.21275866e+02  3.21886581e+01]
 [-1.14672494e+02  3.55407112e+02 -1.08681656e+02 -4.15055169e+01
   8.53496638e+01  1.65108665e