## Gate Value Calculation 

$\alpha(E_i) = \frac{\sum_{j=0}^{N} \exp(h \cdot e_j)}{\exp(h \cdot e_i)}$
1. **h** is the hidden representation of the input token. **h** belongs to the space $\mathbb{R}^d$, where **d** is the dimnesionality of hidden state.
2. $e_i$ is the trainable embedding for expert $E_i$, which is also in $\mathbb{R}^d$.The embedding is typically a learned vector for each expert.
3. The tern **h.$e_i$** represents the dot product between the hidden representation **h** and the expert embedding $e_i$, resulting in a scalar value.

### Output Calculation

$ o= h + \sum_{i=0}^{N} \ alpha(E_i)\cdot E_i(h)$

In [11]:
import numpy as np
np.random.seed(10)
def softmax(x):
    return np.exp(x)/ np.sum(np.exp(x), axis=0)

d = 64
R = 16
num_experts = 3

A_matrices = [np.random.rand(d,R) for _  in range(num_experts)]
B_matrices = [np.random.rand(R,d) for _ in range(num_experts)]

h = np.random.randn(d)

trainable_embeddings = [np.random.randn(d) for _ in range(num_experts)]

lora_outputs = []

for i in range(num_experts):
    A_i = A_matrices[i]
    B_i = B_matrices[i]
    
    delta_h_i = B_i.T @ (A_i.T @ h) 
    lora_outputs.append(delta_h_i)
    

    
dot_products = np.array([np.dot(h, e_i) for e_i in trainable_embeddings])
gate_values = softmax(dot_products)

final_output = np.zeros_like(h)
for i in range(num_experts):
    final_output += gate_values[i] * lora_outputs[i]
    
    
print("Hidden representation h: ", h.shape)
print("Trainable embeddings e_i:",len(trainable_embeddings), trainable_embeddings[0].shape)
print("Dot products h.e_i: ", dot_products.shape)
print("Gate values (alpha(E_i)): ", gate_values.shape)
print("Final output after applying LoRA modules and gate values: ", final_output.shape)





Hidden representation h:  (64,)
Trainable embeddings e_i: 3 (64,)
Dot products h.e_i:  (3,)
Gate values (alpha(E_i)):  (3,)
Final output after applying LoRA modules and gate values:  (64,)


### Mixture of LoRA Experts (MoLE),
MoLE extends the transformer architecture by combining outputs from multiple LoRA modules via a gating function.

In [None]:
def softmax(x, tau=1.0):
    scaled_x = x/ tau
    return np.exp(scaled_x)/ np.sum(np.exp(scaled_x), axis=0)

d = 64
R = 16
num_experts = 3
L = 10

pretrained_params = np.random.randn(d,d)

A_matrices = [np.random.randn(d, R) for _ in range(num_experts)]
B_matrices = [np.random.randn(R, d) for _ in range(num_experts)]

