# Deployment Optimization: Pruning

Original Notebook: [6_3_pruning_structured_llama3.2-1b_OK.ipynb](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/6-PRUNING/6_3_pruning_structured_llama3.2-1b_OK.ipynb) by [Pere Martra](https://www.linkedin.com/in/pere-martra/)

**Model:** meta-llama/Llama-3.2-1B

 - Request for the model:
    https://huggingface.co/meta-llama/Llama-3.2-1B

 - check request for permission:
    https://huggingface.co/settings/gated-repos

# Install libraries & Configure variables

**Note:** `sys.executable` ensures that pip installs into the same Python environment that Jupyter is using.

`sentencepiece` is required for LLaMA tokenizer.

In [None]:
import sys
!{sys.executable} -m pip install datasets transformers torch sentencepiece tqdm

In [4]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch import nn
from torch.utils.data import DataLoader
import os
from tqdm import tqdm

In [5]:
# Check if GPU is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Log in to Hugging Face

In [6]:
from getpass import getpass
hf_token = getpass("Hugging Face: ")

Hugging Face:  ········


In [7]:
!hf auth login --token $hf_token

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `hf`CLI if you want to set the git credential as well.
Token is valid (permission: write).
The token `llama_token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `llama_token`


# Download Model

In [8]:
model_name = 'meta-llama/Llama-3.2-1B'
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
#tokenizer.pad_token = tokenizer.eos_token  # Set pad token

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

In [9]:
# Set pad_token_id to eos_token_id
tokenizer.pad_token = tokenizer.eos_token

In [10]:
def get_output(prompt, model=model, tokenizer=tokenizer):
    inputs = tokenizer(prompt, return_tensors='pt').to(device)
    outputs = model.generate(
        inputs['input_ids'],
        attention_mask=inputs['attention_mask'], # to know which parts of the input are important and which parts are padding
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        temperature=None,
        top_p=None,
        do_sample=False,          # Disable sampling
        num_beams=5,              # Use beam search
        early_stopping=True,      # Stop when end-of-sequence token is generated
        no_repeat_ngram_size=2    # Prevent repetition of 2-grams
    )
    generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated

## Study the Model Structure


**Note:** to prune a model, it is necessary to understand the model architecture.

In [11]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (ro

**Pruning Llama**: based on the observations pruning did not help on **attention** or **output** blocks. The best layers was **MLP** block.

An **MLP block** typically consists of layers that scale the data to larger dimensions and others that return it to its original size.
- It gives more capacity to the model (to solve more complex problems).

In the MLP block of the model, there are two projection layers: `gat_proj` and `down_proj`, both scaling from 2048 to 8192. The purpose of having two layers projecting to the same intermediate size might be related to gating mechanisms. A gating mechanism selectively controls information flow in neural networks by using learned weights to "gate" or filter inputs.
- However, to truly understand how these layers function, it is necessary to refer to the model's documentation or even its source code. Nevertheless, this structure usually indicates that the layers performing the upsizing work in pairs and cannot be treated as independent linear layers.
    - In other words, any operation applied to one layer must be replicated in the other. Most importantly, when identifying which neurons are more or less important, do not evaluate the neurons of a single layer in isolation; instead, treat them as pairs.



In [12]:
# Test the original model
prompt = "Vienna is the capital of"
generated = get_output(prompt)
print(f"Generated text: {generated}")

Generated text: Vienna is the capital of Austria and one of the most beautiful cities in the world. The city is located on the Danube River and has a population of over 1.8 million people. Vienna is known for its rich history, beautiful


In [13]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

In [14]:
original_param_count = count_parameters(model)
print(f"Original model parameters: {original_param_count}")

Original model parameters: 1235814400


# Pruning
## Support Pruning Functions
### Compute Neuron Importance Functions

Three functions are used to calculate neuron importance, in order to decide which ones to eliminate.
- All three functions take into account that the layers should be treated as pairs, considering both layers to calculate neuron importance.

The results obtained with each function have been quite different:
- **Product of Norms**
- **Variance of weights**
- **Maximum absolute weight**

**Observation:** the **Absolute Maximum** calculation has worked the best. I'd say the other methods for selecting neurons to remove have severely degraded the model, or at least eliminated a significant portion of the base model's capabilities.

*I’m leaving the others in the notebook purely as an exercise.*

The **Maximum Absolute Weight** method works better because it directly identifies the most influential neurons based on the magnitude of their connections. These neurons are likely responsible for key decisions, making the model more accurate after pruning.
- The **Variance of Weights** method, while useful in some contexts, can retain neurons that may not contribute significantly to the task, leading to less coherent model outputs.

**Warning:** do not fall into the trap of assuming that this neuron selection method will work best across all model structures. It works well with *Llama* models, and this may be due to several factors:
- The relatively large projection from 2048 to 8192.
- The use of a GLU structure.
- The type of activation function used.

**Note:** when using a model from another family (e.g., Gemma or Mistral), the neuron-selection method might need to be entirely different.



In [15]:
#****DISCARTED****
"""
Product of Norms

Note: since the GLU multiplies the outputs of gate_proj and up_proj,
compute the product of their weight norms to better represent the importance of the neuron pair.

Sample output:
'Paris is the capital of of of of the of the the the the to to to from to from 
from from to to from to France France France France France France France France
France France France France France France France France All All All'
"""

def compute_neuron_pair_importance(gate_weight, up_weight):

    gate_norms = torch.norm(gate_weight, p=1, dim=1)
    up_norms = torch.norm(up_weight, p=1, dim=1)
    importance_scores = gate_norms * up_norms
    return importance_scores

In [16]:
#****DISCARTED****
"""
Variance of Weights

Note: neurons with higher weight variance may contribute more to the model's output.

Sample output:
'Paris is the capital of the French Republic. It is also a Paris is the
capital of the French Republic. It is also a Germany is the German Republic.
It is also a of the Austrian Republic. It is also a'
"""

def compute_neuron_pair_importance(gate_weight, up_weight):
    gate_variance = torch.var(gate_weight, dim=1)
    up_variance = torch.var(up_weight, dim=1)
    importance_scores = gate_variance + up_variance
    return importance_scores

In [17]:
#****SELECTED****
"""
Maximum Absolute Weight

Note: the maximum absolute weight in a neuron might indicate its significance.

Sample output:
'Paris is the capital of France. It is also one of the most beautiful cities in
the world. There is so much to see and do in Paris that it is impossible to cover
it all in one day. However, there are a few things you should not miss while you'
"""

def compute_neuron_pair_importance(gate_weight, up_weight):
  """
  compute neuron pair importance scores (Maximum Absolute Weight)

  Args:
  - gate_weight: Weight matrix from the gate_proj layer.
  - up_weight: Weight matrix from the up_weight layer.

  Returns:
  - importance_scores: Importance scores for each neuron pair.
  """

  gate_max_abs = torch.max(gate_weight, dim=1).values + torch.abs(torch.min(gate_weight, dim=1).values)
  up_max_abs = torch.max(up_weight, dim=1).values + torch.abs(torch.min(up_weight, dim=1).values)
  importance_scores = gate_max_abs + up_max_abs
  return importance_scores

In [18]:
# Prunes a specific percentatge of neurons from the MLP (feed forward layers)
def prune_neuron_pairs(mlp, prune_percent):
    """
    Reduces the dimensions of the **gate_proj**,**up_proj**, **down_proj**
    layers removing the least important neurons.

    Args:
    - mlp: Layers to prune.
    - prune_percent: Percentage of neurons to prune.

    Returns:
    - new_gate_proj, new_up_proj, new_down_proj:  New pruned layers.
    - k: New intermediate size.

    """
    # Extract the weights from the MLP layers
    # these weights are used to calculate each neuron's importance score in the next step.
    gate_weight = mlp.gate_proj.weight.data.float()
    up_weight = mlp.up_proj.weight.data.float()

    # Compute importance scores
    # Neurons with higher importance scores are considered more important and less likely to be pruned.
    importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)

    # Store the original number of neurons in the intermediate layer.
    original_intermediate_size = gate_weight.size(0)
    # Computes the number of neurons to prune.
    # num_neuron_pairs_to_prune = prune_percent * original_intermediate_size
    num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)
    # Calculate the number of neurons to keep: the new intermediate size.
    keep = original_intermediate_size - num_neuron_pairs_to_prune

    # Check that there is no big error calculating keep, i.e., do not prune all the neurons!
    if keep <= 0:
        raise ValueError(f"Invalid number of neuron pairs to keep: {keep}. Adjust the prune_percent.")

    # Select the neuros to keep, by obtaining the indices to keep.
    # _ & indices_to_keep <=> the top highest values & the indices of these top values
    _, indices_to_keep = torch.topk(importance_scores, keep, largest=True, sorted=True)
    # indices of the top most important neurons sorted by importance (not by their original order)
    indices_to_keep = indices_to_keep.sort().values

    # Create the new layers
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, keep, bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, keep, bias=False).to(device)
    new_down_proj = nn.Linear(keep, mlp.down_proj.out_features, bias=False).to(device)

    # Copy weights to the new layers
    new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
    new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
    new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

    # Return new layers and intermediate size
    return new_gate_proj, new_up_proj, new_down_proj, keep


# Prune Loop

The `update_model` function iterates through the blocks *(16 blocks)* within the model's Transformer structure. This structure consists of multiple `LlamaDecoderLayer` blocks, and each of these blocks contains a pair of `LlamaSdpaAttention` and `LlamaMLP` components. The latter contains the MLP layers that will be the target of the pruning process.

```
(layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
  )    
```

The layers that will undergo the removal of neurons identified as less useful are:

```
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
```

The neurons are removed in the `prune_neurons` function based on the values returned by `compute_neuron_pair_importance`.

In [19]:
# Iterates throught the model layers and applies pruning.
def update_model(model, prune_percent):
    """
    It modifies each mlp layer present in model, to retain only the most
    important neurons. Creating new smaller versions of each layer pruned.

    Args:
    - model: Model to prune.
    - prune_percent: Percentage of neurons to prune.

    Returns:
    - model: New pruned model.
    """
    new_intermediate_size = None

    # loop for each model layer.
    for idx, layer in enumerate(model.model.layers):
        # each layer is a LlamaDecoderLayer it contains multiple components: Attention, MLP and Layer norms
        # targetting MLP component by accesing layer.mlp.
        mlp = layer.mlp

        # Call the prune_neiron_pairs with the layers and receiving the pruned.
        new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)

        # Replace the Origiginal Layers with Pruned Layers.
        mlp.gate_proj = new_gate_proj
        mlp.up_proj = new_up_proj
        mlp.down_proj = new_down_proj

        # new_intermediate_size only needs to be set once
        if new_intermediate_size is None:
            new_intermediate_size = new_size

    # Update the model config file.
    model.config.intermediate_size = new_intermediate_size

    return model


# Prune the Model

In [20]:
prune_percent = 0.2  # Prune 20% of neurons
model = update_model(model, prune_percent)

In [21]:
# Recalculate the number of parameters
pruned_param_count = count_parameters(model)
reduction_in_params = original_param_count - pruned_param_count
percentage_savings = (reduction_in_params / original_param_count) * 100

print(f"Pruned model parameters: {pruned_param_count}")
print(f"Reduction in parameters: {reduction_in_params}")
print(f"Percentage of weight savings: {percentage_savings:.2f}%")


Pruned model parameters: 1074792448
Reduction in parameters: 161021952
Percentage of weight savings: 13.03%


## Test the Pruned Model

In [22]:
generated = get_output(prompt, model, tokenizer)
print(f"Generated text after pruning: {generated}")

Generated text after pruning: Vienna is the capital of Austria and has a population of over 1.5 million people. It is also known as the “City of Vienna” because it is located on the banks of the Danube River and is surrounded on all sides


**Observation:** the result is slightly different from what the original model produced.

Looking at the model’s new structure: the `gate_proj` and `up_proj` layers have had their `out_features` reduced to 6554 from 8192. Consequently, the `down_proj` layer has its `in_features` adjusted to match the new size.

In [23]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (ro

## Upload the model to HuggingFace

### Save the Model Locally

```
output_dir = './'+new_model_name
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# model.save_pretrained(output_dir)
model.save_pretrained(output_dir, max_shard_size="200MB", safe_serialization=False)


tokenizer.save_pretrained(output_dir)
print(f"Pruned model saved to {output_dir}")
```