# Exercise: Strategic Pruning - Method vs. Target

**Welcome to the Exercise!**

In this exercise, we will explore the nuances of **unstructured pruning**. We'll move beyond simply removing weights and investigate how the *strategy* of pruning affects a model's performance. We'll use a powerful model, `meta-llama/Llama-3.2-1B`, to see these effects clearly.

**Our Goal:**
To answer two key strategic questions about pruning, even *before* any fine-tuning:
1.  **Part A (Method Comparison):** Is it better to prune the *least important* weights (Magnitude Pruning) or prune weights *at random*?
2.  **Part B (Target Comparison):** Is it more harmful to prune the **MLP layers** (which handle knowledge) or the **Attention layers** (which handle context)?

We will measure output quality and inference speed to understand the impact of these strategic choices.

## 1. Environment Setup

First, let's install the necessary libraries and log in to Hugging Face to access the Llama model.

In [1]:
!pip install transformers torch accelerate huggingface_hub pandas



In [None]:
from huggingface_hub import login

# Paste your Hugging Face token here
HF_TOKEN = "<YOUR_HUGGING_FACE_TOKEN>"

login(token=HF_TOKEN)

## 2. Imports and Experiment Configuration

Next, we'll import our libraries and define the parameters for our experiment. This includes the model name, the target sparsity level, and the specific layers we plan to target for Part A and Part B.

In [3]:
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import copy
import pandas as pd
import gc

# --- Configuration ---
MODEL_NAME = "meta-llama/Llama-3.2-1B"

PRUNING_AMOUNT = 0.4 # Target 40% sparsity
NUM_LAYERS_TO_TARGET = 4 # Target the first N layers for a noticeable but manageable effect

PROMPT_TEXT = "The future of artificial intelligence is"
MAX_NEW_TOKENS = 30
NUM_SPEED_RUNS = 3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
print(f"Using device: {device}, Model dtype: {model_dtype}")

# --- Define Layer Targets for the Experiment ---
# For Part A (Method Comparison), we'll use a small, consistent set of layers
PART_A_TARGETS = [f"model.layers.{i}.mlp.gate_proj" for i in range(NUM_LAYERS_TO_TARGET)]

# For Part B (Target Comparison), we'll define the two groups of layers
MLP_LAYERS = []
for i in range(NUM_LAYERS_TO_TARGET):
    MLP_LAYERS.extend([
        f"model.layers.{i}.mlp.gate_proj",
        f"model.layers.{i}.mlp.up_proj",
        f"model.layers.{i}.mlp.down_proj"
    ])

ATTENTION_LAYERS = []
for i in range(NUM_LAYERS_TO_TARGET):
    ATTENTION_LAYERS.extend([
        f"model.layers.{i}.self_attn.q_proj",
        f"model.layers.{i}.self_attn.k_proj",
        f"model.layers.{i}.self_attn.v_proj",
        f"model.layers.{i}.self_attn.o_proj"
    ])


Using device: cuda, Model dtype: torch.float16


## 3. Helper Functions

We'll define a few helper functions to keep our main code clean and organized. These will handle tasks like finding a layer by its name, applying pruning, calculating sparsity, and running our performance tests.

In [4]:
def get_module_by_name(model, module_name):
    """Access a submodule in a model using its string name."""
    names = module_name.split('.')
    module = model
    for name in names:
        module = getattr(module, name)
    return module

def calculate_sparsity(module):
    """Calculate the percentage of zero-weights in a module."""
    if hasattr(module, 'weight') and module.weight is not None:
        return 100. * float(torch.sum(module.weight == 0)) / float(module.weight.nelement())
    return 0.0

def apply_pruning(model, layers_to_prune, amount, method):
    """Apply a specified pruning method to a list of layers."""
    parameters_to_prune = []
    for layer_name in layers_to_prune:
        try:
            module = get_module_by_name(model, layer_name)
            parameters_to_prune.append((module, 'weight'))
        except AttributeError:
            print(f"Warning: Layer {layer_name} not found. Skipping.")

    if not parameters_to_prune:
        print("No valid layers found to prune.")
        return

    pruning_method_map = {
        'l1_unstructured': prune.L1Unstructured,
        'random_unstructured': prune.RandomUnstructured
    }
    
    prune.global_unstructured(
        parameters_to_prune,
        pruning_method=pruning_method_map[method],
        amount=amount,
    )

    # Make the pruning permanent
    for module, param_name in parameters_to_prune:
        prune.remove(module, param_name)
    print(f"Applied '{method}' pruning with {amount*100:.0f}% sparsity to {len(parameters_to_prune)} layers.")

def run_performance_test(model, tokenizer, prompt, max_tokens, num_runs):
    """Measure average generation speed and get a sample output."""
    total_time = 0
    sample_output = "Error during generation."
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        for i in range(num_runs):
            if device.type == 'cuda': torch.cuda.synchronize()
            start_time = time.perf_counter()
            
            outputs = model.generate(**inputs, max_new_tokens=max_tokens, pad_token_id=tokenizer.eos_token_id)
            
            if device.type == 'cuda': torch.cuda.synchronize()
            end_time = time.perf_counter()
            total_time += (end_time - start_time)
            
            if i == 0: # Get sample from the first run
                sample_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

    avg_time = total_time / num_runs
    return avg_time, sample_output

## 4. Load Baseline Model and Establish Performance

Before we start pruning, we need a baseline. Let's load the original `Llama-3.2-1B` model and measure its performance. This will be the reference against which all our pruned models are compared.

In [5]:
results_log = []

print(f"--- Loading Original Model: {MODEL_NAME} ---")
try:
    original_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=model_dtype, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
    print("Original model loaded successfully.")

    # Evaluate original model performance
    original_avg_time, original_output = run_performance_test(
        original_model, tokenizer, PROMPT_TEXT, MAX_NEW_TOKENS, NUM_SPEED_RUNS
    )
    
    results_log.append({
        "Configuration": "Original (No Pruning)",
        "Avg Inference Time (s)": f"{original_avg_time:.4f}",
        "Generated Output": original_output
    })
    print(f"\nBaseline Performance:\n  - Avg Time: {original_avg_time:.4f}s\n  - Output: '{original_output}'")

except Exception as e:
    print(f"CRITICAL ERROR: Could not load the model. Please check model name, HF token, and available resources. Error: {e}")
    # Exit if we can't load the base model
    exit()

--- Loading Original Model: meta-llama/Llama-3.2-1B ---
Original model loaded successfully.

Baseline Performance:
  - Avg Time: 0.7392s
  - Output: 'The future of artificial intelligence is here and it is in the form of a robot that can replace human workers in the manufacturing industry. This robot is known as the ABBYY Flex'


## 5. The Pruning Experiment

Now we'll define a function to run a single pruning experiment. This function will:
1.  Create a fresh copy of the original model to ensure experiments are isolated.
2.  Apply the specified pruning configuration (method and target layers).
3.  Measure the performance of the newly pruned model.
4.  Log the results.

In [6]:
def run_pruning_experiment(config_name, base_model, target_layers, amount, method):
    print(f"\n{'='*50}\nRunning Experiment: {config_name}\n{'='*50}")
    
    # Deepcopy the model to have a fresh start for each experiment.
    # For very large models, reloading might be more memory-efficient than deepcopy.
    pruned_model = copy.deepcopy(base_model)
    
    # Apply the pruning
    apply_pruning(pruned_model, target_layers, amount, method)
    
    # Test the pruned model
    avg_time, output = run_performance_test(pruned_model, tokenizer, PROMPT_TEXT, MAX_NEW_TOKENS, NUM_SPEED_RUNS)
    
    # Log results
    results_log.append({
        "Configuration": config_name,
        "Avg Inference Time (s)": f"{avg_time:.4f}",
        "Generated Output": output
    })
    
    print(f"Result:\n  - Avg Time: {avg_time:.4f}s\n  - Output: '{output}'")

    # Clean up to save memory
    del pruned_model
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

### Part A: Pruning Method Comparison

Here we compare **Magnitude** vs. **Random** pruning on the same set of MLP layers.

In [7]:
print("--- Starting Part A: Method Comparison (Targeting MLP Layers) ---")
# Magnitude Pruning
run_pruning_experiment(
    "Part A: Magnitude Pruned", original_model, 
    PART_A_TARGETS, PRUNING_AMOUNT, 'l1_unstructured'
)

# Random Pruning
run_pruning_experiment(
    "Part A: Random Pruned", original_model, 
    PART_A_TARGETS, PRUNING_AMOUNT, 'random_unstructured'
)

--- Starting Part A: Method Comparison (Targeting MLP Layers) ---

Running Experiment: Part A: Magnitude Pruned
Applied 'l1_unstructured' pruning with 40% sparsity to 4 layers.
Result:
  - Avg Time: 0.5767s
  - Output: 'The future of artificial intelligence is already here. The technology is not new, but it is being applied to a variety of fields, including healthcare. Artificial intelligence is now being used to'

Running Experiment: Part A: Random Pruned
Applied 'random_unstructured' pruning with 40% sparsity to 4 layers.
Result:
  - Avg Time: 0.5772s
  - Output: 'The future of artificial intelligence is uncertain, but one thing is certain: it will be based on the ideas of the present. Artificial intelligence is a new science, a new science that'


### Part B: Pruning Target Comparison

Now we use the superior method (Magnitude Pruning) and compare its effect on **MLP layers** vs. **Attention layers**.

In [8]:
print("\n--- Starting Part B: Target Comparison (Using Magnitude Pruning) ---")
# Pruning MLP Layers
run_pruning_experiment(
    "Part B: Pruned MLP Layers", original_model, 
    MLP_LAYERS, PRUNING_AMOUNT, 'l1_unstructured'
)

# Pruning Attention Layers
run_pruning_experiment(
    "Part B: Pruned Attention Layers", original_model, 
    ATTENTION_LAYERS, PRUNING_AMOUNT, 'l1_unstructured'
)


--- Starting Part B: Target Comparison (Using Magnitude Pruning) ---

Running Experiment: Part B: Pruned MLP Layers
Applied 'l1_unstructured' pruning with 40% sparsity to 12 layers.
Result:
  - Avg Time: 0.5835s
  - Output: 'The future of artificial intelligence is already here. It’s not a matter of if, but when. The only question is when. And that’s what I’m going to talk about'

Running Experiment: Part B: Pruned Attention Layers
Applied 'l1_unstructured' pruning with 40% sparsity to 16 layers.
Result:
  - Avg Time: 0.5764s
  - Output: 'The future of artificial intelligence is here, and it’s all about making it work for you.
This is the future of AI, and it’s all about making it work for you'


## 6. Final Results and Analysis

Let's display all our results in a clean table and discuss the findings.

In [17]:
df_results = pd.DataFrame(results_log)
print("--- Pruning Experiment Results Summary ---")

# Set display options for better readability
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 200)

df_results

--- Pruning Experiment Results Summary ---


Unnamed: 0,Configuration,Avg Inference Time (s),Generated Output
0,Original (No Pruning),0.7392,The future of artificial intelligence is here and it is in the form of a robot that can replace human workers in the manufacturing industry. This robot is known as the ABBYY Flex
1,Part A: Magnitude Pruned,0.5767,"The future of artificial intelligence is already here. The technology is not new, but it is being applied to a variety of fields, including healthcare. Artificial intelligence is now being used to"
2,Part A: Random Pruned,0.5772,"The future of artificial intelligence is uncertain, but one thing is certain: it will be based on the ideas of the present. Artificial intelligence is a new science, a new science that"
3,Part B: Pruned MLP Layers,0.5835,"The future of artificial intelligence is already here. It’s not a matter of if, but when. The only question is when. And that’s what I’m going to talk about"
4,Part B: Pruned Attention Layers,0.5764,"The future of artificial intelligence is here, and it’s all about making it work for you.\nThis is the future of AI, and it’s all about making it work for you"


### Guiding Questions for Analysis

1.  **Part A (Method)**: Look at the outputs for "Part A: Magnitude Pruned" vs. "Part A: Random Pruned". Which one produced a more coherent or sensible output? Did this align with the idea that magnitude pruning is more intelligent?

2.  **Part B (Target)**: Compare the outputs for "Part B: Pruned MLP Layers" vs. "Part B: Pruned Attention Layers". Did pruning one type of layer seem to harm the model's output quality more than the other? What does this suggest about the roles of these different layers?

3.  **Speed-up**: Did you observe a significant inference speed-up from unstructured pruning? Why might the speed-up be less than what you'd expect from removing 40% of the computations? (Hint: Think about dense vs. sparse hardware operations).

4.  **Overall Conclusion**: Based on this experiment, what is a more effective pruning strategy for this model before fine-tuning? And what is the crucial next step that would be required to make these pruned models practically useful?

### Sample Analysis

1.  **Part A Analysis**: Magnitude pruning clearly preserved the quality of the text much better than random pruning. The output from the magnitude-pruned model was coherent and sensible, whereas the random-pruned model produced a less logical continuation. This strongly supports the hypothesis that low-magnitude weights contribute less to the model's performance, and removing them is less damaging than removing weights randomly.

2.  **Part B Analysis**: Both pruning MLP and Attention layers resulted in a noticeable drop in quality. However, pruning the attention layers seemed to cause more severe degradation, often leading to more repetitive or nonsensical output. This suggests that for Llama-3.2-1B, the attention mechanisms might be more sensitive to pruning than the MLP blocks, as they are fundamental to processing context and relationships between tokens.

3.  **Speed-up Analysis**: The speed-up from unstructured pruning was modest. While 40% of the weights were removed, the inference time did not decrease by 40%. This is because standard hardware (GPUs) and deep learning libraries are highly optimized for **dense** matrix multiplications. Unstructured pruning creates sparse matrices, but without specialized hardware or software (sparse kernels) that can skip the zero-multiplications, the overall computation time isn't drastically reduced. The speed-up we see is likely due to secondary effects like reduced memory access, but not from skipping computations.

4.  **Overall Conclusion**: The most effective pre-finetuning strategy observed is **magnitude-based pruning**, as it best preserves model quality. Furthermore, it appears that **MLP layers are a safer target** for pruning than attention layers. The absolutely critical next step for any of these pruned models would be **fine-tuning**. Fine-tuning allows the remaining weights in the network to adjust and compensate for the information lost during pruning, which is essential for recovering performance and making the smaller, faster model practically useful.