# Demo: Unstructured Pruning - The Mechanics of Zeroing Out Weights

**Welcome!**

In this demonstration, we'll dive into the mechanics of **unstructured model pruning**. Our goal isn't to create a perfectly optimized model, but to understand the fundamental process of how weights are identified and permanently removed (zeroed out) from a specific layer in a large language model.

**Our Plan:**
1.  Load a powerful model, `meta-llama/Llama-3.2-1B`.
2.  Establish a **baseline** by generating text with the full model.
3.  **Target** a single, specific layer within the model's vast architecture.
4.  Apply **L1 unstructured pruning** to zero out 50% of the weights in that layer.
5.  **Verify** that the layer has become sparse.
6.  **Assess the impact** by generating text again and observing the (expected) degradation in quality.

We'll be using `torch.nn.utils.prune` to see exactly how this works under the hood.

## 1. Environment Setup

First, let's install the necessary libraries.

In [1]:
!pip install transformers torch accelerate huggingface_hub



## 2. Authentication and Configuration

Next, we need to log in to Hugging Face to download the Llama model. 

**Action Required:** You need to get a Hugging Face User Access Token.
1. Go to `huggingface.co/settings/tokens`.
2. Create a new token with "read" permissions.
3. Paste your token into the cell below.

In [None]:
from huggingface_hub import login

# Paste your Hugging Face token here
HF_TOKEN = "<YOUR_HUGGING_FACE_TOKEN>"

login(token=HF_TOKEN)

Now, let's configure our demo. We'll define the model we're using, the specific layer we want to prune, and the percentage of weights to remove.

**How do you find the `TARGET_LAYER_NAME_STR`?** You would typically print the entire model object (`print(model)`) and inspect its architecture to find the name of the layer you're interested in.

In [3]:
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = "meta-llama/Llama-3.2-1B"

# The specific layer we will target for pruning. 
# This is a linear layer inside the MLP block of the very first decoder layer.
TARGET_LAYER_NAME_STR = "model.layers.0.mlp.gate_proj"

# We will prune 50% of the weights in this layer
PRUNING_AMOUNT = 0.5 

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
print(f"Using device: {device}, Model dtype: {model_dtype}")

Using device: cuda, Model dtype: torch.float16


## 3. Helper Functions

We need two small helper functions:
1.  `get_module_by_name_str`: PyTorch models are nested objects. This function lets us access a deeply nested layer (like `model.layers.0...`) using its string name.
2.  `calculate_sparsity`: A simple function to calculate what percentage of weights in a layer are zero.

In [4]:
def get_module_by_name_str(model, module_name_str):
    """Gets a module from a model using its string name (e.g., 'model.layers.0.mlp.gate_proj')."""
    names = module_name_str.split('.')
    current_module = model
    for name_part in names:
        if hasattr(current_module, name_part):
            current_module = getattr(current_module, name_part)
        else:
            try: # Handle numeric indices in ModuleLists
                idx = int(name_part)
                current_module = current_module[idx]
            except (ValueError, TypeError, IndexError):
                raise AttributeError(f"Could not resolve name part '{name_part}' in '{module_name_str}'.")
    return current_module

def calculate_sparsity(module, param_name='weight'):
    """Calculates sparsity of a named parameter in a module."""
    if hasattr(module, param_name):
        param = getattr(module, param_name)
        if param is not None:
            return 100. * float(torch.sum(param == 0)) / float(param.nelement())
    return 0.0

## 4. Load the Llama-3.2-1B Model

Now, let's load our model. This can be resource-intensive, so ensure you have enough RAM or VRAM. We use `device_map="auto"` to let `accelerate` handle placing the model efficiently on our hardware.

In [5]:
print(f"--- Loading Model: {MODEL_NAME} ---")

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, 
    torch_dtype=model_dtype,
    device_map="auto" # Automatically handle device placement
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("\nLlama model and tokenizer loaded successfully.")

--- Loading Model: meta-llama/Llama-3.2-1B ---


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]


Llama model and tokenizer loaded successfully.


## 5. The Pruning Process: A Step-by-Step Walkthrough

Let's begin the core of our demonstration.

### Step 5.1: Establish a Baseline

Before we change anything, let's see how the original, unpruned model performs on a simple prompt. This gives us a baseline for comparison.

In [6]:
PROMPT_TEXT_DEMO = "The capital of France is"
print(f"--- Quick Generation PRE-Pruning ---")

inputs = tokenizer(PROMPT_TEXT_DEMO, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {PROMPT_TEXT_DEMO}")
print(f"Generated by full model: {generated_text}")

--- Quick Generation PRE-Pruning ---
Prompt: The capital of France is
Generated by full model: The capital of France is Paris, which is a great city for tourists.


### Step 5.2: Target the Layer and Check Initial Sparsity

Using our helper function, we'll grab the specific layer we want to prune. Then, we'll check its sparsity. As expected, a normal, trained layer has virtually zero sparsity.

In [7]:
print(f"\n--- Accessing Target Layer: {TARGET_LAYER_NAME_STR} ---")
target_module = get_module_by_name_str(model, TARGET_LAYER_NAME_STR)
print(f"Successfully accessed target layer of type: {type(target_module)}")

sparsity_before = calculate_sparsity(target_module, 'weight')
print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' BEFORE pruning: {sparsity_before:.2f}%\n")


--- Accessing Target Layer: model.layers.0.mlp.gate_proj ---
Successfully accessed target layer of type: <class 'torch.nn.modules.linear.Linear'>
Sparsity of 'model.layers.0.mlp.gate_proj.weight' BEFORE pruning: 0.00%



### Step 5.3: Apply the Pruning "Mask"

This is the first key step. We use `prune.l1_unstructured` to identify the 50% of weights with the lowest magnitude (L1 norm).

**Crucially, this does *not* immediately change the weights!** Instead, PyTorch attaches a `weight_mask` and a `weight_orig` attribute to the layer. During a forward pass, the model will now use a temporary, masked version of the weights. The original weights are preserved for now.

In [8]:
print(f"--- Applying L1 unstructured pruning (amount={PRUNING_AMOUNT}) ---")
prune.l1_unstructured(target_module, name="weight", amount=PRUNING_AMOUNT)

print("Pruning hook has been applied.")
print(f"The layer now has a 'weight_mask' and 'weight_orig' attribute.")

--- Applying L1 unstructured pruning (amount=0.5) ---
Pruning hook has been applied.
The layer now has a 'weight_mask' and 'weight_orig' attribute.


### Step 5.4: Make the Pruning Permanent

To make our changes permanent, we call `prune.remove`. This function does two things:
1.  It removes the pruning mask and the original weight backup.
2.  It updates the layer's `weight` attribute to be the final, zeroed-out tensor.

After this step, the weights are permanently gone.

In [9]:
print(f"\n--- Making pruning permanent for '{TARGET_LAYER_NAME_STR}.weight' ---")
prune.remove(target_module, "weight")
print("Pruning has been made permanent. The 'weight' attribute is now the sparse tensor.")


--- Making pruning permanent for 'model.layers.0.mlp.gate_proj.weight' ---
Pruning has been made permanent. The 'weight' attribute is now the sparse tensor.


### Step 5.5: Verify the Final Sparsity

Now, let's recalculate the sparsity. It should be very close to the 50% we aimed for.

In [10]:
sparsity_after = calculate_sparsity(target_module, 'weight')
print(f"Sparsity of '{TARGET_LAYER_NAME_STR}.weight' AFTER pruning: {sparsity_after:.2f}%\n")

Sparsity of 'model.layers.0.mlp.gate_proj.weight' AFTER pruning: 50.00%



### Step 5.6: Assess the Impact on Performance

We've successfully removed 50% of the weights from one layer. What's the cost? Let's run the same prompt again.

**Expectation:** We should see a noticeable degradation in the quality or coherence of the output. This is because we've damaged the model and have **not** performed the essential final step of a real pruning workflow: **fine-tuning** the model to help it recover from the pruning.

In [11]:
print(f"--- Quick Generation POST-Pruning (expect quality degradation) ---")

inputs = tokenizer(PROMPT_TEXT_DEMO, return_tensors="pt").to(model.device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id)

generated_text_pruned = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {PROMPT_TEXT_DEMO}")
print(f"Generated by pruned model: {generated_text_pruned}")

--- Quick Generation POST-Pruning (expect quality degradation) ---
Prompt: The capital of France is
Generated by pruned model: The capital of France is Paris, which has a population of 2.


## 6. Conclusion & The Real-World Picture

This demo focused purely on the 'zeroing out' mechanics within one layer. We saw how to target a layer, apply a pruning hook, and make the change permanent. 

In a real-world application, the process is much more involved:
1.  **Global Pruning:** You would prune many layers across the entire model, not just one.
2.  **Iterative Process:** Often, pruning is done iteratively—prune a little, fine-tune, prune a little more, fine-tune again.
3.  **CRITICAL STEP: Fine-Tuning:** After pruning, the model **must** be fine-tuned on a relevant dataset. This allows the remaining weights to adjust and compensate for the ones that were removed, recovering much of the lost performance.

This demo successfully highlights the first, mechanical step in that much larger journey.

In [12]:
# Clean up model from memory
del model
del tokenizer
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("\nCleaned up models and emptied CUDA cache.")


Cleaned up models and emptied CUDA cache.
