### Step 2: Prune the model
In this step, we will explore two methods to prune the model - depth and width pruning. Refer to the [README.md](./README.md) to decide which pruning techniques you would like to explore. For usage details, please refer to the [pruning docs](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/pruning/pruning.html) for more details.

Let's define the common parameters for depth or width pruning first.

In [None]:
NEMO_ROOT = "/opt/NeMo"
ROOT_DIR = "/workspace"
MODEL_PATH = f"{ROOT_DIR}/Qwen3-8B-nemo"

##### Set data paths
# NOTE: If you have multiple partitioned datasets, you can pass in a space-separated list of paths below.
DATA_PATH = f"{ROOT_DIR}/wikitext-data"
DATA_PATHS = f"{DATA_PATH}/wikitext-train_text_document"
INDEX_MAPPING_DIR = f"{DATA_PATH}/index_mappings"

##### Set sequence length for pruning and distillation
# NOTE: Use 4096 or 8192 depending on whether your dataset texts are short or long
SEQ_LENGTH = 4096

##### Change these to accommodate resources:
# NOTE: Pruning only supports Tensor Parallelism (TP) 1. Number of layers in your model should be divisible by
#   Pipeline Parallelism (PP) size, otherwise you can configure uneven PP using `--num_layers_in_first_pipeline_stage`
#   and `--num_layers_in_last_pipeline_stage` arguments below with the gpt_prune.py script.
DEVICES = 2
TENSOR_PARALLEL_SIZE = 1
PIPELINE_PARALLEL_SIZE = DEVICES
MICRO_BATCH_SIZE = 4

# Reduce this number to speed up the pruning process but may result in a slightly worse pruned model
# Not used if directly dropping layers using `--drop_layers` argument.
NUM_TRAIN_SAMPLES = 1024

#### Step 2a: Using depth-pruning 
To depth-prune, we will prune the Qwen3-8B model from 36 to 24 layers resulting in a 6B model automatically by selecting the best 24 layers to keep based on activation statistics collected from the training samples.

Alternatively, you can also directly drop layers 24-35 (1-indexed) using the `--drop_layers 24 25 26 27 28 29 30 31 32 33 34 35` argument (leaving 1-23 and 36) in the model which also works well generally.

In [None]:
SAVE_PATH = f"{ROOT_DIR}/Qwen3-8B-nemo-depth-pruned"

!torchrun --nproc_per_node "{DEVICES}" "{NEMO_ROOT}/scripts/llm/gpt_prune.py" \
    --devices "{DEVICES}" \
    --tp_size "{TENSOR_PARALLEL_SIZE}" \
    --pp_size "{PIPELINE_PARALLEL_SIZE}" \
    --restore_path "{MODEL_PATH}" \
    --legacy_ckpt \
    --save_path "{SAVE_PATH}" \
    --seq_length "{SEQ_LENGTH}" \
    --num_train_samples "{NUM_TRAIN_SAMPLES}" \
    --mbs "{MICRO_BATCH_SIZE}" \
    --data_paths "{DATA_PATHS}" \
    --index_mapping_dir "{INDEX_MAPPING_DIR}" \
    --target_num_layers 24

Running this script will save the depth-pruned model to your workspace at `<ROOT_DIR>/Qwen3-8B-nemo-depth-pruned`.

#### Step 2b: Using width-pruning 
To width-prune, we will trim the `ffn_hidden_size` from 12288 to 9216 and `hidden_size` 4096 to 3584 also resulting in a 6B model. We can also trim the `num_attention_heads` and `num_query_groups` if needed. If the model is a Hybrid Mamba-Transformer model (e.g. [NVIDIA-Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2)), you can also trim the `mamba_num_heads` and `mamba_head_dim` dimensions.

> **NOTE:** Pruning will take less then 10 minutes to run (depends on GPU).

In [None]:
SAVE_PATH = f"{ROOT_DIR}/Qwen3-8B-nemo-width-pruned"

!torchrun --nproc_per_node "{DEVICES}" "{NEMO_ROOT}/scripts/llm/gpt_prune.py" \
    --devices "{DEVICES}" \
    --tp_size "{TENSOR_PARALLEL_SIZE}" \
    --pp_size "{PIPELINE_PARALLEL_SIZE}" \
    --restore_path "{MODEL_PATH}" \
    --legacy_ckpt \
    --save_path "{SAVE_PATH}" \
    --seq_length "{SEQ_LENGTH}" \
    --num_train_samples "{NUM_TRAIN_SAMPLES}" \
    --mbs "{MICRO_BATCH_SIZE}" \
    --data_paths "{DATA_PATHS}" \
    --index_mapping_dir "{INDEX_MAPPING_DIR}" \
    --target_ffn_hidden_size 9216 \
    --target_hidden_size 3584

Running this script will save the width-pruned model to your workspace at `<ROOT_DIR>/Qwen3-8B-nemo-width-pruned`.

Now that we have the depth and width pruned models, we can distill them from the unpruned model in next step.