# Efficient Model Reduction with Pruning and Distillation of Llama 3.1 Using NeMo Framework

This tutorial demonstrates how to perform depth-pruning, teacher fine-tuning, and distillation on **Llama 3.1-8B** using the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset with NeMo Framework. The [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) language modeling dataset comprises over 100 million tokens extracted from verified Good and Featured articles on Wikipedia.

For this demonstration, we will perform teacher correction by running a light fine-tuning procedure on the `Meta Llama 3.1 8B` teacher model to generate a fine-tuned teacher model, `megatron_llama_ft.nemo`, needed for optimal distillation. This fine-tuned teacher model is then trimmed. There are two methods to prune a model: depth-pruning and width-pruning. We will explore both techniques, yielding `4b_depth_pruned_model.nemo` and `4b_width_pruned_model.nemo`, respectively. These models will serve as starting points for distillation to create the final distilled 4B models.

> We are using models utilizing the `meta-llama/Llama-3.1-8B` tokenizer for this demonstration.

> `NOTE:` Ensure that you run this notebook inside the [NeMo Framework container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo) which has all the required dependencies. 

**Instructions for downloading the model and the container are available in the [README](./README.rst).**

In [None]:
!pip install --upgrade ipywidgets notebook
!pip install datasets

---
## Prerequisites
Ensure you meet the prerequisites listed in this section.
1. **Get the teacher model**: Download the `Meta Llama 3.1 8B .nemo` model. You must follow the instructions in the associated README to download and mount the folder to the NeMo Framework container.

In [None]:
!ls /workspace/llama-3_1-8b-nemo_v1.0/llama3_1_8b.nemo

2. **Set the Hugging Face Access Token**: You can obtain this from your [Hugging Face account](https://huggingface.co/docs/hub/en/security-tokens). 

In [None]:
from huggingface_hub import login
login(token="<YOUR_HF_ACCESS_TOKEN>")

3. **Obtain the dataset**: Generate the `wikitext-{train/val/test}.jsonl` splits after loading the [WikiText-103-v1](https://huggingface.co/datasets/Salesforce/wikitext/viewer/wikitext-103-v1) dataset.

In [None]:
# Split into train, test and val files

import json
import os
from datasets import load_dataset

# Load the WikiText-103 dataset
dataset = load_dataset("wikitext", "wikitext-103-v1")

# Define the destination folder
data_folder = 'wikitext-data'
os.makedirs(data_folder, exist_ok=True)

# Define file paths and destination paths
file_paths = {
    'train': os.path.join(data_folder, 'wikitext-train.jsonl'),
    'validation': os.path.join(data_folder, 'wikitext-val.jsonl'),
    'test': os.path.join(data_folder, 'wikitext-test.jsonl')
}

# Function to save dataset split to a JSONL file
def save_to_jsonl(file_path, data):
    with open(file_path, 'w') as file:
        for item in data:
            file.write(json.dumps(item) + '\n')

# Define splits
splits = ["train", "validation", "test"]

# Save splits to JSONL files and calculate their sizes
for split in splits:
    if split in dataset:
        save_to_jsonl(file_paths[split], dataset[split])
    else:
        print(f"Split {split} not found in the dataset.")


---
##  Step-by-Step Instructions

This workflow is structured into seven notebooks:
1. [Prepare the dataset](./01_data_preparation.ipynb)
2. [Fine-tune the teacher on the dataset](./02_teacher_finetuning.ipynb)
3. Prune the fine-tuned teacher model to create a student 
   - 3.a. [Using depth-pruning](./03_a_depth_pruning.ipynb)
   - 3.b. [Using width-pruning](./03_b_width_pruning.ipynb)
4. Distill knowledge from teacher into student
   - 4.a. [Using depth-pruned student](./04_a_distilling_depth_pruned_student.ipynb)
   - 4.b. [Using width-pruned student](./04_b_distilling_width_pruned_student.ipynb)
5. [Display the validation loss](./05_display_results.ipynb)

> `NOTE:` We are exploring two methods to prune the fine-tuned teacher model: [depth-pruning](./03_a_depth_pruning.ipynb) and [width-pruning](./03_b_width_pruning.ipynb). Per the [tech report](https://arxiv.org/pdf/2408.11796), we can observe that width-pruning generally outperforms depth-pruning so users can choose to perform either [depth-pruning](./03_a_depth_pruning.ipynb) or [width-pruning](./03_b_width_pruning.ipynb) or both methods.