<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Quantize_Mistral_7B_Instruct_NF4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Mistral-7B-Instruct - NF4 Quantization

This notebook quantizes the model Mistral-7B-Instruct-v0.2.

The notebook:

1. Applies quantization to the base model
2. Uploads the model to Hugging Face
3. Runs inference on the model

The code in this notebook is based on the following blogs/documentation :

* [Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B](https://www.datacamp.com/tutorial/mistral-7b-tutorial)
* [Mistral-7B Fine-Tuning: A Step-by-Step Guide](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8)
* [Fine-Tune Your Own Llama 2 Model in a Colab Notebook](https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32)
* [Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model](https://www.datacamp.com/tutorial/fine-tuning-llama-2)
* [Templates for Chat Models](https://huggingface.co/docs/transformers/main/chat_templating)

## Setup

### Log into HuggingFace Hub

This code assumes that the user has a hugging face token setup as a notebook secret called HF_TOKEN

In [None]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install -q --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/388.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m256.0/388.9 kB[0m [31m7.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hHugging Face Version is: 0.22.2


In [None]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required Libraries

In [None]:
# The Transformers library provides APIs and tools to easily download and train pretrained model.
# install the development version of transformers
# !pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U transformers

# Accelerate enables the same PyTorch code to be run across any distributed configuration
!pip install -q -U accelerate -q

# einops allows you to rearrange, reduce, and repeat elements in tensors according to specified patterns.
!pip install -q -U einops

# an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems
!pip install -q -U sentencepiece

# 'bitsandbytes' includes quantization primitives for 8-bit & 4-bit operations
!pip install -q -U bitsandbytes

# trl is afull stack library where we provide a set of tools to train transformer language models
!pip install -q -U trl

# used for monitoring the training process.
!pip install -q -U wandb


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m49.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m93.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

### Check library versions

In [None]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

# print the version of the tensorflow library
import accelerate
print(f"version of accelerate: {accelerate.__version__}")

# print the version of the tensorflow library
import tensorflow as tf
print(f"version of tensorflow: {tf.__version__}")

version of transformers: 4.39.3
version of accelerate: 0.28.0
version of tensorflow: 2.15.0


Import Required Libraries

In [None]:
# for interacting with the operating system.
import os

# torch is the main package of PyTorch.
import torch

# base class of all warning category classes
import warnings

# import for monitoring the training process
import wandb

# load_dataset allows loading and preprocessing of datasets.
from datasets import load_dataset

from transformers import (
    AutoModelForCausalLM, # Generic model class with a causal language modeling head
    AutoTokenizer, # Automatically selects correct tokenizer for a model.
    BitsAndBytesConfig, # Used to configure a BitsAndBytes model.
    HfArgumentParser, # Used for parsing command-line arguments.
    TrainingArguments, # Defines the arguments used during training.
    pipeline, # Creates a pipeline that applies a model to some input data.
    logging, # Used for logging events during training and evaluation.
    AutoModelForQuestionAnswering # Used to get a model to perform context-based question answering etc…
)

# Import the supervised fine-tuning trainer
from trl import SFTTrainer

### Define the processor to use

Ensure the model will use a GPU if available

In [None]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

## Load Model and Tokenizer

### Define pre-trained model and new model name

In [None]:
# Define the pre-trained model to be used
pretrained_model = 'mistralai/Mistral-7B-Instruct-v0.2'

# Quantized model name
new_model = "Mistral-7B-Instruct-NF4"

### Define BitsAndBytes Config

In [None]:
# Bits and Bytes Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=torch.float16, # The data type for computation when using 4-bit base models
    bnb_4bit_use_double_quant=True, # Whether to activate nested quantization for 4-bit base models (double quantization)
)

### Quantize the Pre-Trained Model

In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model,
    quantization_config=bnb_config, # set the quantization configuration for the model.
    device_map=device # sets the device mapping for the model to use the first GPU
)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

### Get Model details

In [None]:
print(f"Model size: {model.get_memory_footprint() / 1e9:.1f} GB")
print(f"Model params: {model.num_parameters():,}")
print(f"Model Config: \n{model.config}")
print(f"View model structure: \n{model}")

Model size: 4.6 GB
Model params: 7,241,732,096
Model Config: 
MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm

### Load Tokenizer

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # sets the pad token to the eos token
tokenizer.padding_side = "right" # set padding to the right side

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Inference

This section is to verify that inference can be carried out successfully on the quantized model.  
More in-depth evaluation will be done in the notebook ‘Model Evaluation’.

### Using transformers pipeline

In [None]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=device,
)
sequences = pipeline(
   "Write a poem about Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Result: Write a poem about Ireland. Question: What is Ireland known for? Answer: Ireland is known for its rich history, scenic beauty, and deep cultural heritage. Here's a poem that celebrates the Emerald Isle's enchanting qualities:

Beneath the azure sky so wide,
Where green and gold in splendor lie,
An island born of sea and tide,
In Ireland, dreams take flight and die.

The cliffs of verdant majesty,
The emerald waves that crash and roar,
A land of history, mystery,
And ancient lore forevermore.

From gilded castles in the north,
To cozy homes with hearth ablaze,
Ireland's magic is in its worth,
An everlasting, timeless maze.

The laughter echoes in the breeze,
In every quaint village and town,
Where songs


In [None]:
sequences = pipeline(
   "What is a large language model?",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: What is a large language model?

Large language models are a type of artificial intelligence algorithm that can understand and process human language. They are trained on massive datasets to recognize patterns and relationships in text. The largest language models can generate human-like text on a wide range of topics, answer complex questions, summarize long documents, and even generate creative stories. These models are often used in applications like chatbots, language translation, and content generation for websites and social media. They can also be used for more specialized tasks like scientific research, customer service, and educational materials.

## How does a large language model work?

Large language models work by analyzing large amounts of text data to extract patterns and relationships. The models are trained on vast datasets of text, which can include books, articles, websites, and conversations. The models use various techniques, such as deep learning and neura

In [None]:
sequences = pipeline(
    "Tell me what you know about Co. Donegal in Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Tell me what you know about Co. Donegal in Ireland. I’ve heard it’s a beautiful place.

There is certainly a lot to see and do in County Donegal, Ireland! Located in the northwest of Ireland, Donegal has a rich culture and history that is sure to leave you feeling enchanted. Here are just a few things you might be interested in:

1. The Wild Atlantic Way: Donegal is a part of the Wild Atlantic Way, a 1,553-mile long driving route along the Atlantic coast of Ireland. The route takes in some of the most scenic and dramatic coastline in Europe, and there are plenty of opportunities for hiking, water sports, and wildlife spotting.
2. Slieve League Cliffs: The Slieve League Cliffs are one of the most spectacular sights in Ireland. Located on the coast of Donegal, these towering cl


### Using the system prompt

In [None]:
# clear down the existing pipeline so a new instance can be created
del pipeline

from transformers import pipeline

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Can you name the all previous presidents of Ireland"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Can you name the all previous presidents of Ireland [/INST] I'd be happy to help you with that! Here is a list of all the Presidents of Ireland, in order from the first president to the current one:

1. Douglas Hyde (1945-1952)
2. Seán T. O'Kelly (1952-1959)
3. Éamon de Valera (1959-1973)
4. Cearbhall Ó Dálaigh (1973-1974)
5. Richard J. Carew Reid (1974-1976)
6. Patrick Hillery (1976-1982)
7. Thomas J. O'Higgins (1982-1990)
8. Mary Robinson


### Chat with the model

Uses the Hugging Face chat model template

In [None]:
messages = [
    {"role": "user", "content": "What is your favourite TV show?"},
    {"role": "assistant", "content": "I liked Breaking Bad and Band of Brothers"},
    {"role": "user", "content": "Do you have any recommendations of other shows I might like?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> [INST] What is your favourite TV show? [/INST]I liked Breaking Bad and Band of Brothers</s> [INST] Do you have any recommendations of other shows I might like? [/INST] Based on your interest in "Breaking Bad" and "Band of Brothers," here are some other TV shows that you might enjoy:

1. The Sopranos: This is another criminally themed show but it's considered a classic of TV dramas and has a strong focus on character development.
2. Mad Men: This show is set in the 1960s and follows the lives of characters working in an advertising agency in New York City. It's known for its intricate storytelling and strong acting.
3. The Crown: This is a historical drama about the reign of Queen Elizabeth II. It's ambitious in scope, with great acting and beautiful production design.
4. Game of Thrones: This show is a fantasy epic about warring families vying for control of the Seven Kingdoms. It's known for its intricate plotting, memorable characters, and epic battles.
5. The Wire: This is a gri

## Store the model

In [None]:
import os

# change directory to content
os.chdir("/content")

In [None]:
# save the file - new_model can be whatever value you want
model.save_pretrained(f"{new_model}")
tokenizer.save_pretrained(f"{new_model}")

('Mistral-7B-Instruct-NF4/tokenizer_config.json',
 'Mistral-7B-Instruct-NF4/special_tokens_map.json',
 'Mistral-7B-Instruct-NF4/tokenizer.model',
 'Mistral-7B-Instruct-NF4/added_tokens.json',
 'Mistral-7B-Instruct-NF4/tokenizer.json')

## Push the model to Hugging Face

### Push Model

In [None]:
model.push_to_hub(f"twhoool02/{new_model}", private=True)
tokenizer.push_to_hub(f"twhoool02/{new_model}", private=True)

README.md:   0%|          | 0.00/656 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/twhoool02/Mistral-7B-Instruct-NF4/commit/622893197f6d529f38f7ee6bb5fdd6d983d85d80', commit_message='Upload tokenizer', commit_description='', oid='622893197f6d529f38f7ee6bb5fdd6d983d85d80', pr_url=None, pr_revision=None, pr_num=None)

### Add Model Card

In [None]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    base_model=pretrained_model,
    language='en',  # The language the model was trained on
    license='other',  # The license for the model
    library=['Transformers','NF4'],  # The library used to train the model
    model_name=new_model,  # The name of the model
    model_type='mistral',
    pipeline_tag='text-generation',
    qunatized_by="twhoool02",
    tags=["Mistral AI_","NF4","mistral"],  # Tags for the model
)

content = f"""
---
{ card_data.to_yaml() }
---

# Model Card for {new_model}

## Model Details

This model is a NF4 quantized version of the {pretrained_model} model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** mistral
- **Model name:** {new_model}
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other
"""

card = ModelCard(content)
print(card)

---
language: en
license: other
tags:
- Mistral AI_
- NF4
- mistral
base_model: mistralai/Mistral-7B-Instruct-v0.2
model_name: Mistral-7B-Instruct-NF4
library:
- Transformers
- NF4
model_type: mistral
pipeline_tag: text-generation
qunatized_by: twhoool02
---

# Model Card for Mistral-7B-Instruct-NF4

## Model Details

This model is a NF4 quantized version of the mistralai/Mistral-7B-Instruct-v0.2 model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** mistral
- **Model name:** Mistral-7B-Instruct-NF4
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other



### Upload the model card

In [None]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=f"twhoool02/{new_model}")
except:
    raise ValueError("model card info is invalid. please check.")