<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Quantize_Mistral_7B_Instruct_NF4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Mistral-7B-Instruct - NF4 Quantization

This notebook quantizes the model Mistral-7B-Instruct-v0.2.

The notebook:

1. Loads the Pre-trained Model
2. Applies quantization to the base model
3. Uploads the model to Hugging Face
4. Runs inference on the model

The code in this notebook is based on the following blogs/documentation :

* [Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B](https://www.datacamp.com/tutorial/mistral-7b-tutorial)
* [Mistral-7B Fine-Tuning: A Step-by-Step Guide](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8)
* [Fine-Tune Your Own Llama 2 Model in a Colab Notebook](https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32)
* [Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model](https://www.datacamp.com/tutorial/fine-tuning-llama-2)
* [Templates for Chat Models](https://huggingface.co/docs/transformers/main/chat_templating)

## Setup

### Log into HuggingFace Hub

This code assumes that the user has a hugging face token setup as a notebook secret called HF_TOKEN

In [1]:
# Required when interacting with HuggingFace Hub
!pip install -q --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/388.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m286.7/388.9 kB[0m [31m8.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hHugging Face Version is: 0.22.2


In [2]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [3]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required Libraries

In [4]:
# The Transformers library provides APIs and tools to easily download and train pretrained model.
!pip install -q -U transformers

# Accelerate enables the same PyTorch code to be run across any distributed configuration
!pip install -q -U accelerate -q

# 'bitsandbytes' includes quantization primitives for 8-bit & 4-bit operations
!pip install -q -U bitsandbytes

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### Check library versions

In [5]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

# print the version of the accelerate library
import accelerate
print(f"version of accelerate: {accelerate.__version__}")

# print the version of the tensorflow library
import tensorflow as tf
print(f"version of tensorflow: {tf.__version__}")

version of transformers: 4.40.0
version of accelerate: 0.29.3
version of tensorflow: 2.15.0


Import Required Libraries

In [6]:
# for interacting with the operating system.
import os

# torch is the main package of PyTorch.
import torch

# base class of all warning category classes
import warnings

from transformers import (
    AutoModelForCausalLM, # Generic model class with a causal language modeling head
    AutoTokenizer, # Automatically selects correct tokenizer for a model.
    BitsAndBytesConfig, # Used to configure a BitsAndBytes model.
    pipeline, # Creates a pipeline that applies a model to some input data.
    logging, # Logs events during training and evaluation.
)

### Define the processor to use

Ensure the model will use a GPU if available

In [7]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

## Load Model and Tokenizer

### Define pre-trained model and new model name

In [11]:
# Define the pre-trained model to be used
pretrained_model = 'mistralai/Mistral-7B-Instruct-v0.2'

# Quantized model name
new_model = "Mistral-7B-Instruct-NF4"

### Define BitsAndBytes Config

In [12]:
# Bits and Bytes Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=torch.float16, # The data type for computation when using 4-bit base models
    bnb_4bit_use_double_quant=True, # Whether to activate nested quantization for 4-bit base models (double quantization)
)

### Quantize the Pre-Trained Model

In [14]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model,
    quantization_config=bnb_config, # set the quantization configuration for the model.
    device_map=device # sets the device mapping for the model to use the first GPU
)

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

### Get Model details

In [15]:
print(f"Model size: {model.get_memory_footprint() / 1e9:.1f} GB")
print(f"Model params: {model.num_parameters():,}")
print(f"Model Config: \n{model.config}")
print(f"View model structure: \n{model}")

Model size: 4.6 GB
Model params: 7,241,732,096
Model Config: 
MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "quantization_config": {
    "_load_in_4bit": true,
    "_load_in_8bit": false,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_storage": "uint8",
    "bnb_4bit_quant_type": "nf4",
    "bnb_4bit_use_double_quant": true,
    "llm_int8_enable_fp32_cpu_offload": false,
    "llm_int8_has_fp16_weight": false,
    "llm_int8_skip_modules": null,
    "llm_int8_threshold": 6.0,
    "load_in_4bit": true,
    "load_in_8bit": false,
    "quant_method": "bitsandbytes"
  },
  "rms_norm

### Load Tokenizer

In [16]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # sets the pad token to the eos token
tokenizer.padding_side = "right" # set padding to the right side

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

## Inference

This section is to verify that inference can be carried out successfully on the quantized model.  
More in-depth evaluation will be done in the notebook ‘Model Evaluation’.

### Using transformers pipeline

In [17]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=device,
)
sequences = pipeline(
   "Write a poem about Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Result: Write a poem about Ireland, Q1.2 What is the significance of the name of the country?
A: In the land of saints and scholars,
Where ancient whispers of a thousand years
Echo through the rolling, lush green hills,
A tale of history, of love, and of spills

Ireland, the land of eternal charm,
Of legends woven like a tapestry,
Where every stone tells a thousand stories,
And the sea cradles the shores, in its history.

The name of this emerald isle,
A mystical, enchanting word,
Bearing tales of brave warriors,
Of kings and heroes that never got old.

Eire, the land of the ever free,
Where the heart beats strong, like the Shannon's tide,
A land where the past intertwines with the present,
A beacon of


In [18]:
sequences = pipeline(
   "What is a large language model?",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: What is a large language model?

A large language model is a type of artificial intelligence (AI) model that can generate human-like text based on given input. These models are trained on vast amounts of data and can understand the context and meaning of the text they are given, as well as generate new, coherent text based on that input.

Some of the most popular large language models include Bing's ChatGPT, Google's Bard, and Microsoft's Turing. These models can be used for a variety of applications, including generating chatbot responses, writing articles, summarizing text, translating languages, and even writing code.

One of the most impressive features of large language models is their ability to understand context and generate human-like text. For example, if you ask a question like "What's the weather like in New York City today?" a large language model can understand that you are asking for the current weather and generate


In [19]:
sequences = pipeline(
    "Tell me what you know about Co. Donegal in Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Tell me what you know about Co. Donegal in Ireland and I’ll try and help you. 

Co Donegal, located in the northwest of Ireland, is one of the most beautiful and rugged counties in Ireland, known for its stunning beaches, breathtaking cliffs, crystal clear waters, and rich cultural heritage. It is a popular tourist destination, especially for those seeking a more natural and unspoilt experience than can be found in more crowded areas. 

Some of the main attractions include:

Slieve League Cliffs: These iconic sea cliffs offer some of the most dramatic and spectacular views in Ireland, with sheer drops and awe-inspiring vistas.

Malin Head: The most northerly point of Ireland, Malin Head is a must-visit for any nature lover, with its beautiful beaches, rugged cliffs, and abundance of wildlife.

Gl


### Using the system prompt

In [20]:
# clear down the existing pipeline so a new instance can be created
del pipeline

from transformers import pipeline

In [21]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

prompt = "Can you name the all previous presidents of Ireland"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Can you name the all previous presidents of Ireland [/INST] I'd be happy to help you with that! Here is a list of all the Presidents of Ireland, in order from the first president to the current one:

1. Douglas Hyde (1945-1952)
2. Seán T. O'Kelly (1952-1959)
3. Éamon de Valera (1959-1973)
4. Cearbhall Ó Dálaigh (1973-1974)
5. Richard J. Carew Reid (1974-1976)
6. Patrick Hillery (1976-1982)
7. Thomas J. O'Higgins (1982-1990)
8. Mary Robinson


### Chat with the model

Uses the Hugging Face chat model template

In [22]:
messages = [
    {"role": "user", "content": "What is your favourite TV show?"},
    {"role": "assistant", "content": "I liked Breaking Bad and Band of Brothers"},
    {"role": "user", "content": "Do you have any recommendations of other shows I might like?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> [INST] What is your favourite TV show? [/INST]I liked Breaking Bad and Band of Brothers</s> [INST] Do you have any recommendations of other shows I might like? [/INST] Based on my previous answer, you might enjoy "Better Call Saul" if you liked "Breaking Bad". It's a prequel and follows the life of Saul Goodman, a criminal lawyer, before he meets Walter White and Jesse Pinkman. If you appreciated the historical and military themes in "Band of Brothers", I would recommend "The Pacific" which is another miniseries by Tom Hanks and Steven Spielberg, but focused on the Pacific War during World War II. Another great recommendation is "Peaky Blinders" which is a crime drama set in post-WWI England, and is known for its complex characters and excellent performances. Finally, for sci-fi fans, "Stranger Things" is a must-watch; it's a thrilling and nostalgic adventure set in the 80's and contains elements of science-fiction, horror, and coming-of-age.</s>


## Store the model

In [23]:
import os

# change directory to content
os.chdir("/content")

In [24]:
# save the file - new_model can be whatever value you want
model.save_pretrained(f"{new_model}")
tokenizer.save_pretrained(f"{new_model}")

('Mistral-7B-Instruct-NF4/tokenizer_config.json',
 'Mistral-7B-Instruct-NF4/special_tokens_map.json',
 'Mistral-7B-Instruct-NF4/tokenizer.model',
 'Mistral-7B-Instruct-NF4/added_tokens.json',
 'Mistral-7B-Instruct-NF4/tokenizer.json')

## Push the model to Hugging Face

### Push Model

In [25]:
model.push_to_hub(f"twhoool02/{new_model}", private=True)
tokenizer.push_to_hub(f"twhoool02/{new_model}", private=True)

README.md:   0%|          | 0.00/656 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/twhoool02/Mistral-7B-Instruct-NF4/commit/62e9572c0bd6e8c3b422e6d42948b29e9c7ee44a', commit_message='Upload tokenizer', commit_description='', oid='62e9572c0bd6e8c3b422e6d42948b29e9c7ee44a', pr_url=None, pr_revision=None, pr_num=None)

### Add Model Card

In [26]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    base_model=pretrained_model,
    language='en',  # The language the model was trained on
    license='other',  # The license for the model
    library=['Transformers','NF4'],  # The library used to train the model
    model_name=new_model,  # The name of the model
    model_type='mistral',
    pipeline_tag='text-generation',
    qunatized_by="twhoool02",
    tags=["Mistral AI_","NF4","mistral"],  # Tags for the model
)

content = f"""
---
{ card_data.to_yaml() }
---

# Model Card for {new_model}

## Model Details

This model is a NF4 quantized version of the {pretrained_model} model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** mistral
- **Model name:** {new_model}
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other
"""

card = ModelCard(content)
print(card)

---
language: en
license: other
tags:
- Mistral AI_
- NF4
- mistral
base_model: mistralai/Mistral-7B-Instruct-v0.2
model_name: Mistral-7B-Instruct-NF4
library:
- Transformers
- NF4
model_type: mistral
pipeline_tag: text-generation
qunatized_by: twhoool02
---

# Model Card for Mistral-7B-Instruct-NF4

## Model Details

This model is a NF4 quantized version of the mistralai/Mistral-7B-Instruct-v0.2 model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** mistral
- **Model name:** Mistral-7B-Instruct-NF4
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other



### Upload the model card

In [27]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=f"twhoool02/{new_model}")
except:
    raise ValueError("model card info is invalid. please check.")