<a href="https://colab.research.google.com/github/twhool02/ptm-quantization/blob/main/Quantize_falcon_7b_instruct_NF4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run Falcon-7B-HF - NF4 Quantization

This notebook quantizes the Hugging Face version of the Llama2 7B model.

The notebook:

1. Applies quantization to the base model
2. Uploads the model to Hugging Face
3. Runs inference on the model

The code in this notebook is based on the following blogs/documentation :

* [The Falcon has landed in the Hugging Face ecosystem](https://huggingface.co/blog/falcon#inference)
* [Fine-Tune Your Own Llama 2 Model in a Colab Notebook](https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32)
* [Fine-Tuning LLaMA 2: A Step-by-Step Guide to Customizing the Large Language Model](https://www.datacamp.com/tutorial/fine-tuning-llama-2)
* [Mistral-7B Fine-Tuning: A Step-by-Step Guide](https://gathnex.medium.com/mistral-7b-fine-tuning-a-step-by-step-guide-52122cdbeca8)

## Setup

### Log into HuggingFace Hub

This code assumes that the user has a hugging face token setup as a notebook secret called HF_TOKEN

In [None]:
# Required when quantizing models/data that are gated on HuggingFace and required for pushing models to HuggingFace
!pip install -q --upgrade huggingface_hub

import huggingface_hub

print(f"Hugging Face Version is: {huggingface_hub.__version__}")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/388.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m378.9/388.9 kB[0m [31m11.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m388.9/388.9 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25hHugging Face Version is: 0.22.2


In [None]:
from google.colab import userdata

# using the HF_TOKEN secret, this has write permissions to Hugging Face
hftoken = userdata.get('HF_TOKEN')

In [None]:
from huggingface_hub import login

# Log into hugging face using the HF_TOKEN secrect
login(hftoken, add_to_git_credential=True)

Token is valid (permission: write).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Install Required Libraries

In [None]:
# The Transformers library provides APIs and tools to easily download and train pretrained model.
# install the development version of transformers
# !pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U transformers

# Accelerate enables the same PyTorch code to be run across any distributed configuration
!pip install -q -U accelerate -q

# einops allows you to rearrange, reduce, and repeat elements in tensors according to specified patterns.
!pip install -q -U einops

# an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems
!pip install -q -U sentencepiece

# 'bitsandbytes' includes quantization primitives for 8-bit & 4-bit operations
!pip install -q -U bitsandbytes

# trl is afull stack library where we provide a set of tools to train transformer language models
!pip install -q -U trl

# used for monitoring the training process.
!pip install -q -U wandb


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m63.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m87.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━

### Check library versions

In [None]:
#print the version of transformers
import transformers
print(f"version of transformers: {transformers.__version__}")

# print the version of the tensorflow library
import accelerate
print(f"version of accelerate: {accelerate.__version__}")

# print the version of the tensorflow library
import tensorflow as tf
print(f"version of tensorflow: {tf.__version__}")

version of transformers: 4.39.3
version of accelerate: 0.28.0
version of tensorflow: 2.15.0


Import Required Libraries

In [None]:
# for interacting with the operating system.
import os

# torch is the main package of PyTorch.
import torch

# base class of all warning category classes
import warnings

# import for monitoring the training process
import wandb

# load_dataset allows loading and preprocessing of datasets.
from datasets import load_dataset

from transformers import (
    AutoModelForCausalLM, # Generic model class with a causal language modeling head
    AutoTokenizer, # Automatically selects correct tokenizer for a model.
    BitsAndBytesConfig, # Used to configure a BitsAndBytes model.
    HfArgumentParser, # Used for parsing command-line arguments.
    TrainingArguments, # Defines the arguments used during training.
    pipeline, # Creates a pipeline that applies a model to some input data.
    logging, # Used for logging events during training and evaluation.
    AutoModelForQuestionAnswering # Used to get a model to perform context-based question answering etc…
)

# Import the supervised fine-tuning trainer
from trl import SFTTrainer

### Define the processor to use

Ensure the model will use a GPU if available

In [None]:
# Load the model directly onto GPU (if available)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

## Load Model and Tokenizer

### Define pre-trained model and new model name

In [None]:
# Define the pre-trained model to be used
pretrained_model = 'tiiuae/falcon-7b-instruct'

# Quantized model name
new_model = "Falcon-7B-instruct-NF4"

### Define BitsAndBytes Config

In [None]:
# Bits and Bytes Config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # Activate 4-bit precision base model loading
    bnb_4bit_quant_type="nf4", # Quantization type (fp4 or nf4)
    bnb_4bit_compute_dtype=torch.float16, # The data type for computation when using 4-bit base models
    bnb_4bit_use_double_quant=True, # Whether to activate nested quantization for 4-bit base models (double quantization)
)

### Quantize the Pre-Trained Model

In [None]:
# Load base model
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model,
    quantization_config=bnb_config, # set the quantization configuration for the model.
    device_map=device # sets the device mapping for the model to use the first GPU
)

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

### Get Model details

In [None]:
print(f"Model size: {model.get_memory_footprint() / 1e9:.1f} GB")
print(f"Model params: {model.num_parameters():,}")
print(f"Model Config: \n{model.config}")
print(f"View model structure: \n{model}")

Model size: 3.9 GB
Model params: 6,921,720,704
Model Config: 
FalconConfig {
  "_name_or_path": "tiiuae/falcon-7b-instruct",
  "alibi": false,
  "apply_residual_connection_post_layernorm": false,
  "architectures": [
    "FalconForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "tiiuae/falcon-7b-instruct--configuration_falcon.FalconConfig",
    "AutoModel": "tiiuae/falcon-7b-instruct--modeling_falcon.FalconModel",
    "AutoModelForCausalLM": "tiiuae/falcon-7b-instruct--modeling_falcon.FalconForCausalLM",
    "AutoModelForQuestionAnswering": "tiiuae/falcon-7b-instruct--modeling_falcon.FalconForQuestionAnswering",
    "AutoModelForSequenceClassification": "tiiuae/falcon-7b-instruct--modeling_falcon.FalconForSequenceClassification",
    "AutoModelForTokenClassification": "tiiuae/falcon-7b-instruct--modeling_falcon.FalconForTokenClassification"
  },
  "bias": false,
  "bos_token_id": 11,
  "eos_token_id": 11,
  "hidden_dropout": 0.0,
  "hidden_size": 4544,
  "i

### Load Tokenizer

In [None]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # sets the pad token to the eos token
tokenizer.padding_side = "right" # set padding to the right side

tokenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

## Inference

This section is to verify that inference can be carried out successfully on the quantized model. More in-depth evaluation will be done in the notebook ‘Model Evaluation’.

### Using transformers pipeline

In [None]:
from transformers import pipeline

In [None]:
pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=device,
)
sequences = pipeline(
   "Write a poem about Ireland",
    top_k=10,
    do_sample=True,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Result: Write a poem about Ireland's green hills
In gentle emerald hue, the hills arise,
A verdant sea, a rolling sky,
A land of dreams, where beauty lies.

Dappled with trees, a landscape grand,
Each hill its tale, its secret band,
A tale of green, a tale of grace,
E'en the wind sings here a gentle psalm.

And where the hills behold the sun,
The world beneath a spell is spun.
From glade to glade, the hills beckon,
A land of peace, a land of lore.


In [None]:
sequences = pipeline(
   "What is a large language model?",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: What is a large language model?
A large language model is a pre-trained machine learning model that uses large datasets of language data to learn patterns and improve natural language processing capabilities. Examples of large language models include BERT from Google, GPT from OpenAI, and RoBERTa from Baidu.


In [None]:
sequences = pipeline(
    "Tell me what you know about Co. Donegal in Ireland",
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    truncation=True,
    max_length=200
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Result: Tell me what you know about Co. Donegal in Ireland.
Co. Donegal is a county located in the northwest of Ireland. It is part of the province of Donegal. It is known for its rugged coastline and scenic landscapes. It includes numerous lakes and mountains. The population of Donegal is around 26,100.


### Using the system prompt

In [None]:
# clear down the existing pipeline so a new instance can be created
del pipeline

from transformers import pipeline

In [None]:
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Run text generation pipeline with our next model
prompt = "Can you name the all previous presidents of Ireland"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] Can you name the all previous presidents of Ireland [/INST] 
Sure! The previous presidents of Ireland include:

1. Eamon De Valera (1959-1961)
2. Sean T. O'Casey (1961-1963)
3. John F. Kennedy (1961-1963)
4. Charles Haughey (1963-1979)
5. Mary Robinson (1990-1997)
6. Bertie Ahern (1997-2007)
7. Brian Cowen (2007-2011)
8. Enda Kenny (2011-present)


### Chat with the model

In [None]:
messages = [
    {"role": "user", "content": "What is your favourite TV show?"},
    {"role": "assistant", "content": "I liked Breaking Bad and Band of Brothers"},
    {"role": "user", "content": "Do you have any recommendations of other shows I might like?"}
]

model_inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<|im_start|>user
What is your favourite TV show?<|im_end|>
<|im_start|>assistant
I liked Breaking Bad and Band of Brothers<|im_end|>
<|im_start|>user
Do you have any recommendations of other shows I might like?<|im_end|>
<|im_start|>assistant
You may enjoy shows like Game of Thrones or The Last King of Scotland<|im_end|>@<|im_start|>
<|im_start|>@<|im_end|>
What is something you would like to learn more about? <|im_start|>
|@|im_end|>
I would like to learn more about artificial intelligence and what it's like to work with it in a professional environment.
<|im_start|>@<|im_end|>
- Are you interested in learning about the current state of artificial intelligence in the industry? Check out Google's updates and developments in AI, machine learning, and machine learning research.

What are some current trends or challenges happening in this field? Follow @GoogleAI to stay informed of new developments and the latest updates in the field.

Do you have any other questions or topics you'd like

## Store the model

In [None]:
import os

# change directory to content
os.chdir("/content")

In [None]:
# save the file - new_model can be whatever value you want
model.save_pretrained(f"{new_model}")
tokenizer.save_pretrained(f"{new_model}")

('Falcon-7B-instruct-NF4/tokenizer_config.json',
 'Falcon-7B-instruct-NF4/special_tokens_map.json',
 'Falcon-7B-instruct-NF4/tokenizer.json')

## Push the model to Hugging Face

### Push Model

In [None]:
model.push_to_hub(f"twhoool02/{new_model}", private=True)
tokenizer.push_to_hub(f"twhoool02/{new_model}", private=True)

README.md:   0%|          | 0.00/627 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/twhoool02/Falcon-7B-instruct-NF4/commit/2a36cf3a38b43c1673ba26b6235894cc1f2f54e4', commit_message='Upload tokenizer', commit_description='', oid='2a36cf3a38b43c1673ba26b6235894cc1f2f54e4', pr_url=None, pr_revision=None, pr_num=None)

### Add Model Card

Add the model card using the model card template

In [None]:
# Add custom model card
from huggingface_hub import ModelCard, ModelCardData

card_data = ModelCardData(
    base_model=pretrained_model,
    language='en',  # The language the model was trained on
    license='other',  # The license for the model
    library=['Transformers','NF4'],  # The library used to train the model
    model_name=new_model,  # The name of the model
    model_type='falcon',
    pipeline_tag='text-generation',
    qunatized_by="twhoool02",
    tags=["tiiuae", "NF4","falcon"],  # Tags for the model
)

# example_template_var = 'nateraw'
content = f"""
---
{ card_data.to_yaml() }
---

# Model Card for Falcon-7B-instruct-NF4

## Model Details

This model is a NF4 quantized version of the tiiuae/falcon-7b-instruct model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** falcon
- **Model name:** {new_model}
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other
"""

card = ModelCard(content)
print(card)

---
language: en
license: other
tags:
- tiiuae
- NF4
- falcon
base_model: tiiuae/falcon-7b-instruct
model_name: Falcon-7B-instruct-NF4
library:
- Transformers
- NF4
model_type: falcon
pipeline_tag: text-generation
qunatized_by: twhoool02
---

# Model Card for Falcon-7B-instruct-NF4

## Model Details

This model is a NF4 quantized version of the tiiuae/falcon-7b-instruct model.

- **Developed by:** Ted Whooley
- **Library:** Transformers, NF4
- **Model type:** falcon
- **Model name:** Falcon-7B-instruct-NF4
- **Pipeline tag:** text-generation
- **Qunatized by:** twhoool02
- **Language(s) (NLP):** en
- **License:** other



### Upload the model card

In [None]:
try:
    card.validate()
    card.save('README.md')
    card.push_to_hub(repo_id=f"twhoool02/{new_model}")
except:
    raise ValueError("model card info is invalid. please check.")