<a href="https://colab.research.google.com/github/vkrisvasan/llamaKV/blob/main/FineTuneLlamaKV2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [17]:
# This notebook demonstrates fine-tuning a large language model (LLM) for text-to-SQL generation using the Unsloth library.

# **Setup**
# Set the runtime to GPU for accelerated training.
# Store the Hugging Face token in Secrets and enable Notebook access.

# **Installation**
# Install Unsloth, Xformers, TRL, PEFT, accelerate, and bitsandbytes libraries.
# Install the datasets library.

# **Hardware Check**
# Display NVIDIA GPU information.

# **Model Loading**
# Load the pre-trained Meta-Llama-3.1-8B model using FastLanguageModel with specified parameters.

# **LoRA (Low-Rank Adaptation)**
# Apply LoRA to the model for efficient fine-tuning.
# Configure target modules, LoRA parameters, gradient checkpointing, and random state.

# **Prompt Template**
# Define a prompt template for text-to-SQL tasks, including placeholders for SQL prompt, context, and output.

# **Dataset Preparation**
# Load the "gretelai/synthetic_text_to_sql" dataset and filter for the "Retail" domain.
# Perform stratified sampling to ensure equal representation of different SQL prompts.
# Convert the dataset to a pandas DataFrame, apply sampling, and convert back to a Hugging Face Dataset.
# Select specific columns and apply a formatting function to prepare the data for training.

# **SFT Trainer**
# Initialize an SFTTrainer from the TRL library for supervised fine-tuning.
# Configure training arguments such as batch size, gradient accumulation steps, learning rate, and optimization settings.

# **Training**
# Train the model using the SFTTrainer.

# **Inference**
# Enable faster inference using FastLanguageModel.for_inference.
# Prepare input prompts with SQL prompt and context.
# Generate SQL outputs using the fine-tuned model and decode them using the tokenizer.

# **Model Saving**
# Save the fine-tuned model and tokenizer locally.
# Optionally push the model to the Hugging Face Hub.

# **Loading Fine-tuned Model**
# (Optional) Load the locally saved fine-tuned model for further inference.

# **Alternative Model Loading (Not Recommended)**
# (Optional) Load the fine-tuned model using AutoPeftModelForCausalLM (not recommended).

# **Model Merging and Quantization**
# (Optional) Merge the LoRA adapters into the base model and save in different formats (16-bit, 4-bit, LoRA only).
# (Optional) Save the model in GGUF (Google Universal Format) with different quantization methods.

In [1]:
#Setting up the Runtime: In Google Colab, go to the "Runtime" menu and select "Change runtime type."
#A dialog box will appear where you can choose the runtime type and hardware accelerator.
#Select "GPU" as the hardware accelerator and click "Save." This step ensures that your Colab notebook is configured to use the GPU.
#it will show as connect to T4 run time
#store in Secrets HF_TOKEN and toggle Notebook access.
#https://huggingface.co/datasets/gretelai/synthetic_text_to_sql/viewer
#https://gretel.ai/blog/synthetic-text-to-sql-dataset
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes
!pip install datasets -q

In [2]:
!nvidia-smi

Sat Aug 24 02:12:44 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.26.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
my_prompt = """Below is an sql_prompt that describes a task, paired with an sql_context that provides further context. Write a sql that appropriately completes the request.

### sql_prompt:
{}

### sql_context:
{}

### sql:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    sql_prompt = examples["sql_prompt"]
    sql_context       = examples["sql_context"]
    sql      = examples["sql"]
    texts = []
    for instruction, input, output in zip(sql_prompt, sql_context, sql):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = my_prompt.format(sql_prompt, sql_context, sql) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

In [7]:
from datasets import load_dataset
from sklearn.model_selection import train_test_split
import pandas as pd
maxRowtobeselected = 40
doamintobeSelected = "retail"

# Load the dataset
dataset = load_dataset("gretelai/synthetic_text_to_sql", split="train")

# Filter for the "Retail" domain
retail_dataset = dataset.filter(lambda example: example['domain'] == doamintobeSelected)

# Convert to pandas DataFrame for easier manipulation
retail_df = retail_dataset.to_pandas()
print(len(retail_df))


# Assume we want equal distribution based on 'sql_prompt'
# Get unique values in 'sql_prompt' and calculate the number of samples per group
unique_domains = retail_df['sql_prompt'].unique()
print(len(unique_domains))
samples_per_group = maxRowtobeselected // len(unique_domains)

# Stratified sampling using 'sql_prompt' to ensure equal distribution
stratified_df = pd.concat([
    group.sample(n=min(samples_per_group, len(group)), random_state=42)
    for _, group in retail_df.groupby('sql_prompt')
])

# If fewer than maxRowtobeselected rows are collected due to small groups, sample additional rows randomly
if len(stratified_df) < maxRowtobeselected:
    additional_samples = retail_df.drop(stratified_df.index).sample(n=maxRowtobeselected-len(stratified_df), random_state=42)
    stratified_df = pd.concat([stratified_df, additional_samples])

# Convert back to a Hugging Face Dataset
final_dataset = dataset.from_pandas(stratified_df)

# Select specific columns
final_dataset = final_dataset.map(lambda example: {
    'sql_prompt': example['sql_prompt'],
    'sql_context': example['sql_context'],
    'sql': example['sql']
},remove_columns=dataset.column_names)

# Verify the result by printing it
print(final_dataset)
print(final_dataset[0])

dataset = final_dataset.map(formatting_prompts_func, batched = True,num_proc=1)

"""
  id: 28884
  domain: retail
  domain_description: Retail data on circular supply chains, ethical labor practices, product transparency, and consumer education.
  sql_complexity: basic SQL
  sql_complexity_description: basic SQL with a simple select statement
  sql_task_type: data manipulation
  sql_task_type_description: inserting, updating, or deleting records
  sql_prompt: Update the price of all products in the 'clothing' department to be 10% higher.
  sql_context: CREATE TABLE products (product_id INT, department VARCHAR(20), price DECIMAL(5,2)); INSERT INTO products (product_id, department, price) VALUES (1, 'clothing', 19.99), (2, 'electronics', 79.99), (3, 'clothing', 49.99), (4, 'electronics', 299.99);
  sql: UPDATE products SET price = price * 1.1 WHERE department = 'clothing';
  sql_explanation: We update the products table, adjusting the price of all products in the 'clothing' department to be 10% higher using the formula price * 1.1. We filter the rows using a WHERE clause, specifying the department as 'clothing'.
"""

Downloading readme:   0%|          | 0.00/8.18k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/32.4M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.90M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5851 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100000 [00:00<?, ? examples/s]

979
979


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

Dataset({
    features: ['sql_prompt', 'sql_context', 'sql', '__index_level_0__'],
    num_rows: 40
})
{'sql_prompt': 'Determine the number of products manufactured using sustainable materials', 'sql_context': 'CREATE TABLE product_materials (product_id INT, material TEXT, is_sustainable BOOLEAN);', 'sql': 'SELECT COUNT(*) as num_products_with_sustainable_materials FROM product_materials WHERE is_sustainable = TRUE;', '__index_level_0__': 199}


Map:   0%|          | 0/40 [00:00<?, ? examples/s]

"\n  id: 28884\n  domain: retail\n  domain_description: Retail data on circular supply chains, ethical labor practices, product transparency, and consumer education.\n  sql_complexity: basic SQL\n  sql_complexity_description: basic SQL with a simple select statement\n  sql_task_type: data manipulation\n  sql_task_type_description: inserting, updating, or deleting records\n  sql_prompt: Update the price of all products in the 'clothing' department to be 10% higher.\n  sql_context: CREATE TABLE products (product_id INT, department VARCHAR(20), price DECIMAL(5,2)); INSERT INTO products (product_id, department, price) VALUES (1, 'clothing', 19.99), (2, 'electronics', 79.99), (3, 'clothing', 49.99), (4, 'electronics', 299.99);\n  sql: UPDATE products SET price = price * 1.1 WHERE department = 'clothing';\n  sql_explanation: We update the products table, adjusting the price of all products in the 'clothing' department to be 10% higher using the formula price * 1.1. We filter the rows using a

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/40 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [9]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 40 | Num Epochs = 12
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.1017
2,1.1017
3,1.0855
4,1.043
5,0.9801
6,0.8934
7,0.7927
8,0.7092
9,0.6057
10,0.5044


Step,Training Loss
1,1.1017
2,1.1017
3,1.0855
4,1.043
5,0.9801
6,0.8934
7,0.7927
8,0.7092
9,0.6057
10,0.5044


In [10]:
# my_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    my_prompt.format(
        "Which suppliers have provided the most circular supply chain products to our retail business?", # sql_prompt
        "CREATE TABLE suppliers (id INT, supplier_name VARCHAR(50), total_supply INT); CREATE TABLE circular_supply_chain_products (id INT, product_id INT, supplier_id INT); INSERT INTO suppliers (id, supplier_name, total_supply) VALUES (1, 'GreenVest', 2000), (2, 'EcoFarms', 1500), (3, 'CircularWear', 1200); INSERT INTO circular_supply_chain_products (id, product_id, supplier_id) VALUES (1, 1, 1), (2, 2, 1), (3, 3, 2), (4, 4, 2), (5, 5, 3);", # sql_context
        "", # sql output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

["<|begin_of_text|>Below is an sql_prompt that describes a task, paired with an sql_context that provides further context. Write a sql that appropriately completes the request.\n\n### sql_prompt:\nWhich suppliers have provided the most circular supply chain products to our retail business?\n\n### sql_context:\nCREATE TABLE suppliers (id INT, supplier_name VARCHAR(50), total_supply INT); CREATE TABLE circular_supply_chain_products (id INT, product_id INT, supplier_id INT); INSERT INTO suppliers (id, supplier_name, total_supply) VALUES (1, 'GreenVest', 2000), (2, 'EcoFarms', 1500), (3, 'CircularWear', 1200); INSERT INTO circular_supply_chain_products (id, product_id, supplier_id) VALUES (1, 1, 1), (2, 2, 1), (3, 3, 2), (4, 4, 2), (5, 5, 3);\n\n### sql:\nSELECT suppliers.supplier_name, SUM(circular_supply_chain_products.product_id) as circular_products_count FROM suppliers INNER JOIN circular_supply_chain_products ON suppliers.id = circular_supply_chain_products.supplier_id GROUP BY suppl

In [11]:
# my_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    my_prompt.format(
        "Identify brands that have increased their product count by more than 50% since last year.", # sql_prompt
        "CREATE TABLE brands(brand_id INT, brand_name TEXT, products_last_year INT, products_this_year INT); INSERT INTO brands(brand_id, brand_name, products_last_year, products_this_year) VALUES (1, 'BrandA', 200, 300), (2, 'BrandB', 500, 700), (3, 'BrandC', 1000, 1200), (4, 'BrandD', 800, 700), (5, 'BrandE', 900, 1000);", # sql_context
        "", # sql output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an sql_prompt that describes a task, paired with an sql_context that provides further context. Write a sql that appropriately completes the request.

### sql_prompt:
Identify brands that have increased their product count by more than 50% since last year.

### sql_context:
CREATE TABLE brands(brand_id INT, brand_name TEXT, products_last_year INT, products_this_year INT); INSERT INTO brands(brand_id, brand_name, products_last_year, products_this_year) VALUES (1, 'BrandA', 200, 300), (2, 'BrandB', 500, 700), (3, 'BrandC', 1000, 1200), (4, 'BrandD', 800, 700), (5, 'BrandE', 900, 1000);

### sql:
SELECT brand_name, products_last_year, products_this_year FROM brands WHERE products_this_year > products_last_year * 1.5;

### sql_prompt:
Determine the number of products manufactured using sustainable materials in each product category.

### sql_context:
CREATE TABLE products_sustainable (product_id INT, product_category VARCHAR(50), product_name VARCHAR(50), material 

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

In [13]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# my_prompt = You MUST copy from above!

inputs = tokenizer(
[
    my_prompt.format(
        "What is the total quantity of products manufactured in each country with sustainable practices?", # sql_prompt
        "", # sql_context
        "", # sql output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an sql_prompt that describes a task, paired with an sql_context that provides further context. Write a sql that appropriately completes the request.

### sql_prompt:
What is the total quantity of products manufactured in each country with sustainable practices?

### sql_context:


### sql:
```sql
SELECT country, SUM(quantity) as total_quantity FROM product_materials pm INNER JOIN suppliers s ON pm.supplier_id = s.id INNER JOIN stores_stocks ss ON pm.product_id = ss.product_id INNER JOIN stores s2 ON ss.store_id = s2.id WHERE s.sustainable_practices = true GROUP BY country;
```
<|end_of_text|>


In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

In [15]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

In [16]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )