## In this code the LoRA (Low-Rank Adaptation) configuration is applied to the model when you call get_peft_model(model, lora_config). This integrates LoRA's modifications directly into the model, which is necessary for fine-tuning it using low-rank adaptation.

Key Points on LoRA in Your Code:
LoRA Configuration (lora_config): You define the LoRA parameters such as r, lora_alpha, target layers (q_proj and v_proj), and other configurations. This configuration is then applied to the model using get_peft_model(model, lora_config). This modifies the model's weights based on LoRA during training.

Including LoRA in SFTTrainer: When you pass peft_config=lora_config to the SFTTrainer, it ensures that the LoRA settings are appropriately utilized during training. This means the model is trained with LoRA applied to the target layers, reducing memory usage and training time by optimizing only specific low-rank matrices.

Do We Need to Include lora_config in Both Model Loading and SFTTrainer?
Yes, you need to apply the LoRA configuration both when loading the model and during training using SFTTrainer:

When loading the model: You apply the LoRA modifications by calling get_peft_model(model, lora_config) to ensure that the model is modified to incorporate LoRA layers.
When using SFTTrainer: Including the peft_config=lora_config ensures that the trainer applies LoRA-specific training optimizations, such as freezing some layers and focusing on fine-tuning the LoRA-adapted parameters.

### Incorporating the LoRA (Low-Rank Adaptation) configuration during model loading does not directly reduce the size of the model in terms of storage, but it does reduce the number of trainable parameters, leading to a significant reduction in memory usage and computational resources during training and inference.


### But yes, the memory usage and computational cost during training and inference are reduced because LoRA only fine-tunes small, low-rank matrices instead of the entire model.

### If reducing the storage size of the model is your goal, you would need to explore techniques like model pruning, quantization, or distillation.


### During inference, the model uses both the pre-trained weights and the small LoRA parameters. While the size on disk isn't drastically reduced, the inference may still be more efficient due to the lower-rank modifications introduced by LoRA.

In [3]:
!pip install torch transformers datasets trl peft psutil

Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting trl
  Downloading trl-0.11.4-py3-none-any.whl.metadata (12 kB)
Collecting peft
  Downloading peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting tyro>=0.5.11 (from trl)
  Downloading tyro-0.8.13-py3-none-any.whl.metadata (8.4 kB)
Collecting shtab>=1.5.6 (from tyro>=0.5.11->trl)
  Downloading shtab-1.7.1-py3-none-any.whl.metadata (7.3 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
Collecting multiprocess (from datasets)
  Downloading mult

In [4]:
!nvidia-smi

Sun Oct 20 18:35:26 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [5]:
!pip install --upgrade peft



In [6]:
!pip uninstall peft -y
!pip install peft

Found existing installation: peft 0.13.2
Uninstalling peft-0.13.2:
  Successfully uninstalled peft-0.13.2
Collecting peft
  Using cached peft-0.13.2-py3-none-any.whl.metadata (13 kB)
Using cached peft-0.13.2-py3-none-any.whl (320 kB)
Installing collected packages: peft
Successfully installed peft-0.13.2


In [9]:
from datasets import load_dataset
import torch
import os
import re
from trl import SFTTrainer
from transformers import TrainingArguments
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [14]:
from huggingface_hub import notebook_login
from google.colab import userdata

HF_TOKEN = userdata.get('hf_token')
os.environ["HF_TOKEN"] = HF_TOKEN

In [15]:
# Load the tokenizer and model for seq2seq tasks
model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [16]:
# Calculating Model Parameters Size

def get_model_size_in_mb(model):
    total_params = sum(p.numel() for p in model.parameters())
    # Assume the parameters are in float32 format (4 bytes per parameter)
    # For quantized models (4-bit), it's 1/8th the size, i.e., 0.5 bytes per parameter
    total_size_in_bytes = total_params * 0.5  # 4-bit quantized model
    total_size_in_mb = total_size_in_bytes / (1024 * 1024)  # Convert bytes to MB
    return total_size_in_mb

# Get the model size after loading and applying QLoRA
model_size = get_model_size_in_mb(model)
print(f"Model size after applying QLoRA: {model_size:.2f} MB")


Model size after applying QLoRA: 589.28 MB


In [17]:
# Checking GPU Memory Usage

def get_gpu_memory_usage():
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated()
        memory_allocated_mb = memory_allocated / (1024 * 1024)  # Convert bytes to MB
        return memory_allocated_mb
    else:
        return "CUDA is not available."



# If the model is loaded on the CPU

def get_cpu_memory_usage():
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    memory_usage_mb = memory_info.rss / (1024 * 1024)  # Convert bytes to MB
    return memory_usage_mb


if torch.cuda.is_available():
  gpu_memory_usage = get_gpu_memory_usage()
  print(f"GPU memory usage after loading the model: {gpu_memory_usage:.2f} MB")
else:
  cpu_memory_usage = get_cpu_memory_usage()
  print(f"CPU memory usage after loading the model: {cpu_memory_usage:.2f} MB")

GPU memory usage after loading the model: 0.00 MB


In [18]:
dataset = load_dataset("b-mc2/sql-create-context", split="train")

README.md:   0%|          | 0.00/4.43k [00:00<?, ?B/s]

sql_create_context_v4.json:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78577 [00:00<?, ? examples/s]

In [9]:
dataset

Dataset({
    features: ['answer', 'question', 'context'],
    num_rows: 78577
})

In [10]:
# Apply LoRA (Low-Rank Adaptation) using PEFT
lora_config = LoraConfig(
    r=8,  # Reduce LoRA rank to save resources
    lora_alpha=16,  # Alpha parameter for LoRA
    target_modules=["q_proj", "v_proj"],  # LoRA will modify these layers
    lora_dropout=0.1,  # Dropout rate
    bias="none",  # No bias adjustment
)
model = get_peft_model(model, lora_config)

### We need to include the LoRA configuration in the SFTTrainer initialization, but you do not need to include it in the TrainingArguments. The TrainingArguments is focused on general training configurations, while the LoRA configuration is specific to how the model adapts during training.

In [31]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

In [36]:
# Preprocessing function to tokenize inputs (context + question) and outputs (SQL answer)
def preprocess_function(examples):
    inputs = [context + " " + question for context, question in zip(examples['context'], examples['question'])]
    targets = examples['answer']  # SQL query (answer)

    # Check if the tokenizer has a padding token
    if tokenizer.pad_token is None:
        # Set pad_token to the eos_token or add a new pad token
        tokenizer.pad_token = tokenizer.eos_token  # or tokenizer.add_special_tokens({'pad_token': '[PAD]'})


    # Tokenize the inputs
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Tokenize the targets (SQL queries)
    labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length").input_ids

    # Replace padding token ID in the labels with -100, so they're ignored during loss computation
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in label_seq] for label_seq in labels]

    model_inputs["labels"] = labels
    return model_inputs

In [37]:
# Tokenize the dataset
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/78577 [00:00<?, ? examples/s]

In [38]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./unsloth-text2sql-finetune",
    per_device_train_batch_size=2,  # Lower batch size for smaller resources
    num_train_epochs=1,
    max_steps=60,
    optim = "adamw_8bit",
    logging_steps=1,
    load_best_model_at_end=False,  # No need to load best model without evaluation
    evaluation_strategy="no",  # Disable evaluation
    report_to="none",  # Disable WandB logging
)



In [39]:
# Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    peft_config=lora_config,
    tokenizer=tokenizer
)

max_steps is given, it will override any value given in num_train_epochs


In [18]:
!pip install bitsandbytes

Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1


In [40]:
# Fine-tune the model
trainer.train()

Step,Training Loss
1,4.9894
2,4.1916
3,4.6191
4,4.9247
5,4.8259
6,5.6867
7,5.0182
8,4.5035
9,4.8181
10,4.2644


TrainOutput(global_step=60, training_loss=4.359213093916575, metrics={'train_runtime': 82.596, 'train_samples_per_second': 1.453, 'train_steps_per_second': 0.726, 'total_flos': 359054652211200.0, 'train_loss': 4.359213093916575, 'epoch': 0.0015271450024179796})

In [41]:
def gen_sql(schema, nl_query):

  # Combine schema and query for the input
  test_input = schema + " " + nl_query

  # Tokenize the input query
  inputs = tokenizer([test_input], return_tensors="pt", padding=True, truncation=True, max_length=512)

  # Ensure the model is in evaluation mode
  model.eval()

  # Move the inputs to the same device as the model
  inputs = {key: value.to(model.device) for key, value in inputs.items()}

  with torch.no_grad():
      outputs = model.generate(**inputs, max_new_tokens=128, use_cache=True)

  # Get the predicted SQL query
  generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

  # Regex pattern to match SQL queries
  sql_pattern = r"(?i)(SELECT.*?;)"  # Match SELECT statements ending with a semicolon

  # Find all matches
  matches = re.findall(sql_pattern, generated_sql)

  # Extract the first match (if any)
  if matches:
      extracted_sql = matches[0]
      print("Extracted SQL Query:", extracted_sql)
  else:
      print("No SQL query found.")

  return extracted_sql

In [42]:
# Test the fine-tuned model on a custom natural language query with a schema
context_schema = "CREATE TABLE employee (emp_id INTEGER, dept_name varchar, teacher_name TEXT, no_of_student FLOAT, subject TEXT);"
user_query = "how many teachers belong to Math department?"

In [43]:
gen_sql(context_schema, user_query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Extracted SQL Query: SELECT COUNT(*) FROM employee WHERE dept_name = 'Math' GROUP BY teacher_name;


"SELECT COUNT(*) FROM employee WHERE dept_name = 'Math' GROUP BY teacher_name;"

In [21]:
model.save_pretrained("lora_model")

In [22]:
!pip install huggingface_hub



In [26]:
from transformers import AutoModelForCausalLM
from bitsandbytes import quantization

ImportError: cannot import name 'quantization' from 'bitsandbytes' (/usr/local/lib/python3.10/dist-packages/bitsandbytes/__init__.py)

In [None]:
#!export "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

In [None]:
model.push_to_hub("Anoop03031988/Llama-3.2 1B_text2sql"
#, token="hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
)

adapter_model.safetensors:   0%|          | 0.00/27.3M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Anoop03031988/unsloth_4bit_mistral_imdb_model/commit/1b7f5c7834a76124a14d15d8e3082a0829e97c74', commit_message='Upload model', commit_description='', oid='1b7f5c7834a76124a14d15d8e3082a0829e97c74', pr_url=None, pr_revision=None, pr_num=None)