# Tunning 
[Unsloth](https://github.com/unslothai/unsloth)<br>
[Instructions using Unsloth](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=QmUBVEnvCDJv)<br>

In [2]:
!pip install --upgrade pip
!pip install "unsloth[cu118-torch220] @ git+https://github.com/unslothai/unsloth.git" -q
!pip install "unsloth[cu121-torch220] @ git+https://github.com/unslothai/unsloth.git" -q
!pip install "unsloth[cu118-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git" -q
!pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git" -q

Looking in indexes: https://artifact.intuit.com/artifactory/api/pypi/pypi-intuit/simple
[0m

In [7]:
# Fine Tunning req
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[0m

In [15]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = False # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
    "/root/llama/models/models--unsloth--llama-3-8b/snapshots/a299e1185075f847b3f9d5434713852093e23684/"
    # "/root/llama_/Meta-Llama-3-8B-Instruct"    
] # More models at https://huggingface.co/unsloth

model_name = fourbit_models[-1]
# model_name = './models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298'

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# model = model.to(device)


==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla V100-SXM2-16GB. Max memory: 15.773 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.0. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.24. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.41it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 289.38 MiB is free. Process 11595 has 15.49 GiB memory in use. Of the allocated memory 15.09 GiB is allocated by PyTorch, and 9.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Inference Before Training

In [12]:
EOS_TOKEN = tokenizer.eos_token

inputs = tokenizer([f"where one can find sea turtles around the world? answer as simply as possible"], 
                   return_tensors = "pt").to("cuda")
input_lenght = inputs['input_ids'][0].shape[0]

outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = False,
                         pad_token_id=tokenizer.eos_token_id)

# outputs
results = tokenizer.batch_decode(outputs)
# for r_i in results[input_lenght:]:
for r_i in results:
    print(r_i)



OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU 0 has a total capacity of 15.77 GiB of which 127.38 MiB is free. Process 11595 has 15.64 GiB memory in use. Of the allocated memory 15.22 GiB is allocated by PyTorch, and 40.34 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

## Creating Dataset 

In [6]:
import pandas as pd
from datasets import load_dataset

dataset = load_dataset("parquet", data_files={'train': 'qa_data.parquet'})

# Must add EOS_TOKEN
EOS_TOKEN = tokenizer.eos_token 

def formatting_prompts_func(rows):
    texts = rows['text']
    texts += EOS_TOKEN
    return { "text" : texts}
pass

from datasets import load_dataset
dataset = load_dataset("parquet", data_files={'train': 'qa_data.parquet'})
dataset = dataset.map(formatting_prompts_func, batched = False,)


In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, 
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_hf",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        # seed = 3407,
        output_dir = "outputs",
    ),
)


Detected kernel version 4.14.336, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
max_steps is given, it will override any value given in num_train_epochs


In [11]:
import torch
torch.cuda.empty_cache()

# trainer_stats = trainer.train()
# del model, tokenizer

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  48244  6916 ?        Ss   09:31   0:00 /bin/sh /opt/.s
root         8  0.4  0.0 437340 83824 ?        Sl   09:31   0:02 /opt/.sagemaker
root        17  4.6  1.2 67161948 1657420 ?    Ssl  09:32   0:23 /opt/conda/bin/
root       214  0.0  0.2 8910952 313864 ?      Sl   09:34   0:00 /opt/conda/bin/
root       215  0.0  0.2 8910952 313812 ?      Sl   09:34   0:00 /opt/conda/bin/
root       217  0.0  0.2 8910952 313812 ?      Sl   09:34   0:00 /opt/conda/bin/
root       219  0.0  0.2 8910952 313812 ?      Sl   09:34   0:00 /opt/conda/bin/
root       221  0.0  0.2 8910952 313816 ?      Sl   09:34   0:00 /opt/conda/bin/
root       223  0.0  0.2 8910952 313820 ?      Sl   09:34   0:00 /opt/conda/bin/
root       224  0.0  0.2 8910952 313824 ?      Sl   09:34   0:00 /opt/conda/bin/
root       227  0.0  0.2 8910952 313824 ?      Sl   09:34   0:00 /opt/conda/bin/
root       229  0.0  0.2 8910952 313

In [None]:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments