<br>

<h1 style="text-align: center;">Llama Fine-Tuning</h1>

<br>

## Initial Setup

---

In [1]:
# # Installation
!pip -q install -U git+https://github.com/huggingface/transformers.git
!pip -q install -U git+https://github.com/huggingface/peft.git
!pip -q install -U git+https://github.com/huggingface/accelerate.git
!pip -q install trl xformers wandb datasets einops gradio sentencepiece bitsandbytes
!pip -q uninstall datasets -y
!pip -q install -U datasets==2.16

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
fastai 2.7.13 requires torch<2.2,>=1.10, but you have torch 2.2.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires cuda-python<12.0a0,>=11.7.1, but you have cuda-python 12.3.0 which is incompatible.
cudf 23.8.0 requires pandas<1.6.0dev0,>=1.3, but you have pandas 2.1.4 which is incompatible.
cudf 23.8.0 requi

In [2]:
# Login to huggingface
!huggingface-cli login --token "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"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
# Import the libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
from transformers import LlamaTokenizer, LlamaForCausalLM

2024-02-25 16:55:11.731125: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-25 16:55:11.731247: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-25 16:55:11.859671: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
# Pre trained model
model_name = "meta-llama/Llama-2-7b-hf" 

# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "soheill/Llama-2-7b-hf"

<br>

## Load Dataset

---

This code loads a dataset from Hugging Face's vicgalle/alpaca-gpt4 and displays the first entry's 'instruction', 'input', 'output', and 'text' fields. It's a quick way to understand the dataset's structure, useful for machine learning and data analysis.


In [7]:
# Load dataset 
dataset = load_dataset(dataset_name, split="train[:100]")

# Report
print("Dataset shape: ", dataset.shape, "\n", "-"*100)
index = 0
print("Instruction: \n", dataset["instruction"][index], "\n", "-"*100)
print("Input: \n", dataset["input"][index], "\n", "-"*100)
print("Output: \n", dataset["output"][index], "\n", "-"*100)
print("Text: \n", dataset["text"][index], "\n", "-"*100)

Dataset shape:  (100, 4) 
 ----------------------------------------------------------------------------------------------------
Instruction: 
 Give three tips for staying healthy. 
 ----------------------------------------------------------------------------------------------------
Input: 
  
 ----------------------------------------------------------------------------------------------------
Output: 
 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and m

<br>

## Configure 4-Bit Quantization for Efficient Training

---

This code snippet initializes a configuration for 4-bit quantization during training using the BitsAndBytes library, which is aimed at optimizing memory usage and computational efficiency. The configuration (bnb_config) is set up with the following parameters:

- load_in_4bit=True: This enables loading the model weights in 4-bit precision, reducing the memory footprint.
- bnb_4bit_quant_type="nf4": It specifies the type of quantization as "nf4", which is a specific method or format for 4-bit quantization.
- bnb_4bit_compute_dtype=torch.float16: Sets the data type for computation to half-precision floating-point (float16), which strikes a balance between precision and performance.
- bnb_4bit_use_double_quant=False: Disables double quantization, which means it avoids further reducing precision beyond 4-bit.

The configuration is then stored in bnb_config, which can be used in training models to enhance efficiency.

In [6]:
# Initialize the configuration to sets up 4-bit quantization for training, optimizing memory and computational efficiency.
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,                     
    bnb_4bit_quant_type= "nf4",             
    bnb_4bit_compute_dtype= torch.float16,  
    bnb_4bit_use_double_quant= False,       
)

bnb_config

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

<br>

## Tokenizer

---

This code initializes a tokenizer for the LLaMA model and configures its behavior for sequence processing. It loads the tokenizer using AutoTokenizer.from_pretrained, allowing remote code execution if necessary for initialization. The padding token is set to be the same as the end-of-sequence (EOS) token, ensuring consistency in padding and sequence termination. Additionally, the tokenizer is configured to add both a beginning-of-sequence (BOS) and an end-of-sequence (EOS) token to each sequence, facilitating clear demarcation of the start and end of text inputs. This setup is important for models dealing with varied and potentially complex text data, as it helps maintain sequence integrity and context understanding.

In [8]:
# Load the LLaMA tokenizer
tokenizer = LlamaTokenizer.from_pretrained(model_name)

# Set the padding token of the tokenizer to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

# Set the tokenizer to add a beginning-of-sequence and an end-of-sequence token to each sequence
#tokenizer.add_bos_token, tokenizer.add_eos_token = True, True

tokenizer

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

LlamaTokenizer(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [9]:
# Test the tokenizer
tokens = tokenizer(["Name three primary colors."], return_tensors="pt")
tokens

Keyword arguments {'add_special_tokens': False} not recognized.


{'input_ids': tensor([[    1,  4408,  2211,  7601, 11955, 29889]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

<br>

## Llama Model

---

This code snippet involves setting up a large language model (specifically 'llama-2-7b-hf') with customized training configurations for efficiency and resource management. First, it loads the base model with a specified quantization configuration (defined earlier as bnb_config) and assigns it to run on GPU 0. Then, it prepares the model for k-bit training, which might involve additional quantization or optimization steps. To reduce memory usage, particularly important for large models, caching is disabled. Finally, the tensor parallelism degree for pretraining is set to 1, a setting that can be adjusted for multi-GPU setups or distributed training scenarios. This setup is crucial for efficiently training large-scale models, especially in resource-constrained environments.

In [10]:
# Load the Llama model
model = LlamaForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bnb_config
)
model

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): Lla

In [11]:
# Prepare the model for k-bit training (possibly setting up additional quantization or optimization parameters)
mdoel = prepare_model_for_kbit_training(model)

# Disable caching to reduce memory usage during training; useful for large models
mdoel.config.use_cache = False 

# Set the tensor parallelism degree for pretraining to 1 (could be adjusting for multi-GPU setups or distributed training)
mdoel.config.pretraining_tp = 1

In [13]:
# Test the model
def generate_text(prompt, max_length=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=max_length)
    generated_text = tokenizer.decode(outputs[0])
    return generated_text

prompt = "The future of AI in healthcare is"
generated_text = generate_text(prompt)
print(generated_text)

Keyword arguments {'add_special_tokens': False} not recognized.


<s> The future of AI in healthcare is bright, but it will take a lot of work to get there.ℓ
In 2019, we saw the launch of the first FDA-approved AI-powered


<br>

## Weights & Biases (Monitoring)

---

This code snippet is used for integrating Weights and Biases (wandb), a popular tool for experiment tracking and monitoring in machine learning workflows.

First, it logs into Weights and Biases using a specific API key, which is essential for authentication and accessing the wandb services.
Then, it initializes a new wandb run, which is a single instance of a model training or evaluation process. This is done using wandb.init, where the project is named 'Fine tuning llama-2-7B', the job type is set as 'training', and the anonymity setting is 'allow', which might be related to how user data is handled or displayed.

This setup is crucial for tracking the performance, hyperparameters, and outputs of the training process, allowing for more organized and efficient machine learning experiments.

In [14]:
# Log in to Weights and Biases (wandb) for experiment tracking and monitoring, using the provided API key
wandb.login(key="d5dc049e40baee0f32ab437502140a7ac844cda0")

# Initialize a Weights and Biases run for tracking and organizing the training process, specifying project name, job type, and anonymity settings
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msoheil-mpg[0m. Use [1m`wandb login --relogin`[0m to force relogin


<br>

## Fine-Tuning

---

This code provides a comprehensive setup for fine-tuning a causal language model using specialized training techniques and tools. It begins by initializing the LoraConfig with parameters tailored for efficient training, targeting specific transformer modules with LoRA (Low-Rank Adaptation) adjustments. This setup aims to optimize the model's performance for causal language modeling tasks.



In [15]:
# Initialize the LoraConfig with specific parameters for efficient training of causal language models, targeting specific modules for optimization
peft_config = LoraConfig(
    lora_alpha= 8,          # Set the scaling factor alpha for LoRA (Low-Rank Adaptation)
    lora_dropout= 0.1,      # Specify the dropout rate for LoRA layers
    r= 16,                  # Define the rank for the low-rank matrices in LoRA
    bias="none",            # Indicate no bias to be used in LoRA layers
    task_type="CAUSAL_LM",  # Specify the task type as Causal Language Modeling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj"]    # List of transformer modules to apply LoRA
)

peft_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type='CAUSAL_LM', inference_mode=False, r=16, target_modules={'q_proj', 'gate_proj', 'v_proj', 'k_proj', 'o_proj', 'up_proj'}, lora_alpha=8, lora_dropout=0.1, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={})

Next, training arguments are defined, specifying various hyperparameters and configurations such as batch size, learning rate, optimizer type, and logging intervals. These settings are crucial for controlling the training process and ensuring efficient learning.

In [16]:
# Define training arguments for the model, setting various hyperparameters and training configurations
training_arguments = TrainingArguments(
    output_dir="./results",                 # Directory to save training outputs
    num_train_epochs=1,                     # Number of training epochs
    per_device_train_batch_size=4,          # Batch size per device during training
    gradient_accumulation_steps=2,          # Number of steps for gradient accumulation
    optim="paged_adamw_8bit",               # Specify optimizer as 8-bit precision AdamW variant
    save_steps=1000,                        # Interval for saving the model
    logging_steps=100,                      # Interval for logging, adjusted for better monitoring
    learning_rate=2e-4,                     # Learning rate
    weight_decay=0.001,                     # Weight decay for regularization
    fp16=True,                              # Enable training in 16-bit floating point precision for efficiency
    max_grad_norm=0.3,                      # Maximum gradient norm for gradient clipping
    warmup_ratio=0.1,                       # Adjusted warmup ratio for the learning rate
    group_by_length=True,                   # Group samples of similar lengths together
    lr_scheduler_type="linear",             # Type of learning rate scheduler
    report_to="wandb",                      # Report training progress to Weights and Biases
    load_best_model_at_end=True,            # Load the best model at the end of training
    evaluation_strategy="steps",            # Evaluate the model periodically
    eval_steps=500,                         # Steps interval for evaluation
)

training_arguments

TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_la

The core of the code is the setup of the Sparse Fine-Tuning (SFT) Trainer, which is configured with the model, dataset, LoRA configuration, tokenizer, and training arguments. This trainer is designed to fine-tune the model effectively while managing computational resources.

In [18]:
# Setting up the Sparse Fine-Tuning (SFT) Trainer with model, dataset, and training configurations
trainer = SFTTrainer(
    model=model,                         # The model to be trained
    train_dataset=dataset,               # The dataset used for training
    peft_config=peft_config,             # Configuration for Low-Rank Adaptation (LoRA) fine-tuning
    max_seq_length=128,                  # Set a reasonable max sequence length limit
    dataset_text_field="text",           # The field in the dataset containing text data
    tokenizer=tokenizer,                 # The tokenizer for processing text data
    args=training_arguments,             # Training arguments including hyperparameters and settings
    packing=True,                        # Enable packing for optimized memory and speed
)

trainer

Generating train split: 0 examples [00:00, ? examples/s]

Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens':

<trl.trainer.sft_trainer.SFTTrainer at 0x7925e01e6f20>

The training process is then executed, followed by saving the fine-tuned model. Additionally, the script integrates Weights and Biases (wandb) for tracking and monitoring the training process, including a clean shutdown of the wandb run after training. The code re-enables caching for faster inference and sets the model to evaluation mode, disabling training-specific behaviors like dropout. This comprehensive approach is aimed at achieving an optimized and well-monitored training process for a causal language model.


In [19]:
# Train model
trainer.train()



Step,Training Loss,Validation Loss


TrainOutput(global_step=10, training_loss=1.7943532943725586, metrics={'train_runtime': 284.8541, 'train_samples_per_second': 0.583, 'train_steps_per_second': 0.035, 'total_flos': 815872490864640.0, 'train_loss': 1.7943532943725586, 'epoch': 0.95})

In [20]:
# Save the fine-tuned model to the specified directory 'new_model'
trainer.model.save_pretrained(new_model)

# Finish and close the current Weights and Biases (wandb) run, stopping all logging and tracking
wandb.finish()

# Re-enable caching for the model, which can speed up inference by storing and reusing certain computations
model.config.use_cache = True

# Set the model to evaluation mode, which disables training-specific behaviors like dropout
model.eval()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁
train/global_step,▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,0.95
train/global_step,10.0
train/total_flos,815872490864640.0
train/train_loss,1.79435
train/train_runtime,284.8541
train/train_samples_per_second,0.583
train/train_steps_per_second,0.035


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4096, out_features=16, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=16, out_features=4096, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (k_proj): lora.Linear4bit(
            (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)

<br>

## Generation

---

In [21]:
# Initial prompt
prompt = "Name three primary colors."

# Complete the prompt
prompt = f"### Instruction:\n{prompt.strip()}\n\n### Response:\n"

# Tokenize
inputs = tokenizer([prompt], return_tensors="pt").to(str(model.device))

# Initialize the streamer
streamer = TextStreamer(tokenizer, skip_prompt=False, skip_special_tokens=False)

# Generate using the streamer
model.generate(**inputs, streamer=streamer, max_new_tokens=20)

Keyword arguments {'add_special_tokens': False} not recognized.


<s> ### Instruction:
Name three primary colors.

### Response:
Red, yellow, and blue

### Instruction:
Name three secondary colors.



tensor([[    1,   835,  2799,  4080, 29901,    13,  1170,  2211,  7601, 11955,
         29889,    13,    13,  2277, 29937, 13291, 29901,    13,  9039, 29892,
         13328, 29892,   322,  7254,    13,    13,  2277, 29937,  2799,  4080,
         29901,    13,  1170,  2211, 16723, 11955, 29889,    13]],
       device='cuda:0')

<br>

## Loading the Pre-Trained Model

---

In [25]:
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

In [None]:
# Initialize and load the base causal language model with specific configurations for memory and computational efficiency
base_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    low_cpu_mem_usage=True,      # Optimize CPU memory usage, especially useful for large models
    return_dict=True,           # Ensure the output is returned as a PyTorch dict object for easier manipulation
    torch_dtype=torch.float16,  # Set model to use 16-bit floating point precision to reduce memory usage and potentially increase performance
    device_map= {"": 0})        # Allocate the model to GPU 0 for faster computation

# Load a pre-trained PEFT (Progressive Embedding Fine-Tuning) model with custom configurations, using the base model as a starting point
model = PeftModel.from_pretrained(base_model, new_model)

# Optimize the model by merging fragmented tensors and unloading unnecessary parts from the GPU to free up memory
model = model.merge_and_unload()

In [None]:
# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

<br>

## Push to Hugging Face

---

In [None]:
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)