# Finetuning llama2 models - from turorial

[Tutorial](https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/)

# Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks

*This tutorial walks through the process of fine-tuning [LLaMA 2](https://ai.meta.com/llama/) models, providing step-by-step instructions.*

*All the code related to this article is available in our dedicated [GitHub repository](https://github.com/ovh/ai-training-examples/blob/main/notebooks/natural-language-processing/llm/miniconda/llama2-fine-tuning/llama_2_finetuning.ipynb).*

## Introduction
On July 18, 2023, [Meta](https://about.meta.com/) released LLaMA 2, the latest version of their **Large Language Model** (LLM).

Trained between January 2023 and July 2023 on 2 trillion tokens, these new models outperforms other LLMs on many benchmarks, including reasoning, coding, proficiency, and knowledge tests. This release comes in different flavors, with parameter sizes of **[7B](https://huggingface.co/meta-llama/Llama-2-7b-hf)**, **[13B](https://huggingface.co/meta-llama/Llama-2-13b-hf)** and a mind-blowing **[70B](https://huggingface.co/meta-llama/Llama-2-70b-hf)**. Models are intended for free for both commercial and research use in English.

To suit every text generation needed and fine-tune these models, we will use [QLoRA](https://arxiv.org/abs/2305.14314) (Efficient Finetuning of Quantized LLMs), a highly efficient fine-tuning technique that involves quantizing a pretrained LLM to just 4 bits and adding small “Low-Rank Adapters”. This unique approach allows for fine-tuning LLMs **using just a single GPU**! This technique is supported by the [PEFT](https://huggingface.co/docs/peft/) library.

## Set up Python environment
The following libraries are used for this method (`requirements.txt` file):

```
torch
accelerate @ git+https://github.com/huggingface/accelerate.git
bitsandbytes
datasets==2.13.1
transformers @ git+https://github.com/huggingface/transformers.git
peft @ git+https://github.com/huggingface/peft.git
trl @ git+https://github.com/lvwerra/trl.git
scipy
```

Then install and import the installed libraries

In [1]:
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, DataCollatorForLanguageModeling, Trainer, TrainingArguments, pipeline

## Download LLaMA 2 model
As mentioned before, LLaMA 2 models come in different flavors which are 7B, 13B, and 70B. Your choice can be influenced by your computational resources. Indeed, larger models require more resources, memory, processing power, and training time.

To download the model you have been granted access to, **make sure you are logged in to the Hugging Face model hub**. As mentioned in the requirements step, you need to use the `huggingface-cli` login command.

The following function will help us to download the model and its tokenizer. It requires a bitsandbytes configuration that we will define later.

In [2]:
import os

base_dir = '/mnt/d/Study/Thesis/thesis-implementations/'
model_name = 'meta-llama/llama-2-7b-hf'
model_dir = 'models_local'
model_path = os.path.join(base_dir, model_dir, model_name)

In [3]:
def load_model(model_name, bnb_config):
    n_gpus = torch.cuda.device_count()
    max_memory = f'{12288}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

## Quest Dataset

Load the qust dataset for training and create prompts accordingly

In [4]:
import json
import numpy as np
import os

dataset_path = os.path.join(base_dir, 'quest_generation/llama2/data')
train_file = 'train.jsonl'
val_file = 'val.jsonl'
data_files = {
	"train": train_file, 
	"val": val_file
}

In [5]:
# all_quests = ''
# with open(os.path.join(base_dir, 'data/VartinenFormatted/all_quests.json')) as json_file:
# 	all_quests = json.load(json_file)

In [6]:
# import random

# random.shuffle(all_quests)

# full_size = len(all_quests)
# train_size = int(full_size * 0.9)

# train_set = all_quests[:train_size]
# val_set = all_quests[train_size:]

In [7]:
# for out_file, qarr in zip([train_file, val_file], [train_set, val_set]):
# 	with open(out_file, 'w') as outfile:
# 		for entry in qarr:
# 			json.dump(entry, outfile)
# 			outfile.write('\n')

In [8]:
# %mv ./*.jsonl data/

In [5]:
dataset = load_dataset(dataset_path, data_files=data_files)

Downloading and preparing dataset json/data to /home/manish/.cache/huggingface/datasets/json/data-1eab3bb49fa34974/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/manish/.cache/huggingface/datasets/json/data-1eab3bb49fa34974/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
train_dataset = dataset['train']
val_dataset = dataset['val']

del dataset

## Create a bitsandbytes configuration and load the model and tokenizer
This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

In [7]:
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a [LoRa configuration](https://huggingface.co/docs/peft/conceptual_guides/lora):

In [8]:
def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

In [9]:
# Load model from HF with user's token and with bitsandbytes config
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_path, bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



In [10]:
EOS_TOKEN = tokenizer.eos_token

In [11]:
def create_prompt_formats_with_kg(input):
    """
    Format various fields of the input quest data ('plots', 'kb', 'quest')
    Then concatenate them using two newline characters 
    :param input: input dictionary
    """

    BACKGROUND = "### Background:"
    PLOTS_KEY = "### Plots:"
    INTRO_BLURB = "The quest related to the above information is as follows."
    QUEST = "### Quest:"
    END_KEY = "### End"
    

    blurb = f"{INTRO_BLURB}"  # add intro blurb - model system instruction

    background = ''  # add background - knowledge graph as text
    for kb in input['kbs']:
        entity = kb['name']
        desc = kb['description']
        e_type = kb['type']
        relations = kb['relations']
        background += f'{entity} is a {e_type}. '
        if entity != desc:
            background+= f'{entity} is a {desc}. '
        for rel in relations:
            background += f' {entity} is {rel[0]} {rel[1]}.'
        background += '\n'
    background = f"{BACKGROUND}\n{background}"
    plots_str = '\n'.join(input['plots'])
    plots = f"{PLOTS_KEY}\n{plots_str}"  # add plots - key plot points
    
    quest_str = ''
    for k,v in input['quest'].items():
        if k == 'description':
            continue
        if k == 'tasks':
            value = '\n ' + '\n '.join(np.char.capitalize(v[:-1]))
        else:
            value = v.capitalize()
        quest_str += f'{k.capitalize()}: {value}\n' 
    quest = f"{QUEST}\n{quest_str}"  # add quest output
    
    end = f"{END_KEY}"  # add end key
    
    parts = [part for part in [background, plots, blurb, quest, end] if part]

    formatted_prompt = "\n\n".join(parts)
    input['text'] = formatted_prompt + f'\n{EOS_TOKEN}'

    return input

In [12]:
def create_prompt_formats_val_with_kg(input):
    """
    Format various fields of the input quest data ('plots', 'kb', 'quest')
    Then concatenate them using two newline characters 
    :param input: input dictionary
    """

    BACKGROUND = "### Background:"
    PLOTS_KEY = "### Plots:"
    INTRO_BLURB = "The quest related to the above information is as follows."
    QUEST = "### Quest:"
    END_KEY = "### End"
    

    blurb = f"{INTRO_BLURB}"  # add intro blurb - model system instruction

    background = ''  # add background - knowledge graph as text
    for kb in input['kbs']:
        entity = kb['name']
        desc = kb['description']
        e_type = kb['type']
        relations = kb['relations']
        background += f'{entity} is a {e_type}. '
        if entity != desc:
            background+= f'{entity} is a {desc}. '
        for rel in relations:
            background += f' {entity} is {rel[0]} {rel[1]}.'
        background += '\n'
    background = f"{BACKGROUND}\n{background}"
    plots_str = '\n'.join(input['plots'])
    plots = f"{PLOTS_KEY}\n{plots_str}"  # add plots - key plot points
    
    quest_str = ''
    for k,v in input['quest'].items():
        if k == 'description':
            continue
        if k == 'tasks':
            value = '\n ' + '\n '.join(np.char.capitalize(v[:-1]))
        else:
            value = v.capitalize()
        quest_str += f'{k.capitalize()}: {value}\n' 
    quest = f"{QUEST}\n{quest_str}"  # add quest output
    
    end = f"{END_KEY}"  # add end key
    
    parts_p = [part for part in [background, plots, blurb] if part]
    parts_o = [part for part in [quest, end] if part]
    
    formatted_prompt = "\n\n".join(parts_p)
    formatted_output = "\n\n".join(parts_o)
    input['text'] = formatted_prompt
    input['output'] = formatted_output + f'\n{EOS_TOKEN}'

    return input

In [13]:
def create_prompt_formats_without_kg(input):
    """
    Format various fields of the input quest data ('plots', 'quest')
    Then concatenate them using two newline characters 
    :param input: input dictionary
    """

    PLOTS_KEY = "### Plots:"
    INTRO_BLURB = "The quest related to the above information is as follows."
    QUEST = "### Quest:"
    END_KEY = "### End"
    
    plots_str = '\n'.join(input['plots'])
    quest_str = ''

    blurb = f"{INTRO_BLURB}"  # add intro blurb - model system instruction
    plots = f"{PLOTS_KEY}\n{plots_str}"  # add plots - key plot points
    for k,v in input['quest'].items():
        if k == 'description':
            continue
        if k == 'tasks':
            value = '\n ' + '\n '.join(np.char.capitalize(v[:-1]))
        else:
            value = v.capitalize()
        quest_str += f'{k.capitalize()}: {value}\n' 
    quest = f"{QUEST}\n{quest_str}"  # add quest output
    end = f"{END_KEY}"  # add end key
    
    parts = [part for part in [plots, blurb, quest, end] if part]

    formatted_prompt = "\n\n".join(parts)
    input['text'] = formatted_prompt + f'\n{EOS_TOKEN}'

    return input

In [14]:
def create_prompt_formats_val_without_kg(input):
    """
    Format various fields of the input quest data ('plots', 'quest')
    Then concatenate them using two newline characters 
    :param input: input dictionary
    """

    PLOTS_KEY = "### Plots:"
    INTRO_BLURB = "The quest related to the above information is as follows."
    QUEST = "### Quest:"
    END_KEY = "### End"
    
    plots_str = '\n'.join(input['plots'])
    quest_str = ''

    blurb = f"{INTRO_BLURB}"  # add intro blurb - model system instruction
    plots = f"{PLOTS_KEY}\n{plots_str}"  # add plots - key plot points
    for k,v in input['quest'].items():
        if k == 'description':
            continue
        if k == 'tasks':
            value = '\n ' + '\n '.join(np.char.capitalize(v[:-1]))
        else:
            value = v.capitalize()
        quest_str += f'{k.capitalize()}: {value}\n' 
    quest = f"{QUEST}\n{quest_str}"  # add quest output
    end = f"{END_KEY}"  # add end key
    
    
    parts_p = [part for part in [plots, blurb] if part]
    parts_o = [part for part in [quest, end] if part]

    formatted_prompt = "\n\n".join(parts_p)
    formatted_output = "\n\n".join(parts_o)
    input['text'] = formatted_prompt
    input['output'] = formatted_output + f'\n{EOS_TOKEN}'

    return input

In [15]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, dataset: str, include_kg: bool = True):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    :param include_kg (bool): Whether to include knowledge graph in the prompt
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats_with_kg if include_kg else create_prompt_formats_without_kg)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["id", "game", "kbs", "plots", "quest"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    return dataset

In [16]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_val_dataset(tokenizer: AutoTokenizer, max_length: int, dataset: str, include_kg: bool = True):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    :param include_kg (bool): Whether to include knowledge graph in the prompt
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats_val_with_kg if include_kg else create_prompt_formats_val_without_kg)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=["id", "game", "kbs", "plots", "quest"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    return dataset

Now, we will use the **model tokenizer to process these prompts into tokenized ones**.

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

In [17]:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the `print_trainable_parameters()` helper function to see how many trainable parameters are in the model. We expect the LoRa model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

In [18]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

## Train
Now that everything is ready, we can pre-process our dataset and load our model using the set configurations:

In [19]:
# ## Preprocess dataset
max_length = get_max_length(model)
train_dataset = preprocess_dataset(tokenizer, max_length, train_dataset)
val_dataset = preprocess_val_dataset(tokenizer, max_length, val_dataset)

Found max lenth: 4096
Preprocessing dataset...


Map:   0%|          | 0/692 [00:00<?, ? examples/s]

Map:   0%|          | 0/692 [00:00<?, ? examples/s]

Filter:   0%|          | 0/692 [00:00<?, ? examples/s]

Preprocessing dataset...


Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Filter:   0%|          | 0/77 [00:00<?, ? examples/s]

In [24]:
## Preprocess dataset for no KG inputs
# max_length = get_max_length(model)
# train_dataset = preprocess_dataset(tokenizer, max_length, train_dataset, include_kg=False)
# val_dataset = preprocess_val_dataset(tokenizer, max_length, val_dataset, include_kg=False)

Loading cached processed dataset at /home/manish/.cache/huggingface/datasets/json/data-6359e290ba54d2fa/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-493dfaefeaeff728.arrow
Loading cached processed dataset at /home/manish/.cache/huggingface/datasets/json/data-6359e290ba54d2fa/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-06078d78d729b5c4.arrow
Loading cached processed dataset at /home/manish/.cache/huggingface/datasets/json/data-6359e290ba54d2fa/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-eabef810a6db0ed4.arrow
Loading cached processed dataset at /home/manish/.cache/huggingface/datasets/json/data-6359e290ba54d2fa/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-a9968558bff97f93.arrow


Found max lenth: 4096
Preprocessing dataset...
Preprocessing dataset...


Map:   0%|          | 0/77 [00:00<?, ? examples/s]

Filter:   0%|          | 0/77 [00:00<?, ? examples/s]

In [20]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=10,
            max_steps=20,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = os.path.join(base_dir, model_dir, 'results', model_name, 'final_checkpoint')
train(model, tokenizer, train_dataset, output_dir)

all params: 3,540,389,888 || trainable params: 39,976,960 || trainable%: 1.1291682911958425
torch.float32 302387200 0.08541070604255438
torch.uint8 3238002688 0.9145892939574456
Training...


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.1706
2,2.1023
3,2.3173
4,2.2382
5,1.9208
6,1.7836
7,1.8578
8,1.7175
9,1.5464
10,1.2287


***** train metrics *****
  epoch                    =       0.12
  total_flos               =   709948GF
  train_loss               =     1.5125
  train_runtime            = 0:33:48.45
  train_samples_per_second =      0.039
  train_steps_per_second   =       0.01
{'train_runtime': 2028.4507, 'train_samples_per_second': 0.039, 'train_steps_per_second': 0.01, 'total_flos': 762301429014528.0, 'train_loss': 1.5124629020690918, 'epoch': 0.12}
Saving last checkpoint of the model...


*If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the `max_steps` argument by `num_train_epochs`.*

To later load and use the model for inference, we have used the `trainer.model.save_pretrained(output_dir)` function, which saves the fine-tuned model’s weights, configuration, and tokenizer files.

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a `EarlyStoppingCallback`, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.

## Inference

In [26]:
# from torch.utils.data import Dataset

# class ListDataset(Dataset):
#      def __init__(self, original_list):
#         self.original_list = original_list
#      def __len__(self):
#         return len(self.original_list)

#      def __getitem__(self, i):
#         return self.original_list[i]

In [27]:
# from tqdm.auto import tqdm
# # Run text generation pipeline with our new model
# prompts = ListDataset(val_dataset['text'])
# pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)
# results = []

# for out in tqdm(pipe(prompts)):
#     results.append(out)

In [28]:
# with open('results.jsonl', 'w') as outfile:
# 	for result in results:
# 		result[0]['generated_text'] = result[0]['generated_text'].split('### END')[0].strip()
# 		json.dump(result[0], outfile)
# 		outfile.write('\n')

In [22]:
from tqdm.auto import tqdm
results = []

for item in tqdm(val_dataset):
	# Specify input
	inp = torch.tensor([item['input_ids']])
	attn_mask = torch.tensor([item['attention_mask']])

	# Specify device
	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	# Get answer
	# (Adjust max_new_tokens variable as you wish (maximum number of tokens the model can generate to answer the input))
	outputs = model.generate(input_ids=inp.to(device), attention_mask=attn_mask, max_new_tokens=250, eos_token_id=tokenizer('### End')['input_ids'], pad_token_id=tokenizer.eos_token_id)

	# Decode output & append to outptu list
	results.append(tokenizer.decode(outputs[0], skip_special_tokens=True))
	break

  0%|          | 0/77 [00:00<?, ?it/s]



In [24]:
print(results[0])

### Background:
Elder Josimon is a character. Elder Josimon is a a man from the Enclave.  Elder Josimon is present in Slavers' Camp.
Slavers' Camp is a location. Slavers' Camp is a home of the slavers who have been harassing the Estherians, in the Frosted Hills.  Slavers' Camp is connected to Frosted Hills.
Frosted Hills is a location. Frosted Hills is a hills. 


### Plots:
the player freed Elder Josimon from the slavers: he had recovered the Passkey Ember from them, but they captured him

The quest related to the above information is as follows.

### Quest:
Title: The key to freedom
Objective: Bring the passkey ember to elder josimon
Tasks: 
 Find the passkey ember


### End


In [31]:
print(f'{val_dataset[0]["text"]}\n\n{val_dataset[0]["output"]}')

### Background:
Elder Josimon is a character. Elder Josimon is a a man from the Enclave.  Elder Josimon is present in Slavers' Camp.
Slavers' Camp is a location. Slavers' Camp is a home of the slavers who have been harassing the Estherians, in the Frosted Hills.  Slavers' Camp is connected to Frosted Hills.
Frosted Hills is a location. Frosted Hills is a hills. 


### Plots:
the player freed Elder Josimon from the slavers: he had recovered the Passkey Ember from them, but they captured him

The quest related to the above information is as follows.

### Quest:
Title: Up in smoke
Objective: Put the slavers out of the picture for good
Tasks: 
 Burn down the slavers' houses


### End
</s>


In [30]:
with open('outputs/results.txt', 'w') as outfile:
	for result in results:
		outfile.write(f'{result}\n{"-"*40}\n')

## Merge weights
Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!

In [33]:
del model
# del pipe

import gc
gc.collect()
gc.collect()

0

In [32]:
print(output_dir)

models/results/llama-2-13b/final_checkpoint


In [34]:
output_dir = os.path.join(base_dir, model_dir, 'results', model_name, 'final_checkpoint')

In [37]:
base_model = load_model(model_name, bnb_config)
# base_model = model

from peft import PeftModel    

model = PeftModel.from_pretrained(base_model, output_dir)
model = model.merge_and_unload()

output_merged_dir = os.path.join(base_dir, model_dir, 'results', model_name, 'final_merged_checkpoint')
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
# model_name = "models/meta-llama/llama-2-13b-hf" 
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# tokenizer.save_pretrained(output_merged_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



AttributeError: 'tuple' object has no attribute 'named_modules'

In [None]:
## Preprocess validation dataset
max_length = get_max_length(model)
val_dataset = preprocess_val_dataset(tokenizer, max_length, val_dataset)

In [None]:
import textwrap
from ctransformers import AutoModelForCausalLM, LlamaTokenizer, LlamaForCausalLM
import os
import sys
from typing import List
 
from peft import (
    LoraConfig,
    get_peft_model,
    get_peft_model_state_dict,
    prepare_model_for_int8_training,
)
 
import fire
import torch
from datasets import load_dataset
import pandas as pd
 
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from pylab import rcParams
 
%matplotlib inline
sns.set(rc={'figure.figsize':(10, 7)})
sns.set(rc={'figure.dpi':100})
sns.set(style='white', palette='muted', font_scale=1.2)
 
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DEVICE

In [None]:
llm = AutoModelForCausalLM.from_pretrained("llama2/TheBloke/Llama-2-13B-Ensemble-v5-GGUF", 
										   model_file="llama-2-13b-ensemble-v5.Q5_0.gguf", 
										   model_type="llama", 
										   gpu_layers=1000000000)