# Parameter Efficient Fine-Tuning QLoRA for Legal Assistance

In the ever-evolving field of legal technology, the use of advanced natural language processing models has become increasingly critical. **To enhance the effectiveness and accuracy of automated legal assistants**, fine-tuning state-of-the-art language models, such as those based on the QLoRA (Quantized Low-Rank Adapters), is essential. This process involves adapting a pre-trained model to **better handle specific legal contexts and tasks**, ultimately improving its ability to assist users with nuanced legal inquiries.

- **Objective:** The primary objective of this fine-tuning exercise is to adapt the QLoRA model to perform optimally in the domain of legal assistance. By **fine-tuning the model with a specialized dataset**, we aim to enhance its proficiency in understanding, generating, and advising on various legal issues.

- **Dataset Preparation:** The dataset for this fine-tuning exercise has been carefully curated to reflect a range of common and complex legal scenarios. It includes a variety of examples across different areas of law, including but not limited to:

  - Contract Law
  - Family Law
  - Corporate Law
  - Intellectual Property Law
  - Estate Planning
  - Real Estate Law
  - Immigration Law
  - Consumer Protection Law
  - Criminal Law
  - Environmental Law

Each example is structured in a specific format to provide the model with clear instructions and responses. The dataset includes:

- **System Prompt:** A brief description of the model's role and expertise.
- **User Prompt:** A detailed legal question or scenario requiring a specific response.
- **Model Answer:** A well-structured and accurate response tailored to the legal query.

- **Format:** The dataset entries follow this format:

```
<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>

```
The dataset features a diverse set of legal queries, ranging from drafting legal clauses to explaining complex legal principles. Here are a few types of examples included:

- **Contract Clauses:** Drafting specific clauses such as limitation of liability.

- **Custody Arrangements:** Explaining factors considered in child custody decisions.

- **Business Formation:** Outlining steps to form a limited liability company (LLC).

- **Copyright Registration:** Detailing requirements for obtaining a copyright registration.

- **Estate Planning:** Comparing wills and living trusts.
- **Lease Agreements:** Identifying key components of a residential lease agreement.
- **Immigration Processes:** Describing the process for applying for a family-based green card.
- **Consumer Rights:** Outlining rights under the Fair Credit Reporting Act (FCRA).
- **Criminal Rights:** Explaining Miranda rights and their application.
- **Environmental Regulations:** Summarizing key provisions of the Clean Air Act.
- **Purpose and Benefits:** Fine-tuning the QLoRA model with this dataset will enable it to **provide more accurate, relevant, and contextually appropriate responses** to a wide range of legal inquiries. This tailored model will better support users in navigating legal issues, understanding their rights and obligations, and making informed decisions.

By investing in this fine-tuning process, we aim to enhance the model's capability to serve as a reliable and knowledgeable legal assistant, ultimately improving the quality and accessibility of legal support.


## Performance of LLaMA-2-7B-Chat Model Without Fine-Tuning

### Loading the Pretrained LLaMA 2 Model and Configuring the Tokenizer for Text Generation


In [1]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 xformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.9/116.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m101.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m89.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuring and loading a model for **causal language modeling** using the transformers library from Hugging Face. The code specifically sets up a configuration for a model that **is optimized for running in a resource-constrained environment** by utilizing 4-bit quantization techniques. The following key components and settings are involved:

- **Importing Required Libraries:** The script begins by importing two essential libraries:

- **AutoModelForCausalLM from the transformers module:** This function is responsible for loading a pre-trained model that can be used for causal language modeling, a type of task where the model generates text based on previous input.

- **BitsAndBytesConfig from the same module:** This class is used to configure specific settings related to the quantization of the model, which allows the model to **operate with reduced precision**, saving memory and computational resources.

- **torch:** This is the PyTorch library, which provides support for tensor computations and **enables model training on GPUs.**

**Configuring the Bits and Bytes (bnb) Settings:** The bnb_config object is instantiated using the BitsAndBytesConfig class. The configuration includes several important parameters:

- **load_in_4bit=True:** This enables loading the model in 4-bit precision, which drastically reduces the memory footprint while maintaining an acceptable level of performance.
- **bnb_4bit_quant_type="nf4":** This parameter specifies the quantization type as "nf4" (Normal Float 4), a quantization scheme known for preserving the accuracy of the model while operating with lower precision.

- **bnb_4bit_compute_dtype=torch.float16:** This setting ensures that computations are performed using 16-bit floating-point precision (float16). This balances performance and resource usage during model inference.

- **bnb_4bit_use_double_quant=False:** This setting disables double quantization, which could further reduce precision at the cost of potentially increasing inference speed. In this case, it is set to False to maintain a higher level of accuracy.

Overall, this script is designed to optimize the model for efficient inference, particularly in environments where memory and computational resources are limited. The use of 4-bit quantization and float16 precision strikes a balance between performance and resource efficiency, making it well-suited for deployment in scenarios where hardware constraints are a concern.

In [2]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enables loading the model with 4-bit quantization to reduce memory usage and improve performance.
    bnb_4bit_quant_type="nf4",  # Specifies the quantization type for 4-bit quantization. "nf4" refers to a specific quantization format used to optimize model size and speed.
    bnb_4bit_compute_dtype=torch.float16,  # Sets the data type used for computations. "torch.float16" indicates that calculations should be performed using 16-bit floating-point precision, which can improve performance on compatible hardware.
    bnb_4bit_use_double_quant=False,  # Determines whether to use double quantization for additional precision. Setting this to "False" disables double quantization, which may reduce computational overhead.
)

model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Llama-2-7b-chat-hf",  # Specifies the pre-trained model to load. In this case, it is "NousResearch/Llama-3-7b-chat-hf", which refers to a specific model variant.
    quantization_config=bnb_config,  # Passes the quantization configuration (defined earlier) to adjust the model's precision and memory usage.
    device_map={"": 0},  # Maps the model to a specific device. Here, it assigns the model to the first GPU (device 0). An empty string for the key indicates the main model.
)

model.config.use_cache = False  # Disables the use of cache during model inference. This setting ensures that the model does not use cached intermediate computations, which might be useful for debugging or specific use cases.
model.config.pretraining_tp = 1  # Sets the tensor parallelism degree for pre-training. A value of 1 indicates no tensor parallelism, meaning that computations are not distributed across multiple devices.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

The following code configures a tokenizer for the "NousResearch/Llama-2-7b-chat-hf" model using the Hugging Face `transformers` library:

- **Tokenizer Initialization**:  
  The tokenizer is initialized using `AutoTokenizer.from_pretrained()` with the model identifier `"NousResearch/Llama-2-7b-chat-hf"`. The `trust_remote_code=True` parameter ensures that any custom code from the model's repository is utilized during the tokenizer setup. This is important for maintaining compatibility with the specific model.

- **Padding Token Configuration**:  
  The `pad_token` is set to the same value as the `eos_token` (end-of-sequence token). This ensures that whenever padding is required, it is treated as the end of a sequence. This can be helpful in generating and processing sequences where the `eos_token` has specific importance.

- **Padding Side Configuration**:  
  The `padding_side` is set to `"right"`, meaning that padding will be added to the right side of sequences. This keeps the main content aligned to the left, with any extra padding added at the end. This configuration is useful in keeping the structure of input sequences consistent during batching.

In [3]:
from transformers import AutoTokenizer

# Load the tokenizer for the "NousResearch/Llama-2-7b-chat-hf" model.
# The "trust_remote_code=True" flag allows the use of custom code from the model's repository,
# which may include specific tokenization logic.

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-chat-hf", trust_remote_code=True)

# Set the padding token to be the same as the end-of-sequence (EOS) token.
# This ensures that when padding sequences to a fixed length, the padding tokens
# will be treated as end-of-sequence tokens.

tokenizer.pad_token = tokenizer.eos_token

# Configure the tokenizer to apply padding to the right side of sequences.
# This means that when padding is applied, it will be added to the end (right side) of the input,
# leaving the actual content aligned to the left.
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

### Text Generation Using LLaMA-2-7B-Chat Model

The following code creates a text generation pipeline using the Hugging Face `transformers` library, which leverages the fine-tuned LLaMA 2 model for generating text.

- **Pipeline Initialization**:  
  The pipeline is initialized with the `task="text-generation"`, which specifies that the pipeline will perform text generation. The pipeline is constructed using the pre-loaded `model` and `tokenizer`, ensuring that it leverages the specific configurations and capabilities of the LLaMA 2 model.

- **Max Length Configuration**:  
  The `max_length=500` parameter sets the maximum number of tokens that the model will generate in a single sequence. This limits the length of the generated text, ensuring that the output doesn't exceed a specified token count, which is useful in controlling the size of generated outputs.

This setup allows for generating text based on a given input prompt using the LLaMA 2 model, with appropriate tokenization and length control.

In [4]:
from transformers import pipeline

llama2_pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)

  @torch.library.impl_abstract("xformers_flash::flash_fwd")
  @torch.library.impl_abstract("xformers_flash::flash_bwd")


In this example, we showcase how to leverage a language model to draft a specialized document.

We start by defining a **prompt** that instructs the model to act as a **legal assistant** with expertise in **employment law**. The task is to create an **employment contract** that includes essential clauses such as **job duties**, **compensation**, and **termination conditions**.

To guide the model, we format the prompt with specific instructions, ensuring clarity in the task. The prompt is then passed to the model for processing, and it generates a draft of the employment contract based on the given guidelines.

After generating the text, we perform a final touch-up to improve readability, ensuring that the output is clean and well-formatted.

This approach illustrates the model’s capability to produce detailed and contextually accurate documents from structured prompts.



In [12]:
prompt = """<<SYS>>
You are a legal assistant specializing in employment law. Your task is to draft an employment contract that includes clauses for job duties, compensation, and termination conditions.
<</SYS>>

Draft an employment contract that covers job duties, compensation, and termination conditions."""

prompt_template = f"<s>[INST] {prompt} [/INST]"

output = llama2_pipe(prompt_template)
print(output[0]['generated_text'].replace("[/INST]", "[/INST]\n\n"))



<s>[INST] <<SYS>>
You are a legal assistant specializing in employment law. Your task is to draft an employment contract that includes clauses for job duties, compensation, and termination conditions.
<</SYS>>

Draft an employment contract that covers job duties, compensation, and termination conditions. [/INST]

  Sure, here is a sample employment contract that covers job duties, compensation, and termination conditions:

Employment Contract

This Employment Contract (the "Agreement") is made and entered into on [insert date] by and between [insert company name] (the "Company") and [insert employee name] (the "Employee").

Job Duties:

The Employee shall be employed as a [insert job title] and shall be responsible for the following job duties:

* [insert duty 1]
* [insert duty 2]
* [insert duty 3]

Compensation:

The Company shall pay the Employee the following compensation:

* [insert salary or hourly rate] per [insert time period (e.g. month, quarter, year)]
* [insert any bonuses or

In this example, we demonstrate how to utilize a language model to draft a **copyright license agreement**.

We start by setting up a **prompt** where the model is instructed to act as a **legal assistant** specializing in **intellectual property law**. The task is to draft a **copyright license agreement** that includes key clauses such as **scope of use**, **duration**, and **payment terms**.

The prompt is formatted to include clear instructions for the model. We then pass this formatted prompt to the model, which generates the text based on the given parameters.

After generating the text, we clean up the output to ensure it is well-structured and readable, removing any unnecessary formatting tags.

This process highlights how the model can assist in creating detailed legal documents from structured prompts, tailored to specific legal requirements.


In [13]:
prompt = """<<SYS>>
You are a legal assistant specializing in intellectual property law. Your task is to draft a copyright license agreement that includes clauses for scope of use, duration, and payment terms.
<</SYS>>

Draft a copyright license agreement that includes scope of use, duration, and payment terms."""

prompt_template = f"<s>[INST] {prompt} [/INST]"

output = llama2_pipe(prompt_template)
print(output[0]['generated_text'].replace("[/INST]", "[/INST]\n\n"))

<s>[INST] <<SYS>>
You are a legal assistant specializing in intellectual property law. Your task is to draft a copyright license agreement that includes clauses for scope of use, duration, and payment terms.
<</SYS>>

Draft a copyright license agreement that includes scope of use, duration, and payment terms. [/INST]

  Sure, here is a sample copyright license agreement that includes scope of use, duration, and payment terms:

License Agreement for Copyrighted Material

This License Agreement (the "Agreement") is entered into on [insert date] by and between [insert name of licensor] (the "Licensor") and [insert name of licensee] (the "Licensee").

Scope of Use:

The Licensor hereby grants to the Licensee a non-exclusive, non-transferable, and non-sublicensable license to use the copyrighted material (the "Material") for the following purposes:

[insert specific purposes, such as "for use in the Licensee's business" or "for use in the Licensee's advertising and marketing campaigns"]

Du

While the generated text provides a useful starting point, it is important to note that the model does not have a specific **style** or **tone**. Consequently, the format or type of response may not fully align with the company's desired tone, culture, or operational practices. It is essential to review and adapt the output to ensure it meets the company's standards and requirements.

## Dataset Selection and Preparation for Model Training and Evaluation

## Data Formatting for Llama 2 Models

When working with Large Language Models (LLMs) like Llama 2, one of the critical aspects of data preparation is ensuring that training examples are formatted correctly. Proper formatting is essential to train the model effectively and to achieve desired performance in tasks such as text generation and conversation.

For Llama 2, specifically designed for chat models, the authors provided a specific template for structuring the training examples. This template is designed to guide the model in generating coherent and contextually appropriate responses.

### Template Structure

The Llama 2 template is structured as follows:

```
<s>[INST] <<SYS>> System prompt <</SYS>>

User prompt [/INST] Model answer </s>
```

This structure consists of several components:

- **System Instructions (Optional)**: These are used to set the context or provide additional guidance to the model. They are not always required but can help in shaping the model's behavior.

- **User Instructions (Mandatory)**: This part contains the primary instruction or query that the model needs to respond to. It is essential for guiding the model in generating appropriate responses.

- **Additional Inputs (Optional)**: These may include any extra context or information that should be considered by the model while generating the response.

- **Model Response (Mandatory)**: This is the expected output from the model. It provides the answer or continuation based on the user prompt and any additional instructions.

By adhering to this template, you ensure that the training data is consistently formatted, which helps the model learn and generalize better during the fine-tuning process. Proper data formatting is crucial for leveraging the full potential of Llama 2 and achieving high-quality results in real-world applications.


### Loading and Reading the Dataset from a File

This code snippet demonstrates how to read a text file line by line and store each line as an element in a list.

### Explanation:

1. **Dataset Path**:  
   The `dataset_path` variable stores the path to the text file, which in this case is `"/content/dataset.txt"`.

2. **Opening the File**:  
   The `with open()` statement is used to open the file in read-only mode (`'r'`). The `encoding='utf-8'` ensures that the file is read with UTF-8 encoding, which is important for handling a wide range of characters.

3. **Reading the File**:  
   Inside the `with` block, the `read().splitlines()` method reads the entire content of the file and splits it into individual lines. Each line is stored as a separate element in the list `examples`.

4. **Storing the Examples**:  
   The `examples` list now contains each line of the file as a separate string. This allows you to process the data line by line efficiently.

This method ensures that the file is properly closed after reading, and it is a memory-efficient approach for handling large text datasets.

In [6]:
dataset_path = "/content/dataset.txt"

with open(dataset_path, 'r', encoding='utf-8') as f:
  examples = f.read().splitlines()

In [7]:
examples[1]

'<s>[INST] <<SYS>>You are a legal assistant specializing in intellectual property law. Your task is to draft a copyright license agreement that includes clauses for scope of use, duration, and payment terms.<</SYS>>Draft a copyright license agreement that includes scope of use, duration, and payment terms.</INST>Sure! Here is a template for a Copyright License Agreement: COPYRIGHT LICENSE AGREEMENT This Copyright License Agreement ("Agreement") is made between [Licensor Name], with an address at [Address], and [Licensee Name], with an address at [Address]. 1. Scope of Use: The Licensor grants the Licensee a non-exclusive, non-transferable license to use the copyrighted material titled [Title of Work] for [specific use]. 2. Duration: This license shall commence on the Effective Date and continue for [X years/months]. 3. Payment Terms: The Licensee agrees to pay the Licensor [Amount] as a one-time fee or [Specify Payment Schedule]. IN WITNESS WHEREOF, the parties have executed this Agree

This code snippet demonstrates how to create a `Dataset` object using the `datasets` library.

### Explanation:

1. **Importing Modules**:  
   The code begins by importing `Dataset` and `DatasetDict` from the `datasets` library, which is used to work with datasets.

2. **Creating the Dataset**:  
   The `Dataset.from_dict()` method is used to create a `Dataset` object from a dictionary. In this case, the dictionary has a key `"text"` and the value is a list of text lines. This method converts the dictionary into a structured `Dataset` format.

3. **Resulting Dataset**:  
   The resulting `Dataset` object contains each line of text from the list as a separate entry under the `"text"` column. This structured format facilitates further data processing and analysis.

In [8]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_dict({"text": examples})

In [9]:
ds

Dataset({
    features: ['text'],
    num_rows: 43
})

This code snippet demonstrates how to access a specific entry in a `Dataset` object.

### Explanation:

- **Accessing an Entry**:  
  The code `ds["text"][2]` is used to retrieve the third entry (index `2`, since indexing is zero-based) from the `"text"` column of the `Dataset` object `ds`.

  - `ds["text"]` refers to the `"text"` column of the dataset.
  - `[2]` accesses the element at index `2` within that column.

In this context, `ds["text"][2]` would return the text content of the third row from the `"text"` column of the dataset, allowing you to inspect or manipulate specific data entries.

In [10]:
ds["text"][2]

'<s>[INST] <<SYS>>You are a legal assistant specializing in real estate law. Your task is to draft a lease agreement that includes clauses for rent, duration, and maintenance responsibilities.<</SYS>>Draft a lease agreement that includes rent, duration, and maintenance responsibilities.</INST>Sure! Here is a template for a Lease Agreement: LEASE AGREEMENT This Lease Agreement ("Agreement") is entered into between [Landlord Name], with an address at [Address], and [Tenant Name], with an address at [Address]. 1. Rent: The Tenant agrees to pay the Landlord a monthly rent of [Amount], due on the [Day] of each month. 2. Duration: This lease shall commence on [Start Date] and end on [End Date], unless extended or terminated earlier in accordance with this Agreement. 3. Maintenance Responsibilities: The Tenant shall be responsible for routine maintenance and repair of the leased premises, including [specific responsibilities]. The Landlord shall be responsible for major repairs not caused by 

In [14]:
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}")

In [15]:
print_trainable_parameters(model)

trainable params: 262410240 || all params: 3500412928 || trainable%: 7.50


In [16]:
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training

# Define LoRA configuration
lora_config = LoraConfig(
    r=16,  # Dimension of the low-rank matrices (rank of the adaptation matrices)
    lora_alpha=16,  # Scaling factor for LoRA, controls the influence of the low-rank matrices
    lora_dropout=0.05,  # Dropout rate for regularization, helps prevent overfitting by dropping a fraction of the LoRA parameters
    bias="none",  # Specifies whether biases in LoRA layers are applied. Options include "none", "all", or "lora"
    task_type="CAUSAL_LM"  # Type of task/model to which LoRA is applied, here it is for causal language models
    # target_modules=["q", "k", "v"],  # (Optional) Specifies which modules (e.g., attention heads) LoRA matrices should be applied to.
)

### LoRA Configuration Parameters

The `LoraConfig` class is used to set up parameters for Low-Rank Adaptation (LoRA) in the model training.** LoRA is a technique to efficiently adapt pre-trained models to new tasks by injecting trainable low-rank matrices into the model.** Here is a breakdown of each parameter:

- **`r` (Dimensión de las matrices)**:
  - **Description**: This parameter defines the rank of the low-rank matrices used in LoRA. It essentially determines the size of the additional parameters that will be learned during training. **A larger `r` may capture more complex patterns but increases the number of parameters and computational cost.**
  - **Example**: `r = 16`

- **`lora_alpha` (LoRA scaling factor)**:
  - **Description**: This scaling factor adjusts the magnitude of the low-rank adaptation matrices. It controls how much influence the LoRA parameters have relative to the original model parameters. Higher values can increase the impact of LoRA modifications.
  - **Example**: `lora_alpha = 16`

- **`lora_dropout` (Regularización)**:
  - **Description**: This parameter specifies the dropout rate applied to the LoRA matrices. Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of the weights to zero during training. The `lora_dropout` helps in regularizing the LoRA-specific parameters.
  - **Example**: `lora_dropout = 0.05`

- **`bias`**:
  - **Description**: This parameter controls whether biases in the LoRA layers are applied or not. Options typically include `"none"`, `"all"`, or `"lora"`. Setting `bias="none"` means no bias terms are included in the LoRA layers.
  - **Example**: `bias="none"`

- **`task_type` (Tipo de tarea/modelo al que aplicarlo)**:
  - **Description**: Specifies the type of task or model architecture to which LoRA is applied. For example, `"CAUSAL_LM"` indicates that LoRA is configured for a causal language model (such as GPT). This ensures that LoRA is applied appropriately based on the model's task.
  - **Example**: `task_type="CAUSAL_LM"`

- **`target_modules`** (comentado en el código):
  - **Description**: This optional parameter defines which specific modules of the model (e.g., attention heads) will have LoRA matrices applied. If not specified, LoRA may be applied to all layers by default. This allows targeted adaptation for certain parts of the model.
  - **Example**: `target_modules=["q", "k", "v"]` (commented out in the code)

By configuring these parameters, you can tailor the LoRA adaptation process to fit your specific needs and optimize the performance of the model on new tasks.

In [17]:
# Apply the LoRA configuration to the model
model_peft = get_peft_model(model, lora_config)
# This line applies the LoRA configuration (defined by `lora_config`) to the existing model.
# `get_peft_model` integrates LoRA into the model, adjusting its parameters according to the specified configuration.

# Display the number of trainable parameters
model_peft.print_trainable_parameters()
# This line prints out the number of parameters in the model that are trainable under the current LoRA configuration.
# It helps to verify which and how many parameters are being modified and trained.

trainable params: 8,388,608 || all params: 3,508,801,536 || trainable%: 0.23907331075678143


In [18]:
# Directory where the model predictions and checkpoints will be stored
output_dir = "/content/llama2-7b-chat-peft"
# Specifies the path where the model's predictions and checkpoint files will be saved.

# Number of epochs for training
num_train_epochs = 5
# Defines the number of times the entire training dataset will be passed through the model.

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False
# Indicates whether to use mixed precision (fp16 or bf16) for training to potentially speed up training and reduce memory usage.

# Batch size per GPU for training
per_device_train_batch_size = 4
# Specifies the number of samples per batch to be processed on each GPU during training.

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4
# Specifies the number of samples per batch to be processed on each GPU during evaluation.

# Number of steps to accumulate gradients
gradient_accumulation_steps = 1
# Determines how many steps to accumulate gradients before performing an optimization step. Helps to manage memory usage with large batch sizes.

# Enable gradient checkpointing
gradient_checkpointing = True
# When enabled, gradient checkpointing reduces memory usage by saving only some intermediate results during forward pass and recomputing them during backward pass.

# Maximum gradient norm (gradient clipping)
max_grad_norm = 0.3
# Specifies the maximum allowable value for gradients; gradients larger than this value are scaled down to prevent exploding gradients.

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Sets the starting learning rate for the AdamW optimizer, which controls how much to adjust the model weights in each step.

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Defines the regularization parameter that penalizes large weights to prevent overfitting, applied to all layers except biases and LayerNorm parameters.

# Optimizer to use
optim = "paged_adamw_32bit"
# Specifies the optimizer algorithm. "paged_adamw_32bit" is a variant of the AdamW optimizer optimized for memory efficiency.

# Learning rate scheduler type
lr_scheduler_type = "cosine"
# Sets the type of learning rate scheduler. "cosine" gradually decreases the learning rate following a cosine function during training.

# Number of training steps (overrides num_train_epochs)
max_steps = -1
# If set to a positive integer, this value overrides `num_train_epochs` to define the total number of training steps. A value of -1 means no override.

# Proportion of steps for linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Specifies the proportion of steps during which the learning rate will increase linearly from 0 to the initial learning rate.

# Group sequences into batches of the same size
# Saves memory and speeds up training significantly
group_by_length = True
# When enabled, sequences of similar lengths are grouped together into batches, reducing padding and improving training efficiency.

# Save checkpoint every X update steps
save_steps = 0
# Determines how often (in terms of update steps) checkpoints are saved. A value of 0 means no additional checkpoints are saved.

# Log every X update steps
logging_steps = 25
# Specifies how often (in terms of update steps) training progress will be logged. Helps to monitor training performance and progress.


## Explanation of Training Configuration

In this section, we set up and configure the training process for a machine learning model using the `transformers` and `trl` libraries. Below is an overview of the key components and their roles:

### `TrainingArguments` Class

The `TrainingArguments` class from the `transformers` library is used to specify various training parameters for fine-tuning a model. The parameters include:

- **`output_dir`**: Directory where the model checkpoints and other outputs will be saved during training.

- **`num_train_epochs`**: Number of times the entire training dataset will be passed through the model. This controls the number of epochs (iterations over the whole dataset) the model will train for.

- **`per_device_train_batch_size`**: Number of training examples processed in a single batch on each device (e.g., GPU) during training.

- **`gradient_accumulation_steps`**: Number of steps over which gradients are accumulated before performing an optimization step, allowing for effectively larger batch sizes.

- **`optim`**: Optimizer to use for training, such as "paged_adamw_32bit". This updates the model parameters based on computed gradients.

- **`save_steps`**: Frequency (in terms of update steps) to save a checkpoint of the model. Setting it to `0` means checkpoints are not saved based on steps.

- **`logging_steps`**: Frequency (in terms of update steps) to log the training progress.

- **`learning_rate`**: Initial learning rate for the optimizer, controlling how much to adjust the model parameters in response to the estimated error.

- **`weight_decay`**: Weight decay regularization to prevent overfitting by penalizing large weights.

- **`fp16`** and **`bf16`**: Enable mixed precision training using `fp16` (half-precision floating-point) or `bf16` (bfloat16) to reduce memory usage and accelerate training.

- **`max_grad_norm`**: Maximum norm for gradient clipping to prevent exploding gradients. Gradients are scaled if their norm exceeds this value.

- **`max_steps`**: Total number of training steps, overriding `num_train_epochs` if set to a positive integer.

- **`warmup_ratio`**: Proportion of training steps used for a linear warmup of the learning rate, stabilizing training in the beginning.

- **`group_by_length`**: Group sequences of similar lengths into batches to save memory and improve efficiency.

- **`lr_scheduler_type`**: Type of learning rate scheduler, such as "cosine", which adjusts the learning rate during training.

- **`remove_unused_columns`**: Remove columns from the dataset that are not used for training.

- **`save_strategy`**: Strategy for saving model checkpoints, such as "epoch", meaning checkpoints are saved at the end of each epoch.

- **`save_total_limit`**: Maximum number of checkpoints to keep. Older checkpoints will be deleted when this limit is exceeded.

### `SFTTrainer` Class

The `SFTTrainer` class from the `trl` (Transformers Reinforcement Learning) library is used for training models with a focus on specific training objectives and configurations, particularly for sequence-to-sequence models. Key parameters include:

- **`model`**: The model to be trained, typically an instance of a pre-trained model fine-tuned on the dataset.

- **`train_dataset`**: Dataset used for training the model, containing the training examples the model will learn from.

- **`peft_config`**: Configuration for Parameter-Efficient Fine-Tuning (PEFT), such as LoRA (Low-Rank Adaptation), specifying how to adapt the model parameters during training.

- **`dataset_text_field`**: Field in the dataset that contains the text data for training.

- **`max_seq_length`**: Maximum length of sequences considered during training. If `None`, it is determined by the longest sequence in a batch.

- **`tokenizer`**: Tokenizer used to preprocess the input data into the format required by the model.

- **`args`**: Training arguments specifying various training parameters.

- **`packing`**: Whether to pack multiple short examples into the same input sequence to increase training efficiency.

This setup ensures that the training process is well-defined and controlled, facilitating efficient and effective fine-tuning of the model.


In [19]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,  # Directory where the model checkpoints and outputs will be saved.
    num_train_epochs=num_train_epochs,  # Number of times to pass the entire training dataset through the model.
    per_device_train_batch_size=per_device_train_batch_size,  # Number of samples per batch processed on each GPU during training.
    gradient_accumulation_steps=gradient_accumulation_steps,  # Number of steps to accumulate gradients before performing an optimization step.
    optim=optim,  # Optimizer algorithm to use for training (e.g., "paged_adamw_32bit").
    save_steps=save_steps,  # Number of update steps before saving a checkpoint. A value of 0 means checkpoints are not saved based on steps.
    logging_steps=logging_steps,  # Number of update steps before logging training progress.
    learning_rate=learning_rate,  # Initial learning rate for the optimizer.
    weight_decay=weight_decay,  # Regularization parameter for weight decay to prevent overfitting.
    fp16=fp16,  # Enable mixed precision training with fp16 (False means not enabled).
    bf16=bf16,  # Enable mixed precision training with bf16 (False means not enabled).
    max_grad_norm=max_grad_norm,  # Maximum gradient norm for gradient clipping to prevent exploding gradients.
    max_steps=max_steps,  # Total number of training steps. Overrides num_train_epochs if set to a positive integer.
    warmup_ratio=warmup_ratio,  # Proportion of training steps for linear warmup of the learning rate.
    group_by_length=group_by_length,  # Group sequences of similar length into batches to save memory and improve training efficiency.
    lr_scheduler_type=lr_scheduler_type,  # Type of learning rate scheduler (e.g., "cosine").
    remove_unused_columns=False,  # Whether to remove unused columns from the dataset. False means columns are not removed.
    save_strategy="epoch",  # Strategy for saving checkpoints. "epoch" means save checkpoints at the end of each epoch.
    save_total_limit=2  # Maximum number of checkpoints to keep. Older checkpoints will be deleted when this limit is exceeded.
)

# Create the training instance
trainer = SFTTrainer(
    model=model,  # The model to be trained.
    train_dataset=ds,  # Dataset to be used for training.
    peft_config=lora_config,  # Configuration for LoRA (Low-Rank Adaptation).
    dataset_text_field="text",  # Field in the dataset containing the text data.
    max_seq_length=None,  # Maximum sequence length for the input data. If None, determined by the longest sequence in a batch.
    tokenizer=tokenizer,  # Tokenizer used to process the input data.
    args=training_arguments,  # Training arguments specifying the training parameters.
    packing=False,  # Whether to pack multiple short examples into the same input sequence for efficiency.
)



Map:   0%|          | 0/43 [00:00<?, ? examples/s]

## Starting the Training Process

The training process is initiated by calling the `train()` method on the `SFTTrainer` instance.

In [20]:
# Start the training
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
25,1.095
50,0.402


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


TrainOutput(global_step=55, training_loss=0.7104978127913042, metrics={'train_runtime': 214.7012, 'train_samples_per_second': 1.001, 'train_steps_per_second': 0.256, 'total_flos': 1178895162654720.0, 'train_loss': 0.7104978127913042, 'epoch': 5.0})

In the final stage of our model’s training, we observed a significant improvement in performance. During the fifth and last epoch, the model went through a series of 55 steps. Each step represents an iteration where the model processes a mini-batch of data and updates its parameters accordingly.

At the 25th step of this epoch, the training loss was recorded at 1.095. This figure indicates how far off the model’s predictions were from the actual values at that point. Essentially, a higher loss suggests that the model's predictions were less accurate.

However, as we progressed to the 50th step within the same epoch, the training loss had dropped to 0.402. This reduction is a positive sign, showing that the model's performance had improved significantly. The decrease in loss from 1.095 to 0.402 demonstrates that the model was learning effectively and adjusting its parameters to make more accurate predictions.

Overall, these results highlight that the model not only completed the training epoch but did so with enhanced performance, making it more adept at handling the task it was trained for. This improvement reflects the model's growing ability to generalize and deliver better results as training advances.

### Explanation

- **Training Initiation**: This step starts the training loop for the model. It involves:
  - **Data Handling**: The `SFTTrainer` manages the loading and batching of the training data.
  - **Model Updates**: It updates the model's weights based on the gradients computed during each training step.
  - **Checkpoints and Logging**: The process includes saving model checkpoints and logging training progress according to the predefined settings.

This process is essential for fine-tuning the model on the specific task or dataset, allowing it to learn and adapt to improve its performance.

## Saving the Tokenizer

After training, it's important to save the tokenizer for future use. This step ensures that the tokenizer, which converts text into tokens for the model and vice versa, is preserved.

In [21]:
# We save the tokenizer on disk for later use
tokenizer.save_pretrained(f"{output_dir}/tokenizer")

('/content/llama2-7b-chat-peft/tokenizer/tokenizer_config.json',
 '/content/llama2-7b-chat-peft/tokenizer/special_tokens_map.json',
 '/content/llama2-7b-chat-peft/tokenizer/tokenizer.model',
 '/content/llama2-7b-chat-peft/tokenizer/added_tokens.json',
 '/content/llama2-7b-chat-peft/tokenizer/tokenizer.json')

### Explanation

- **Purpose**: Saving the tokenizer allows you to reuse it for making predictions or further fine-tuning in the future. This ensures consistency in how text is tokenized and detokenized.
- **Location**: The tokenizer is saved to a specified directory (`output_dir`), which includes the path where the trained model and other artifacts are stored.
- **Method**: The `save_pretrained` method writes the tokenizer's configuration and vocabulary to disk, making it available for later use.

This step is crucial for maintaining the integrity of the model's text processing capabilities and for streamlining the deployment or further training processes.

In [22]:
from transformers import pipeline

llama2_ft_pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)

In [23]:
prompt = """<<SYS>>
You are a legal assistant specializing in intellectual property law. Your task is to draft a copyright license agreement that includes clauses for scope of use, duration, and payment terms.
<</SYS>>

Draft a copyright license agreement that includes scope of use, duration, and payment terms."""

prompt_template = f"<s>[INST] {prompt} [/INST]"

output = llama2_pipe(prompt_template)
print(output[0]['generated_text'].replace("[/INST]", "[/INST]\n\n"))

  return fn(*args, **kwargs)


<s>[INST] <<SYS>>
You are a legal assistant specializing in intellectual property law. Your task is to draft a copyright license agreement that includes clauses for scope of use, duration, and payment terms.
<</SYS>>

Draft a copyright license agreement that includes scope of use, duration, and payment terms. [/INST]

  Sure! Here is a template for a Copyright License Agreement: 1. Scope of Use: The Licensee is granted a non-exclusive license to use the copyrighted material for [insert purpose, e.g., "commercial use"]. 2. Duration: The license is valid for [insert duration, e.g., "three years"]. 3. Payment Terms: The Licensee agrees to pay [insert amount] to the Licensor as a license fee. IN WITNESS WHEREOF, this Copyright License Agreement has been executed as of the Effective Date.</s> IN WITNESS WHEREOF, this Copyright License Agreement has been executed as of the Effective Date.</s> ENDINSTELLAW You are a legal assistant specializing in intellectual property law. Your task is to dr

In [24]:
prompt = """<<SYS>>
You are a legal assistant specializing in real estate law. Your task is to draft a lease agreement that includes clauses for rent, duration, and maintenance responsibilities.
<</SYS>>

Draft a lease agreement that includes rent, duration, and maintenance responsibilities.."""

prompt_template = f"<s>[INST] {prompt} [/INST]"

output = llama2_pipe(prompt_template)
print(output[0]['generated_text'].replace("[/INST]", "[/INST]\n\n"))

<s>[INST] <<SYS>>
You are a legal assistant specializing in real estate law. Your task is to draft a lease agreement that includes clauses for rent, duration, and maintenance responsibilities.
<</SYS>>

Draft a lease agreement that includes rent, duration, and maintenance responsibilities.. [/INST]

  Sure! Here is a template for a Lease Agreement: LEASE AGREEMENT This Lease Agreement ("Agreement") is made between [Landlord Name], with an address at [Address], and [Tenant Name], with an address at [Address]. 1. Rent: The Tenant agrees to pay the Landlord a monthly rent of [Amount], due on the [Day] of each month. 2. Duration: This lease is for a term of [Number] months, starting from [Effective Date]. 3. Maintenance Responsibilities: The Tenant is responsible for maintaining the property, including [specific maintenance tasks]. The Landlord is responsible for [specific maintenance tasks]. IN WITNESS WHEREOF, the parties have executed this Agreement as of the Effective Date.</s> Sure! H

In [None]:
!pip install sentencepiece

## Loading and Merging the Model with PEFT Adapters

This section involves loading a pre-trained model and applying Parameter Efficient Fine-Tuning (PEFT) adapters to it. These steps ensure that the model is ready for specific tasks or fine-tuning processes.


In [None]:
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM

# Define the model name and the path to the PEFT adapters
model_name = "NousResearch/Llama-2-7b-chat-hf"
adapters_name = "/content/llama2-7b-chat-peft/checkpoint-15"

print(f"Loading model: '{model_name}' into memory...")

# Load the pre-trained model for causal language modeling
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Uncomment the following line to load the model in 4-bit precision (requires specific setup)
    # load_in_4bit=True,
    torch_dtype=torch.bfloat16,  # Set the data type to bfloat16 for reduced memory usage
    device_map={"": 0}  # Map the model to the first GPU (device 0)
)

# Load the PEFT adapters and integrate them with the pre-trained model
model = PeftModel.from_pretrained(model, adapters_name)
# Merge the adapters into the model and unload adapter weights to save memory
model = model.merge_and_unload()

print(f"Model: '{model_name}' has been loaded successfully")


### Explanation

1. **Importing Libraries**:
   - `torch`: The core PyTorch library used for tensor computations and model operations.
   - `PeftModel` from `peft`: A class used to manage models with Parameter Efficient Fine-Tuning adapters.
   - `AutoModelForCausalLM` from `transformers`: A class to load pre-trained language models for causal language modeling tasks.

2. **Model and Adapters Path**:
   - `model_name`: Specifies the name of the pre-trained model to load.
   - `adapters_name`: Path to the PEFT adapters that are applied to the pre-trained model.

3. **Loading the Model**:
   - The model is loaded from the specified `model_name` using `AutoModelForCausalLM`. The model is set to use `torch.bfloat16` for its data type and is mapped to the available GPU (device 0).

4. **Applying PEFT Adapters**:
   - `PeftModel.from_pretrained` loads the PEFT adapters from the given path and integrates them with the pre-trained model.
   - `model.merge_and_unload()` merges the adapters into the model, allowing it to use the fine-tuned parameters efficiently and then unloads the adapter weights to reduce memory usage.

5. **Confirmation**:
   - After successfully loading and merging the model and adapters, a confirmation message is printed.

This process prepares the model for use in specific tasks or for further fine-tuning by integrating adapters that have been pre-trained for particular purposes.