<h1>
Instruct Tune
</h1>


# Brief Recap

**Instruction tuning** is a technique used to enhance the performance of large language models (LLMs) by fine-tuning them on datasets containing instruction-output pairs. This process aims to improve the model's ability to understand and follow natural language instructions, making it more versatile and capable of performing a wide range of tasks without extensive prompt engineering.

Instruction tuning was introduced in 2021 by Google Research. Their influential paper, "Finetuned Language Models are Zero-Shot Learners," presented the concept of instruction tuning as a technique to improve the ability of large language models (LLMs) to respond to natural language processing (NLP) instructions.


# Architecture

<img src='assets/arch.png' width=500>

This flowchart illustrates the complete instruction tuning pipeline and inference process. Let me break down each component:

* **Pre-trained Language Model:**
The starting point is a base language model that has already been trained on large amounts of text data. This model has general language understanding capabilities but hasn't been specifically optimized for following instructions.

* **Fine-tuning on Instruction Dataset:**
This stage involves training the base model on carefully curated instruction-response pairs. The model learns to understand and respond to specific instructions through this process, adapting its weights to better handle directed tasks.

* **Instruction-tuned Model:**
The resulting model after the fine-tuning process is now specialized in understanding and responding to instructions. It maintains the base knowledge from pre-training but has enhanced capabilities for following specific directives.

* **User Prompt:**
This represents the actual input from users - questions, instructions, or tasks that they want the model to process. It's separate from the training process and represents the real-world usage of the model.

* **Inference:**
This is the decision-making stage where the instruction-tuned model processes the user prompt. The model applies its learned knowledge to interpret the prompt and determine an appropriate response.

* **Generated Response:**
The final output produced by the model in response to the user prompt. This represents the model's attempt to fulfill the given instruction or answer the question based on its training.

The flowchart effectively shows how the model evolves from a general-purpose language model to a specialized instruction-following system, and how it ultimately processes user inputs to generate appropriate responses.




# Use cases

Instruction tuning has numerous practical applications across various industries. Here are the key use cases:

* **Virtual Assistance**
  - Customer service chatbots with enhanced understanding of user queries
  - Personalized virtual assistants for task automation
  - Real-time support systems with improved response accuracy

* **Educational Technology**

  - Personalized tutoring systems
  - Interactive learning tools with real-time feedback
  - Adaptive educational content delivery

* **Sales and Business**

  - Real-time suggestions for sales representatives
  - Conversation guidance during customer calls
  - Objection handling assistance

* **Financial Services**

  - Personalized investment recommendations
  - Risk assessment analysis
  - Financial planning assistance

* **Healthcare**
  - Diagnostic assistance for healthcare professionals
  - Analysis of patient data and symptoms
  - Medical literature recommendations

* **Software Development**

  - Code review automation
  - Bug detection and quality assessment
  - Performance optimization suggestions
  - Best practices recommendations

* **Content Creation**
  - Text summarization
  - Document analysis
  - Content generation for specific domains

The versatility of instruction tuning makes it particularly valuable in scenarios requiring specialized knowledge and natural language understanding, while maintaining the ability to adapt to new contexts and requirements.


# Instruction Tune Implementation

### **Methodology**
* **Dataset Preparation**: Construct a dataset of diverse instruction-output pairs, either manually or through automated generation.
* **Input Format**: Each training sample typically consists of three elements:
  * **Instruction**: The task description
  * **Optional input**: Supplementary context information
  * **Anticipated output**: The desired response
* **Training Objective**: The model is trained to predict each token in the output sequence given the instruction and input.
* **Optimization**: The model's parameters are adjusted using techniques like gradient descent to minimize the difference between predicted and target outputs.


In [None]:
import tensorflow as tf

# Define the same dummy dataset
dummy_data = [
    {
        "instruction": "Translate English to French",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment ça va?"
    },
    {
        "instruction": "Summarize the following text",
        "input": "Artificial intelligence is transforming technology.",
        "output": "AI is changing tech."
    },
    {
        "instruction": "Convert temperature from Celsius to Fahrenheit",
        "input": "25 C",
        "output": "77 F"
    }
]

# Implementation of dataset preparation
def prepare_instruction_data(data):
    formatted_examples = []
    for item in data:
        prompt = f"### Instruction:\n{item['instruction']}\n\n"
        prompt += f"### Input:\n{item['input']}\n\n"
        prompt += f"### Response:\n{item['output']}"
        
        formatted_examples.append({
            "text": prompt,
            "labels": item['output']
        })
    return formatted_examples

# Training setup using TensorFlow
def create_training_batch(examples, tokenizer):
    # Tokenize the examples
    encoded = tokenizer(
        [ex["text"] for ex in examples],
        padding=True,
        truncation=True,
        return_tensors="tf"
    )
    
    return {
        "input_ids": encoded["input_ids"],
        "attention_mask": encoded["attention_mask"],
        "labels": encoded["input_ids"]
    }

# Inference function for TensorFlow
def generate_response(instruction, input_text, model, tokenizer):
    prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="tf", padding=True)
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=128,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=3
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [None]:
import tensorflow as tf
from transformers import TFAutoModelForCausalLM, AutoTokenizer

# Initialize model and tokenizer
model_name = "gpt2"  # Using GPT-2 as an example
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # For causal language models

model = TFAutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

# Prepare the dataset
formatted_data = prepare_instruction_data(dummy_data)

# Create a training batch
batch = create_training_batch(formatted_data, tokenizer)

examples = [
    {
        "instruction": "Translate English to French",
        "input": "Good morning!"
    },
    {
        "instruction": "Summarize the following text",
        "input": "The weather today is sunny with clear skies and mild temperatures."
    },
    {
        "instruction": "Convert temperature from Celsius to Fahrenheit",
        "input": "30 C"
    }
]

# Example of inference
instruction = "Translate English to French"
input_text = "Good morning!"
response = generate_response(instruction, input_text, model, tokenizer)
print(f"Generated Response: {response}")

# Generate responses for each example
for example in examples:
    response = generate_response(
        example["instruction"],
        example["input"],
        model,
        tokenizer
    )
    print(f"\nInstruction: {example['instruction']}")
    print(f"Input: {example['input']}")
    print(f"Generated Response: {response}")


**Reflections**

* The model failed to generate any responses for both test cases:
  * **Translation task**: No response generated
  * **Summarization task**: No response generated
* This suggests that while the model's training loss improved significantly, there are issues with the generation phase that need to be addressed:
  * The model might need more training data
  * Generation parameters may need adjustment
* The connection between training and inference might be broken.
To fix this, we should:
  * Increase the training dataset size
  * Adjust generation parameters (temperature, top_p, etc.)
  * Verify the inference pipeline is properly connected to the trained model
Add more diverse examples to the training data

**Release Memory**

In [None]:
# Delete models and variables
del model
del batch

# Clear Keras/TF session
tf.keras.backend.clear_session()

# Garbage collection
import gc
gc.collect()


# English to French Translation using LoRA

#### **Dataset**
We will use the MTNT (Machine Translation of Noisy Text) dataset, which is available from TensorFlow Datasets. MTNT is a collection of comments from the Reddit discussion website in English, French and Japanese, translated to and from English. The particularity of this dataset is that the data consists of "noisy" text, that exhibits typos, grammar errors, code switching and more.

In this example, we will use the **French-to-English** portion of the dataset.

#### **Initial Setup**
* Install and import all the
libraries we need. We'll be using the KerasHub library.

* Secondly, let's set the precision to bfloat16. This will help us reduce the
memory usage and training time.

* Also, ensure that `KAGGLE_USERNAME` and `KAGGLE_KEY` have been correctly
configured to access the Gemma model.






In [None]:
import gc
import os

os.environ["KERAS_BACKEND"] = "tensorflow"
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Suppress verbose logging from TF

os.environ["KAGGLE_USERNAME"] = "YOUR USERNAME"
os.environ["KAGGLE_KEY"] = "YOUR API KEY"

import keras
import keras_hub
import tensorflow as tf
import tensorflow_datasets as tfds

keras.config.set_dtype_policy("bfloat16")

In [None]:
# Downloading dataset
train_ds = tfds.load("mtnt/fr-en", split="train")

We can print some samples. Each sample in the dataset contains two entries:

- src: the original French sentence.
- dst: the corresponding English translation.

In [None]:
examples = train_ds.take(3)
examples = examples.as_numpy_iterator()

for idx, example in enumerate(examples):
    print(f"Example {idx}:")
    for key, val in example.items():
        print(f"{key}: {val}")
    print()

In [None]:
examples = train_ds.take(3)
examples = examples.as_numpy_iterator()

for idx, example in enumerate(examples):
    print(f"Example {idx}:")
    for key, val in example.items():
        print(f"{key}: {val}")
    print()

Since we will fine-tune our model to perform a French-to-English translation
task, we should format the inputs for instruction tuning. For example, we could
format the translation task in this example like:

```
<start_of_turn>user
Translate French into English:
{src}<end_of_turn>
<start_of_turn>model
{dst}<end_of_turn>
```

The special tokens such as `<start_of_turn>user`, `<start_of_turn>model` and
`<end_of_turn>` are used for Gemma models. You can learn more from
https://ai.google.dev/gemma/docs/formatting

In [None]:
train_ds = train_ds.map(
    lambda x: tf.strings.join(
        [
            "<start_of_turn>user\n",
            "Translate French into English:\n",
            x["src"],
            "<end_of_turn>\n",
            "<start_of_turn>model\n",
            "Translation:\n",
            x["dst"],
            "<end_of_turn>",
        ]
    )
)

train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE) # Add prefetch with AUTOTUNE

examples = train_ds.take(3)
examples = examples.as_numpy_iterator()

for idx, example in enumerate(examples):
    print(f"Example {idx}:")
    print(example)
    print()

We will take a subset of the dataset for the purpose of this example.

In [None]:
train_ds = train_ds.batch(1).take(100)



KerasHub provides implementations of many popular model architectures.
In this example, we will use `GemmaCausalLM`, an end-to-end Gemma model for
causal language modeling. A causal language model predicts the next token based
on previous tokens.

Note that `sequence_length` is set to `256` to speed up the fitting.

In [None]:
preprocessor = keras_hub.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_1.1_instruct_2b_en", sequence_length=256
)
gemma_lm = keras_hub.models.GemmaCausalLM.from_preset(
    "gemma_1.1_instruct_2b_en", preprocessor=preprocessor
)
gemma_lm.summary()

## LoRA Instruct-tuning

<img src='assets/lora.png' width=500>

### What exactly is LoRA?

Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning technique for
LLMs. It freezes the weights of the LLM, and injects trainable
rank-decomposition matrices. Let's understand this more clearly.

Assume we have an `n x n` pre-trained dense layer (or weight matrix), `W0`. We
initialize two dense layers, `A` and `B`, of shapes `n x rank`, and `rank x n`,
respectively. `rank` is much smaller than `n`. In the paper, values between 1
and 4 are shown to work well.

### LoRA equation

The original equation is `output = W0x + b0`, where `x` is the input, `W0` and
`b0` are the weight matrix and bias terms of the original dense layer (frozen).
The LoRA equation is: `output = W0x + b0 + BAx`, where `A` and `B` are the
rank-decomposition matrices.

LoRA is based on the idea that updates to the weights of the pre-trained
language model have a low "intrinsic rank" since pre-trained language models are
over-parametrized. Predictive performance of full fine-tuning can be replicated
even by constraining `W0`'s updates to low-rank decomposition matrices.

### Number of trainable parameters

Let's do some quick math. Suppose `n` is 768, and `rank` is 4. `W0` has
`768 x 768 = 589,824` parameters, whereas the LoRA layers, `A` and `B` together
have `768 x 4 + 4 x 768 = 6,144` parameters. So, for the dense layer, we go
from `589,824` trainable parameters to `6,144` trainable parameters!

### Why does LoRA reduce memory footprint?

Even though the total number of parameters increase
(since we are adding LoRA layers), the memory footprint reduces, because the
number of trainable parameters reduces. Let's dive deeper into this.

The memory usage of a model can be split into four parts:

- Model memory: This is the memory required to store the model weights. This
will be slightly higher for LoRA than the original model.
- Forward pass memory: This mostly depends on batch size, sequence length, etc.
We keep this constant for both models for a fair comparison.
- Backward pass memory: This is the memory required to store the gradients. Note
that the gradients are computed only for the trainable parameters.
- Optimizer memory: This is the memory required to store the optimizer state.
For example, the Adam optimizer stores the "1st moment vectors" and
"2nd moment vectors" for the trainable parameters.

Since, with LoRA, there is a huge reduction in the number of trainable
parameters, the optimizer memory and the memory required to store the gradients
for LoRA is much less than the original model. This is where most of the memory
savings happen.

### Why is LoRA so popular?

- Reduces GPU memory usage;
- Faster training; and
- No additional inference latency.

When using KerasHub, we can enable LoRA with an one-line API:
`enable_lora(rank=4)`

From `gemma_lm.summary()`, we can see enabling LoRA reduces the number of
trainable parameters significantly (from 2.5 billion to 1.3 million).

In [None]:
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

Let's fine-tune the LoRA model.

In [None]:
# To save memory, use the SGD optimizer instead of the usual AdamW optimizer.
# For this specific example, SGD is more than enough.
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(train_ds, epochs=1)

After fine-tuning, responses will follow the instructions provided in the
prompt.

In [None]:
template = (
    "<start_of_turn>user\n"
    "Translate French into English:\n"
    "{inputs}"
    "<end_of_turn>\n"
    "<start_of_turn>model\n"
    "Translation:\n"
)
prompt = template.format(inputs="Bonjour, je m'appelle Morgane.")
outputs = gemma_lm.generate(prompt, max_length=256)
print("Translation:\n", outputs.replace(prompt, ""))

------

We have seen that **LoRA instruct tuning** focuses on efficiently fine-tuning a pre-trained LLM for instruction following by adding a small number of trainable parameters while freezing the original model weights. This approach reduces memory usage and training time compared to traditional fine-tuning methods while achieving comparable performance.

In [None]:
preprocessor = keras_hub.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_1.1_instruct_2b_en", sequence_length=128
)
gemma_lm = keras_hub.models.GemmaCausalLM.from_preset(
    "gemma_1.1_instruct_2b_en", preprocessor=preprocessor
)
gemma_lm.quantize("int8")
gemma_lm.backbone.enable_lora(rank=4)
gemma_lm.summary()

Let's fine-tune the QLoRA model.

If you are using a device with int8 acceleration support, you should see an
improvement in the training speed.

In [None]:
optimizer = keras.optimizers.SGD(learning_rate=1e-4)
gemma_lm.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=optimizer,
    weighted_metrics=[keras.metrics.SparseCategoricalAccuracy()],
)
gemma_lm.fit(train_ds, epochs=1)

You should get a similar output with QLoRA fine-tuning.

In [None]:
prompt = template.format(inputs="Bonjour, je m'appelle Morgane.")
outputs = gemma_lm.generate(prompt, max_length=256)
print("Translation:\n", outputs.replace(prompt, ""))

And we're all done!

Note that for demonstration purposes, this example fine-tunes the model on a
small subset of the dataset for just one epoch and with a low LoRA rank value.
To get better responses from the fine-tuned model, you can experiment with:

- Increasing the size of the fine-tuning dataset.
- Training for more steps (epochs).
- Setting a higher LoRA rank.
- Modifying the hyperparameter values such as `learning_rate` and
`weight_decay`.