<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_C401_Instruction_tuning_1_2_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

#Demonstration 1.2.2 Use a genAI data set to create an instruction tuning template and train an LLM

In this demonstration, you will perform instruction tuning and learn how to:
- Load a genAI data set to provide instruction templates.
- Select a pre-trained base model for fine-tuning.
- Define a formatting function and perform instruction tuning.

You will use the transformers package and a generative AI data set called [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), which contains examples of question and open-ended answer pairs.



#### Install the necessary packages

In [None]:
!pip install -U transformers
!pip install -U accelerate
!pip install -U trl

Collecting transformers
  Downloading transformers-4.45.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.21,>=0.20 (from transformers)
  Downloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.45.2-py3-none-any.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m106.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tokenizers-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m94.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.

#### Load the data set

In [None]:
# Load the necessary dataset from Hugging Face.
from datasets import load_dataset

train_dataset = load_dataset("tatsu-lab/alpaca" ,split='train[:5000]')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.47k [00:00<?, ?B/s]

(…)-00000-of-00001-a09b74b3ef9c3b56.parquet:   0%|          | 0.00/24.2M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [None]:
# Check the structure of the data.
train_dataset

Dataset({
    features: ['instruction', 'input', 'output', 'text'],
    num_rows: 5000
})

In [None]:
# Display examples from the data set.
train_dataset['instruction'][:10]

['Give three tips for staying healthy.',
 'What are the three primary colors?',
 'Describe the structure of an atom.',
 'How can we reduce air pollution?',
 'Describe a time when you had to make a difficult decision.',
 'Identify the odd one out.',
 'Explain why the following fraction is equivalent to 1/4',
 'Write a short story in third person narration about a protagonist who has to make an important career decision.',
 'Render a 3D model of a house',
 'Evaluate this sentence for spelling and grammar mistakes']

In [None]:
# Select an example from the data set.
train_dataset['instruction'][5]

'Identify the odd one out.'

In [None]:
train_dataset['input'][5]

'Twitter, Instagram, Telegram'

In [None]:
# Show the response to the query above.
train_dataset['output'][5]

'Telegram'

In [None]:
print(train_dataset['text'][5])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Identify the odd one out.

### Input:
Twitter, Instagram, Telegram

### Response:
Telegram


In [None]:
# Another example
print(train_dataset['text'][15])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Explain the use of word embeddings in Natural Language Processing

### Response:
Word embeddings are one of the most powerful tools available for Natural Language Processing (NLP). They are mathematical representations of words or phrases in a vector space, allowing similarities between words and the context in which they are used to be measured. Word embeddings are useful for tasks such as sentiment analysis, text classification, predicting the next word in a sequence, and understanding synonyms and analogies. They allow for words to be processed as numerical values, giving machines an easier way to perform NLP tasks.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [None]:
# Specify the model_path here.
model_name = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_name )
tokenizer = AutoTokenizer.from_pretrained(model_name )
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

In [None]:
# Use the saved models to perfom inference on a sample text
from transformers import pipeline
text = "Explain moon landing to a 6 year old?"
generator = pipeline("text-generation", model=model_name,max_length=100)
generator(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[{'generated_text': "Explain moon landing to a 6 year old?\nI'm not sure if you're being sarcastic or not, but I'm pretty sure it's a joke.\nI'm not sure if you're being sarcastic or not, but I'm pretty sure it's a joke.  I'm not sure if you're being sarcastic or not, but I'm pretty sure it's a joke.  I'm not sure if you're being sarcastic or not, but I'm pretty sure it's"}]

In [None]:
# Function to prepare instruction tuning templates.
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"""
        Below is an instruction that describes a task. Write a response that appropriately completes the request.

        ### Human: {example['instruction'][i]}

        ### Assistant: {example['output'][i]}"""
        output_texts.append(text)
    return output_texts

In [None]:
from trl import SFTConfig, SFTTrainer
response_template = "Answer: "
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

sft_config = SFTConfig(
    max_seq_length=512,
    output_dir="/tmp",
    num_train_epochs=10,
)

trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    formatting_func=formatting_prompts_func,
    args=sft_config,
)


Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

In [None]:
# Trainin the model with specified hyperparameter.
trainer.train()

Step,Training Loss
500,1.4446
1000,1.1739
1500,1.019
2000,0.8702
2500,0.7499
3000,0.615
3500,0.542
4000,0.4779
4500,0.4206
5000,0.3791


TrainOutput(global_step=6250, training_loss=0.6788679504394531, metrics={'train_runtime': 1548.2501, 'train_samples_per_second': 32.295, 'train_steps_per_second': 4.037, 'total_flos': 5322861637632000.0, 'train_loss': 0.6788679504394531, 'epoch': 10.0})

In [None]:
# Save the model and tokenizer locally.
model.save_pretrained("/content/model_instruction_tuned")
tokenizer.save_pretrained("/content/model_instruction_tuned")

('/content/model_instruction_tuned/tokenizer_config.json',
 '/content/model_instruction_tuned/special_tokens_map.json',
 '/content/model_instruction_tuned/vocab.json',
 '/content/model_instruction_tuned/merges.txt',
 '/content/model_instruction_tuned/added_tokens.json',
 '/content/model_instruction_tuned/tokenizer.json')

In [None]:
# Before instruction tuning.

text = "What is the capital of India?"

## Use the text directly, since the model is not trained to understand the prompt template

# text_prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
# ### Instruction: {text }
### Response:  """

generator_base_model = pipeline("text-generation", model=model_name, max_length=100, return_full_text = False, repetition_penalty=1.1)
response_base_model_response = generator_base_model(text )
for seq in response_base_model_response:
    print(f"{seq['generated_text']}")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



India is in the middle of a civil war.  The Indian government has been trying to get rid of the "Indian" name for decades.  It's not like they're going to change it.
I'm sure they'll do it eventually, but I doubt it will be as big as the one they've been trying to get rid of for years.


In [None]:
text = "What is the capital of India?"

text_prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: {text }
### Response:  """
generator = pipeline("text-generation", model="/content/model_instruction_tuned",tokenizer="/content/model_instruction_tuned", max_length=50,  return_full_text = False, repetition_penalty=1.1,)
response = generator(text_prompt)
for seq in response:
    print(f"{seq['generated_text']}")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



The capital of India is New Delhi.

    


In [None]:
text = "Give a solution for global warming"

generator_base_model = pipeline("text-generation", model=model_name,max_length=100,   return_full_text = False, repetition_penalty=1.1)
response_base_model_response = generator_base_model(text )
print("\nBefore instruction tuning")
for seq in response_base_model_response:
    print(f"{seq['generated_text']}")

text_prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: {text }
### Response:  """
generator = pipeline("text-generation", model="/content/model_instruction_tuned",tokenizer="/content/model_instruction_tuned", max_length=100,  return_full_text = False, repetition_penalty=1.1,)
response = generator(text_prompt)
print("\nAfter instruction tuning")
for seq in response:
    print(f"{seq['generated_text']}")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing


Before instruction tuning
.
I'm not sure what you mean by "global warming".  I think it's just the fact that we have to deal with the effects of climate change on our planet.  We can't just ignore them and let them happen.
> Give a solution for global warming.  What do you mean?  > I'm not sure what you mean by "global warming".  I think it's just the fact that we have to deal with the

After instruction tuning

Answer: 
Global warming is a problem that is having a detrimental effect on the environment and human health. It is caused by the increase in the average temperature of the Earth's atmosphere and oceans due to human-caused emissions of greenhouse gases. These gases trap heat in the atmosphere and prevent it from radiating out


## Key information
Instruction tuning involves creating instruction templates and using these examples to fine-tune the LLMs, improving its ability to handle a variety of tasks.

## Reflect
How successful has the instruction tuning process been?

> Select the pen from the toolbar to add your entry.