In [2]:
## basics of instruction tuning with olmo 1b blog

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import hf_olmo


In [3]:
model_ckpt = "allenai/OLMo-1B"

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [5]:
model = AutoModelForCausalLM.from_pretrained(
    model_ckpt,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Some weights of OLMoForCausalLM were not initialized from the model checkpoint at allenai/OLMo-1B and are newly initialized: ['model.transformer.ff_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
def generate(prompt, max_new_tokens=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
    gen_tokens = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.1,
    )
    return tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]

In [7]:
prompt = "Whats the capital of India?"

In [8]:
print(generate(prompt, max_new_tokens=100))

Whats the capital of India?
Which country is known as the land of seven rivers?
How many states are there in India?
Is India a developed or developing country?
Who is the richest man in India?
The capital city of India is New Delhi. The official language of India is Hindi, but English is also widely spoken.
India has been called “the world’s largest democracy” and “a nation with a rich culture.” It is home to more than 1 billion people, making it one


In [9]:
print(generate("What is the capital of France?", max_new_tokens=100))

What is the capital of France?
How many countries are in Europe?
Is there a country called Russia?
Which European country has the most land area?
Where is the largest city in Europe?
The United Kingdom, with an estimated population of about 65 million people, is the world’s second-largest country by total area.
It is also the world’s sixth-most populous country and the most densely populated country..
France is one of the oldest nations in Europe, having been founded as a


In [10]:
print(generate("The captial of France is", max_new_tokens=100))

The captial of France is Paris.
What is the capital of France?
Paris, France is the capital city of France.
Where is the capital of France located?
France's capital is Paris.
Is France a country or a state?
France is a country and not a state.
How many countries are in Europe?
There are currently 27 European countries.
Which country has the most land area?
Italy has the most land area.
What is the capital of Italy?
The


## download the hugggingface dataset

In [11]:
from datasets import load_dataset
import os

In [12]:
cache_dir = 'slimorca/'

In [13]:
slimorca = load_dataset('Open-Orca/SlimOrca',
                           cache_dir=cache_dir)

In [14]:
import json

In [15]:
print(json.dumps(slimorca["train"][121], indent=4))

{
    "conversations": [
        {
            "from": "system",
            "value": "You are an AI assistant. You will be given a task. You must generate a detailed and long answer.",
            "weight": null
        },
        {
            "from": "human",
            "value": "After World War II, the Guam Organic Act of 1950 established Guam as an unincorporated organized territory of the United States, provided for the structure of the island's civilian government, and granted the people U.S. citizenship. The Governor of Guam was federally appointed until 1968, when the Guam Elective Governor Act provided for the office's popular election.:242 Since Guam is not a U.S. state, U.S. citizens residing on Guam are not allowed to vote for president and their congressional representative is a non-voting member.\nTry to answer this question if possible (otherwise reply \"unanswerable\"): In what year did World War II end?",
            "weight": 0.0
        },
        {
            "fr

## formatting the entire dataset

In [16]:

print(tokenizer.chat_template), print(tokenizer.default_chat_template)


No chat template is defined for this tokenizer - using a default chat template that implements the ChatML format (without BOS/EOS tokens!). If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.



None
{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}


(None, None)

In [17]:
chat = [
    {"role": "system", "content": "You are a helpful assistant and an expert at making coffee."},
    {"role": "user", "content": "How do I make coffee with a Chemex coffee maker?"},
    {"role": "assistant", "content": "To make coffee with a Chemex:\n1. Boil water to about 200°F (93°C).\n2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.\n3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.\n4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.\n5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.\n6. Once brewing is complete, remove the filter and enjoy."}
]

In [18]:
print(tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=False)) ## understanding the chat template format
## ChatML format #combines different types of messages

<|im_start|>system
You are a helpful assistant and an expert at making coffee.<|im_end|>
<|im_start|>user
How do I make coffee with a Chemex coffee maker?<|im_end|>
<|im_start|>assistant
To make coffee with a Chemex:
1. Boil water to about 200°F (93°C).
2. Place the Chemex filter in the top and rinse it with hot water to remove paper taste and warm the vessel. Discard the rinse water.
3. Add coffee grounds to the filter. Use a medium-coarse grind, about 1 gram of coffee per 16 grams of water.
4. Pour just enough hot water to saturate the grounds. Wait 30 seconds for the coffee to 'bloom'.
5. Slowly pour the remaining water over the grounds in a circular motion. Aim for a total brew time of 3.5 to 4.5 minutes.
6. Once brewing is complete, remove the filter and enjoy.<|im_end|>



In [19]:
import torch

In [20]:
special_tokens = ["<|im_start|>", "<|im_end|>"] ## adding instruction tokens to the tokenizer

In [21]:
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})

2

In [31]:
def format_slimorca(ex):
    role_mapping = {"gpt": "assistant", "system": "system", "human": "user"}
    chat = [
        {"role": role_mapping[message["from"]], "content": message["value"]}
        for message in ex['conversations']
    ]
    formatted_chat = tokenizer.apply_chat_template(
        chat,
        tokenize=False,
        add_generation_prompt=False,
    )+ tokenizer.eos_token 
    tokenized_output = tokenizer(
            formatted_chat,
            add_special_tokens = False,
             padding="max_length",
            max_length=1024,
            truncation=True
    )
    return tokenized_output

In [32]:
slimorca_tokenized = slimorca.map(format_slimorca, num_proc=16).remove_columns(
    "conversations"
)

Map (num_proc=16):   0%|          | 0/517982 [00:00<?, ? examples/s]

In [33]:
slimorca_tokenized

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 517982
    })
})

In [34]:
from transformers import DataCollatorForLanguageModeling

In [35]:
data_collator= DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [36]:
from transformers import set_seed

set_seed(123)

slimorca_tokenized_split = slimorca_tokenized["train"].train_test_split(
    train_size=10000, test_size=1000)

In [37]:
from transformers import TrainingArguments, Trainer

In [30]:
import mlflow

In [38]:
mlflow_tracking_path = os.path.join('mlflow_results', 'olmo_orcaslim_instruct')

In [40]:
if not os.path.exists(mlflow_tracking_path):
    os.mkdir(mlflow_tracking_path)

In [41]:
OUTPUT_DIR = os.path.join(mlflow_tracking_path, 'output')
LOG_DIR = os.path.join(mlflow_tracking_path, "logs")

In [42]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=4,
    auto_find_batch_size=True,
    gradient_accumulation_steps=8,
    warmup_steps=1,
    weight_decay=0.01,
    logging_dir=LOG_DIR,
    logging_steps=5,  # Log every 5 steps
    evaluation_strategy="epoch",
    lr_scheduler_type="linear",
    bf16=True,
    gradient_checkpointing=False,
    save_steps=10000,
    learning_rate=8.5e-6
)

In [43]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=slimorca_tokenized_split["train"],
    eval_dataset=slimorca_tokenized_split["test"],
    data_collator=data_collator,
)

Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [46]:

mlflow.set_tracking_uri(mlflow_tracking_path)
mlflow.set_experiment("slimorca_dataset_training")
with mlflow.start_run(log_system_metrics=True):
    mlflow.log_params(training_args.to_dict())
    trainer.evaluate() # eval before starting tuning
    trainer.train()
    

Traceback (most recent call last):
  File "/home/snoronha/conda/envs/venv/lib/python3.9/site-packages/mlflow/store/tracking/file_store.py", line 302, in search_experiments
    exp = self._get_experiment(exp_id, view_type)
  File "/home/snoronha/conda/envs/venv/lib/python3.9/site-packages/mlflow/store/tracking/file_store.py", line 395, in _get_experiment
    meta = FileStore._read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
  File "/home/snoronha/conda/envs/venv/lib/python3.9/site-packages/mlflow/store/tracking/file_store.py", line 1320, in _read_yaml
    return _read_helper(root, file_name, attempts_remaining=retries)
  File "/home/snoronha/conda/envs/venv/lib/python3.9/site-packages/mlflow/store/tracking/file_store.py", line 1313, in _read_helper
    result = read_yaml(root, file_name)
  File "/home/snoronha/conda/envs/venv/lib/python3.9/site-packages/mlflow/utils/file_utils.py", line 310, in read_yaml
    raise MissingConfigException(f"Yaml file '{file_path}' does not exist."

Epoch,Training Loss,Validation Loss
0,1.5837,1.619634
1,1.5103,1.578189
2,1.4566,1.565118
3,1.4361,1.561697
4,1.4629,1.561468


2024/04/10 14:42:00 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...
2024/04/10 14:42:00 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


In [47]:
trainer.save_model('models/olmo_instruct_slimorca/')



## generating text using trained model

In [48]:
fine_tuned_model = AutoModelForCausalLM.from_pretrained(
    'models/olmo_instruct_slimorca/',
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Some weights of OLMoForCausalLM were not initialized from the model checkpoint at models/olmo_instruct_slimorca/ and are newly initialized: ['model.transformer.ff_out.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [54]:
def generate(prompt, max_new_tokens=1000, chat=True):
    if chat:
        messages = [
            {
                "role": "system",
                "content": "You have been fine-tuned to answer questions and follow instructions. Concisely answer the following instruction or question.",
            },
            {"role": "user", "content": prompt},
        ]
        formatted_prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    else:
        formatted_prompt = prompt
    input_ids = tokenizer(formatted_prompt, return_tensors="pt").input_ids.to(
        fine_tuned_model.device
    )
    gen_tokens = fine_tuned_model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id,
        repetition_penalty=1.1,
    )
    return tokenizer.batch_decode(gen_tokens, skip_special_tokens=False)[0]

In [63]:
print(generate("can you tell me more about airbus", chat=True))

<|im_start|>system
You have been fine-tuned to answer questions and follow instructions. Concisely answer the following instruction or question.<|im_end|>
<|im_start|>user
can you tell me more about airbus<|im_end|>
<|im_start|>assistant
Airbus is a French multinational corporation that designs, manufactures, and sells aerospace and defense products, including aircraft, satellites, and missiles. They are headquartered in Toulouse, France. Airbus has been involved in various projects related to space exploration, such as the International Space Station (ISS) and the European Space Agency's (ESA) ExoMars mission.

To summarize this information: Airbus is an international company with a focus on space exploration and other advanced technologies. Their involvement in these projects demonstrates their commitment to innovation and excellence in the field of aerospace engineering.<|im_end|>
<|endoftext|>


In [68]:
## does not working with multi step chat
formatted_prompt = """<|im_start|>system
You have been fine-tuned to answer questions and follow instructions. Concisely answer the following instruction or question.<|im_end|>
<|im_start|>user
can you tell me more about airbus<|im_end|>
<|im_start|>assistant
Airbus is a French multinational corporation that designs, manufactures, and sells aerospace and defense products, including aircraft, satellites, and missiles. They are headquartered in Toulouse, France. Airbus has been involved in various projects related to space exploration, such as the International Space Station (ISS) and the European Space Agency's (ESA) ExoMars mission.

To summarize this information: Airbus is an international company with a focus on space exploration and other advanced technologies. Their involvement in these projects demonstrates their commitment to innovation and excellence in the field of aerospace engineering.<|im_end|>
<|im_start|>user
Which sector do they operate in?<|im_end|>

<|endoftext|>"""
input_ids = tokenizer(formatted_prompt, return_tensors="pt").input_ids.to(
        fine_tuned_model.device
    )
gen_tokens = fine_tuned_model.generate(
    input_ids,
    max_new_tokens=1000,
    eos_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1,
)
print(tokenizer.batch_decode(gen_tokens, skip_special_tokens=False)[0])

<|im_start|>system
You have been fine-tuned to answer questions and follow instructions. Concisely answer the following instruction or question.<|im_end|>
<|im_start|>user
can you tell me more about airbus<|im_end|>
<|im_start|>assistant
Airbus is a French multinational corporation that designs, manufactures, and sells aerospace and defense products, including aircraft, satellites, and missiles. They are headquartered in Toulouse, France. Airbus has been involved in various projects related to space exploration, such as the International Space Station (ISS) and the European Space Agency's (ESA) ExoMars mission.

To summarize this information: Airbus is an international company with a focus on space exploration and other advanced technologies. Their involvement in these projects demonstrates their commitment to innovation and excellence in the field of aerospace engineering.<|im_end|>
<|im_start|>user
Which sector do they operate in?<|im_end|>

<|endoftext|>The first thing I did when I 