resources:

* https://huggingface.co/docs/transformers/peft#add-a-new-adapter
* https://github.com/huggingface/trl/blob/2f726ce4e88a99b5d20eca3b5482954851d91ef6/trl/trainer/utils.py#L482
* https://github.com/huggingface/tokenizers/issues/247
* https://github.com/philschmid/llm-sagemaker-sample/blob/main/scripts/run_qlora.py
* https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10/blob/main/special_tokens_map.json
* https://github.com/huggingface/trl/blob/v0.7.4/trl/trainer/sft_trainer.py#L50
* https://huggingface.co/docs/trl/v0.7.4/en/sft_trainer#packing-dataset--constantlengthdataset-
* https://huggingface.co/docs/transformers/main/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template
* https://github.com/huggingface/trl/blob/2f726ce4e88a99b5d20eca3b5482954851d91ef6/trl/trainer/utils.py#L133
* https://huggingface.co/docs/trl/sft_trainer
* https://huggingface.co/docs/transformers/peft

In [1]:
from transformers import AutoTokenizer, PreTrainedTokenizer
from datasets import load_dataset

path_or_dataset_id = "philschmid/guanaco-oai-style"
model_id = "mistralai/Mistral-7B-v0.1"

In [2]:
assistant_token = '<|im_start|>assistant'
user_token = '<|im_start|>user'
system_token = '<|im_start|>system'
    
def create_tokenizer(tokenizer_id:str) -> PreTrainedTokenizer:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
    
    # new special tokens
    bos_token= '<|im_start|>'
    eos_token= '<|im_end|>'

    tokenizer.chat_template = (
        "{% for message in messages %}"
        "{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}"
        "{% endfor %}"
        "{% if add_generation_prompt %}"
        "{{ '<|im_start|>assistant\n' }}"
        "{% endif %}"
    )
    # https://huggingface.co/OpenAssistant/codellama-13b-oasst-sft-v10/blob/main/special_tokens_map.json
    tokenizer.eos_token = eos_token
    tokenizer.pad_token = eos_token
    tokenizer.bos_token = bos_token
    tokenizer.add_special_tokens({"additional_special_tokens": [bos_token, eos_token]})
    
    return tokenizer
    # model.resize_token_embeddings(len(tokenizer))

tokenizer = create_tokenizer(model_id)

In [3]:
class InvalidMessagesError(Exception):
  pass

def validate_messages(data):
  if "messages" not in data:
    raise InvalidMessagesError("No 'messages' key in data")  

  messages = data["messages"]
  if not isinstance(messages, list):
    raise InvalidMessagesError("messages is not a list")

  if not all(isinstance(m, dict) and "role" in m and "content" in m for m in messages):
    raise InvalidMessagesError("One or more invalid messages")


In [4]:
import os 
from random import randint
from trl.trainer.utils import DataCollatorForCompletionOnlyLM, ConstantLengthDataset

# Load raw dataset from the hub or disk
raw_json = load_dataset(path_or_dataset_id,split="train")

# check if dataset has messages key and if elements are conversations with Dict[str, str] elements and with keys of role, content
validate_messages(raw_json.features)

# apply chat template and tokenize 
template_ds = raw_json.map(lambda s: {"prompt":tokenizer.apply_chat_template(s["messages"], tokenize=False)},remove_columns=raw_json.features, num_proc=os.cpu_count())
# print random example
print(template_ds[randint(0,len(template_ds))]["prompt"])
ds = ConstantLengthDataset(tokenizer, dataset=template_ds, seq_length=2048,dataset_text_field="prompt")

collator = DataCollatorForCompletionOnlyLM(instruction_template=user_token, response_template=assistant_token, tokenizer=tokenizer, mlm=False)

Found cached dataset json (/home/ubuntu/.cache/huggingface/datasets/philschmid___json/philschmid--guanaco-oai-style-fafc168945666df2/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/philschmid___json/philschmid--guanaco-oai-style-fafc168945666df2/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-45809af9be7206da_*_of_00008.arrow


<|im_start|>user
How to write docker yaml file<|im_end|>
<|im_start|>assistant
To write a Docker YAML file, you will need to create a Docker Compose file which is a YAML file used to define and run multi-container Docker applications.

Here's a basic example of a Docker Compose file:

```yaml
version: '3.9'
services:
  web:
    build: .
    ports:
      - "5000:5000"
  redis:
    image: "redis:alpine"
```

In this example, there are two services: `web` and `redis`. The `web` service is built from the current directory (`.`) and exposes port 5000 on the host. The `redis` service uses the Redis Alpine image.

Here's an explanation of the key components in this file:

- `version`: specifies the version of the Docker Compose file format to use.
- `services`: a list of services that make up your application.
- `build`: specifies the path to the Dockerfile to use to build the container.
- `image`: specifies the name of the image to use for the container.
- `ports`: specifies the ports to exp

In [5]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
import bitsandbytes as bnb
from peft import LoraConfig,TaskType
    
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
        model_id,
        # use_cache=False if training_args.gradient_checkpointing else True,  # this is needed for gradient checkpointing
        use_cache=False,  # this is needed for gradient checkpointing
        device_map="auto",
        # use_flash_attention_2=script_args.use_flash_attn,
        quantization_config=bnb_config,
    )

model.resize_token_embeddings(len(tokenizer))
model.gradient_checkpointing_enable()

# COPIED FROM https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, bnb.nn.Linear4bit):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)

modules = find_all_linear_names(model)


peft_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=modules,
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model.add_adapter(peft_config, adapter_name="adapter_1")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
from transformers import Trainer, TrainingArguments


args = TrainingArguments(output_dir="test",per_device_train_batch_size=1,per_device_eval_batch_size=1,gradient_checkpointing=True,logging_steps=10)

trainer = Trainer(
    args=args,
    data_collator=collator,
    train_dataset=ds,
    model=model
)


In [7]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.5082
20,1.0516
30,1.2106
40,1.1185
50,0.8941
60,0.8284
70,1.1593
80,1.0248



К 1973 году стало возможным обмениваться сообщениями не только внутри американского континента, но и передавать информацию в Европу. Это было сделано благодаря трансатлантическому телефонному кабелю, первые из которых были проложены еще в 50-е годы. При этом никуда не делась упомянутая нами проблема с интерфейсом.

Поэтому в 1976 году, теперь уже DARPANetначинает финансировать создание единого унифицированного протокола под названием ТСР/IP, которые на долгое время станет эталоном для функционирования глобальных сетей. В 1983 году сеть полностью перешла на его использование.

В 1984 году начинается разработка системы DNS– Domain Name System. Каждое устройство, включенное в систему DARPANet, имело свой числовой адрес. С ростом числа устройств становилось все труднее сопоставлять буквенные адреса с числовыми, а делать это приходилось вручную. В этом же году был разработан первый сервер доменных имен, получивший свое название от университета, где обучались четыре студента, которые разраб

KeyboardInterrupt: 