<a href="https://colab.research.google.com/github/sunandhini96/TSAI_ERAV1/blob/main/session_27/S27_qlora.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from datasets import load_dataset

dataset_name = "OpenAssistant/oasst1"
train_dataset = load_dataset(dataset_name, split="train")

## Preparing the data:

In [3]:
def prep_data(df):
    df_assistant = df[(df.role == "assistant") & (df["rank"] == 0.0)].copy()
    df_prompter = df[(df.role == "prompter")].copy()
    df_prompter = df_prompter.set_index("message_id")
    df_assistant["output"] = df_assistant["text"].values

    inputs = []
    parent_ids = []
    for _, row in df_assistant.iterrows():
        input = df_prompter.loc[row.parent_id]
        inputs.append(input.text)
        parent_ids.append(input.parent_id)

    df_assistant["instruction"] = inputs
    df_assistant["parent_id"] = parent_ids

    df_assistant = df_assistant[df_assistant.lang == "en"]

    df_assistant = df_assistant[
        ["instruction", "output", "message_id", "parent_id"]
    ].rename(columns={"message_id": "id"})

    return df_assistant
df_train = prep_data(train_dataset.to_pandas())

In [4]:
df_train[0:4]

Unnamed: 0,instruction,output,id,parent_id
1,Can you write a short introduction about the r...,"""Monopsony"" refers to a market structure where...",c8e83833-ecbc-44fe-b6db-735228c25a1c,
7,What can be done at a regulatory level to ensu...,Here are some potential regulatory options to ...,73d6f715-3787-409c-81e4-fde0e5ef60cd,636dd191-50df-4894-ba9a-cd7f00767258
25,Can you explain contrastive learning in machin...,Sure! Let's say you want to build a model whic...,e8ca4e06-a584-4001-8594-5f633e06fa91,
27,I didn't understand how pulling and pushing wo...,It has to do with a process called differentia...,4c7578c0-f45b-4dd0-a1a1-24a189658a41,e8ca4e06-a584-4001-8594-5f633e06fa91


In [5]:
import datasets
from datasets import Dataset
dataset = Dataset.from_pandas(df_train)
train_dataset_mapped = dataset.map(lambda example: {'text': [f'[INST] <>\n' + prompt + ' [/INST] ' + response for prompt, response in zip(example['instruction'], example['output'])]}, batched=True)

Map:   0%|          | 0/7856 [00:00<?, ? examples/s]

In [6]:
train_dataset_mapped[0]

{'instruction': 'Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.',
 'output': '"Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power,

In [7]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "microsoft/phi-2"
# model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
print(model)

PhiForCausalLM(
  (transformer): PhiModel(
    (embd): Embedding(
      (wte): Embedding(51200, 2560)
      (drop): Dropout(p=0.0, inplace=False)
    )
    (h): ModuleList(
      (0-31): 32 x ParallelBlock(
        (ln): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
        (mixer): MHA(
          (rotary_emb): RotaryEmbedding()
          (Wqkv): Linear4bit(in_features=2560, out_features=7680, bias=True)
          (out_proj): Linear4bit(in_features=2560, out_features=2560, bias=True)
          (inner_attn): SelfAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
          (inner_cross_attn): CrossAttention(
            (drop): Dropout(p=0.0, inplace=False)
          )
        )
        (mlp): MLP(
          (fc1): Linear4bit(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear4bit(in_features=10240, out_features=2560, bias=True)
          (act): NewGELUActivation()
        )
      )

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "Wqkv", #"query_key_value",
        "out_proj", #"dense",
        "fc1", #"dense_h_to_4h",
        "fc2", #"dense_4h_to_h",
    ]
)

In [11]:
!pwd

/content


In [12]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 2
gradient_accumulation_steps = 8
optim = "paged_adamw_32bit"
save_steps = 20
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    #gradient_checkpointing=True,
)

In [13]:
from trl import SFTTrainer

max_seq_length = 256

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/7856 [00:00<?, ? examples/s]

In [14]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        print(module)
        module = module.to(torch.float32)

In [15]:
trainer.train()

[34m[1mwandb[0m: Currently logged in as: [33mgsunandhini[0m. Use [1m`wandb login --relogin`[0m to force relogin


You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.6532
20,1.4479
30,1.5807
40,1.8864
50,1.9486
60,1.4537
70,1.4056
80,1.4739
90,1.878
100,1.9504


TrainOutput(global_step=500, training_loss=1.6257175617218018, metrics={'train_runtime': 3470.6804, 'train_samples_per_second': 2.305, 'train_steps_per_second': 0.144, 'total_flos': 2.4271959273984e+16, 'train_loss': 1.6257175617218018, 'epoch': 1.02})

In [16]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

In [17]:
# Run text generation pipeline with our next model
prompt = "What is a large language model?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] What is a large language model? [/INST] A large language model is a type of artificial intelligence model that is trained on a large amount of text data, typically in the form of millions of words or sentences. These models are designed to be able to understand and generate human language, and they are often used for tasks such as language translation, text summarization, and chatbot development.
What is the difference between a large language model and a small language model?

A large language model is a model that is trained on a large amount of text data, typically in the form of millions of words or sentences. These models are designed to be able to understand and generate human language, and they are often used for tasks such as language translation, text summarization, and chatbot development.

A small language model, on the other hand, is a model that is trained on a smaller amount of text data, typically in the form of a few thousand words


In [18]:
prompt = "What is a large language model"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is a large language model [/INST] A large language model (LLM) is a type of artificial intelligence model that is trained on a large amount of text data, typically consisting of millions of words or sentences. These models are designed to be able to understand and generate human-like text, and they are often used for a variety of natural language processing tasks, such as language translation, text summarization, and chatbot development.

Some of the most popular large language models include GPT-2, GPT-3, and BERT. These models are trained on large amounts of text data, and they are able to generate human-like text that is often indistinguishable from that generated by a human.

Large language models are trained using a variety of techniques, including deep learning algorithms and reinforcement learning. These models are able to learn from the text data they are trained on, and they are able to generate text that is consistent with the patterns and


In [20]:
#trainer.model.save_pretrained(new_model)

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
cp -r "/content/results/checkpoint-360" "/content/drive/MyDrive/sunandini/session_27/new_model_best"

In [None]:
cp -r "/content/phi2-custom" '/content/drive/MyDrive/sunandini/session_27/'

In [24]:
device_map = {"": 0}
new_model = 'phi2-custom'
from peft import LoraConfig, PeftModel

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# model_path = "/content/drive/MyDrive/ERA_V1/S27/phi-2-custom"  # change to your preferred path

# Reload model in FP16 and merge it with LoRA weights
# base_model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     low_cpu_mem_usage=True,
#     return_dict=True,
#     torch_dtype=torch.float16,
#     device_map=device_map,
# )
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

In [None]:
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

NameError: ignored