<a href="https://colab.research.google.com/github/swathi-dinakaran/Advanced_DataGuided_Business_Intelligence_all_projects/blob/master/GHC_LevelUp_Lab_Natual_Language_Understanding_Fine_tuning_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Understanding: Create your own language model



# Fine-tuning demo

Welcome to the  LLAMA2 fine tuning tutorial. There are 2 parts to the tutorial - fine-tuning and inference.

The fine-tuning part has been further divided into 4 milestones.

Let's get in there and train some 🦙 🦙 .....

Access the notebook :

https://tinyurl.com/ghc2023nlu

# Milestone 1

Preparing the environment, installing dependencies and importing required libraries.

In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/244.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m80.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from google.colab import drive

# Milestone 2

Load the data.

Process the data to make it ready for the model.

In [None]:
#Load the dataset hosted on hugging face
dataset_name = "coconutzhang/ghc_session_data_v2"
fine_tuning_data = load_dataset(dataset_name, split="train")

#Pre-process data to make it consumable by LLAMA2
def generate_llama_prompt(example):
  example['text'] = f"""### Instruction: {example["User"]} ### Response: {example["Prompt"].strip()}""".strip()
  return example

fine_tuning_data = fine_tuning_data.map(generate_llama_prompt)
fine_tuning_data = fine_tuning_data.map(lambda example: {"text": example['text']}, remove_columns=['User', 'Prompt'])
fine_tuning_data[0]

Downloading readme:   0%|          | 0.00/393 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/141k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1215 [00:00<?, ? examples/s]

Map:   0%|          | 0/1215 [00:00<?, ? examples/s]

Map:   0%|          | 0/1215 [00:00<?, ? examples/s]

{'text': '### Instruction: midjourney prompt for a female character in a futuristic setting ### Response: < yoshida akihiko art, pixiv art, patreon art, girl art, painting by Yoshida Akihiko, Nier Automata 2B, Nier Automata, r-18, Nier Automata concept art, Akihiko Yoshida concept art, painting by Akihiko Yoshida'}

# Milestone 3

Prepare the configs required for training

In [None]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, "float16")

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

In [None]:
# Load base model
model_name="TinyPixel/Llama-2-7B-bf16-sharded"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

In [None]:
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fixing overflow issue with padding

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

In [None]:
#Setting training hyper params
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=4,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    logging_steps=50,
    learning_rate=2e-4,
    fp16=False,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
)

# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=fine_tuning_data,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False,
)

# Milestone 4

Train the model

Save the trained model

In [None]:
# Train model
trainer.train()

In [None]:
# Save trained model
trainer.model.save_pretrained("llama-2-7b-midjourney-ghc2023")

#mount google drive and upload model
drive.mount('/content/drive/')

In [None]:
#copy the model to your google drive
!cp -r llama-2-7b-midjourney-ghc2023 /content/drive/MyDrive/

**Other experiment idea:**

- Load a heavier model (from hugging face, or any other platform)

- Change the number of epochs

- Try larger datasets