 # **Finetuning a large language model using [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio)**

In this notebook, we will demonstrate how one can finetune a large language model easily using the CLI interface of H2O LLM Studio.



In [None]:
!git clone https://github.com/h2oai/h2o-llmstudio.git
!cd h2o-llmstudio && git checkout ce10af57ff118a2bbb81b5b3eae12273e290299a -q
!cp -r h2o-llmstudio/. ./
!rm -r h2o-llmstudio

Cloning into 'h2o-llmstudio'...
remote: Enumerating objects: 3393, done.[K
remote: Counting objects: 100% (2077/2077), done.[K
remote: Compressing objects: 100% (873/873), done.[K
remote: Total 3393 (delta 1569), reused 1536 (delta 1182), pack-reused 1316[K
Receiving objects: 100% (3393/3393), 19.65 MiB | 19.48 MiB/s, done.
Resolving deltas: 100% (2246/2246), done.


In [None]:
# Install pyhon 3.10 that will be used within pipenv
!sudo add-apt-repository ppa:deadsnakes/ppa -y > /dev/null
!sudo apt install python3.10 python3.10-distutils psmisc -y > /dev/null
!curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 > /dev/null

# install requirements
!make setup > /dev/null



[0m[1mCreating a virtualenv for this project...[0m
Pipfile: [33m[1m/content/Pipfile[0m
[1mUsing[0m [33m[1m/usr/local/bin/python[0m [32m(3.10.12)[0m [1mto create virtualenv...[0m
⠼[0m Creating virtual environment...[K[36mcreated virtual environment CPython3.10.12.final.0-64 in 1004ms
  creator Venv(dest=/root/.local/share/virtualenvs/content-cQIIIOO2, clear=False, no_vcs_ignore=False, global=False, describe=CPython3Posix)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.1.2, setuptools==67.8.0, wheel==0.40.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
[0m
[K[?25h[32m[22m✔ Successfully created virtual environment![39m[22m[0m 
[32mVirtualenv location: /root/.local/share/virtualenvs/content-cQIIIOO2[0m


In [None]:
!python -m pip install datasets > /dev/null
!mkdir data
!mkdir data/oasst-data

[0m

### **Loading the Bhagwad Gita Data**

######The Bhagwad Gita data contains several features, including ID, Chapter, Verse, Shloka, HinMeaning, and EngMeaning. Let’s load this data and take a closer look at it.





In [None]:
import pandas as pd
import json

bhagavath_gita_data=pd.read_csv('/content/Bhagwad_Gita.csv')
bhagavath_gita_data.head()

Unnamed: 0,ID,Chapter,Verse,Shloka,Transliteration,HinMeaning,EngMeaning,WordMeaning
0,BG1.1,1,1,धृतराष्ट्र उवाच |\nधर्मक्षेत्रे कुरुक्षेत्रे स...,dhṛtarāṣṭra uvāca .\ndharmakṣetre kurukṣetre s...,।।1.1।।धृतराष्ट्र ने कहा -- हे संजय ! धर्मभूमि...,1.1 Dhritarashtra said What did my people and...,1.1 धर्मक्षेत्रे on the holy plain? कुरुक्षेत्...
1,BG1.2,1,2,सञ्जय उवाच |\nदृष्ट्वा तु पाण्डवानीकं व्यूढं द...,sañjaya uvāca .\ndṛṣṭvā tu pāṇḍavānīkaṃ vyūḍha...,।।1.2।।संजय ने कहा -- पाण्डव-सैन्य की व्यूह रच...,1.2. Sanjaya said Having seen the army of the...,1.2 दृष्ट्वा having seen? तु indeed? पाण्डवानी...
2,BG1.3,1,3,पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम् |\n...,paśyaitāṃ pāṇḍuputrāṇāmācārya mahatīṃ camūm .\...,।।1.3।।हे आचार्य ! आपके बुद्धिमान शिष्य द्रुपद...,"1.3. ""Behold, O Teacher! this mighty army of t...",1.3 पश्य behold? एताम् this? पाण्डुपुत्राणाम् ...
3,BG1.4,1,4,अत्र शूरा महेष्वासा भीमार्जुनसमा युधि |\nयुयुध...,atra śūrā maheṣvāsā bhīmārjunasamā yudhi .\nyu...,।।1.4।।इस सेना में महान् धनुर्धारी शूर योद्धा ...,"1.4. Here are heroes, mighty archers, eal in b...",1.4 अत्र here? शूराः heroes? महेष्वासाः mighty...
4,BG1.5,1,5,धृष्टकेतुश्चेकितानः काशिराजश्च वीर्यवान् |\nपु...,dhṛṣṭaketuścekitānaḥ kāśirājaśca vīryavān .\np...,"।।1.5।।धृष्टकेतु, चेकितान, बलवान काशिराज, पुर...","1.5. ""Dhrishtaketu, chekitana and the valiant ...",1.5 धृष्टकेतुः Dhrishtaketu? चेकितानः Chekitan...


### **Creating a Prompt and Response Dataframe**

In this step, we are transforming the Bhagwad Gita data into a prompt and response dataframe. We do this by adding a prefix to the ‘HinMeaning’ column, which turns the Hindi Shloka into a question asking for its English commentary. The resulting dataframe can be used to generate prompts and responses for training a language model.

In [None]:
bhagavath_gita_data['HinMeaning']= "What is English commentary of this Hindi Shloka in Bhagvath Gita:  "+bhagavath_gita_data['HinMeaning']
bhagavath_gita_data


Unnamed: 0,ID,Chapter,Verse,Shloka,Transliteration,HinMeaning,EngMeaning,WordMeaning
0,BG1.1,1,1,धृतराष्ट्र उवाच |\nधर्मक्षेत्रे कुरुक्षेत्रे स...,dhṛtarāṣṭra uvāca .\ndharmakṣetre kurukṣetre s...,What is English commentary of this Hindi Shlok...,1.1 Dhritarashtra said What did my people and...,1.1 धर्मक्षेत्रे on the holy plain? कुरुक्षेत्...
1,BG1.2,1,2,सञ्जय उवाच |\nदृष्ट्वा तु पाण्डवानीकं व्यूढं द...,sañjaya uvāca .\ndṛṣṭvā tu pāṇḍavānīkaṃ vyūḍha...,What is English commentary of this Hindi Shlok...,1.2. Sanjaya said Having seen the army of the...,1.2 दृष्ट्वा having seen? तु indeed? पाण्डवानी...
2,BG1.3,1,3,पश्यैतां पाण्डुपुत्राणामाचार्य महतीं चमूम् |\n...,paśyaitāṃ pāṇḍuputrāṇāmācārya mahatīṃ camūm .\...,What is English commentary of this Hindi Shlok...,"1.3. ""Behold, O Teacher! this mighty army of t...",1.3 पश्य behold? एताम् this? पाण्डुपुत्राणाम् ...
3,BG1.4,1,4,अत्र शूरा महेष्वासा भीमार्जुनसमा युधि |\nयुयुध...,atra śūrā maheṣvāsā bhīmārjunasamā yudhi .\nyu...,What is English commentary of this Hindi Shlok...,"1.4. Here are heroes, mighty archers, eal in b...",1.4 अत्र here? शूराः heroes? महेष्वासाः mighty...
4,BG1.5,1,5,धृष्टकेतुश्चेकितानः काशिराजश्च वीर्यवान् |\nपु...,dhṛṣṭaketuścekitānaḥ kāśirājaśca vīryavān .\np...,What is English commentary of this Hindi Shlok...,"1.5. ""Dhrishtaketu, chekitana and the valiant ...",1.5 धृष्टकेतुः Dhrishtaketu? चेकितानः Chekitan...
...,...,...,...,...,...,...,...,...
696,BG18.74,18,74,सञ्जय उवाच |\nइत्यहं वासुदेवस्य पार्थस्य च महा...,sañjaya uvāca .\nityahaṃ vāsudevasya pārthasya...,What is English commentary of this Hindi Shlok...,18.74 Sanjaya said Thus I have heard this won...,18.74 इति thus? अहम् I? वासुदेवस्य of Krishna?...
697,BG18.75,18,75,व्यासप्रसादाच्छ्रुतवानेतद्गुह्यमहं परम् |\nयोग...,vyāsaprasādācchrutavānetadguhyamahaṃ param .\n...,What is English commentary of this Hindi Shlok...,18.75 Through the grace of Vyasa I have heard ...,18.75 व्यासप्रसादात् through the grace of Vyas...
698,BG18.76,18,76,राजन्संस्मृत्य संस्मृत्य संवादमिममद्भुतम् |\nक...,rājansaṃsmṛtya saṃsmṛtya saṃvādamimamadbhutam ...,What is English commentary of this Hindi Shlok...,"18.76 O King, remembering this wonderful and h...",18.76 राजन् O King? संस्मृत्य having remembere...
699,BG18.77,18,77,तच्च संस्मृत्य संस्मृत्य रूपमत्यद्भुतं हरेः |\...,tacca saṃsmṛtya saṃsmṛtya rūpamatyadbhutaṃ har...,What is English commentary of this Hindi Shlok...,"18.77 And, remembering again and again, also t...",18.77 तत् that? च and? संस्मृत्य having rememb...


In [None]:
bhagavath_gita_data=bhagavath_gita_data[['HinMeaning','EngMeaning']].rename(columns={'HinMeaning':"prompt",'EngMeaning':'Response'})
bhagavath_gita_data

Unnamed: 0,prompt,Response
0,What is English commentary of this Hindi Shlok...,1.1 Dhritarashtra said What did my people and...
1,What is English commentary of this Hindi Shlok...,1.2. Sanjaya said Having seen the army of the...
2,What is English commentary of this Hindi Shlok...,"1.3. ""Behold, O Teacher! this mighty army of t..."
3,What is English commentary of this Hindi Shlok...,"1.4. Here are heroes, mighty archers, eal in b..."
4,What is English commentary of this Hindi Shlok...,"1.5. ""Dhrishtaketu, chekitana and the valiant ..."
...,...,...
696,What is English commentary of this Hindi Shlok...,18.74 Sanjaya said Thus I have heard this won...
697,What is English commentary of this Hindi Shlok...,18.75 Through the grace of Vyasa I have heard ...
698,What is English commentary of this Hindi Shlok...,"18.76 O King, remembering this wonderful and h..."
699,What is English commentary of this Hindi Shlok...,"18.77 And, remembering again and again, also t..."


### **Loading and Processing Commentary Data**



In [None]:


with open('/content/translation.json', 'r') as f:
    commentary = json.load(f)

english_commentary = {}
for comment in commentary:
    lang_id = comment['language_id']
    verse_id = comment['verse_id']
    if lang_id == 1 and verse_id not in english_commentary:
        english_commentary[verse_id] = comment

english_commentary = sorted(english_commentary.values(), key=lambda x: x['verseNumber'])

data = []
for i in range(len(english_commentary) - 1):
    # Extract information from the commentaries
    prompt_verse_number = english_commentary[i]['verseNumber']
    prompt_description = english_commentary[i]['description']
    response_verse_number = english_commentary[i+1]['verseNumber']
    response_description = english_commentary[i+1]['description']

    # Creating the prompt
    prompt = f'Verse {prompt_verse_number}: {prompt_description}'

    # Creating the response
    response = f'Verse {response_verse_number}: {response_description}'

    data.append([prompt, response])

df = pd.DataFrame(data, columns=['prompt', 'Response'])

df['prompt']= "Given the current verse, separated by a colon (:), Provide the subsequent next verse  :  "+df['prompt']


bhagavath_gita_data=bhagavath_gita_data.append(df)

  bhagavath_gita_data=bhagavath_gita_data.append(df)


In [None]:
df = pd.read_csv('/content/Bhagwad_Gita.csv')
unique_chapters = list(df['Chapter'].unique())

data = []

for chapter_number in unique_chapters:
    # Getting the unique verses for this chapter
    unique_verses = list(df.loc[df['Chapter'] == chapter_number, 'Verse'].unique())

    for verse_number in unique_verses:
        # Getting the unique verse numbers for this verse
        response_series = df.loc[(df['Verse'] == verse_number) & (df['Chapter'] == chapter_number), 'EngMeaning']
        if not response_series.empty:
            response = response_series.iloc[0]
            prompt = f'Please explain verse {verse_number} of chapter {chapter_number} from the Bhagavad Gita'
            data.append([prompt, response])

df = pd.DataFrame(data, columns=['prompt', 'Response'])
bhagavath_gita_data=bhagavath_gita_data.append(df)

  bhagavath_gita_data=bhagavath_gita_data.append(df)


In [None]:
bhagavath_gita_data

Unnamed: 0,prompt,Response
0,What is English commentary of this Hindi Shlok...,1.1 Dhritarashtra said What did my people and...
1,What is English commentary of this Hindi Shlok...,1.2. Sanjaya said Having seen the army of the...
2,What is English commentary of this Hindi Shlok...,"1.3. ""Behold, O Teacher! this mighty army of t..."
3,What is English commentary of this Hindi Shlok...,"1.4. Here are heroes, mighty archers, eal in b..."
4,What is English commentary of this Hindi Shlok...,"1.5. ""Dhrishtaketu, chekitana and the valiant ..."
...,...,...
696,Please explain verse 74 of chapter 18 of the B...,18.74 Sanjaya said Thus I have heard this won...
697,Please explain verse 75 of chapter 18 of the B...,18.75 Through the grace of Vyasa I have heard ...
698,Please explain verse 76 of chapter 18 of the B...,"18.76 O King, remembering this wonderful and h..."
699,Please explain verse 77 of chapter 18 of the B...,"18.77 And, remembering again and again, also t..."


In [None]:
bhagavath_gita_data.to_csv('/content/data/oasst-data/Bhagwad_Gita.csv',index=False)

In [None]:
# !mv /content/Bhagwad_Gita.csv /content/data/oasst-data

## Preparing Configurations

In H2O LLM Studio, we use dataclasses to specify various [finetuning parameters](https://github.com/h2oai/h2o-llmstudio/blob/main/docs/parameters.md).



In [None]:
%%writefile cfg_notebook.py

import os
from dataclasses import dataclass

from llm_studio.python_configs.text_causal_language_modeling_config import ConfigProblemBase, ConfigNLPCausalLMDataset, \
    ConfigNLPCausalLMTokenizer, ConfigNLPAugmentation, ConfigNLPCausalLMArchitecture, ConfigNLPCausalLMTraining, \
    ConfigNLPCausalLMPrediction, ConfigNLPCausalLMEnvironment, ConfigNLPCausalLMLogging


ROOT_DIR = "./data/oasst-data/"
@dataclass
class Config(ConfigProblemBase):
    output_directory: str = "output/demo_oasst-data/"
    experiment_name: str = "demo_experiment"
    llm_backbone: str = "gpt2-xl"

    dataset: ConfigNLPCausalLMDataset = ConfigNLPCausalLMDataset(
        train_dataframe=os.path.join(ROOT_DIR, "Bhagwad_Gita.csv"),

        validation_strategy="automatic",
        validation_dataframe="",
        validation_size=0.01,

        prompt_column=("prompt",),
        answer_column="Response",
        text_prompt_start="<s>",
        text_answer_separator="<sep>",

        add_eos_token_to_prompt=True,
        add_eos_token_to_answer=True,
        mask_prompt_labels=True,

    )
    tokenizer: ConfigNLPCausalLMTokenizer = ConfigNLPCausalLMTokenizer(
        max_length_prompt=256,
        max_length_answer=256,
        max_length=256,
        padding_quantile=1.0
    )
    augmentation: ConfigNLPAugmentation = ConfigNLPAugmentation(token_mask_probability=0.0)
    architecture: ConfigNLPCausalLMArchitecture = ConfigNLPCausalLMArchitecture(
        backbone_dtype="float16",
        gradient_checkpointing=False,
        force_embedding_gradients=False,
        intermediate_dropout=0
    )
    training: ConfigNLPCausalLMTraining = ConfigNLPCausalLMTraining(
        loss_function="CrossEntropy",
        optimizer="AdamW",

        learning_rate=0.00015,

        batch_size=2,
        drop_last_batch=True,
        epochs=2,
        schedule="Cosine",
        warmup_epochs=0.0,

        weight_decay=0.0,
        gradient_clip=0.0,
        grad_accumulation=1,

        lora=True,
        lora_r=4,
        lora_alpha=16,
        lora_dropout=0.05,
        lora_target_modules="",

        save_best_checkpoint=False,
        evaluation_epochs=1.0,
        evaluate_before_training=False,
    )
    prediction: ConfigNLPCausalLMPrediction = ConfigNLPCausalLMPrediction(
        metric="BLEU",

        min_length_inference=2,
        max_length_inference=256,
        batch_size_inference=0,

        do_sample=False,
        num_beams=4,
        temperature=0.3,
        repetition_penalty=1.8,
    )
    environment: ConfigNLPCausalLMEnvironment = ConfigNLPCausalLMEnvironment(
        mixed_precision=True,
        number_of_workers=4,
        seed=1
    )

Overwriting cfg_notebook.py


In [None]:
%%writefile run.sh
echo "Training Started..."

pipenv run python train.py -C cfg_notebook.py &

wait
echo "Training Completed...."

Overwriting run.sh


In [None]:
!sh run.sh

Training Started...

Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-07-01 07:50:09,111 - INFO: Global random seed: 1
2023-07-01 07:50:09,113 - INFO: Preparing the data...
2023-07-01 07:50:09,113 - INFO: Setting up automatic validation split...
2023-07-01 07:50:09,152 - INFO: Preparing train and validation data
2023-07-01 07:50:09,152 - INFO: Loading train dataset...
Downloading (…)lve/main/config.json: 100% 689/689 [00:00<00:00, 629kB/s]
Downloading (…)olve/main/vocab.json: 100% 1.04M/1.04M [00:00<00:00, 4.91MB/s]
D

In [None]:
val_outputs = pd.read_csv("output/demo_oasst-data/validation_predictions.csv")



## Inference and prompting

In [None]:
!pipenv run python prompt.py --e output/demo_oasst-data/


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
Loading model weights...
trainable params: 1228800 || all params: 1558840000 || trainable%: 0.07882784634728388

You can change inference parameters on the fly by typing --param value, such as --num_beams 4. You can also chain them such as --num_beams 4 --top_k 30.

Please enter some prompt (type 'exit' to stop): Please explain verse 75 of chapter 18 of th