# LLM Fine-Tuning chat data using 🔭 Galileo's auto

In this tutorial we will upload chat data to Galileo's console.

We use a small sample chat dataset with `jsonl` data, but users can provide data via Pandas DataFrame, Huggingface datasets, or a local path to the dataset stored as a `.csv`, `.json`, or `.jsonl` file. 

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

# Install Dependancies [Including Setting up DQ] + Add Imports

In [None]:
#@title Install `dataquality`

# Upgrade pip
!pip install -U pip &> /dev/null

# Install all dependecies
!pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/
print('👋 Installed necessary libraries.')

# 1. Initialize Galileo

In [None]:
# 🔭🌕 Galileo log-in
import os

# Update these so that you can log in to Galileo
# without having to enter your credentials every time
os.environ['GALILEO_CONSOLE_URL']=""
os.environ["GALILEO_USERNAME"]=""
os.environ["GALILEO_PASSWORD"]=""

import dataquality as dq
dq.configure()

# 2. Set Data
We load the data from Hugging Face for fine-tuning an Encoder-Decoder model. Additionally, the original Alpaca dataset does not specify a val/test split, so in auto we randomly sample to train/val with the ratios (0.8, 0.2). Use the auto docs to learn more about how to configure your own training/val/test sets. 

NOTE: We are working with LLMs (emphasis on Large) and Alpaca is a decently sized dataset with 52,000 data samples. Therefore, training times can be large. To speed up training during this tutorial we default the training set size to be 1000 samples (and thus 200 for val). Consider changing the `max_train_size` parameter to fit your data needs.

In [None]:
#@title Load mock data
sample = {
    "turns": [
        {
            "role": "User",
            "content": "What is the meaning of life?",
            "my_metric": 1.618,  # These fields will show up as metadata
            "other_metric": 5,  # These fields will show up as metadata
        },
        {
            "role": "Assistant",
            "content": "I cannot answer that with certainty, but I hear it is 42.",
            "my_metric": 2.718,
            "other_metric": 3,
        },
        {
            "role": "User",
            "content": "Hmm, what does that mean?",
            "my_metric": 0.001,
            "other_metric": 4,
        },
        {
            "role": "Assistant",
            "content": "To me it means that you should always be nice to others.",
            "my_metric": 1.234,
            "other_metric": 4,
        },
    ],
    "score": 3.14, # This field will also be logged as metadata
    "metadata": {
        "sample_id": "1234",
        "annotator": "Bugs Bunny",
    }
}

In [None]:
import pandas as pd

n_samples = 10
dataset = pd.DataFrame([sample] * n_samples)

# 3. Setup configuration with Galileo
Galileo auto uses 3 classes to set configuration for the Dataset, Training parameters, and Generation config. While they all have defaults that work out of the box, we also allow granular control over these settings, see the [docs](https://docs.rungalileo.io/galileo/llm-studio/llm-debugger/getting-started) for more info.

For chat we must include a data formatter `ChatHistoryFormatter`. The ChatHistoryFormatter assumes that each sample has a column (default name is `turns`) that contains a list of turn information. Update the below fields to match your dataset column names. 

In [None]:
from dataquality.integrations.seq2seq.formatters.chat import ChatHistoryFormatter
from dataquality.integrations.seq2seq.schema import (
    Seq2SeqDatasetConfig,
    Seq2SeqGenerationConfig,
    Seq2SeqTrainingConfig
)

chat_history_formatter = ChatHistoryFormatter(
    hf_tokenizer="google/flan-t5-base",  # This is the default tokenizer
    turns_col="turns",
    metadata_col="metadata",
    content_col="content",
    role_col="role",
    user="User",
    assistant="Assistant",
)


# For huggingface datasets, use `train_data`
# For local files, use `train_train`
dataset_config = Seq2SeqDatasetConfig(
    train_data=dataset,
    formatter=chat_history_formatter,
)
# For chat data with pre-trained models we can skip generation
gen_config = Seq2SeqGenerationConfig(
    generation_splits=[]
)
# Since we are focused on data insights rather than model specific data insights 
# we can only train for 1 epoch to speed up the process
tr_config = Seq2SeqTrainingConfig(
    epochs=1,
)

# 4. Log input data with Galileo auto

Testing `auto` for Seq2Seq tasks is as simple as importing and calling `auto()`. However, we set a few basic parameters in this tutorial such as project/run name and config settings.

In [None]:
from dataquality.integrations.seq2seq.auto import auto

auto(
    project_name="galileo-llm-auto",  # TODO, update project name
    run_name="example_run_galileo-chat_with_auto",  # TODO, update with unique run name
    dataset_config=dataset_config,
    generation_config=gen_config,
    training_config=tr_config,
)