# LLM Fine-Tuning using 🔭 Galileo's auto

In this tutorial, we will fine-tune an Encoder-Decoder model from HuggingFace 🤗 for instruction completion and explore the results in Galileo.

We use the well known Alpaca intruction-tuning dataset, from the [Stanford Alpaca project](https://github.com/tatsu-lab/stanford_alpaca). In doing so, we help highlight several known data errors and limitations of this dataset!

**Make sure to select GPU in your Runtime! (Runtime -> Change Runtime type)**

# Install Dependancies [Including Setting up DQ] + Add Imports

In [None]:
#@title Install `dataquality`

# Upgrade pip
!pip install -U pip &> /dev/null

# Install all dependecies
!pip install 'dataquality[cuda]' --extra-index-url=https://pypi.nvidia.com/
print('👋 Installed necessary libraries.')

# 1. Initialize Galileo

In [None]:
# 🔭🌕 Galileo log-in
import os

# Update these so that you can log in to Galileo
# without having to enter your credentials every time
os.environ['GALILEO_CONSOLE_URL']=""
os.environ["GALILEO_USERNAME"]=""
os.environ["GALILEO_PASSWORD"]=""

import dataquality as dq
dq.configure()

# 2. Set Data
We load the data from Hugging Face for fine-tuning an Encoder-Decoder model. Additionally, the original Alpaca dataset does not specify a val/test split, so in auto we randomly sample to train/val with the ratios (0.83, 0.17). Use the auto docs to learn more about how to configure your own training/val/test sets. 

NOTE: We are working with LLMs (emphasis on Large) and Alpaca is a decently sized dataset with 52,000 data samples. Therefore, training times can be large. To speed up training during this tutorial we default the training set size to be 1000 samples (and thus 200 for val). Consider changing the `max_train_size` parameter to fit your data needs.

In [None]:
#@title Load 🤗 HuggingFace Alpaca Dataset
max_train_size = 1000
dataset = "tatsu-lab/alpaca"

# 3. Setup configuration with Galileo
Galileo auto uses 3 classes to set configuration for the Dataset, Training parameters, and Generation config. While they all have defaults that work out of the box, we also allow granular control over these settings, see the [docs](https://docs.rungalileo.io/galileo/llm-studio/llm-debugger/getting-started) for more info.

In this tutorial, we use the Encoder-Decoder model [`google/flan-t5-small`](https://huggingface.co/google/flan-t5-small) and leverage a simple greedy decoding strategy.

To speed up training and reduce memory, we limit the `max_target_tokens` (for the decoder block) to `128`, while leaving `max_input_tokens` (for the encoder block) as the default 512. Feel free to change this to reduce the samples with truncation.

The dataset can be passed in as a path to a local file, a HuggingFace dataset, or the string name of a remote HF dataset.

In [None]:
from dataquality.integrations.seq2seq.schema import (
    Seq2SeqDatasetConfig, Seq2SeqGenerationConfig, Seq2SeqTrainingConfig
)

# For huggingface datasets, use `train_data`
# For local files, use `train_train`
dataset_config = Seq2SeqDatasetConfig(
    hf_data=dataset,
    input_col="prompt",
    target_col="completion",
)
# Generation takes about 1 second per sample on a V100 GPU
# So we limit to only the test set
# Update here to include "training" or omit "validation" / "test"
gen_config = Seq2SeqGenerationConfig(
    generation_splits=["validation", "test"]
)
tr_config = Seq2SeqTrainingConfig(
    epochs=3,
    max_target_tokens=128,
)

# 4. Log input data with Galileo auto

Testing `auto` for Seq2Seq tasks is as simple as importing and calling `auto()`. However, we set a few basic parameters in this tutorial such as project/run name, the config settings, and a max dataset size.

In [None]:
from dataquality.integrations.seq2seq.auto import auto

auto(
    project_name="galileo-finetune",  # TODO, update project name
    run_name="example_run_galileo-finetune_with_auto",  # TODO, update with unique run name
    dataset_config=dataset_config,
    generation_config=gen_config,
    training_config=tr_config,
    max_train_size=max_train_size,
)