# Part 2: Generate training files and train the model in SambaStudio

This notebook is a guide of how to convert your jsonl files to hdf5 files that are the files you will need to upload in order to start a training job in Sambastudio

In [None]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

## Generative data preparation package dependencies

The [generative Data preparation package]() is a package developed by Sambanova systems used to prepare your data to be used as dataset in your SambaStudio Environment, this package is included as git submodule in the AI Starter kit, so for instalations you should onlY execute teh following line

In [None]:
generative_data_prep_dir = os.path.join(repo_dir, "utils", "fine_tuning", "generative_data_prep")
! pip install $generative_data_prep_dir

## Generative data preparation usage

For Dataset generation you will need to specify the tokenizer of the model you want to train setting its model id in the `TOKENIZER` variable
- [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf): meta-llama/Llama-2-7b-hf
- [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): meta-llama/Meta-Llama-3-8B
- [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2): mistralai/Mistral-7B-Instruct-v0.2

> You will need to request access to the models to get the tokenizers in their HuggingFace spaces  [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf), [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Mistral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

### Pretraining

#### Pretrain with squad-smol-sql

In [None]:
INPUT_PATH = os.path.join(kit_dir, "data" , "pre-training", "pretrain-squad-smol-sql.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "pretrain-squad-smol-sql")
TOKENIZER = "meta-llama/Llama-2-7b-hf"  # set with the model id
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [None]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

#### Pretrain with the stack dedup dataset

In [None]:
INPUT_PATH = os.path.join(kit_dir, "data" , "pre-training", "pretrain-the-stack-dedup.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "pretrain-the-stack-dedup")
TOKENIZER = "meta-llama/Llama-2-7b-hf" 
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [None]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

Fine-tune with the NSText2SQL dataset

In [None]:
INPUT_PATH = os.path.join(kit_dir, "data" , "fine-tuning", "fine-tune-nstext2sql.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "fine-tune-nstext2sql")
TOKENIZER = "meta-llama/Llama-2-7b-hf" 
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [None]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

> find more details of the data prepariton proces in the [Generative data preparation Readme](../../utils/fine_tuning/generative_data_prep/README.md)