# Part 2: Generate training files and train the model in SambaStudio

This notebook is a guide of how to convert your jsonl files to hdf5 files that are the files you will need to upload in order to start a training job in Sambastudio

In [12]:
import os
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)

## Generative data preparation package dependencies

The [generative Data preparation package]() is a package developed by Sambanova systems used to prepare your data to be used as dataset in your SambaStudio Environment, this package is included as git submodule in the AI Starter kit, so for instalations you should onlY execute teh following line

In [14]:
generative_data_prep_dir = os.path.join(repo_dir, "utils", "fine_tuning", "generative_data_prep")
! pip install $generative_data_prep_dir

Processing /Users/jorgep/Documents/ask_public_own/ai-starter-kit/utils/fine_tuning/generative_data_prep
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: sambanova-generative-data-prep
  Building wheel for sambanova-generative-data-prep (pyproject.toml) ... [?25ldone
[?25h  Created wheel for sambanova-generative-data-prep: filename=sambanova_generative_data_prep-0.1.dev66+g3b4779c-py3-none-any.whl size=86027 sha256=68fdaca373d2bf47c26899b3ec35d75316f2c88ed1f4e0b61785938be5076565
  Stored in directory: /private/var/folders/p4/y0q2kh796nx_k_yzfhxs57f00000gp/T/pip-ephem-wheel-cache-1sggcu80/wheels/df/4f/fb/02896327c293054033c9dc14409c1f07ee004b71bd3d093709
Successfully built sambanova-generative-data-prep
Installing collected packages: sambanova-generative-data-prep
  Attempting uninstall: sambanova-generative-data-prep
    Found exis

## Generative data preparation usage

For Dataset generation you will need to specify the tokenizer of the model you want to train setting its model id in the `TOKENIZER` variable
- [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf): meta-llama/Llama-2-7b-hf
- [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): meta-llama/Meta-Llama-3-8B
- [Mixtral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2): mistralai/Mistral-7B-Instruct-v0.2

> You will need to request access to the models to get the tokenizers in their HuggingFace spaces  [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf), [Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B), [Mixtral](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

### Pretraining

#### Pretrain with squad-smol-sql

In [17]:
INPUT_PATH = os.path.join(kit_dir, "data" , "pre-training", "pretrain-squad-smol-sql.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "pretrain-squad-smol-sql")
TOKENIZER = "meta-llama/Llama-2-7b-hf"  # set with the model id
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [18]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Size of input jsonl file is: 0.15 GB (148.6 MB)
--------------------------------------------------------------------------------
Running tokenization jobs locally, There are 8 processes working on it.


|██████████████████████████████████████▏⚠︎| (!) 9546/10016 [95%] in 40.1s (237.91/s) 


--------------------------------------------------------------------------------
Tokenization is complete, the output dataset is located at: /Users/jorgep/Documents/ask_public_own/ai-starter-kit/fine_tuning_sql/data/output/pretrain-squad-smol-sql
--------------------------------------------------------------------------------
Balancing hdf5 files to ensure they have the same number of sequences.
------------------------------------Metrics------------------------------------
╒════════════════════════════╤══════════╕
│ Sequences                  │ 17934    │
├────────────────────────────┼──────────┤
│ Articles                   │ 10000    │
├────────────────────────────┼──────────┤
│ Dataset Tokens             │ 73457664 │
├────────────────────────────┼──────────┤
│ Prompt Tokens              │ 0        │
├────────────────────────────┼──────────┤
│ Completion Tokens          │ 73398229 │
├────────────────────────────┼──────────┤
│ Padding Tokens             │ 59435    │
├────────────────

#### Pretrain with the stack dedup dataset

In [22]:
INPUT_PATH = os.path.join(kit_dir, "data" , "pre-training", "pretrain-the-stack-dedup.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "pretrain-the-stack-dedup")
TOKENIZER = "meta-llama/Llama-2-7b-hf" 
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [None]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

Fine-tune with the NSText2SQL dataset

In [20]:
INPUT_PATH = os.path.join(kit_dir, "data" , "fine-tuning", "fine-tune-nstext2sql.jsonl")
OUTPUT_PATH = os.path.join(kit_dir, "data", "output", "fine-tune-nstext2sq")
TOKENIZER = "meta-llama/Llama-2-7b-hf" 
MAX_SEQ_LENGTH = 4096
if not os.path.exists(OUTPUT_PATH):
    os.makedirs(OUTPUT_PATH)

In [21]:
%run -m generative_data_prep pipeline \
--input_file_path=$INPUT_PATH \
--output_path=$OUTPUT_PATH \
--pretrained_tokenizer=$TOKENIZER \
--max_seq_length=$MAX_SEQ_LENGTH \
--shuffle=on_RAM \
--keep_split_jsonls

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Size of input jsonl file is: 0.4 GB (408.08 MB)
--------------------------------------------------------------------------------
Running tokenization jobs locally, There are 8 processes working on it.


|██████████████████████████████████████▍⚠︎| (!) 277156/289312 [96%] in 1:35.1 (2913.15/s) 


--------------------------------------------------------------------------------
Tokenization is complete, the output dataset is located at: /Users/jorgep/Documents/ask_public_own/ai-starter-kit/fine_tuning_sql/data/output/fine-tune-nstext2sq
--------------------------------------------------------------------------------
Balancing hdf5 files to ensure they have the same number of sequences.
------------------------------------Metrics------------------------------------
╒════════════════════════════╤═══════════╕
│ Sequences                  │ 35882     │
├────────────────────────────┼───────────┤
│ Articles                   │ 289288    │
├────────────────────────────┼───────────┤
│ Dataset Tokens             │ 146972672 │
├────────────────────────────┼───────────┤
│ Prompt Tokens              │ 125607979 │
├────────────────────────────┼───────────┤
│ Completion Tokens          │ 21300079  │
├────────────────────────────┼───────────┤
│ Padding Tokens             │ 64614     │
├────────

> find more details of the data prepariton proces in the [Generative data preparation Readme](../../utils/fine_tuning/generative_data_prep/README.md)