<a href="https://colab.research.google.com/github/simplifine-llm/Simplifine/blob/main/examples/cloud_quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Simplifine-tuning your LLMs! 💫

This is a quick guide on getting started with Simplifine!

Below is an example of sending a supervised fine-tuning job to Simplifine's hosted servers.

First, we start by downloading Simplifine's latest version from github.

In [1]:
!pip install git+https://github.com/simplifine-llm/Simplifine.git -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.0/337.0 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Supervised fine tuning is a useful method to fine-tune a model for generating formatted answers, based on the provided input.

An example would be to generate an answer based on provided context.

An example would be:

QUESTION: What is the capital France?

CONTEXT: France has had its capital as Paris for some time now!

ANSWER: Paris is the capital of France.

In this example, you would want the model to fill in for the answer, having provided it with the question and context.

In this example, an arbitrary dataset will be used. We will use the following prompt template:

```
'''### TITLE: {title}\n ### ABSTRACT: {abstract}\n ###EXPLANATION: {explanation}'''
```

Then as mentioned, we want the model to fill in the text for answer, so we asign this to a response template:



```
response_template='\n ###EXPLANATION:'
```

In the example below, we use our own dataset. This dataset should be a python dictionary, which should include the keys that are required to populate the template you provided. You can also use any dataset hosted on huggingface (some require authentication/tokens)

In [2]:
from simplifine_alpha import train_engine
import wandb
import os

# disabling WandB logging, change if you'd like to have one.
# Note that you will need a wandb token.
wandb.init(mode='disabled')

# You can provided a HF dataset name.
# be sure to change the keys, response template and tempalte accordingly.
template = '''### TITLE: {title}\n ### ABSTRACT: {abstract}\n ###EXPLANATION: {explanation}'''
response_template='\n ###EXPLANATION:'
keys = ['title', 'abstract', 'explanation']
dataset_name=''

# you can change the model. bigger models might throw OOM errors.
model_name = 'EleutherAI/pythia-160m'

from_hf = True
if True:  # change this if you want to try this on a dataset on huggingface!
  from_hf = False
  data = {
      'title':['title 1', 'title 2', 'title 3']*200,
      'abstract':['abstract 1', 'abstract 2', 'abstract 3']*200,
      'explanation':['explanation 1', 'explanation 2', 'explanation 3']*200
  }

train_engine.hf_sft(model_name, from_hf=from_hf, dataset_name=dataset_name,
        keys = keys, data = data,
        template = template,
        response_template=response_template, zero=False, ddp=False, gradient_accumulation_steps=4, fp16=True, max_seq_length=2048)

[2024-07-28 18:09:39,647] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/480 [00:00<?, ? examples/s]

Map:   0%|          | 0/120 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

Using CUDA




Step,Training Loss


Testing the model's generation after training.
The simplifine trainer saves the final model in a folder in output_dir called "final_model".

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# This the path that the model and other relevant files are saved to.
# this is the default folder name in the trainer.
# The final checkpoint is saved under final_model.
path = '/content/sft_output/final_model'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

# an example following the arbitrary training data
input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''

input_example = sf_tokenizer(input_example, return_tensors='pt')

output = sf_model.generate(input_example['input_ids'],
                           attention_mask=input_example['attention_mask'],
                           max_length=30,eos_token_id=sf_tokenizer.eos_token_id,
                           early_stopping=True,
                           pad_token_id=sf_tokenizer.eos_token_id
)

print(sf_tokenizer.decode(output[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### TITLE: title 1
 ### ABSTRACT: abstract 1
 ###EXPLANATION:  explanation 1 3 explanation 1 3


## Using Simplifine's GPU clusters

In the example above, a small Pythia model (160m parameters) on a L4 GPU. Note that we do not use any adapters e.g. LoRA.
In the next step, we show how simplifine allows to carry out the same thing, but on GPU clusters. This will use functions of train_utils.

By using this command, you can manually pass the parallelization method.

If you have a model that is small enough, try using DDP. In this method, each processor (fansy word for GPU!) has a replica of the model and attends to a different sample.

You can also utilize ZeRO from DeepSpeed. With this, you can shard the model parameters, activation states and gradients across the GPUs. You also have the option to offload some to CPUs, at the expense of lower throughput.

**NOTE**: we currently support L4 and A100 gpus. When initilising the client, you can define which GPU you would like to run your job on. each server goes up to 8 gpus. The default is L4 GPUs.

# Using DDP to train
The example below uses DDP to distribute the training process.


you would need a simplifine API key. contact us for one for free! :)

see contact details at our github repo at https://github.com/simplifine-llm/Simplifine/tree/main

In [5]:
from simplifine_alpha.train_utils import Client

# setting up the client with
# enter your simplifine api key below
api_key = ''
gpu_type = 'a100'  # l4 or a100
client = Client(api_key=api_key, gpu_type=gpu_type)

# simply pass all the arguements you used above, and change ddp ot zero if you want parallelization.
client.sft_train_cloud(model_name = model_name, from_hf=from_hf, dataset_name=dataset_name,
        keys = keys, data = data,
        template = template, job_name='ddp_job',
        response_template=response_template, use_zero=False, use_ddp=True)

After sending the query, you can check the status of your jobs. Note that the status is one of the three options:
```text
status = complete|in progress|pending
```

In [6]:
status = client.get_all_jobs()
for num,i in enumerate(status[-5:]):
  print(f'Job {num}: {i}')

Job 0: {'job_id': '544bb4f0-206f-43b7-850e-5e1e9f7b4d23', 'job_name': 'job-4', 'status': 'completed'}
Job 1: {'job_id': 'bde91132-9776-41ae-89f9-855dfb116a91', 'job_name': 'ddp_job', 'status': 'completed'}
Job 2: {'job_id': 'a1ff54dd-5ee2-4e35-9e78-6868f63dad37', 'job_name': 'zero_example_cloud', 'status': 'completed'}
Job 3: {'job_id': '543d3bc3-3ce4-4af6-9f9a-6c0823dcc9b0', 'job_name': 'ddp_job', 'status': 'in progress'}
Job 4: {'job_id': '5d55d46a-7793-4c06-9cef-279f03a0f953', 'job_name': 'job_1', 'status': 'pending'}


You can also stop an ongoing job, by calling the function below

In [None]:
stop_running_job = False
if stop_running_job:
  job_id = status[-1]['job_id']
  client.stop_job(job_id)

In [7]:
# getting the job_id of the last job
job_id = status[-1]['job_id']

logs = client.get_train_logs(job_id)
print(logs['response'])

W0728 18:13:03.377000 134787342856320 torch/distributed/run.py:779] 
W0728 18:13:03.377000 134787342856320 torch/distributed/run.py:779] *****************************************
W0728 18:13:03.377000 134787342856320 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0728 18:13:03.377000 134787342856320 torch/distributed/run.py:779] *****************************************
[2024-07-28 18:13:08,712] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 18:13:08,803] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
  @autocast_custom_fwd
  @autocast_custom_bwd
[2024-07-28 18:13:08,963] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 18:13:09,002] [

### Downloading
The trained model can be downloaded using the "download_model" function. it will be a zip file.

In [8]:
import os

# creating a folder to store the model
os.mkdir('sf_trained_model')

# download and save the model to it.
# This might take some time, have a sip of that coffee! :)
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model')

Downloading: 100%|██████████| 540M/540M [00:36<00:00, 14.9MiB/s]



Directory downloaded successfully and saved to /content/sf_trained_model/5d55d46a-7793-4c06-9cef-279f03a0f953.zip
Model unzipped successfully to /content/sf_trained_model
Deleted the zip file at /content/sf_trained_model/5d55d46a-7793-4c06-9cef-279f03a0f953.zip
Model downloaded, unzipped, and zip file deleted successfully!


Finally, we test loading the model!

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

path = '/content/sf_trained_model'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''

input_example = sf_tokenizer(input_example, return_tensors='pt')

output = sf_model.generate(input_example['input_ids'],
                           attention_mask=input_example['attention_mask'],
                           max_length=30,eos_token_id=sf_tokenizer.eos_token_id,
                           early_stopping=True,
                           pad_token_id=sf_tokenizer.eos_token_id
)

print(sf_tokenizer.decode(output[0]))

Some weights of the model checkpoint at /content/sf_trained_model were not used when initializing GPTNeoXForCausalLM: ['module.embed_out.weight', 'module.gpt_neox.embed_in.weight', 'module.gpt_neox.final_layer_norm.bias', 'module.gpt_neox.final_layer_norm.weight', 'module.gpt_neox.layers.0.attention.dense.bias', 'module.gpt_neox.layers.0.attention.dense.weight', 'module.gpt_neox.layers.0.attention.query_key_value.bias', 'module.gpt_neox.layers.0.attention.query_key_value.weight', 'module.gpt_neox.layers.0.input_layernorm.bias', 'module.gpt_neox.layers.0.input_layernorm.weight', 'module.gpt_neox.layers.0.mlp.dense_4h_to_h.bias', 'module.gpt_neox.layers.0.mlp.dense_4h_to_h.weight', 'module.gpt_neox.layers.0.mlp.dense_h_to_4h.bias', 'module.gpt_neox.layers.0.mlp.dense_h_to_4h.weight', 'module.gpt_neox.layers.0.post_attention_layernorm.bias', 'module.gpt_neox.layers.0.post_attention_layernorm.weight', 'module.gpt_neox.layers.1.attention.dense.bias', 'module.gpt_neox.layers.1.attention.dens

### TITLE: title 1
 ### ABSTRACT: abstract 1
 ###EXPLANATION: rugu stretmediate complains GermanServ


### Using ZeRO
ZeRO is a strong tool when a model cannot fit on GPU memory, so it is sharded across them (parameters, gradients and activations). Further memory reduction could be by enabling fp16/bf16, and gradient_checkpointing.

In [10]:
# This time, we just change the use_zero arg to True, and opposite to use_ddp.
client.sft_train_cloud(model_name = model_name, from_hf=from_hf, dataset_name=dataset_name,
        keys = keys, data = data,
        template = template, job_name='zero_example_cloud',
        response_template=response_template, use_zero=True, use_ddp=False)

In [11]:
# repeat the same step of extracting jobs and ids
status = client.get_all_jobs()

for num,i in enumerate(status[-5:]):
  print(f'Number {num} status: {i}\n')

Number 0 status: {'job_id': 'bde91132-9776-41ae-89f9-855dfb116a91', 'job_name': 'ddp_job', 'status': 'completed'}

Number 1 status: {'job_id': 'a1ff54dd-5ee2-4e35-9e78-6868f63dad37', 'job_name': 'zero_example_cloud', 'status': 'completed'}

Number 2 status: {'job_id': '543d3bc3-3ce4-4af6-9f9a-6c0823dcc9b0', 'job_name': 'ddp_job', 'status': 'completed'}

Number 3 status: {'job_id': '5d55d46a-7793-4c06-9cef-279f03a0f953', 'job_name': 'job_1', 'status': 'completed'}

Number 4 status: {'job_id': '42d965c0-773f-4b45-8dfb-a4f310e6606e', 'job_name': 'zero_example_cloud', 'status': 'in progress'}



In [12]:
# extracting logs again
job_id = status[-1]['job_id']

logs = client.get_train_logs(job_id)
print(logs['response'])

W0728 18:16:44.514000 133239404900480 torch/distributed/run.py:779] 
W0728 18:16:44.514000 133239404900480 torch/distributed/run.py:779] *****************************************
W0728 18:16:44.514000 133239404900480 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0728 18:16:44.514000 133239404900480 torch/distributed/run.py:779] *****************************************
[2024-07-28 18:16:49,912] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 18:16:49,967] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 18:16:50,049] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 18:16:50,075] [INFO] [real_accelerator.py:203:get_accelerator

As above, now we download the model.

In [13]:
# creating a folder to store the model
os.mkdir('sf_trained_model_ZeRO')

# download and save the model to it.
# This might take some time, have a sip of that coffee! :)
client.download_model(job_id=job_id, extract_to='/content/sf_trained_model_ZeRO')

Downloading: 100%|██████████| 295M/295M [00:20<00:00, 14.1MiB/s]



Directory downloaded successfully and saved to /content/sf_trained_model_ZeRO/42d965c0-773f-4b45-8dfb-a4f310e6606e.zip
Model unzipped successfully to /content/sf_trained_model_ZeRO
Deleted the zip file at /content/sf_trained_model_ZeRO/42d965c0-773f-4b45-8dfb-a4f310e6606e.zip
Model downloaded, unzipped, and zip file deleted successfully!


Next we test this model trained with ZeRO for generation.

In [14]:
from transformers import AutoModelForCausalLM, AutoTokenizer

path = '/content/sf_trained_model_ZeRO'
sf_model = AutoModelForCausalLM.from_pretrained(path)
sf_tokenizer = AutoTokenizer.from_pretrained(path)

input_example = '''### TITLE: title 1\n ### ABSTRACT: abstract 1\n ###EXPLANATION: '''

input_example = sf_tokenizer(input_example, return_tensors='pt')

output = sf_model.generate(input_example['input_ids'],
                           attention_mask=input_example['attention_mask'],
                           max_length=30,eos_token_id=sf_tokenizer.eos_token_id,
                           early_stopping=True,
                           pad_token_id=sf_tokenizer.eos_token_id
)

print(sf_tokenizer.decode(output[0]))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### TITLE: title 1
 ### ABSTRACT: abstract 1
 ###EXPLANATION:  explanation 1
 ### QUE
