`goal`
we have a dataset named "example_LM_input" which has three columns. 
1. `id` column: representing the id number of each data
2. `input` column: the input of each data. 
3. `output_example` column: an example what might the output look like.  

we're going to make a script that 
1. can batch inference of the datas(inputs) to the model.  
2. save the (batched) model output to a column named `output`.   
3. and just make an additional column `input_output` which is just the concatenation of input & output.  
 -> We'll save the enw dataset as "example_LM_input_output".   
 -> We'll implement it for both cpu & gpu. 

### Settings 1

In [1]:
import os
import torch
from datasets import load_from_disk
from transformers import AutoTokenizer, AutoModelForCausalLM, \
    BitsAndBytesConfig, GenerationConfig
from accelerate.test_utils.testing import get_backend
from multiprocess import set_start_method


# Automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device, _, _ = get_backend()
# Define the base model directory and model ID
MODEL_DIR = '/home/zerothweek/llm/models'
model_id = 'Llama-3.2-1B'
# Construct the full path to the model
model_path = os.path.join(MODEL_DIR, model_id)

### Settings 2

In [2]:
# Loading and Setting model
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_comput_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="auto",
    torch_dtype=torch.float16,
    #quantization_config = quantization_config,
    trust_remote_code=False
)

# Loading and Setting tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_path, 
    padding_side="left")
tokenizer.pad_token = tokenizer.eos_token # Most LLMs don't have a pad token by default

# Setting the Generationg Config
generation_config = GenerationConfig.from_pretrained(model_path)

### Settings 3

In [3]:
dataset = load_from_disk('./datasets/example_LM_input')

### CPU computation

In [17]:
def inference_dataset_cpu(batch):

    model_inputs = tokenizer(batch["input"], return_tensors="pt", padding=True).to(device) # tokenize model input

    with torch.no_grad():
        outputs = model.generate(**model_inputs, max_new_tokens=1)
    
    batch["output"] = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    #batch["input_output"] = [input + " " + output for input, output in zip(batch['input'], batch['output'])]
    return batch

if __name__ == "__main__":
    updated_dataset = dataset.map(
        inference_dataset_cpu,
        batched=True,
        batch_size=4
    )

Map:   0%|          | 0/28 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Map:   0%|          | 0/1 [00:00<?, ? examples/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


### GPU computation

In [None]:
def inference_dataset_gpu(batch, rank):
    '''
    The `AutoModelForCausalLM.from_pretrained(device_map="auto",..)
    should be removed to work properly
    '''
    # Move the model on the right GPU if it's not there already.
    device = f"cuda:{(rank or 0) % torch.cuda.device_count()}"
    model.to(device)
    model_inputs = tokenizer(batch["input"], return_tensors="pt", padding=True).to(device) # tokenize model input

    with torch.no_grad():
        outputs = model.generate(**model_inputs, max_new_tokens=1)
    
    batch["output"] = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    #batch["input_output"] = [input + " " + output for input, output in zip(batch['input'], batch['output'])]
    return batch

if __name__ == "__main__":
    set_start_method("spawn")
    updated_dataset = dataset.map(
        inference_dataset_gpu,
        batched=True,
        batch_size=16,
        with_rank=True,
        num_proc=torch.cuda.device_count() # one process per GPU
    )