Firstly, we obtain and download pretrained transformer tokenizer and model `deepseek-coder-1.3b-base`.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# import torch_directml
# dml = torch_directml.device()

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True).to(device)


Now, we want to load our code translated dataset and process it. We process every sample by passing it through `tokenize` function. In our case, model's input is tokenized `example["problem"]`, while the label is tokenized `example["solution"]`.

In [7]:
from datasets import load_dataset

def tokenize_function(example):
    example['input_ids'] = tokenizer(example["problem"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["solution"], padding="max_length", truncation=True, return_tensors="pt").input_ids
    
    return example

dataset = load_dataset("json", data_files="python_to_kotlin_data.jsonl")["train"]
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Generating train split: 99 examples [00:00, 4148.67 examples/s]
Map: 100%|██████████| 99/99 [00:01<00:00, 82.61 examples/s]


Now utilize the built-in HuggingFace Trainer class. We pass the preprocessed dataset with reference to the original model. Training parameters are set to minimal values so that computational requirements are as low as possible. 

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./codegen-kotlin-finetune",
    learning_rate=2e-5,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    weight_decay=0.01,
    logging_steps=1,
    max_steps=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train']
)

All that's left now is to run the training.

In [None]:
trainer.train()

However, we can not see any results because my local machine crashed. I tried to run the same code on GoogleColab, but I've only recieved an error message - CUDA out of memory. Traning was not possible on CPU.
We could've use some smaller transformer model, but every model I've tried other than `deepseek-coder-1.3b-base` couldn't comprehend the evaluation (which was the first) and resulted in failure. 