We decided to use `jinaai/code_exercises` dataset for generating synthetic data. Firstly, we need to download and obtain it.

In [13]:
from datasets import load_dataset
ds = load_dataset("jinaai/code_exercises")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 1468146/1468146 [00:03<00:00, 442317.35 examples/s]


For now, we will only work with smaller subset of data, since the original dataset has over 400k samples.

In [16]:
train_subset = ds['train'].select(range(200))
print(train_subset)

Dataset({
    features: ['problem', 'solution'],
    num_rows: 200
})


Now, we will save the subset to json so we can process it easier later on.

In [17]:
train_subset.to_json('data/train_subset.json')
print(f"Subset saved to data/train_subset.json")

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 58.81ba/s]

Subset saved to train_subset.json





Now, let's try to use some LLM for code translation. One of the best models for this task is `Salesforce/codet5-base`. 

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "Salesforce/codet5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

We create a function for generating the prompt and code translation.

In [None]:
def translate_python_to_kotlin(python_code):
    input_text = f"Translate the following Python code:\n\n{python_code} to Kotlin code:\n\n"
    inputs = tokenizer(input_text, return_tensors="pt")

    outputs = model.generate(**inputs, max_length=150, num_beams=5, no_repeat_ngram_size=2)
    kotlin_code = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return kotlin_code

Let's test code translation on some basic Python code.

In [12]:

python_code_example = """
    total = 0
    for num in nums:
        if num > 10:
            total += num
    return total
"""

kotlin_translation = translate_python_to_kotlin(python_code_example)
print("Kotlin Code:\n", kotlin_translation)


Kotlin Code:
  def( nums )()() )()())()()


As we can see, the result of the function is not accurate at all. I've also tried many other Python code examples, but the results were similar. I've also experimented with some other models. That attempt was also unsuccessful. 

Since I couldn't find any appropriate model and method to use LLM for code translation, I have decided to do it manually by copying first 100 rows from `train_subset.json` file and pasting them with the well structured prompt to ChatGPT 3.5 (online). Finally, I have obtained translated code and stored it in `data/python_to_kotlin_data.jsonl` file. Now, we can fine tune our model.