## Dataset processing steps
1. Original seed dataset
2. Base dataset (for SFT model training and creation of KTO/DPO datasets)
3. KTO/DPO datasets

## 1. Seed dataset

Seed dataset is code_search_net for Java and Python and KExercises for Kotlin.
The seed dataset is filtered as follows:
- **Filter code based on its size**
- **Filter documentation based on its size**
- **Filter code by counting spaces**
- **Remove documentation with "TODO" mentions**
- **Filter code quality and content**: remove examples with long functions and long if/while bodies
- **Filter code references that don't compile**
- ..


KExercises keys:
```
{
    "problem",
    "solution"
}
```

KExercises example

In [1]:
from datasets import load_dataset
from pprint import pprint

kexercises_dataset = load_dataset("JetBrains/KExercises", split="train", trust_remote_code=True)
pprint(kexercises_dataset[0])

README.md:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

{'problem': '/**\n'
            ' * This exercise requires you to iterate over a specific range of '
            'numbers and perform XOR operation on them.\n'
            ' *\n'
            ' * @param numbers A list of numbers\n'
            ' * @param start The start index of the range\n'
            ' * @param end The end index of the range\n'
            ' * @return The XOR result of all numbers in the specified range\n'
            ' */\n'
            'fun xorOperation(numbers: List<Int>, start: Int, end: Int): Int {',
 'solution': '\n'
             '    var result = 0\n'
             '\n'
             '    for (i in start..end) {  // iterating over the specified '
             'range\n'
             '        result = result xor numbers[i]  // performing XOR '
             'operation\n'
             '    }\n'
             '\n'
             '    return result\n'
             '}\n'
             '\n'
             '// Example usage'}


## Base dataset
Base dataset is linked to a model.

It contains model predictions for every problem. ANd it has some of the necessary information extracted from the seed dataset.

Base dataset keys:

- code_completion is the code reference (comes from naming in code_search_net)

- prompt is the problem (when is sent to the LLM)

```
{
    "code_completion",
    "documentation",
    "func_name",
    "function_def",
    "language",
    "prediction",
    "prompt",
    "whole_func_string"
}
```

Base dataset example:

In [2]:
base_dataset = load_dataset("stojchet/deepseek_bs1_kotlin-empty", split="train", trust_remote_code=True)
pprint(base_dataset[0])

{'code_completion': '\n'
                    "    val vowels = listOf('a', 'e', 'i', 'o', 'u')\n"
                    '    var vowelsCount = 0\n'
                    '    var modifiedText = text\n'
                    '\n'
                    '    for (char in text) {\n'
                    '        if (char.toLowerCase() in vowels) {\n'
                    '            modifiedText = '
                    'modifiedText.replace(char.toString(), "*")\n'
                    '            vowelsCount++\n'
                    '        }\n'
                    '    }\n'
                    '\n'
                    '    return Pair(modifiedText, vowelsCount)\n'
                    '}',
 'documentation': '/**\n'
                  ' * This function counts the number of vowels in a given '
                  'text and replaces them with asterisks.\n'
                  ' */',
 'func_name': 'countVowels',
 'function_def': 'fun countVowels(text: String): Pair<String, Int> {',
 'language': 'kotlin',


## KTO dataset

- label is a binary True/False label based on whether `prompt + "\n" + completion` compiles.

- all dataset references are added with label True.
- all model completion that don't compile are added with label False.
- possibly model completion are corrupted by modifying keywords or symbols and added with label False.

```
{
    "prompt",
    "completion",
    "label",
}
```


In [3]:
kto_dataset = load_dataset("stojchet/deepseek_bs1_kotlin-empty", split="train", trust_remote_code=True)
pprint(kto_dataset[0])

{'code_completion': '\n'
                    "    val vowels = listOf('a', 'e', 'i', 'o', 'u')\n"
                    '    var vowelsCount = 0\n'
                    '    var modifiedText = text\n'
                    '\n'
                    '    for (char in text) {\n'
                    '        if (char.toLowerCase() in vowels) {\n'
                    '            modifiedText = '
                    'modifiedText.replace(char.toString(), "*")\n'
                    '            vowelsCount++\n'
                    '        }\n'
                    '    }\n'
                    '\n'
                    '    return Pair(modifiedText, vowelsCount)\n'
                    '}',
 'documentation': '/**\n'
                  ' * This function counts the number of vowels in a given '
                  'text and replaces them with asterisks.\n'
                  ' */',
 'func_name': 'countVowels',
 'function_def': 'fun countVowels(text: String): Pair<String, Int> {',
 'language': 'kotlin',


## DPO Dataset

DPO dataset keys:

- rejected contains a completion that when concatenated with the prompt doesn't compile. The completion comes from model prediction. Same as above it can possibly be corrupted.

    `prompt + "\n" + rejected`

- chosen completion compiles and is the dataset reference solution

```
{
    "prompt",
    "rejected",
    "chosen",
}
```

DPO dataset example:

In [4]:
dpo_dataset = load_dataset("stojchet/deepseek_bs1_kotlin-empty", split="train", trust_remote_code=True)
pprint(dpo_dataset[0])

{'code_completion': '\n'
                    "    val vowels = listOf('a', 'e', 'i', 'o', 'u')\n"
                    '    var vowelsCount = 0\n'
                    '    var modifiedText = text\n'
                    '\n'
                    '    for (char in text) {\n'
                    '        if (char.toLowerCase() in vowels) {\n'
                    '            modifiedText = '
                    'modifiedText.replace(char.toString(), "*")\n'
                    '            vowelsCount++\n'
                    '        }\n'
                    '    }\n'
                    '\n'
                    '    return Pair(modifiedText, vowelsCount)\n'
                    '}',
 'documentation': '/**\n'
                  ' * This function counts the number of vowels in a given '
                  'text and replaces them with asterisks.\n'
                  ' */',
 'func_name': 'countVowels',
 'function_def': 'fun countVowels(text: String): Pair<String, Int> {',
 'language': 'kotlin',


# Dataset analysis

## Compilations

In [7]:
from src.cf_dataset.compiler import compile_function


def count_compiled(dataset, prompt_str: str, completion_str: str):
    i = 0
    for j, example in enumerate(dataset):
        if j > 100:
            break
        if compile_function["kotlin"](example[prompt_str] + "\n" + example[completion_str]) and len(example[completion_str]) > 0:
            i += 1

    print("Dataset size: " + str(len(dataset)))
    print("Examples that compile: " + str(i))

In [8]:
print("KExercises")
count_compiled(kexercises_dataset, "problem", "solution")

KExercises


FileNotFoundError: [Errno 2] No such file or directory: 'out.jar'

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-1.3b-base", trust_remote_code=True, use_fast=True)

Distribution of length of problem sent to LLM (docstring + function definition)

In [None]:
from notebooks.util import tokenize_and_bucketize

tokenize_and_bucketize(tokenizer, kexercises_dataset, "problem", 50)

Distribution of length of reference solutions of KExercises

In [None]:
tokenize_and_bucketize(tokenizer, kexercises_dataset, "solution")

Distribution of length of reference solutions of KExercises after filtering

In [None]:
tokenize_and_bucketize(tokenizer, base_dataset, "code_completion")

Distribution of length of predictions

In [None]:
tokenize_and_bucketize(tokenizer, base_dataset, "prediction")

KTO/DPO dataset sizes

In [None]:
print("KTO dataset size")
print(kto_dataset.num_rows)

print("DPO dataset size")
print(dpo_dataset.num_rows)

# Dataset creations
To showcase main functions where processing happens. To make some changes look in the functions.

Base dataset

In [None]:
import torch
import gc

torch.cuda.empty_cache()
gc.collect()

In [None]:
from src.util import get_small_dataset
from src.model_impl import Model
from src.cf_dataset.util import _collect_predictions

# load model and dataset
model = Model(name="deepseek-ai/deepseek-coder-1.3b-base", truncation=True)
# this is the filtered dataset from KExercises
full_seed_dataset = load_dataset("stojchet/base_prediction_dataset", "kotlin", split="train", trust_remote_code=True)
seed_dataset = get_small_dataset(full_seed_dataset.to_iterable_dataset(), 10)

# call model.predict on each "problem" and store it in "prediction"
base_dataset = _collect_predictions(seed_dataset, model, "", batch_size=1, split="train", language="kotlin")

# after that you can save the dataset
base_dataset

KTO dataset

In [None]:
from src.cf_dataset.kto_dataset import create_kto_dataset

kto_dataset = create_kto_dataset(base_dataset, "kotlin", 0.5)
kto_dataset

DPO dataset

In [None]:
from src.cf_dataset.dpo_dataset import create_dpo_dataset

dpo_dataset = create_dpo_dataset(base_dataset, "kotlin", 1.0)
dpo_dataset