# Curating a coding dataset with Lilac

This is the accompanying notebook for the [blog post](https://docs.lilacml.com/blog/curate-coding-dataset.html) on curating a coding dataset with Lilac.

Check out the [live demo](https://lilacai-lilac.hf.space/datasets#lilac/glaive&expandedStats=%7B%22answer_formatted.has_edit%22%3Atrue%7D&query=%7B%22filters%22%3A%5B%7B%22path%22%3A%5B%22answer_formatted%22%2C%22has_edit%22%5D%2C%22op%22%3A%22equals%22%2C%22value%22%3A1%7D%5D%7D&compareColumns=%5B%7B%22column%22%3A%5B%22answer%22%5D%2C%22compareToColumn%22%3A%5B%22answer_formatted%22%2C%22answer%22%5D%2C%22swapDirection%22%3Afalse%7D%5D&rowId=%22fffc265c-845e-4a2b-b3ce-2caa61fed0f4%22).


In [1]:
import lilac as ll

ll.set_project_dir('./demo_data')

try:
  ds = ll.get_dataset('lilac', 'glaive')
except Exception:
  # Create the dataset.
  config = ll.DatasetConfig(
    namespace='lilac',
    name='glaive',
    source=ll.HuggingFaceSource(dataset_name='glaiveai/glaive-code-assistant'),
  )
  ds = ll.create_dataset(config)

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Start the lilac webserver.
ll.start_server()

INFO:     Started server process [70330]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:5432 (Press CTRL+C to quit)


Perhaps you already have a cluster running?
Hosting the HTTP server on port 52328 instead


In [18]:
import re
import subprocess
import lilac as ll
from pprint import pprint

code_block_re = re.compile('```(py|python)\n(.*)?\n```', re.MULTILINE | re.DOTALL)


# Format the code blocks of the "answer" column using the `ruff`` formatter.
def format_code(item):
  text = item['answer']
  if not text:
    return None

  new_text = text
  has_edit = False
  for _, code_block in code_block_re.findall(text):
    if not code_block:
      continue
    try:
      # Call the ruff binary to format the current code block.
      formatted_code_block = subprocess.check_output(
        ['ruff', 'format', '-'], input=code_block, encoding='utf-8', stderr=subprocess.DEVNULL
      )
      new_text = new_text.replace(code_block, formatted_code_block)
      has_edit = True
    except subprocess.CalledProcessError:
      continue
  return {'answer': new_text, 'has_edit': has_edit}


# Run over a sample to print the output to make sure our formatter is what we want. We emit the output_column to
# avoid writing to the dataset.
sample_output = ds.map(format_code, limit=3)
pprint(list(sample_output))

[lilac/glaive][1 shards] map "format_code" to "format_code":   0%|          | 0/136109 [00:00<?, ?it/s]

[{'answer': "You can achieve this by using the numpy library in Python. Here's "
            'an example code snippet:\n'
            '\n'
            '```\n'
            'import numpy as np\n'
            '\n'
            'def generate_gaussian_noise(mean=0, std=0.1):\n'
            '    noise = np.random.normal(mean, std, 1000)\n'
            '    return noise\n'
            '```\n'
            '\n'
            'In this code, we first import the numpy library using `import '
            'numpy as np`. Then, we define a function called '
            '`generate_gaussian_noise` which takes two optional parameters: '
            '`mean` (default value is 0) and `std` (default value is 0.1). \n'
            '\n'
            'Inside the function, we use `np.random.normal(mean, std, 1000)` '
            'to generate an array of 1000 random numbers that follow a '
            'Gaussian distribution with the given mean and standard deviation. '
            'This is done using the `normal` fun




In [19]:
# Run over the whole dataset.
ds.map(
  format_code,
  output_column='answer_formatted',
  num_jobs=-1,  # Use all available CPU cores.
  execution_type='processes',  # Run on multiple processes.
  overwrite=True,
)

[lilac/glaive][12 shards] map "format_code" to "answer_formatted": 100%|██████████| 136109/136109 [02:15<00:00, 1004.39it/s]


Wrote map output to ./demo_data/datasets/lilac/glaive/answer_formatted-00000-of-00001.parquet


<lilac.data.dataset_duckdb.DuckDBMapOutput at 0x2c748da90>