Skip to content

wyatt-avilla/sunbird

Repository files navigation

sunbird 🐦‍🔥

Ruff Checked with mypy python pytorch Tree Sitter

Overview

This project focuses on translating x86-64 assembly back into C code using a machine learning model trained on a dataset of C code snippets. Each snippet is compiled with multiple optimization levels across different compilers, and the resulting assembly code is tokenized for use in training.

Dataset

  • the model was trained on an augmented version of this dataset
  • each snippet of C code is compiled (by default) with the first four optimization levels of GCC and Clang, yielding 8 unique assembly code snippets for each element in the initial dataset (totaling 2.5 million snippets)

If kaggle is in your path, the original dataset can be downloaded with:

kaggle datasets download -d shirshaka/c-code-snippets-and-their-labels && \
unzip -d dataset c-code-snippets-and-their-labels.zip

Generation

  • compilation is performed as needed when calling DatasetIterator.take(n)
    • compilation settings, including optimization levels and compiler choices are specified in the arguments to this method call
  • the exact flags passed into the compilation subprocesses are specified in the .compile() methods in compilation.py

Tokenization

C and assembly code snippets are tokenized semantically using the tree-sitter library. Each token includes raw text paired with its symbolic identity, e.g., (variable, 42).