This project focuses on translating x86-64 assembly back into C code using a machine learning model trained on a dataset of C code snippets. Each snippet is compiled with multiple optimization levels across different compilers, and the resulting assembly code is tokenized for use in training.
- the model was trained on an augmented version of this dataset
- each snippet of C code is compiled (by default) with the first four optimization levels of GCC and Clang, yielding 8 unique assembly code snippets for each element in the initial dataset (totaling 2.5 million snippets)
If kaggle
is in your path, the original dataset can be downloaded with:
kaggle datasets download -d shirshaka/c-code-snippets-and-their-labels && \
unzip -d dataset c-code-snippets-and-their-labels.zip
- compilation is performed as needed when calling
DatasetIterator.take(n)
- compilation settings, including optimization levels and compiler choices are specified in the arguments to this method call
- the exact flags passed into the compilation subprocesses are specified in the
.compile()
methods incompilation.py
C and assembly code snippets are tokenized semantically using the tree-sitter
library. Each token includes raw text paired with its symbolic identity, e.g.,
(variable, 42)
.