Welcome to the repository for the research paper: "Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers." Here, we present DetectCodeGPT, a novel approach to distinguish between machine- and human-generated code snippets. This README will guide you through setting up and using the DetectCodeGPT framework.
Experiments are conducted using Python 3.9.7 on a Ubuntu 22.01.1 server.
To install all required packages, navigate to the root directory of this project and run:
pip install -r requirements.txt
To prepare the datasets used in our study:
-
Navigate to the
code-generation
directory. -
Obtain datasets from either:
-
Update the data paths and model specifications in
generate.py
to reflect your local setup. -
Execute the data generation script with:
python generate.py
After data preparation, you can proceed to the empirical analysis:
- Navigate to the
code-analysis
directory. - Analyze code length by running:
python analyze_length.py
- Verify Zipf's and Heaps' laws, and compute token frequencies with:
python analyze_law_and_frequency.py
- Analyze the proportion of different token categories by executing:
python analyze_proportion.py
- Study the naturalness of code snippets via:
python analyze_naturalness.py
To evaluate our DetectCodeGPT model:
- Navigate to the
code-detection
directory. - Configure
main.py
with the appropriate model and dataset paths. - Run the model evaluation script with:
python main.py
If you encounter any issues or have questions, please feel free to contact us!
We hope that our work will aid in advancing the field of machine learning in code generation and detection. Thank you for your interest in DetectCodeGPT!