- Paper: Link to Paper
- Hugging Face Dataset: WildCode on Hugging Face
- GitHub Repository: WildCode on GitHub
- Kobra Khanmohammadi (Sheridan College, Ontario, Canada) - kobra.khanmohammadi@sheridancollge.ca
- Pooria Roy (School of Computing, Queen's University, Kingston, Canada) - pooria.roy@queensu.ca
- Raphael Khoury (Université du Québec en Outaouais (UQO), Canada) - raphael.khoury@uqo.ca
- Wahab Hamou-Lhadj (Concordia University, Montreal, Canada) - wahab.hamou-lhadj@concordia.ca
- Wilfried Patrick Konan (Université du Québec en Outaouais (UQO), Canada) - konk14@uqo.ca
- Clone this repository
git clone https://github.com/regularpooria/wildcode
- Clone the submodules with a depth of 1
git submodule update --init --recursive --depth 1
- Setup a Python virtual environment with the
virtualenvpackage (installation and making "venv")pip install virtualenv virtualenv venv source venv/bin/activate - Install the packages from
requirements.txtpip install -r requirements.txt
- Some linting and analysis tools are needed for certain experiments, which in those cases the installation process is explained and performed in the notebook itself.
- The development of this repository has been done on a Linux device (WSL). As a result, some of the commands in the notebooks are exclusively for Linux and will not run on a Windows device.
Follow these files in sequence:
-
experiments/code_snippets/Extract_codesnippets_from_wildchat.ipynb- This file will load the WildChat-1M dataset and extract any code snippet that starts and ends with 3 backticks "```"
- It will save any code snippets that it finds into a json file called
tmp/code_snippets.json. Each code snippet will have a "conversation_hash" that corresponds to the conversation in WildChat-1M dataset and the code itself alongside with the given programming language name in the backtick.
-
experiments/code_snippets/Classify_programming_language.ipynb- This file will load the
tmp/code_snippets.jsonand will pick out any code snippet that does not have a language name assigned to it (90k rows). - At the end, the predicted languages will get replaced in
tmp/code_snippets.json
- This file will load the
-
experiments/code_snippets/run_linting.ipynb- To ensure that we have clean code in Python, C/CPP, C#, Java, JavaScript, PHP we have to use their respective linting/compiler tools to weed out any bad syntax code snippets.
- This file will also use
tmp/code_snippets.jsonto perform the syntax on any of the languages mentioned above.
-
experiments/code_snippets/remove_bad_lints.ipynb- This notebook will iterate through each linting result and remove them from the temporary dataset.
- At this point, the dataset generation is done, and for our paper we split
tmp/code_snippets.jsoninto individual json files by language and updated it to Hugging Face and called it WildCode.
-
experiments/hallucinations/javascript_hallucinations.ipynb- This notebook will load the WildCode dataset and extract the used libraries in JavaScript/NodeJS code snippets.
- The list of libraries will then be checked using the "all-the-package-names" package from NPM that has a list of ALL npm packages and is updated regularly. If the libraries exist then they will be excluded from the list.
- The list of libraries will then be checked against the built-in modules in NodeJS and removed from the list.
- The results will be written to
results/hallucinations_javascript.jsonat the end. This file may include actual libraries that exist and we tackle this problem inexperiments/hallucinations/verify_hallucinations.ipynb
-
experiments/hallucinations/python_hallucinations.ipynb- This notebook performs the same library extraction as step 1.1 for Python code snippets.
- Similar to step 1.2, but for Python, we use the
utils/simplefile which contains a list of all PyPI packages at the time of writing this README. A newer version can be found by querying PyPI. - Similar to step 1.3, this step checks against built-in Python modules.
- The results will be written to
results/hallucinations_python.json.
-
experiments/hallucinations/verify_hallucinations.ipynb- This notebook will read the list of hallucinations from either Python or JavaScript and then asks an LLM to search the web and determine if the given library name is hallucinated or legitimate.
- For the LLM, you will need to use OpenRouter and provide your own API key.
- This will create a new file
results/hallucinations_LANGUAGE_cleaned_with_ai.json(where LANGUAGE is eitherjavascriptorpython).
-
experiments/security_analysis/run_statnt_analysis_tool.ipynb- This notebook uses
opengrepandopengrep-rulesto perform vulnerability checks on the WildCode dataset. - The output will be saved to
.sariffiles for each language and will also be saved into.csvfiles in theresultsfolder. These will later be used to target more specific issues such as weak hash algorithms, SQL injection, etc.
- This notebook uses
-
experiments/security_analysis/parse_results.ipynb(fordeserialization,hash,unsafe_memory,weak_random)- These notebooks look through the static analysis tool results and return the conversation hashes that are affected by these issues, alongside with some statistics that we used in the paper.
