WildCode

Links and Authors

Paper: Link to Paper
Hugging Face Dataset: WildCode on Hugging Face
GitHub Repository: WildCode on GitHub

Authors

Kobra Khanmohammadi (Sheridan College, Ontario, Canada) - kobra.khanmohammadi@sheridancollge.ca
Pooria Roy (School of Computing, Queen's University, Kingston, Canada) - pooria.roy@queensu.ca
Raphael Khoury (Université du Québec en Outaouais (UQO), Canada) - raphael.khoury@uqo.ca
Wahab Hamou-Lhadj (Concordia University, Montreal, Canada) - wahab.hamou-lhadj@concordia.ca
Wilfried Patrick Konan (Université du Québec en Outaouais (UQO), Canada) - konk14@uqo.ca

Setting up the GitHub repository

Clone this repository

git clone https://github.com/regularpooria/wildcode

Clone the submodules with a depth of 1

git submodule update --init --recursive --depth 1

Setup a Python virtual environment with the virtualenv package (installation and making "venv")
```
pip install virtualenv
virtualenv venv
source venv/bin/activate
```
Install the packages from requirements.txt
```
pip install -r requirements.txt
```
Some linting and analysis tools are needed for certain experiments, which in those cases the installation process is explained and performed in the notebook itself.
The development of this repository has been done on a Linux device (WSL). As a result, some of the commands in the notebooks are exclusively for Linux and will not run on a Windows device.

Generating the dataset from WildChat:

Follow these files in sequence:

experiments/code_snippets/Extract_codesnippets_from_wildchat.ipynb
1. This file will load the WildChat-1M dataset and extract any code snippet that starts and ends with 3 backticks "```"
2. It will save any code snippets that it finds into a json file called tmp/code_snippets.json. Each code snippet will have a "conversation_hash" that corresponds to the conversation in WildChat-1M dataset and the code itself alongside with the given programming language name in the backtick.
experiments/code_snippets/Classify_programming_language.ipynb
1. This file will load the tmp/code_snippets.json and will pick out any code snippet that does not have a language name assigned to it (90k rows).
2. At the end, the predicted languages will get replaced in tmp/code_snippets.json
experiments/code_snippets/run_linting.ipynb
1. To ensure that we have clean code in Python, C/CPP, C#, Java, JavaScript, PHP we have to use their respective linting/compiler tools to weed out any bad syntax code snippets.
2. This file will also use tmp/code_snippets.json to perform the syntax on any of the languages mentioned above.
experiments/code_snippets/remove_bad_lints.ipynb
1. This notebook will iterate through each linting result and remove them from the temporary dataset.
2. At this point, the dataset generation is done, and for our paper we split tmp/code_snippets.json into individual json files by language and updated it to Hugging Face and called it WildCode.

Step 2: Hallucinations on WildCode

experiments/hallucinations/javascript_hallucinations.ipynb
1. This notebook will load the WildCode dataset and extract the used libraries in JavaScript/NodeJS code snippets.
2. The list of libraries will then be checked using the "all-the-package-names" package from NPM that has a list of ALL npm packages and is updated regularly. If the libraries exist then they will be excluded from the list.
3. The list of libraries will then be checked against the built-in modules in NodeJS and removed from the list.
4. The results will be written to results/hallucinations_javascript.json at the end. This file may include actual libraries that exist and we tackle this problem in experiments/hallucinations/verify_hallucinations.ipynb
experiments/hallucinations/python_hallucinations.ipynb
1. This notebook performs the same library extraction as step 1.1 for Python code snippets.
2. Similar to step 1.2, but for Python, we use the utils/simple file which contains a list of all PyPI packages at the time of writing this README. A newer version can be found by querying PyPI.
3. Similar to step 1.3, this step checks against built-in Python modules.
4. The results will be written to results/hallucinations_python.json.
experiments/hallucinations/verify_hallucinations.ipynb
1. This notebook will read the list of hallucinations from either Python or JavaScript and then asks an LLM to search the web and determine if the given library name is hallucinated or legitimate.
2. For the LLM, you will need to use OpenRouter and provide your own API key.
3. This will create a new file results/hallucinations_LANGUAGE_cleaned_with_ai.json (where LANGUAGE is either javascript or python).

Step 3: Security analysis

experiments/security_analysis/run_statnt_analysis_tool.ipynb
1. This notebook uses opengrep and opengrep-rules to perform vulnerability checks on the WildCode dataset.
2. The output will be saved to .sarif files for each language and will also be saved into .csv files in the results folder. These will later be used to target more specific issues such as weak hash algorithms, SQL injection, etc.
experiments/security_analysis/parse_results.ipynb (for deserialization, hash, unsafe_memory, weak_random)
1. These notebooks look through the static analysis tool results and return the conversation hashes that are affected by these issues, alongside with some statistics that we used in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
dataset		dataset
experiments		experiments
linting_rules		linting_rules
opengrep-rules @ 4f6fa6f		opengrep-rules @ 4f6fa6f
paper		paper
results		results
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
wildchat_code_en_categorized_lang_1.pkl		wildchat_code_en_categorized_lang_1.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

WildCode

Links and Authors

Authors

Setting up the GitHub repository

Generating the dataset from WildChat:

Step 2: Hallucinations on WildCode

Step 3: Security analysis

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

regularpooria/WildCode

Folders and files

Latest commit

History

Repository files navigation

WildCode

Links and Authors

Authors

Setting up the GitHub repository

Generating the dataset from WildChat:

Step 2: Hallucinations on WildCode

Step 3: Security analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages