codenavi

This is the repo for codenavi, a framework for applying neural net based langauge models to software projects, in order to help software developers identify bugs and ship better code.

I worked on this project as part of my UC Berkeley MIDS class - Natural Language Processing with Deep Learning

Paper can be found here.

Underlying tech used: Python, docker, haskell, Code2Seq, Tensorflow

Overview

This project contains the files for codenavi, a framework to assist software developers review code by using Natural Language Processing techniques applied to code.

Primer

In large established software development projects, bugs can be costly. A IBM study found that bugs can be 5x more costly to identify & fix after it is deployed, compared to being identified in the design phase. Software development can be fast moving, complicated, and confusing. Technical details may be siloed with certain developers, and information on technical details may be lost over time. Common industry practices involve structured processes such as CI/CD, and testing framework - this project aims to expand on the existing frameworks to help reduce the rise of recurring bugs in software projects.

The purpose of this project is to create a framework which analyzes a software project's history of pull requests, and trains a language model to assess the particular risk of introducing an issue or problematic code into the main/master branch. The score will then be presented to developers as a way to help them review individual pull requests.

Purpose

This project is unique in that it treats each unit of record for a machine learning model as the modified code in the PR, instead of generic code snippets. Most existing papers at the time of writing use prepackaged datasets to tackle this problem - this repo is an effort to tackle this problem at scale. Given that the space is rapidly evolving (i.e., with OpenAI codex, it is possible that some elements may not prove as useful over the long term, i.e. AST approach), but various data pipeline elements will likely remain useful for others for a long period of time.

It takes existing repos that use a seq2seq approach to code, and structrues it so that most, if not all code on github can be passed through a seq2seq abstract syntax tree encoder decoder neural model.

Machine Learning Pipeline Overview

The process of converting the underlying history of code changes for a project into a format that is understandable by a language model is quite complex and involved. The overall process is outlined below:

1) Pick a project of interest - project requirements

An important aspect here is that the project has a long history of committers and contributors. Additionally, you will need to be able to label Pull Request accordingly. For the project, I have picked out matplotlib, however, any project in particular will do.

Matplotlib has an extensive history of pull requests (10k+), as well has a long history of closed issues which link back to pull requests.

I went through and manually labeled a dataset, however, depending on how the discussion around a project is formatted, it may be conducive to create an additional framework to scrape and label this data automatically.

2) Scrape the data

Data was scraped using the GitHub Pull Request API. From this API, several important fields were pulled from the resulting API

The commit hash associated with the pull request ID, required for git to pull up the information in question The corresponding .diff file, which shows which files were changed as part of the pull request. Isolate the files that were changed as part of each merge request

You can run PR_api_request.py to scrape the necessary files.

Github has a rate limiting mechanism, so it is recommended to perform the API call in conjuction with a user generated PAT token.

python PR_api_request.py <your user generated PAT token>

Once the raw json data has been scraped, you will want to extract the files that were changed for each individual pull request. A flow chart of this process is provided below, for 1 hash iteration

This process will be repeated multiple times via the bash script git_scraper.sh

.\git_scraper.sh

3) Label the data.

Labels are a key part of any Machine Learning project.

After scraping the files, you will want to parse them into the individual files. You will want to generate a list of files.

You can review labeled_bugs.csv to get an idea of the schema. Issues that have been mapped to individual pull requests will be considered "positives", otherwise, we will consider the pull request as "negative".

4) Data preprocessing.

Once we have scraped the files associated with each individual pull request, we will want to process the data in a way that it can be understood by a language model. For the purpose of this project, I have decided to go with an abstract syntax tree approach.

We will want to map the unique Pull Request ID to the corresponding hash. Once the files have been mapped, we will want to separate the files into individual "positive" and "negative" folders.

The first step here is to parsing the individual code into abstract syntax trees.

# join manual labeled data with scraped PR data from github. Source file for manual lables must 
# follow format of labeled_bugs.csv
python map_generator.py

# move data to separate directories. Sample required amounts to enable 1:1 balance
python labeled_map_mover.py

## Parse positive files

# Enter docker container
docker run -it --entrypoint /bin/bash  -v d:/mids/data/positive:/home 082ebf62f850

# Execute script to generate asts within the docker container
./ast_generator.sh

## Parse Negative files

# Enter docker container
docker run -it --entrypoint /bin/bash  -v d:/mids/data/negative:/home 082ebf62f850

# Execute script to generate asts within the docker container
./ast_generator.sh

# Preprocessing step for json_to_seq. Combine qualifying files that were parsed by semantic into one file
python clean_and_agg_seqs.py

# Convert generated sequences with labels encoded at the beginning of the sequence.
python json_to_seq_v2.py

Model training

I will be using an approach based on the outlook referenced in the seq2seq paper. We have generated our labels and sequences, so the next step will be to utilize a language model. Most of the code in the project covers the data generation portion (can be applied across a wide array of projects on Github!) - the rest is left up to actual models to actually run a language model.

Setting up a VM

The model was trained on AWS. Due to the cost of a GPU based instance on AWS, the data was preprocessed locally. Once it was finished processing, it was tar'd and uploaded to an S3 bucket, where it found it's way AWS GPU based instance. I used the Nvidia Deep Learning AMI on AWS, as well as the Nvidia Tensorflow docker container.

The VM instance in particular used was a g4dn.xlarge instance with a T4 GPU. An additional storage space of 1TB was also provisioned for the VM instance.

Commands after provisioning the VM to train model.


# command to connect to the vm
ssh -i "w251-ec2.pem" ubuntu@<ec2 instance dns>.compute-1.amazonaws.com

# pull docker container
docker pull nvcr.io/nvidia/tensorflow:21.10-tf2-py3

# mount any additional storage provisioned by VM
sudo mount /dev/nvme1n1 /home/ubuntu/storage

# clone file
https://superli3w251.s3.amazonaws.com/limited.tar

# pull the files on the tar'd VM
sudo wget <location of tar'd data>

### run docker container ###

## tf 2 - note, I tried Kolkir's TF2 implementation of code2seq, but ran into multiple bugs, which were not present with the original code2seq library.
docker run --gpus all -it --rm -v "/home/ubuntu/project":"/workspace/project" -p 8888:8888 nvcr.io/nvidia/tensorflow:21.10-tf2-py3

## tf 1
docker run --gpus all -it --rm -v "/home/ubuntu/project":"/workspace/project" -p 8888:8888 nvcr.io/nvidia/tensorflow:21.10-tf1-py3

# run docker container without mounted storage, if needed
docker run --gpus all -it --rm -p 8888:8888 nvcr.io/nvidia/tensorflow:21.10-tf1-py3

## run script
source /workspace/project/codenavi/vm_env.sh

chmod +x preprocess.sh
./preprocess.sh $DATA_DIR

# run jupyter lab
jupyter lab

# pull this repo onto the vm
git clone https://github.com/superli3/codenavi.git

# pull github repo for code2seq (note - only works with tf1 container)
git clone https://github.com/tech-srl/code2seq.git

# docker container
ip instance:8888 

# pull files onto vm
wget <url to tar'd files processed by json_to_seq_v2.py>

# untar files
tar -xvf limited.tar

# set vm environment variables
source vm_env.sh

#From Python150KExtractor from code2seq repo, preprocess data for training
./preprocess.sh $DATA_DIR

# Train (modify config.py in code2seq repo, as needed)
# set PATIENCE to 300
# set MAX_PATH_LENGTH to length of sequences generated by json_to_seq
# set MAX_TARGET_PARTS to 1
# set TARGET_VOCAB_MAX_SIZE to 2

# from code2seq repo
./train_python150k.sh $DATA_DIR $DESC $CUDA $SEED

Results

Honestly, not terrible, but lots of room for improvement.

Tree depth	F1 Score
2	0.533
4	0.52
8	0.533

Some potential room for improvement, to be left for future devs:

-Isolating changes

-Alternative models - one weakness of code2seq is longer ASTs, as noted by the author

Papers

Code2Seq: Generating Sequences from Structured Representations of Code - Uri Alon, Shaked Brody, Omer Levy, Eran Yahav - https://arxiv.org/abs/1808.01400

DeepBugs: A Learning Approach to Name-based Bug Detection - Michael Pradel, Koushik Sen - https://arxiv.org/abs/1805.11683

Neural Software Analysis - Michael Pradel, Satish Chandra - https://arxiv.org/abs/2011.07986

Evaluating Large Language Models Trained on Code - Chen, et al. - https://arxiv.org/abs/2107.03374

Questions? Concerns? Feedback?

Happy to hear it - please feel free to drop a line in the issues section, or email me directly.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
Closed_PR_files		Closed_PR_files
images		images
testing		testing
.gitignore		.gitignore
CodeNavi.pdf		CodeNavi.pdf
PR_api_request.py		PR_api_request.py
README.md		README.md
clean_and_agg_seqs.py		clean_and_agg_seqs.py
clean_and_agg_seqs_small.py		clean_and_agg_seqs_small.py
git_scraper.sh		git_scraper.sh
issues_api_request.py		issues_api_request.py
json_to_seq_v1.py		json_to_seq_v1.py
json_to_seq_v2.py		json_to_seq_v2.py
json_to_seq_with_function.py		json_to_seq_with_function.py
labeled_bugs.csv		labeled_bugs.csv
labeled_map_mover.py		labeled_map_mover.py
map_generator.py		map_generator.py
mapper.py		mapper.py
read_pr_data.py		read_pr_data.py
vm_container_script.sh		vm_container_script.sh
vm_env.sh		vm_env.sh

superli3/codenavi

Folders and files

Latest commit

History

Repository files navigation

codenavi

Overview

Primer

Purpose

Machine Learning Pipeline Overview

1) Pick a project of interest - project requirements

2) Scrape the data

3) Label the data.

4) Data preprocessing.

Model training

Setting up a VM

Commands after provisioning the VM to train model.

Results

Papers

Questions? Concerns? Feedback?

About

Topics

Resources

Stars

Watchers

Forks

Languages