# Classifying code with a pretrained model (CodeBERT)
This model (https://huggingface.co/mrm8488/codebert-base-finetuned-detect-insecure-code), given a piece of code, returns 0 for secure code and 1 for insecure code.

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np

# test imports
import re
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


### Data preprocessing
csv_name is the result of a NIST dataset run through Kyle's parser<br>
folder_name is where all the data files are actually stored>br>
num_cases is the number of test cases you want to process (concretely, the first num_cases data points are analyzed)

file_formatting incrementally removes all types of comments by specific regex. The data files seem to have HTML, Python, and PHP style comments. Then we remove newlines.

files is a list of strings containing the formatted file contents<br>
labels is a list of 0s and 1s, where 0 = good and 1 - bad

In [24]:
# php csv
csv_name = 'parsed_data.csv'
folder_name = 'data/2022-05-12-php-test-suite-sqli-v1-0-0/'

df = pd.read_csv(csv_name)
df_len = df.shape[0]
# num_cases = int(df_len / 4)
num_cases = 100
df = df.head(num_cases) # take top (num_cases) files for now
filenames = df['file_location']

def file_formatting(file_location):
    file_path = file_location
    raw_contents = open(folder_name + file_path, "r").read()
    remove = re.sub("(<!--.*?-->)", "", raw_contents, flags=re.DOTALL) # html
    remove = re.sub('#.*?\n', '', remove, flags=re.DOTALL) # python
    remove = re.sub('\/\*\*[^*]*\*+([^/][^*]*\*+)*\/', '', remove, flags=re.S) # php
    remove = remove.replace('\n', '').replace(' ','') # newlines
    return remove

# data contains strings of all files
files = []
for f in filenames:
    try:
        fstring = file_formatting(f)
    except:
        pass
    files.append(fstring)

# get labels
state = df['state']
def replace_good_bad(lst):
    mapping = {"good": 0, "bad": 1}
    return [mapping.get(item, item) for item in lst]
state = replace_good_bad(state)

### Model setup
You need to run this before running the cell below it!

This initializes the model and prints a test case

In [18]:
tokenizer = AutoTokenizer.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')
model = AutoModelForSequenceClassification.from_pretrained('mrm8488/codebert-base-finetuned-detect-insecure-code')

# single use test
inputs = tokenizer(files[0], return_tensors="pt", truncation=True, padding='max_length')
pt_labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
outputs = model(**inputs, labels=pt_labels)
loss = outputs.loss
logits = outputs.logits

print(np.argmax(logits.detach().numpy()), state[0])

1 0


### Model running
Super scuffed but this interates through num_cases test cases and returns the accuracy of the model. However, I think the pretrained model is very predisposed to only returning 1. I also don't understand what the pt_labels means :/

But hey we have a baseline accuracy of ~0.15 for a Transformers model!

In [25]:
acc = 0

for idx in range(num_cases):
    inputs = tokenizer(files[idx], return_tensors="pt", truncation=True, padding='max_length')
    pt_labels = torch.tensor([1]).unsqueeze(0)  # Batch size 1
    #print(state.shape)
    outputs = model(**inputs, labels=pt_labels)
    loss = outputs.loss
    logits = outputs.logits
    print(np.argmax(logits.detach().numpy()), loss, state[idx])
    if (np.argmax(logits.detach().numpy()) == state[idx]):
        acc = acc + 1

print(acc / num_cases)

1 tensor(0.0174, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0137, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0329, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0291, grad_fn=<NllLossBackward0>) 1
1 tensor(0.0729, grad_fn=<NllLossBackward0>) 0
1 tensor(0.1860, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0064, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0822, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0390, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0245, grad_fn=<NllLossBackward0>) 1
1 tensor(0.0094, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0279, grad_fn=<NllLossBackward0>) 1
1 tensor(0.0272, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0594, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0067, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0041, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0139, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0198, grad_fn=<NllLossBackward0>) 0
1 tensor(0.0291, grad_fn=<NllLossBackward0>) 0
1 tensor(0.2048, grad_fn=<NllLossBackward0>) 1
1 tensor(0.0165, grad_fn=<NllLossBackward0>) 0
1 tensor(0.12