# Classifiying code by fine-tuning a pre-trained model
This model (https://huggingface.co/mrm8488/codebert-base-finetuned-detect-insecure-code), given a piece of code, returns 0 for secure code and 1 for insecure code. We fine-tune the model to see if any improvements are made.

In [1]:
# test imports
import re
import numpy as np
import pandas as pd
from datasets import Dataset
import torch

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification, TrainingArguments, Trainer, AutoModelForSequenceClassification
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

### Data preprocessing
csv_name is the result of a NIST dataset run through Kyle's parser<br>
folder_name is where all the data files are actually stored>br>
num_cases is the number of test cases you want to process (concretely, the first num_cases data points are analyzed)

file_formatting incrementally removes all types of comments by specific regex. The data files seem to have HTML, Python, and PHP style comments. Then we remove newlines.

files is a list of strings containing the formatted file contents<br>
labels is a list of 0s and 1s, where 0 = good and 1 - bad

In [2]:
# php csv
csv_name = 'parsed_data.csv'
folder_name = 'data/2022-05-12-php-test-suite-sqli-v1-0-0/'

df = pd.read_csv(csv_name)
df_len = df.shape[0]
# num_cases = int(df_len / 4)
num_cases = 100
df = df.head(num_cases) # take top (num_cases) files for now
filenames = df['file_location']

def file_formatting(file_location):
    file_path = file_location
    raw_contents = open(folder_name + file_path, "r").read()
    remove = re.sub("(<!--.*?-->)", "", raw_contents, flags=re.DOTALL) # html
    remove = re.sub('#.*?\n', '', remove, flags=re.DOTALL) # python
    remove = re.sub('\/\*\*[^*]*\*+([^/][^*]*\*+)*\/', '', remove, flags=re.S) # php
    remove = remove.replace('\n', '').replace(' ','') # newlines
    return remove

# data contains strings of all files
files = []
for f in filenames:
    try:
        fstring = file_formatting(f)
    except:
        pass
    files.append(fstring)

# get labels
state = df['state']
def replace_good_bad(lst):
    mapping = {"good": 0, "bad": 1}
    return [mapping.get(item, item) for item in lst]
state = replace_good_bad(state)

# create df of files and labels
data = pd.DataFrame({'label': state, 'file': files})
#print(data, files, labels)


### Model selection and setup
You can uncomment the model_name to choose which model's tokenizer you want to use

In [4]:
#model_name = 'bert-base-cased' # natural language
model_name = 'microsoft/codebert-base' # code

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModelForSequenceClassification.from_pretrained(model_name)

2023-03-02 19:40:26.890837: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# do not run unless you are just testing
This cell is for testing purposes to ensure all imports and dependencies above work. This cell does not run fast on large datasets. To fine-tune using a large dataset, run the cell below

In [6]:
# tester model for smaller datasets

#data = np.array(data)
labels = np.asarray(state)
print(type(files[0]), type(files))
tokenized_data = tokenizer(files, return_tensors="np", padding=True)
# Tokenizer returns a BatchEncoding, but we convert that to a dict for Keras
tokenized_data = dict(tokenized_data)

model.compile(optimizer=Adam(3e-5))

labels = np.array(data["label"])
model.fit(tokenized_data, labels)


No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


<class 'str'> <class 'list'>


<keras.callbacks.History at 0x7fbc9565c3d0>

# run this
This cell creates a Tensorflow dataset to better handle large inputs. Then it fits the model

In [8]:
# tensorflow model for larger datasets

# creating tf dataset
tf_data = Dataset.from_pandas(data)

def tokenize_dataset(data):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(data["file"], padding="max_length", truncation=True)

tf_data = tf_data.map(tokenize_dataset, batched=True)
tf_data = model.prepare_tf_dataset(tf_data, batch_size=16, shuffle=True, tokenizer=tokenizer)

# Load and compile our model
# Lower learning rates are often better for fine-tuning transformers
model.compile(optimizer=Adam(3e-5))
model.fit(tf_data)


  0%|          | 0/1 [00:00<?, ?ba/s]

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.




<keras.callbacks.History at 0x7f798e3a8550>

### One case tester cell
To run many test cases at once, run the next cell

In [34]:
inputs = tokenizer(files[0], return_tensors="tf", truncation=True, padding='max_length')
pt_labels = torch.tensor([1]).unsqueeze(0).numpy()  # Batch size 1
outputs = model(**inputs, labels=pt_labels)
loss = outputs.loss
logits = outputs.logits
print(np.argmax(logits.numpy()), loss, state[0])

logits = model(**inputs).logits
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]

0 tf.Tensor([2.0462415], shape=(1,), dtype=float32) 0


'LABEL_0'

### Model accuracy
This cell computes the accuracy of the model a bit slowly. Feel free to improve and add model evaluation methods are desired

In [38]:
acc = 0

for idx in range(num_cases):
    inputs = tokenizer(files[idx], return_tensors="tf", truncation=True, padding='max_length')
    pt_labels = torch.tensor([1]).unsqueeze(0).numpy()  # Batch size 1
    #print(state.shape)
    outputs = model(**inputs, labels=pt_labels)
    loss = outputs.loss
    logits = outputs.logits

    logits = model(**inputs).logits
    predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
    model.config.id2label[predicted_class_id]

    print(np.argmax(logits.numpy()), predicted_class_id, loss, state[idx])

    if (np.argmax(logits.numpy()) == state[idx]):
        acc = acc + 1

print(acc / num_cases)

0 0 tf.Tensor([2.0462415], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.04916], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1357431], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1071458], shape=(1,), dtype=float32) 1
0 0 tf.Tensor([2.099639], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.0569355], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.167201], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.0916145], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1226583], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1774137], shape=(1,), dtype=float32) 1
0 0 tf.Tensor([2.1934183], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1057508], shape=(1,), dtype=float32) 1
0 0 tf.Tensor([2.0568833], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.0796993], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1138654], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.1853392], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.128327], shape=(1,), dtype=float32) 0
0 0 tf.Tensor([2.101945], shape=(1,), dtype=float32) 

KeyboardInterrupt: 

# buggy code: do not run!
This cell implements the 'Train with PyTorch Trainer' section of this (https://huggingface.co/docs/transformers/training)

It does not work on my computer :sad:

In [7]:
# pytorch init
pt_model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

training_args = TrainingArguments(output_dir="test_trainer")

%pip install sklearn
import evaluate

metric = evaluate.load("accuracy")

# pytorch training
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")


# creating tf dataset
tf_data = Dataset.from_pandas(data)
#tf_data = tf_data.train_test_split(test_size=0.3)
#print(tf_data)

def tokenize_dataset(data):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(data["file"], padding="max_length", truncation=True)

tf_data = tf_data.map(tokenize_dataset, batched=True)
tf_data = model.prepare_tf_dataset(tf_data, batch_size=16, shuffle=True, tokenizer=tokenizer)

#small_train_dataset = tf_data["train"].shuffle(seed=42).select(range(50))
#small_eval_dataset = tf_data["test"].shuffle(seed=42).select(range(50))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tf_data,
    eval_dataset=tf_data,
    compute_metrics=compute_metrics,
)

trainer.train()

loading configuration file config.json from cache at /Users/mercy/.cache/huggingface/hub/models--microsoft--codebert-base/snapshots/3b0952feddeffad0063f274080e3c23d75e7eb39/config.json
Model config RobertaConfig {
  "_name_or_path": "microsoft/codebert-base",
  "architectures": [
    "RobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file tf_model.h5 from cache at /Users/mercy/.cache/huggingface/hub/models--microsoft--codebert-base

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Note: you may need to restart the kernel to use updated packages.


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


  0%|          | 0/1 [00:00<?, ?ba/s]

AttributeError: 'TFRobertaForSequenceClassification' object has no attribute 'to'