# Setup

In [3]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification, TrainingArguments
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

import torch
torch.cuda.empty_cache()

from sklearn.metrics import f1_score
import numpy as np
import pandas as pd
import evaluate

import ast
import astunparse

import random
import string

In [38]:
!nvidia-smi

Sun Apr  9 11:54:07 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02   Driver Version: 418.126.02   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   39C    P0    58W / 300W |   2428MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

Load and test epoch 4 model

In [5]:
model = RobertaForSequenceClassification.from_pretrained('checkpoingnts/R2-checkpoint-18000')
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")

In [10]:
def get_function_info(functionNode):
    functionName = functionNode.name
    functionArgs = [arg.arg for arg in functionNode.args.args]
    functionCode = astunparse.unparse(functionNode)
    return [functionName, functionArgs, functionCode]

def code_to_functions_df(code):
    node = ast.parse(code)
    functions = [n for n in node.body if isinstance(n, ast.FunctionDef)]
    classes = [n for n in node.body if isinstance(n, ast.ClassDef)]

    standalone_functions = [get_function_info(function) for function in functions]
    
    class_functions = []
        
    for class_ in classes:
        methods = [n for n in class_.body if isinstance(n, ast.FunctionDef)]
        cur_class_functions = [get_function_info(method) for method in methods]
        class_functions.extend(cur_class_functions)
    
    return pd.DataFrame(standalone_functions + class_functions,
                      columns =['functionName', 'functionArgs', 'functionCode'])

def file_to_processed_df(filename):
    functions = []
    with open(filename) as file:
        functions = code_to_functions_df(file.read())
    #preprocess - remove all before args definition
    functions['functionCode'] = [s[s.find('('):] for s in functions['functionCode']]
    return functions

# Test 1 - Looking at good code
We look at test module code from Philips' library for extracting functions from GitHub repos, all functions are correct, how will model recognize them?

source: https://github.com/philips-software/functiondefextractor/blob/master/test/test_core_extractor.py

In [60]:
functions = file_to_processed_df("code_sample/code_sample.py")
functions

Unnamed: 0,functionName,functionArgs,functionCode
0,get_log_data,[line],(line):\n ' function to get the line reques...
1,test_filter_reg_files,[self],(self):\n 'Function to test filter_reg_file...
2,test_get_function_names,[self],(self):\n 'Function to test get_function_na...
3,test_get_func_body,[self],(self):\n 'Function to test get_function_bo...
4,test_process_ad,[self],(self):\n 'Function to test the complete en...
5,test_process_extract,[self],(self):\n 'Function to test the complete en...
6,test_process_annot,[self],(self):\n 'Function to test the complete en...
7,test_process_python_test_extract,[self],(self):\n 'Function to test the complete en...
8,test_invalid_path,[self],(self):\n 'Function to test valid input pat...
9,test_py_annot_method_names,[self],(self):\n 'Function to test python annoted ...


So, the final result should look like this

In [61]:
labels = torch.LongTensor(14 * [0])
labels

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [62]:
inputs = tokenizer([n + tokenizer.sep_token + c for n,c in functions[['functionName', 'functionCode']].values],
                         return_tensors='pt', max_length=512,
                         truncation=True, padding='max_length')
outputs = model(**inputs)

In [63]:

print(f"Prediction: {torch.argmax(outputs['logits'], dim=1)}")

Prediction: tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


All guessed correct!

# Test 2 - Shuffle on that name
Same code but every 2nd func name is incorrect - manually shuffled with other ones

In [64]:
functions = file_to_processed_df("code_sample/code_sample_halfmashed.py")
functions

Unnamed: 0,functionName,functionArgs,functionCode
0,get_log_data,[line],(line):\n ' function to get the line reques...
1,test_get_func_body,[self],(self):\n 'Function to test filter_reg_file...
2,test_get_function_names,[self],(self):\n 'Function to test get_function_na...
3,test_process_python_test_extract,[self],(self):\n 'Function to test get_function_bo...
4,test_process_ad,[self],(self):\n 'Function to test the complete en...
5,test_process_annot,[self],(self):\n 'Function to test the complete en...
6,test_process_extract,[self],(self):\n 'Function to test the complete en...
7,test_filter_reg_files,[self],(self):\n 'Function to test the complete en...
8,test_invalid_path,[self],(self):\n 'Function to test valid input pat...
9,test_extractor_cmd,[self],(self):\n 'Function to test python annoted ...


In [65]:
labels = torch.LongTensor(7 * [0, 1])
labels

tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

In [66]:
inputs = tokenizer([n + tokenizer.sep_token + c for n,c in functions[['functionName', 'functionCode']].values],
                         return_tensors='pt', max_length=512,
                         truncation=True, padding='max_length')
outputs = model(**inputs)

In [69]:
results = torch.argmax(outputs['logits'], dim=1)

print(f"Prediction: {results}")
print (f"The F1 score is: {f1_score(labels, results)}")

Prediction: tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
The F1 score is: 1.0


In [84]:
outputs['logits']

tensor([[ 3.8380, -3.9264],
        [-2.5123,  2.1762],
        [ 3.9297, -4.0873],
        [-3.7171,  3.7810],
        [ 2.9823, -2.6671],
        [-1.7128,  1.1522],
        [ 3.7639, -3.8097],
        [-3.7143,  3.7712],
        [ 3.7294, -3.7613],
        [-2.1930,  1.7692],
        [ 3.7019, -3.7025],
        [-3.7071,  3.7530],
        [ 3.8992, -4.0322],
        [-3.7218,  3.7915]], grad_fn=<AddmmBackward>)

It was pretty certain in most cases too

# Test 3 - Spelling errors, on big function corpus

For this test, we will use later part of the python code dataset (post-100k),
which was not used in the train/val. Let's imagine that we have a __VERY__ nervous
programmer working, and he makes syntactic mistakes (usually - _adds or replaces_ random 1-3
characters within the function name, in around half of the cases. How much will 
we be able to detect? 

We will use around 10k functions for this test.
To keep things consistent we will still remove functions like \_\_init\_\_ and
\_\_getitem\_\_, however in experiment part __a__ we will __not__ remove ones that
exceed 512 tokens, truncating them, and ccompare with part __b__ where we will

## Test 3 setup

In [11]:
 def add_typos(function_name):
    rounds = random.choice(range(1,4))
    for i in range(rounds):
    
        #True - typo, False - insert
        is_typo = random.choice([True,False])
        
        shift = -1 if is_typo else 0

        slot = random.choice(range(len(function_name)))
        function_name = function_name[:max(0,slot + shift)] + random.choice(string.ascii_lowercase) + function_name[slot:]
        
    return function_name

In [18]:
add_typos("add_typos_into_functionName")

'add_uypos_into_funcbiojName'

In [19]:
giga_df = pd.read_parquet("pyfunc_272k.parquet")

#take last 20k
giga_df = giga_df.tail(20000)

#filter ones that have tags and have function decorators
filt_df = giga_df[giga_df['functionCode'].str.startswith("\n\ndef ")]
filt_df = filt_df[~filt_df['functionName'].str.startswith("__")].reset_index(drop=True)

#preprocess body
filt_df['functionCode'] = [s[s.find('('):] for s in filt_df['functionCode']]

filt_df['label'] = 0

filt_df

Unnamed: 0,functionName,functionArgs,functionCode,label
0,load_label,[label_file],(label_file):\n with open(label_file) as f:...,0
1,load_content,[file_name],(file_name):\n with open(file_name) as f:\n...,0
2,_spawn_aspell,"[self, aspell_executable, language]","(self, aspell_executable, language):\n args...",0
3,find_misspelled_words,"[self, list_of_words, max_words]","(self, list_of_words, max_words=None):\n ' ...",0
4,suggestions_for_word,"[self, word, max_suggestions]","(self, word, max_suggestions=None):\n ' Che...",0
...,...,...,...,...
14849,transform_anagramically,[L],(L):\n d = {}\n for e in L:\n se ...,0
14850,permutation_representation,"[s1, s2]","(s1, s2):\n assert (set(s1) == set(s2))\n ...",0
14851,find_largest_sq_ana,"[tass, squares]","(tass, squares):\n S = []\n for s in tas...",0
14852,find_ana_sq,"[tass, squares]","(tass, squares):\n all_sq_ana = find_larges...",0


Now let's add errors

In [20]:
rand_idx = filt_df.sample(frac = 0.5)['functionName'].index
print(filt_df.iloc[rand_idx[:5]]['functionName'], '\n')
filt_df['functionName'][rand_idx] = [add_typos(name) for name in filt_df.iloc[rand_idx]['functionName']]
filt_df['label'][rand_idx] = 1

print(filt_df.iloc[rand_idx[:5]]['functionName'])

8343        test___str___with_fragment
10422                       _switch_db
14839    parse_packet_with_fingerprint
9198                          freevars
3529                           LibName
Name: functionName, dtype: object 

8343         tenst___stn___with_fragment
10422                        _sswitch_db
14839    parse_upacket_with_fingverprint
9198                           frejfarps
3529                            LzibNpme
Name: functionName, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filt_df['functionName'][rand_idx] = [add_typos(name) for name in filt_df.iloc[rand_idx]['functionName']]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filt_df['label'][rand_idx] = 1


In [21]:
filt_df['tokenizedStr'] = [tokenizer(n + tokenizer.sep_token + c, max_length = 512, truncation = True, return_tensors ='pt') for n,c in tqdm(filt_df[['functionName', 'functionCode']].values)]
filt_df['tokenizedStrLen'] = [len(x['input_ids'][0]) for x in filt_df['tokenizedStr']]
filt_df

100%|██████████| 14854/14854 [00:29<00:00, 511.11it/s]


Unnamed: 0,functionName,functionArgs,functionCode,label,tokenizedStr,tokenizedStrLen
0,load_label,[label_file],(label_file):\n with open(label_file) as f:...,0,"[input_ids, attention_mask]",42
1,load_content,[file_name],(file_name):\n with open(file_name) as f:\n...,0,"[input_ids, attention_mask]",39
2,_spawn_aspell,"[self, aspell_executable, language]","(self, aspell_executable, language):\n args...",0,"[input_ids, attention_mask]",153
3,find_misspelled_words,"[self, list_of_words, max_words]","(self, list_of_words, max_words=None):\n ' ...",0,"[input_ids, attention_mask]",146
4,suggestions_for_word,"[self, word, max_suggestions]","(self, word, max_suggestions=None):\n ' Che...",0,"[input_ids, attention_mask]",163
...,...,...,...,...,...,...
14849,transform_anagramically,[L],(L):\n d = {}\n for e in L:\n se ...,0,"[input_ids, attention_mask]",81
14850,permutation_representation,"[s1, s2]","(s1, s2):\n assert (set(s1) == set(s2))\n ...",0,"[input_ids, attention_mask]",134
14851,find_largwst_sq_ana,"[tass, squares]","(tass, squares):\n S = []\n for s in tas...",1,"[input_ids, attention_mask]",146
14852,findeonadsq,"[tass, squares]","(tass, squares):\n all_sq_ana = find_larges...",1,"[input_ids, attention_mask]",453


In [30]:
class FunctionsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings
    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    def __len__(self):
        return len(self.encodings.input_ids)

In [31]:
filt_df_tr = filt_df

In [32]:
inputs_test = tokenizer([n + tokenizer.sep_token + c for n,c in filt_df_tr[['functionName', 'functionCode']].values],
                         return_tensors='pt', max_length=512,
                         truncation=True, padding='max_length')
inputs_test['labels'] = torch.LongTensor([filt_df_tr['label'].to_list()]).T
dataset_test = FunctionsDataset(inputs_test)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# and move our model over to the selected device
model.to(device)

## Test 3a - Anylength%

In [34]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
                                  per_device_eval_batch_size = 16, output_dir = "test_3a"
                                 )

metric = evaluate.load("f1")

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Create a Trainer instance
trainer_a = Trainer(
    model=model,
    args=training_args,
    eval_dataset=dataset_test,
    compute_metrics=compute_metrics,
)

trainer_a.evaluate()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Evaluation *****
  Num examples = 14854
  Batch size = 16
  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Trainer is attempting to log a value of "[0.7259203  0.55287865]" of type <class 'numpy.ndarray'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.60822491 0.8079213 ]" of type <class 'numpy.ndarray'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.90009425 0.42022351]" of type <class 'numpy.ndarray'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 2.182441473007202,
 'eval_accuracy': 0.6601588797630268,
 'eval_f1': array([0.7259203 , 0.55287865]),
 'eval_precision': array([0.60822491, 0.8079213 ]),
 'eval_recall': array([0.90009425, 0.42022351]),
 'eval_runtime': 149.5009,
 'eval_samples_per_second': 99.357,
 'eval_steps_per_second': 6.214}

## Test 3b - Ones That Fit

Now let's evaluate only on function defs that fit into 512 token length

In [35]:
filt_fit_df = filt_df_tr[filt_df_tr['tokenizedStrLen'] < 512].reset_index(drop=True)
filt_fit_df

Unnamed: 0,functionName,functionArgs,functionCode,label,tokenizedStr,tokenizedStrLen
0,load_label,[label_file],(label_file):\n with open(label_file) as f:...,0,"[input_ids, attention_mask]",42
1,load_content,[file_name],(file_name):\n with open(file_name) as f:\n...,0,"[input_ids, attention_mask]",39
2,_spawn_aspell,"[self, aspell_executable, language]","(self, aspell_executable, language):\n args...",0,"[input_ids, attention_mask]",153
3,find_misspelled_words,"[self, list_of_words, max_words]","(self, list_of_words, max_words=None):\n ' ...",0,"[input_ids, attention_mask]",146
4,suggestions_for_word,"[self, word, max_suggestions]","(self, word, max_suggestions=None):\n ' Che...",0,"[input_ids, attention_mask]",163
...,...,...,...,...,...,...
13447,transform_anagramically,[L],(L):\n d = {}\n for e in L:\n se ...,0,"[input_ids, attention_mask]",81
13448,permutation_representation,"[s1, s2]","(s1, s2):\n assert (set(s1) == set(s2))\n ...",0,"[input_ids, attention_mask]",134
13449,find_largwst_sq_ana,"[tass, squares]","(tass, squares):\n S = []\n for s in tas...",1,"[input_ids, attention_mask]",146
13450,findeonadsq,"[tass, squares]","(tass, squares):\n all_sq_ana = find_larges...",1,"[input_ids, attention_mask]",453


In [36]:
inputs_test_b = tokenizer([n + tokenizer.sep_token + c for n,c in filt_fit_df[['functionName', 'functionCode']].values],
                         return_tensors='pt', max_length=512,
                         truncation=True, padding='max_length')
inputs_test_b['labels'] = torch.LongTensor([filt_fit_df['label'].to_list()]).T
dataset_test_b = FunctionsDataset(inputs_test_b)

In [37]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
                                  per_device_eval_batch_size = 16, output_dir = "test_3b"
                                 )

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Create a Trainer instance
trainer_b = Trainer(
    model=model,
    args=training_args,
    eval_dataset=dataset_test_b,
    compute_metrics=compute_metrics,
)

trainer_b.evaluate()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Evaluation *****
  Num examples = 13452
  Batch size = 16
  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Trainer is attempting to log a value of "[0.72569236 0.55233809]" of type <class 'numpy.ndarray'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.60754793 0.80911436]" of type <class 'numpy.ndarray'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "[0.90087811 0.41927818]" of type <class 'numpy.ndarray'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.


{'eval_loss': 2.1934709548950195,
 'eval_accuracy': 0.6598275349390426,
 'eval_f1': array([0.72569236, 0.55233809]),
 'eval_precision': array([0.60754793, 0.80911436]),
 'eval_recall': array([0.90087811, 0.41927818]),
 'eval_runtime': 135.162,
 'eval_samples_per_second': 99.525,
 'eval_steps_per_second': 6.222}

## Conclusion

While fine-tuning on roughly 120MB of data, CodeBERT-base was able to achieve __0.96 F1__ on validation set with __72k__ training function examples, of which roughly 50% were shuffled to be incorrect. Real-life example of the same scenario showed that model is able to correctly identify __all__ of the correct and wrong name functions.

With acheived __0.66 accuracy__ (all metrics are pretty much the same on both any-length and full-fit) functions, model was quite unsensitive towards error names, with only 0.55 F1 on the error-class. While certaily being to detect logical misconnections and scoring better than random-guess, the chosen train method on only the correct & shuffled functions __does not prove to be too effective with spelling errors__ - thus to be nerve-proof, we need the training data needs to contain those (and maybe others too) kinds of errors.

One possible explanation is byte-level BPE tokenizer here shows to be not sensitive enough, even with few extra/replaced symbols the whole name remains semi-logical.

What else to improve? 
- Embrace in-class (aka \_\_this\_\_) functions 
- Find a better way of handling longer functions (bigger context model of course)
- Maybe a third hybrid class of error-functions that are logical but not that representative?
- Extend binary classification to 'not fitting' / 'syntax error' / etc?