<a href="https://colab.research.google.com/github/rjenez/W266-final-project/blob/main/notebooks/Plagiarism_with_CodeBert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plagiarism with CodeBERT
**Author:*** Ricardo Jenez heavily modified from examples in HuggingFace
**Description:** NLP code to detect plagiarism in code.

## Introduction

This is a preliminary model for doing code plagiarism detection. The idea is to identify when students in a class has plagiarized a coding example.

### References

* [CodeBert](https://arxiv.org/abs/2002.08155)
* [Plagiarism Detection in Computer Programming Using Feature Extraction From Ultra-Fine-Grained Repositories](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9097285)

## Setup

Note: install HuggingFace `transformers` via `pip install transformers` (version >= 2.11.0).

In [None]:
%%capture
!pip3 install transformers
!pip3 install sentence_transformers
!pip3 install imbalanced-learn
!pip3 install datasets
#!pip3 install wandb

In [None]:
import torch
import gc
import datasets
import transformers
import pandas as pd
import numpy as np
from transformers import RobertaTokenizer, \
RobertaForSequenceClassification, Trainer, TrainingArguments,EvalPrediction, \
AutoTokenizer,  RobertaTokenizerFast
from torch.utils.data import Dataset, DataLoader
#import wandb
import random
from imblearn.over_sampling import RandomOverSampler
import pprint


In [None]:
#!gsutil cp gs://w266finalproject/plagA20162017.tar plag2.tar
!gsutil cp gs://w266finalproject/plag2.tar plag2.tar

Copying gs://w266finalproject/plag2.tar...
- [1 files][ 77.8 MiB/ 77.8 MiB]                                                
Operation completed over 1 objects/77.8 MiB.                                     


In [None]:
#!gcloud auth login

In [None]:

!tar xvf plag2.tar
!ls -l
# !mv trainA*.csv train.csv
# !mv testA*.csv test.csv
!mv train2.csv train.csv
!mv test2.csv test.csv

alldata2.csv
groundtruth2.csv
test2.csv
train2.csv
total 237864
-rw-r--r-- 1 jupyter jupyter  1114619 Mar 16 08:22 alldata2.csv
drwxr-xr-x 2 jupyter jupyter     4096 Mar 24 04:58 data
-rw-r--r-- 1 jupyter jupyter   203396 Mar 16 08:19 groundtruth2.csv
-rw-r--r-- 1 jupyter jupyter 81619968 Mar 24 05:25 plag2.tar
drwxr-xr-x 6 jupyter jupyter     4096 Mar 24 05:23 results
drwxr-xr-x 2 jupyter jupyter     4096 Mar 24 04:24 resultsBERT03242022_042423_512
drwxr-xr-x 6 jupyter jupyter     4096 Mar 24 04:34 resultsBERT03242022_042500_512
drwxr-xr-x 3 jupyter jupyter     4096 Mar 24 02:43 src
-rw-r--r-- 1 jupyter jupyter 15819857 Mar 16 08:22 test.csv
-rw-r--r-- 1 jupyter jupyter 15819857 Mar 16 08:22 test2.csv
-rw-r--r-- 1 jupyter jupyter 64478135 Mar 16 08:22 train.csv
-rw-r--r-- 1 jupyter jupyter 64478135 Mar 16 08:22 train2.csv
drwxr-xr-x 6 jupyter jupyter     4096 Mar 24 02:43 tutorials


In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")#[:500]
valid_df = train_df[int(len(train_df)*0.8):]#[:2000]
train_df = train_df[:int(len(train_df)*0.8)]#[:3000]

In [None]:
print("Train Target Distribution")
print(train_df.plagiarized.value_counts())

Train Target Distribution
0    10595
1      463
Name: plagiarized, dtype: int64


In [None]:
# y_train = tf.keras.utils.to_categorical(train_df.plagiarized, num_classes=2)
# y_val = tf.keras.utils.to_categorical(valid_df.plagiarized, num_classes=2)
# y_test = tf.keras.utils.to_categorical(test_df.plagiarized, num_classes=2)

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority',random_state=1234)
train_over, y_train_over = oversample.fit_resample(train_df, train_df.plagiarized)
print("Train Target Distribution")
print(train_over.plagiarized.value_counts())

valid_over, y_valid_over = oversample.fit_resample(valid_df, valid_df.plagiarized)
print("Valid Target Distribution")
print(valid_over.plagiarized.value_counts())

test_over, y_test_over = oversample.fit_resample(test_df, test_df.plagiarized)
print("Test Target Distribution")
print(test_over.plagiarized.value_counts())

Train Target Distribution
0    10595
1    10595
Name: plagiarized, dtype: int64
Valid Target Distribution
0    2654
1    2654
Name: plagiarized, dtype: int64
Test Target Distribution
0    3294
1    3294
Name: plagiarized, dtype: int64


In [None]:
# train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 
#                                              cache_dir='/media/data_files/github/website_tutorials/data')

# train_data, test_data = datasets.load_dataset('csv',  split =['train', 'test'], data_files={'train': 'train.csv',
#                                               'test': 'test.csv'},cache_dir='data')

train_data = datasets.Dataset.from_pandas(train_over)
valid_data = datasets.Dataset.from_pandas(valid_over)
test_data = datasets.Dataset.from_pandas(test_over)

In [None]:
print(len(train_data),type(train_data),train_data)

21190 <class 'datasets.arrow_dataset.Dataset'> Dataset({
    features: ['label', 'filename0', 'filename1', 'source0', 'source1', 'percent', 'percent0', 'percent1', 'lines', 'plagiarized'],
    num_rows: 21190
})


In [None]:
# load model and tokenizer and define length of the text sequence
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = RobertaForSequenceClassification.from_pretrained(#"huggingface/CodeBERTa-small-v1",
    "microsoft/codebert-base",
                num_labels = 2,
#                gradient_checkpointing=False,
                cache_dir='data',
                return_dict=True).to(device)

# tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base", 
#                                           max_length = 512,
#                                           cache_dir='data',)
tokenizer = RobertaTokenizerFast.from_pretrained("microsoft/codebert-base") #"huggingface/CodeBERTa-small-v1")#"microsoft/codebert-base")

def tokenization(batched_text):
    return tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = 512)
train_data = train_data.map(tokenization, batched = True, batch_size = 256) #len(train_data))
# valid_data = valid_data.map(tokenization, batched = True, batch_size = len(valid_data))
# test_data = test_data.map(tokenization, batched = True, batch_size = len(test_data))

Some weights of the model checkpoint at microsoft/codebert-base were not used when initializing RobertaForSequenceClassification: ['pooler.dense.bias', 'pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.out_proj.bias', 'classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be 

  0%|          | 0/83 [00:00<?, ?ba/s]

In [None]:
gc.collect()

49

In [None]:
valid_data = valid_data.map(tokenization, batched = True, batch_size = 256) #len(valid_data))

  0%|          | 0/21 [00:00<?, ?ba/s]

In [None]:
gc.collect()
test_data = test_data.map(tokenization, batched = True, batch_size = 256)#len(test_data))

  0%|          | 0/26 [00:00<?, ?ba/s]

In [None]:
train_data = train_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
valid_data = valid_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
# train_data = train_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
# valid_data = valid_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
# test_data = test_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [None]:
# define accuracy metrics
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = 'results',
    num_train_epochs = 8,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 32,    
    per_device_eval_batch_size= 16,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 1e-5,
    fp16 = True,
    logging_dir='logs',
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=valid_data
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp half precision backend


'cuda'

In [None]:
# see how the basic model would perform
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 64


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 0.6966226100921631,
 'eval_accuracy': 0.5,
 'eval_f1': 0.0,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_runtime': 8.7135,
 'eval_samples_per_second': 609.173,
 'eval_steps_per_second': 9.525}

In [None]:
!nvidia-smi -L 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-894e29b4-90ba-075b-150b-df0473693298)
GPU 1: Tesla V100-SXM2-16GB (UUID: GPU-ef7b62fa-d172-d02c-f520-708e91870a2b)
GPU 2: Tesla V100-SXM2-16GB (UUID: GPU-df767309-f546-de10-6ecb-05c19463d6a8)
GPU 3: Tesla V100-SXM2-16GB (UUID: GPU-a5851701-d7bf-0c21-7512-2909cb00ea1e)


In [None]:
torch.cuda.empty_cache()
import gc
gc.collect()

46

In [None]:
# train the model
trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21190
  Num Epochs = 8
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 32
  Total optimization steps = 656


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,0.6631,0.695451,0.510927,0.39963,0.517365,0.325546
1,0.4557,0.663369,0.697249,0.605451,0.868922,0.464582
2,0.3445,0.517141,0.813677,0.806723,0.838002,0.777694
3,0.2841,0.762075,0.784099,0.748132,0.897679,0.641296
4,0.2258,0.704528,0.814431,0.798197,0.874719,0.733986
5,0.1803,0.855262,0.81217,0.787193,0.907927,0.6948
6,0.1204,0.856625,0.809721,0.78474,0.903337,0.69367
7,0.1183,0.85497,0.81688,0.795455,0.900858,0.712133


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 64


Saving model checkpoint to results/checkpoint-82
Configuration saved in results/checkpoint-82/config.json
Model weights saved in results/checkpoint-82/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 64
Saving model checkpoint to results/checkpoint-164
Configuration saved in results/checkpoint-164/config.json
Model weights saved in results/checkpoint-164/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filenam

TrainOutput(global_step=656, training_loss=0.32164869439311145, metrics={'train_runtime': 2234.2045, 'train_samples_per_second': 75.875, 'train_steps_per_second': 0.294, 'total_flos': 4.455049011566592e+16, 'train_loss': 0.32164869439311145, 'epoch': 7.99})

In [None]:
# Evaluate the results
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 64


{'eval_loss': 0.5171408653259277,
 'eval_accuracy': 0.8136774679728711,
 'eval_f1': 0.8067226890756302,
 'eval_precision': 0.8380024360535931,
 'eval_recall': 0.7776940467219292,
 'eval_runtime': 8.6214,
 'eval_samples_per_second': 615.68,
 'eval_steps_per_second': 9.627,
 'epoch': 7.99}

In [None]:
results = trainer.predict(test_data)
pprint.pprint(results.metrics)

The following columns in the test set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1. If filename1, lines, filename0, percent0, plagiarized, source1, percent, source0, percent1 are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 64


{'test_accuracy': 0.7844565877352763,
 'test_f1': 0.7736691106152375,
 'test_loss': 0.5832812786102295,
 'test_precision': 0.8144295302013422,
 'test_recall': 0.7367941712204007,
 'test_runtime': 10.6206,
 'test_samples_per_second': 620.303,
 'test_steps_per_second': 9.698}
