<a href="https://colab.research.google.com/github/rjenez/W266-final-project/blob/main/notebooks/Plagiarism_with_BigBird.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plagiarism with BigBird
**Author:*** Ricardo Jenez heavily modified from examples in HuggingFace
**Description:** NLP code to detect plagiarism in code.

## Introduction

This is a preliminary model for doing code plagiarism detection. The idea is to identify when students in a class has plagiarized a coding example.

### References

* [BigBird](https://arxiv.org/abs/2007.14062)
* [Plagiarism Detection in Computer Programming Using Feature Extraction From Ultra-Fine-Grained Repositories](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9097285)

## Setup

Note: install HuggingFace `transformers` via `pip install transformers` (version >= 2.11.0).

In [1]:
%%capture
!pip3 install transformers
!pip3 install sentence_transformers
!pip3 install imbalanced-learn
!pip3 install datasets
#!pip3 install wandb

In [2]:
import torch
import datasets
import transformers
import pandas as pd
import numpy as np
from transformers import BigBirdTokenizer, \
BigBirdForSequenceClassification, Trainer, TrainingArguments,EvalPrediction, \
AutoTokenizer,  BigBirdTokenizerFast
from torch.utils.data import Dataset, DataLoader
#import wandb
import random
import datetime
from imblearn.over_sampling import RandomOverSampler
import pprint
import gc
import warnings
warnings.filterwarnings('ignore')


In [3]:
def print_cuda_memory():
  t = torch.cuda.get_device_properties(0).total_memory
  r = torch.cuda.memory_reserved(0)
  a = torch.cuda.memory_allocated(0)
  f = r-a  # free inside reserved
  print(f'Total = {t} reserved = {r} allocated = {a} free ={f}')

In [4]:
#!gsutil cp gs://w266finalproject/plagA20162017.tar plag2.tar
!gsutil cp gs://w266finalproject/plag2.tar plag2.tar

Copying gs://w266finalproject/plag2.tar...
\ [1 files][ 77.8 MiB/ 77.8 MiB]                                                
Operation completed over 1 objects/77.8 MiB.                                     


In [5]:
#from google.colab import auth
#auth.authenticate_user()

In [5]:

!tar xvf plag2.tar
!ls
!mv train2.csv train.csv
!mv test2.csv test.csv

alldata2.csv
groundtruth2.csv
test2.csv
train2.csv
alldata2.csv  groundtruth2.csv	plag2.tar  sample_data	test2.csv  train2.csv


In [6]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
valid_df = train_df[int(len(train_df)*0.8):]
train_df = train_df[:int(len(train_df)*0.8)]

In [7]:
print("Train Target Distribution")
print(train_df.plagiarized.value_counts())

Train Target Distribution
0    10595
1      463
Name: plagiarized, dtype: int64


In [8]:
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority',random_state=1234)
train_over, y_train_over = oversample.fit_resample(train_df, train_df.plagiarized)
print("Train Target Distribution")
print(train_over.plagiarized.value_counts())

valid_over, y_valid_over = oversample.fit_resample(valid_df, valid_df.plagiarized)
print("Valid Target Distribution")
print(valid_over.plagiarized.value_counts())

test_over, y_test_over = oversample.fit_resample(test_df, test_df.plagiarized)
print("Test Target Distribution")
print(test_over.plagiarized.value_counts())

Train Target Distribution
0    10595
1    10595
Name: plagiarized, dtype: int64
Valid Target Distribution
0    2654
1    2654
Name: plagiarized, dtype: int64
Test Target Distribution
0    3294
1    3294
Name: plagiarized, dtype: int64


In [9]:


train_data = datasets.Dataset.from_pandas(train_over)
valid_data = datasets.Dataset.from_pandas(valid_over)
test_data = datasets.Dataset.from_pandas(test_over)

In [11]:
print(len(train_data),type(train_data),train_data)
print([train_data['source0'],train_data['source1']][0])

21190 <class 'datasets.arrow_dataset.Dataset'> Dataset({
    features: ['label', 'filename0', 'filename1', 'source0', 'source1', 'percent', 'percent0', 'percent1', 'lines', 'plagiarized'],
    num_rows: 21190
})


IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [12]:
# Set parameters
today = datetime.datetime.now()
date_time = today.strftime("%m%d%Y_%H%M%S")
token_max_length = 1024
train_batch_size = 1 # 1 for 4096
cachedir = 'data' + date_time + '_' + str(token_max_length)
outputdir = 'resultsBigBIRD' + date_time + '_' + str(token_max_length)
logsdir = 'logs' + date_time + '_' + str(token_max_length)

In [13]:
# load model and tokenizer and define length of the text sequence
print_cuda_memory()
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base',
                gradient_checkpointing=False,
                attention_type = "original_full",
                num_labels = 2,
                cache_dir=cachedir,
                return_dict=True).to(device)
torch.cuda.empty_cache()
gc.collect()
print_cuda_memory()
tokenizer = AutoTokenizer.from_pretrained('google/bigbird-roberta-base', 
                                          max_length = token_max_length,
                                          cache_dir=cachedir,)
#tokenizer = BigBirdTokenizerFast.from_pretrained('google/bigbird-roberta-base')

def tokenization(batched_text):
    return tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = token_max_length)
train_data = train_data.map(tokenization, batched = True, batch_size = 64)
valid_data = valid_data.map(tokenization, batched = True, batch_size = 64)
test_data = test_data.map(tokenization, batched = True, batch_size = 64)

print_cuda_memory()

Total = 16945512448 reserved = 0 allocated = 0 free =0


Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/489M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bigbird-roberta-base were not used when initializing BigBirdForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BigBirdForSequenceClassifica

Total = 16945512448 reserved = 568328192 allocated = 512799232 free =55528960


Downloading:   0%|          | 0.00/0.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/826k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

  0%|          | 0/332 [00:00<?, ?ba/s]

  0%|          | 0/83 [00:00<?, ?ba/s]

  0%|          | 0/103 [00:00<?, ?ba/s]

Total = 16945512448 reserved = 568328192 allocated = 512799232 free =55528960


In [14]:
train_data = train_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
valid_data = valid_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
print_cuda_memory()


  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Total = 16945512448 reserved = 568328192 allocated = 512799232 free =55528960


In [15]:
# define accuracy metrics
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [16]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = outputdir,
    num_train_epochs = 4,
    per_device_train_batch_size = 4,#train_batch_size, #2,
    gradient_accumulation_steps =32,    #32
    per_device_eval_batch_size= 2,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 1e-5,
    fp16 = True,
    logging_dir=logsdir,
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=valid_data
)
print_cuda_memory()



Using amp half precision backend


Total = 16945512448 reserved = 568328192 allocated = 512799232 free =55528960


In [17]:
# see how the basic model would perform
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


{'eval_accuracy': 0.5124340617935192,
 'eval_f1': 0.18565135305223412,
 'eval_loss': 0.6929774284362793,
 'eval_precision': 0.5629770992366412,
 'eval_recall': 0.11115297663903542,
 'eval_runtime': 93.9331,
 'eval_samples_per_second': 56.508,
 'eval_steps_per_second': 28.254}

In [18]:
torch.cuda.empty_cache()
gc.collect()
print_cuda_memory()

Total = 16945512448 reserved = 568328192 allocated = 512799232 free =55528960


In [19]:
# train the model
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21190
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 32
  Total optimization steps = 660


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,0.4466,0.501263,0.778071,0.798976,0.730193,0.882065
1,0.2444,0.794691,0.710814,0.614806,0.920361,0.461567
2,0.1958,0.805111,0.743595,0.66634,0.953684,0.512057
3,0.1529,0.780234,0.783346,0.733673,0.951923,0.596835


The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


Saving model checkpoint to resultsBigBIRD04012022_072835_1024/checkpoint-165
Configuration saved in resultsBigBIRD04012022_072835_1024/checkpoint-165/config.json
Model weights saved in resultsBigBIRD04012022_072835_1024/checkpoint-165/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2
Saving model checkpoint to resultsBigBIRD04012022_072835_1024/checkpoint-330
Configuration saved in resultsBigBIRD04012022_072835_1024/checkpoint-330/config.json
Model weights saved in resultsBigBIRD04012022_072835_1024/checkpoint-330/pytorch_mod

TrainOutput(global_step=660, training_loss=0.33997134177973776, metrics={'train_runtime': 4174.1021, 'train_samples_per_second': 20.306, 'train_steps_per_second': 0.158, 'total_flos': 4.487305645780992e+16, 'train_loss': 0.33997134177973776, 'epoch': 4.0})

In [20]:
# Evaluate the results
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


{'epoch': 4.0,
 'eval_accuracy': 0.7780708364732479,
 'eval_f1': 0.7989761092150172,
 'eval_loss': 0.5012627840042114,
 'eval_precision': 0.7301933873986276,
 'eval_recall': 0.8820648078372268,
 'eval_runtime': 94.0963,
 'eval_samples_per_second': 56.41,
 'eval_steps_per_second': 28.205}

In [21]:
results = trainer.predict(test_data)
pprint.pprint(results.metrics)

The following columns in the test set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0. If source0, filename0, lines, percent, percent1, plagiarized, source1, filename1, percent0 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 2


{'test_accuracy': 0.7369459623557985,
 'test_f1': 0.7597393594898101,
 'test_loss': 0.5458023548126221,
 'test_precision': 0.6991579484562388,
 'test_recall': 0.8318154219793564,
 'test_runtime': 122.2055,
 'test_samples_per_second': 53.909,
 'test_steps_per_second': 26.955}


In [22]:
#!gsutil cp -r $outputdir gs://w266finalproject/

In [19]:
# reset and run 2K
model = None
tokenizer = None
torch.cuda.empty_cache()
gc.collect()
print_cuda_memory()
train_data = datasets.Dataset.from_pandas(train_over)
valid_data = datasets.Dataset.from_pandas(valid_over)
test_data = datasets.Dataset.from_pandas(test_over)

Total = 16945512448 reserved = 15051259904 allocated = 512799232 free =14538460672


In [20]:
# Set parameters
today = datetime.datetime.now()
date_time = today.strftime("%m%d%Y_%H%M%S")
token_max_length = 2048
train_batch_size = 1 # 1 for 4096
cachedir = 'data' + date_time + '_' + str(token_max_length)
outputdir = 'resultsBigBIRD' + date_time + '_' + str(token_max_length)
logsdir = 'logs' + date_time + '_' + str(token_max_length)

In [21]:
# load model and tokenizer and define length of the text sequence
print_cuda_memory()
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = BigBirdForSequenceClassification.from_pretrained('google/bigbird-roberta-base',
                gradient_checkpointing=False,
                #attention_type = "original_full",
                num_labels = 2,
                cache_dir=cachedir,
                return_dict=True).to(device)
torch.cuda.empty_cache()
gc.collect()
print_cuda_memory()
tokenizer = AutoTokenizer.from_pretrained('google/bigbird-roberta-base', 
                                          max_length = token_max_length,
                                          cache_dir=cachedir,)
#tokenizer = BigBirdTokenizerFast.from_pretrained('google/bigbird-roberta-base')

def tokenization(batched_text):
    return tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = token_max_length)
train_data = train_data.map(tokenization, batched = True, batch_size = 64)
valid_data = valid_data.map(tokenization, batched = True, batch_size = 64)
test_data = test_data.map(tokenization, batched = True, batch_size = 64)

print_cuda_memory()

Total = 16945512448 reserved = 15051259904 allocated = 512799232 free =14538460672


https://huggingface.co/google/bigbird-roberta-base/resolve/main/config.json not found in cache or force_download set to True, downloading to /content/data04012022_162657_2048/tmpm42gpjy0


Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

storing https://huggingface.co/google/bigbird-roberta-base/resolve/main/config.json in cache at data04012022_162657_2048/d7643b757353be56f05bdd19496d6e3fb5bb9edfdf5f9e5eca88d6f479e32324.dc98375bb3e19a644a5cadd5c305949ec470186fcc20bd8c8b959a43dcc3ff21
creating metadata file for data04012022_162657_2048/d7643b757353be56f05bdd19496d6e3fb5bb9edfdf5f9e5eca88d6f479e32324.dc98375bb3e19a644a5cadd5c305949ec470186fcc20bd8c8b959a43dcc3ff21
loading configuration file https://huggingface.co/google/bigbird-roberta-base/resolve/main/config.json from cache at data04012022_162657_2048/d7643b757353be56f05bdd19496d6e3fb5bb9edfdf5f9e5eca88d6f479e32324.dc98375bb3e19a644a5cadd5c305949ec470186fcc20bd8c8b959a43dcc3ff21
Model config BigBirdConfig {
  "architectures": [
    "BigBirdForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_

Downloading:   0%|          | 0.00/489M [00:00<?, ?B/s]

storing https://huggingface.co/google/bigbird-roberta-base/resolve/main/pytorch_model.bin in cache at data04012022_162657_2048/c523b12608662dbff39b2c24a608a6ff30857bc7967a5c9b00cb76d1147e223b.06e7996caf35449212f17d31a2129bb55c59c19054fcf8552a847e4bcb475688
creating metadata file for data04012022_162657_2048/c523b12608662dbff39b2c24a608a6ff30857bc7967a5c9b00cb76d1147e223b.06e7996caf35449212f17d31a2129bb55c59c19054fcf8552a847e4bcb475688
loading weights file https://huggingface.co/google/bigbird-roberta-base/resolve/main/pytorch_model.bin from cache at data04012022_162657_2048/c523b12608662dbff39b2c24a608a6ff30857bc7967a5c9b00cb76d1147e223b.06e7996caf35449212f17d31a2129bb55c59c19054fcf8552a847e4bcb475688
Some weights of the model checkpoint at google/bigbird-roberta-base were not used when initializing BigBirdForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transf

Total = 16945512448 reserved = 1159725056 allocated = 1025895424 free =133829632


https://huggingface.co/google/bigbird-roberta-base/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /content/data04012022_162657_2048/tmpbe8ftt98


Downloading:   0%|          | 0.00/0.99k [00:00<?, ?B/s]

storing https://huggingface.co/google/bigbird-roberta-base/resolve/main/tokenizer_config.json in cache at data04012022_162657_2048/d20a688e918d227ce5dbcd5f2b570a093cee6b095952d74b9c245b245e6510de.c8f14f85d9ff88cdd1fe7094cde11f85b74fcb7eb03616822964895bc6626c3b
creating metadata file for data04012022_162657_2048/d20a688e918d227ce5dbcd5f2b570a093cee6b095952d74b9c245b245e6510de.c8f14f85d9ff88cdd1fe7094cde11f85b74fcb7eb03616822964895bc6626c3b
loading configuration file https://huggingface.co/google/bigbird-roberta-base/resolve/main/config.json from cache at data04012022_162657_2048/d7643b757353be56f05bdd19496d6e3fb5bb9edfdf5f9e5eca88d6f479e32324.dc98375bb3e19a644a5cadd5c305949ec470186fcc20bd8c8b959a43dcc3ff21
Model config BigBirdConfig {
  "_name_or_path": "google/bigbird-roberta-base",
  "architectures": [
    "BigBirdForPreTraining"
  ],
  "attention_probs_dropout_prob": 0.1,
  "attention_type": "block_sparse",
  "block_size": 64,
  "bos_token_id": 1,
  "classifier_dropout": null,
  "eos

Downloading:   0%|          | 0.00/826k [00:00<?, ?B/s]

storing https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model in cache at data04012022_162657_2048/d318d7bb69cafb1d8964fc87515592ac3092a2c8fdb305068f9ba4020df3ee3b.271d467a9adc15fb44348481bc75c48b63cba0fd4934bc5377d63a63de052c45
creating metadata file for data04012022_162657_2048/d318d7bb69cafb1d8964fc87515592ac3092a2c8fdb305068f9ba4020df3ee3b.271d467a9adc15fb44348481bc75c48b63cba0fd4934bc5377d63a63de052c45
https://huggingface.co/google/bigbird-roberta-base/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /content/data04012022_162657_2048/tmpomeoc311


Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

storing https://huggingface.co/google/bigbird-roberta-base/resolve/main/special_tokens_map.json in cache at data04012022_162657_2048/400be7e354ea6eb77319bcc7fa34899ec9fa2e3aff0fa677f6eb7e45a01b1548.75b358ecb30fa6b001d9d87bfde336c02d9123e7a8f5b90cc890d0f6efc3d4a3
creating metadata file for data04012022_162657_2048/400be7e354ea6eb77319bcc7fa34899ec9fa2e3aff0fa677f6eb7e45a01b1548.75b358ecb30fa6b001d9d87bfde336c02d9123e7a8f5b90cc890d0f6efc3d4a3
loading file https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model from cache at data04012022_162657_2048/d318d7bb69cafb1d8964fc87515592ac3092a2c8fdb305068f9ba4020df3ee3b.271d467a9adc15fb44348481bc75c48b63cba0fd4934bc5377d63a63de052c45
loading file https://huggingface.co/google/bigbird-roberta-base/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/google/bigbird-roberta-base/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/google/bigbird-roberta-base/resolve

  0%|          | 0/332 [00:00<?, ?ba/s]

  0%|          | 0/83 [00:00<?, ?ba/s]

  0%|          | 0/103 [00:00<?, ?ba/s]

Total = 16945512448 reserved = 1159725056 allocated = 1025895424 free =133829632


In [22]:
train_data = train_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
valid_data = valid_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
print_cuda_memory()


  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

Total = 16945512448 reserved = 1159725056 allocated = 1025895424 free =133829632


In [23]:
# define accuracy metrics
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [24]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = outputdir,
    num_train_epochs = 4,
    per_device_train_batch_size = 2,#train_batch_size, #2,
    gradient_accumulation_steps =32,    #32
    per_device_eval_batch_size= 2,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 1e-5,
    fp16 = True,
    logging_dir=logsdir,
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=valid_data
)
print_cuda_memory()



PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp half precision backend


Total = 16945512448 reserved = 1159725056 allocated = 1025895424 free =133829632


In [25]:
# see how the basic model would perform
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


{'eval_accuracy': 0.5009419743782969,
 'eval_f1': 0.6558399376380408,
 'eval_loss': 0.6930267214775085,
 'eval_precision': 0.5004957366646837,
 'eval_recall': 0.9510173323285607,
 'eval_runtime': 630.2536,
 'eval_samples_per_second': 8.422,
 'eval_steps_per_second': 4.211}

In [26]:
# train the model
torch.cuda.empty_cache()
gc.collect()
print_cuda_memory()
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21190
  Num Epochs = 4
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 32
  Total optimization steps = 1324


Total = 16945512448 reserved = 614465536 allocated = 513096192 free =101369344


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,0.3045,0.553045,0.778636,0.753927,0.848656,0.678222
1,0.2261,0.900293,0.7289,0.651152,0.912984,0.506029
2,0.1825,0.928028,0.769216,0.721401,0.909925,0.597589
3,0.0638,1.15633,0.749058,0.686145,0.915723,0.548606


The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


Saving model checkpoint to resultsBigBIRD04012022_162657_2048/checkpoint-331
Configuration saved in resultsBigBIRD04012022_162657_2048/checkpoint-331/config.json
Model weights saved in resultsBigBIRD04012022_162657_2048/checkpoint-331/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2
Saving model checkpoint to resultsBigBIRD04012022_162657_2048/checkpoint-662
Configuration saved in resultsBigBIRD04012022_162657_2048/checkpoint-662/config.json
Model weights saved in resultsBigBIRD04012022_162657_2048/checkpoint-662/pytorch_mod

TrainOutput(global_step=1324, training_loss=0.2482875572500632, metrics={'train_runtime': 23619.2443, 'train_samples_per_second': 3.589, 'train_steps_per_second': 0.056, 'total_flos': 8.981393380623974e+16, 'train_loss': 0.2482875572500632, 'epoch': 4.0})

In [27]:
# Evaluate the results
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


{'epoch': 4.0,
 'eval_accuracy': 0.7786360211002261,
 'eval_f1': 0.7539267015706806,
 'eval_loss': 0.5530449151992798,
 'eval_precision': 0.8486562942008486,
 'eval_recall': 0.6782215523737755,
 'eval_runtime': 606.2404,
 'eval_samples_per_second': 8.756,
 'eval_steps_per_second': 4.378}

In [28]:
results = trainer.predict(test_data)
pprint.pprint(results.metrics)

The following columns in the test set  don't have a corresponding argument in `BigBirdForSequenceClassification.forward` and have been ignored: filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1. If filename0, percent0, plagiarized, source0, percent, lines, percent1, source1, filename1 are not expected by `BigBirdForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 2


{'test_accuracy': 0.7585003035822708,
 'test_f1': 0.7266792647311459,
 'test_loss': 0.6154723763465881,
 'test_precision': 0.8369608231104076,
 'test_recall': 0.6420765027322405,
 'test_runtime': 751.4322,
 'test_samples_per_second': 8.767,
 'test_steps_per_second': 4.384}
