<a href="https://colab.research.google.com/github/rjenez/W266-final-project/blob/main/notebooks/Plagiarism_with_Trainer_Classifiers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plagiarism with Classifiers From HuggingFace
**Author:*** Ricardo Jenez heavily modified from examples in HuggingFace
**Description:** NLP code to detect plagiarism in code.

## Introduction

This is a preliminary model for doing code plagiarism detection. The idea is to identify when students in a class has plagiarized a coding example.

### References

* [BERT](https://arxiv.org/pdf/1810.04805.pdf)
* [Plagiarism Detection in Computer Programming Using Feature Extraction From Ultra-Fine-Grained Repositories](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9097285)

## Setup

Note: install HuggingFace `transformers` via `pip install transformers` (version >= 2.11.0).

In [None]:
%%capture
!pip3 install transformers
!pip3 install sentence_transformers
!pip3 install imbalanced-learn
!pip3 install datasets
#!pip3 install wandb

In [None]:
import torch
import datasets
import transformers
import pandas as pd
import numpy as np
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments,EvalPrediction, \
AutoTokenizer
from torch.utils.data import Dataset, DataLoader
#import wandb
import random
import datetime
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, confusion_matrix
# Imports the Google Cloud client library
from google.cloud import storage
import pprint


In [None]:
# Reset GPU and Clear memory
class global_objects:
  def __init__(self,classifier):
    self.train = None
    self.valid = None
    self.test = None
    self.model = None
    self.classifier = classifier

#global_data = global_objects("bert-base-uncased")
global_data = global_objects("bert-large-uncased-whole-word-masking")

def resetandclear():
  with pytorch.no_grad():
    torch.cuda.empty_cache()
  global_data = None
  #global_data = global_objects("bert-base-uncased")
  global_data = global_objects("bert-large-uncased-whole-word-masking")
  gc.collect()



In [None]:
with torch.no_grad():
  torch.cuda.empty_cache()

In [None]:
#!gsutil cp gs://w266finalproject/plagA20162017.tar plag2.tar
!gsutil cp gs://w266finalproject/plag2.tar plag2.tar

Copying gs://w266finalproject/plag2.tar...
- [1 files][ 77.8 MiB/ 77.8 MiB]                                                
Operation completed over 1 objects/77.8 MiB.                                     


In [None]:
!nvidia-smi -L 

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-e074cc5d-6a67-875c-c74e-94781d81e946)


In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
#!gcloud auth login

In [None]:

!tar xvf plag2.tar
!ls -l
# !mv trainA*.csv train.csv
# !mv testA*.csv test.csv
!mv train2.csv train.csv
!mv test2.csv test.csv

alldata2.csv
groundtruth2.csv
test2.csv
train2.csv
total 159428
-rw-r--r-- 1 root root       720 Mar 28 01:13 adc.json
-rw-r--r-- 1  501 staff  1114619 Mar 16 08:22 alldata2.csv
-rw-r--r-- 1  501 staff   203396 Mar 16 08:19 groundtruth2.csv
-rw-r--r-- 1 root root  81619968 Mar 28 01:13 plag2.tar
drwxr-xr-x 1 root root      4096 Mar 23 14:22 sample_data
-rw-r--r-- 1  501 staff 15819857 Mar 16 08:22 test2.csv
-rw-r--r-- 1  501 staff 64478135 Mar 16 08:22 train2.csv


In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
valid_df = train_df[int(len(train_df)*0.8):]
train_df = train_df[:int(len(train_df)*0.8)]#[:15000]

In [None]:
print("Train Target Distribution")
print(train_df.plagiarized.value_counts())

Train Target Distribution
0    10595
1      463
Name: plagiarized, dtype: int64


In [None]:


# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority',random_state=1234)
train_over, y_train_over = oversample.fit_resample(train_df, train_df.plagiarized)
print("Train Target Distribution")
print(train_over.plagiarized.value_counts())

valid_over, y_valid_over = oversample.fit_resample(valid_df, valid_df.plagiarized)
print("Valid Target Distribution")
print(valid_over.plagiarized.value_counts())

test_over, y_test_over = oversample.fit_resample(test_df, test_df.plagiarized)
print("Test Target Distribution")
print(test_over.plagiarized.value_counts())

Train Target Distribution
0    10595
1    10595
Name: plagiarized, dtype: int64
Valid Target Distribution
0    2654
1    2654
Name: plagiarized, dtype: int64
Test Target Distribution
0    3294
1    3294
Name: plagiarized, dtype: int64


In [None]:

train_data = datasets.Dataset.from_pandas(train_over)
valid_data = datasets.Dataset.from_pandas(valid_over)
test_data = datasets.Dataset.from_pandas(test_over)


In [None]:
print(len(train_data),type(train_data),train_data)

21190 <class 'datasets.arrow_dataset.Dataset'> Dataset({
    features: ['label', 'filename0', 'filename1', 'source0', 'source1', 'percent', 'percent0', 'percent1', 'lines', 'plagiarized'],
    num_rows: 21190
})


In [None]:
# load model and tokenizer and define length of the text sequence
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained(global_data.classifier,
#
                num_labels = 2,
                cache_dir='data',
                return_dict=True).to(device)



tokenizer = AutoTokenizer.from_pretrained(global_data.classifier, 
                                          max_length = 512,
                                          cache_dir='data',)

global_data.model = model
global_data.tokenizer = tokenizer
model=None
tokenizer=None

def tokenization(batched_text):
    return global_data.tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = 512)
train_data = train_data.map(tokenization, batched = True, batch_size = 256) #len(train_data))
valid_data = valid_data.map(tokenization, batched = True, batch_size = 256) #len(valid_data))
test_data = test_data.map(tokenization, batched = True, batch_size = 256) #len(test_data))





Downloading:   0%|          | 0.00/434 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

  0%|          | 0/83 [00:00<?, ?ba/s]

  0%|          | 0/21 [00:00<?, ?ba/s]

  0%|          | 0/26 [00:00<?, ?ba/s]

In [None]:
train_data = train_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
valid_data = valid_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])

global_data.train = train_data
global_data.valid= valid_data
global_data.test = test_data
train_data = None
valid_data = None
test_data = None

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [None]:
# define accuracy metrics

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# Set parameters
today = datetime.datetime.now()
date_time = today.strftime("%m%d%Y_%H%M%S")
token_max_length = 512
train_batch_size = 2 # 1 for 4096
cachedir = 'data' + date_time + '_' + str(token_max_length)
outputdir = global_data.classifier + date_time + '_' + str(token_max_length)
logsdir = 'logs' + date_time + '_' + str(token_max_length)

In [None]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = outputdir,
    num_train_epochs = 4,
    per_device_train_batch_size = 2, #8,
    gradient_accumulation_steps = 32,    
    per_device_eval_batch_size= 2, #16,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 2e-5, #1e-5,
    fp16 = True,
    logging_dir='logs',
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=global_data.model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=global_data.train,
    eval_dataset=global_data.valid
)
global_data.trainer=trainer
trainer =None

Using amp half precision backend


In [None]:
# see how the basic model would perform
global_data.trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_accuracy': 0.5,
 'eval_f1': 0.0,
 'eval_loss': 0.7117193937301636,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_runtime': 87.7346,
 'eval_samples_per_second': 60.501,
 'eval_steps_per_second': 30.25}

In [None]:
!nvidia-smi -L 

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-e074cc5d-6a67-875c-c74e-94781d81e946)


In [None]:
# train the model
global_data.trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21190
  Num Epochs = 4
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 32
  Total optimization steps = 1324


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,0.1926,0.418166,0.84156,0.831834,0.886238,0.783723
1,0.0811,0.783638,0.819894,0.790901,0.942649,0.681236
2,0.0578,1.085943,0.827242,0.799475,0.952579,0.688772
3,0.0436,1.437532,0.803504,0.763653,0.957931,0.634891


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


Saving model checkpoint to bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331
Configuration saved in bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331/config.json
Model weights saved in bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2
Saving model checkpoint to bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-662
Configuration saved in bert-large-uncased-whole-word-masking03282022_011556_512/checkpo

TrainOutput(global_step=1324, training_loss=0.16142554535865333, metrics={'train_runtime': 4537.1775, 'train_samples_per_second': 18.681, 'train_steps_per_second': 0.292, 'total_flos': 7.898491076750131e+16, 'train_loss': 0.16142554535865333, 'epoch': 4.0})

In [None]:
# Evaluate the results
global_data.trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 2


{'epoch': 4.0,
 'eval_accuracy': 0.8415599095704597,
 'eval_f1': 0.8318336332733453,
 'eval_loss': 0.4181661009788513,
 'eval_precision': 0.8862377503195569,
 'eval_recall': 0.7837226827430294,
 'eval_runtime': 87.6794,
 'eval_samples_per_second': 60.539,
 'eval_steps_per_second': 30.269}

In [None]:
results = global_data.trainer.predict(global_data.test)
pprint.pprint(results.metrics)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 2


{'test_accuracy': 0.8193685488767456,
 'test_f1': 0.8035006605019814,
 'test_loss': 0.6026988625526428,
 'test_precision': 0.8808834178131788,
 'test_recall': 0.738615664845173,
 'test_runtime': 114.6228,
 'test_samples_per_second': 57.475,
 'test_steps_per_second': 28.738}


In [None]:
#!gsutil cp -r $outputdir gs://w266finalproject/

In [None]:
!rm -rf saved_model
!mkdir saved_model

In [None]:
global_data.trainer.save_model('saved_model')


Saving model checkpoint to saved_model
Configuration saved in saved_model/config.json
Model weights saved in saved_model/pytorch_model.bin


In [None]:
!gsutil cp -r saved_model/* gs://w266finalproject/$outputdir

Copying file://saved_model/config.json [Content-Type=application/json]...
Copying file://saved_model/pytorch_model.bin [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

Copying file://saved_model/training_args.bin [Content-Type=application/octet-stream]...
/ [3 files][  1.2 GiB/  1.2 GiB]   21.5 MiB/s                             

In [None]:
!ls -al saved_model

total 1309340
drwxr-xr-x 2 root root       4096 Mar 28 02:36 .
drwxr-xr-x 1 root root       4096 Mar 28 02:36 ..
-rw-r--r-- 1 root root        713 Mar 28 02:36 config.json
-rw-r--r-- 1 root root 1340739309 Mar 28 02:36 pytorch_model.bin
-rw-r--r-- 1 root root       3055 Mar 28 02:36 training_args.bin


In [None]:
#!gsutil cp -R gs://w266finalproject/resultsBERT03272022_053224_512/checkpoint-662/* saved_model

In [None]:
!gsutil ls gs://w266finalproject/$outputdir

gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/config.json
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/optimizer.pt
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/pytorch_model.bin
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/rng_state.pth
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/scaler.pt
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/scheduler.pt
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/trainer_state.json
gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/training_args.bin


In [None]:
!ls bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331
!gsutil cp -R bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331/* gs://w266finalproject/bert-large-uncased-whole-word-masking03282022_011556_512/


config.json   pytorch_model.bin  scaler.pt     trainer_state.json
optimizer.pt  rng_state.pth	 scheduler.pt  training_args.bin
Copying file://bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331/config.json [Content-Type=application/json]...
Copying file://bert-large-uncased-whole-word-masking03282022_011556_512/checkpoint-331/optimizer.pt [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on com

In [None]:
!ls saved_model


config.json  pytorch_model.bin	training_args.bin


In [None]:
# load model and tokenizer and define length of the text sequence
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = AutoModelForSequenceClassification.from_pretrained("./saved_model",
                num_labels = 2,
                cache_dir='data',
                return_dict=True).to(device)


tokenizer = AutoTokenizer.from_pretrained(global_data.classifier,
                                          max_length = 512,
                                          cache_dir='data',)

global_data.model = model
model = None
global_data.tokenizer = tokenizer
tokenizer = None

loading configuration file ./saved_model/config.json
Model config BertConfig {
  "_name_or_path": "./saved_model",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.17.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file ./saved_model/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint

In [None]:

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = "saved",
    num_train_epochs = 4,
    per_device_train_batch_size = 8,
    gradient_accumulation_steps = 32,    
    per_device_eval_batch_size= 16,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 1e-5,
    fp16 = True,
    logging_dir='logs',
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=global_data.model,
    args=training_args,
    compute_metrics=compute_metrics,
    # train_dataset=train_data,
    # eval_dataset=valid_data
)
global_data.trainer = trainer
trainer = None

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using amp half precision backend


In [None]:
#!gsutil cp gs://w266finalproject/plagA20162017.tar plag2.tar
!gsutil cp gs://w266finalproject/plag2.tar plag2.tar
!tar xvf plag2.tar
!ls -l

!mv train2.csv train.csv
!mv test2.csv test.csv

test_df = pd.read_csv("test.csv")
oversample = RandomOverSampler(sampling_strategy='minority',random_state=1234)
test_over, y_test_over = oversample.fit_resample(test_df, test_df.plagiarized)
print(test_over.plagiarized.value_counts())
test_data = datasets.Dataset.from_pandas(test_over)
def tokenization(batched_text):
    return global_data.tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = 512)
test_data = test_data.map(tokenization, batched = True, batch_size = 256)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'label'])

global_data.test = test_data
global_data.test_over = test_over
test_df = None
test_over = None
y_test_over = None
test_data = None

Copying gs://w266finalproject/plag2.tar...
- [1 files][ 77.8 MiB/ 77.8 MiB]                                                
Operation completed over 1 objects/77.8 MiB.                                     
alldata2.csv
groundtruth2.csv
test2.csv
train2.csv
total 237868
-rw-r--r-- 1 root root       720 Mar 28 01:13 adc.json
-rw-r--r-- 1  501 staff  1114619 Mar 16 08:22 alldata2.csv
drwxr-xr-x 6 root root      4096 Mar 28 02:32 bert-large-uncased-whole-word-masking03282022_011556_512
drwxr-xr-x 2 root root      4096 Mar 28 01:14 data
-rw-r--r-- 1  501 staff   203396 Mar 16 08:19 groundtruth2.csv
drwxr-xr-x 3 root root      4096 Mar 28 02:34 logs
-rw-r--r-- 1 root root  81619968 Mar 28 02:37 plag2.tar
drwxr-xr-x 1 root root      4096 Mar 23 14:22 sample_data
drwxr-xr-x 2 root root      4096 Mar 28 02:37 saved
drwxr-xr-x 2 root root      4096 Mar 28 02:36 saved_model
-rw-r--r-- 1  501 staff 15819857 Mar 16 08:22 test2.csv
-rw-r--r-- 1  501 staff 15819857 Mar 16 08:22 test.csv
-rw-r--r-- 1 

  0%|          | 0/26 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [None]:
predictions = global_data.trainer.predict(global_data.test)
pprint.pprint(predictions.metrics)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1. If lines, percent0, percent1, filename0, percent, source0, plagiarized, filename1, source1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 16


{'test_accuracy': 0.8193685488767456,
 'test_f1': 0.8035006605019814,
 'test_loss': 0.6027801036834717,
 'test_precision': 0.8808834178131788,
 'test_recall': 0.738615664845173,
 'test_runtime': 76.3557,
 'test_samples_per_second': 86.28,
 'test_steps_per_second': 5.396}


In [None]:
preds = np.argmax(predictions.predictions, axis=-1)
print(preds)

[0 0 0 ... 0 1 1]


In [None]:

print(confusion_matrix(preds, predictions.label_ids))

tn, fp, fn, tp = confusion_matrix(preds, predictions.label_ids).ravel()
print(tn,fp,fn,tp)




[[2965  861]
 [ 329 2433]]
2965 861 329 2433


In [None]:
print(global_data.test['source0'])

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
sourcefalsepos = global_data.test_over[np.logical_and(preds == 1,predictions.label_ids==0)][['source0','source1','filename0','filename1']]

In [None]:
pp = pprint.PrettyPrinter(depth=6)

In [None]:
print(sourcefalsepos[['filename0','filename1']].iloc[0])

filename0    A2016/Z1/Z4/student5611
filename1    A2016/Z1/Z4/student2967
Name: 25, dtype: object


In [None]:
pp.pprint(sourcefalsepos['source0'].iloc[0])

('#include <stdio.h>\n'
 '\n'
 'int main() {\n'
 '\tint n,i,j;\n'
 '\t\n'
 '\tprintf("Unesite broj n: ");\n'
 '\tscanf("%d",&n);\n'
 '\t\n'
 '\tif(n<=0 || n>50){\n'
 '\t\tprintf("Pogresan unos");\n'
 '\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\tscanf("%d",&n);\n'
 '\t\tif(n<=0 || n>50){\n'
 '\t\t\tprintf("Pogresan unos");\n'
 '\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\tscanf("%d",&n);\n'
 '\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\t\t}\n'
 '\t\t\t\t}\n'
 '\t\t\t}\n'
 '\t\t}\n'
 '\t\t\n'
 '\t}\n'
 '\tfor(i=0; i<=n-1; i++){\n'
 '\t\tfor(j=0; j<=(n-1)*4; j++){\n'
 '\t\t\tif(n==j+i-(

In [None]:
pp.pprint(sourcefalsepos['source1'].iloc[0])

('#include <stdio.h>\n'
 '\n'
 'int main() {\n'
 '\tint n,i,j;\n'
 '\t\n'
 '\tprintf("Unesite broj n: ");\n'
 '\tscanf("%d",&n);\n'
 '\t\n'
 '\tif(n<=0 || n>50){\n'
 '\t\tprintf("Pogresan unos");\n'
 '\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\tscanf("%d",&n);\n'
 '\t\tif(n<=0 || n>50){\n'
 '\t\t\tprintf("Pogresan unos");\n'
 '\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\tscanf("%d",&n);\n'
 '\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\t\t\tscanf("%d",&n);\n'
 '\t\t\t\t\t\tif(n<=0 || n>50){\n'
 '\t\t\t\t\t\t\tprintf("Pogresan unos");\n'
 '\t\t\t\t\t\t\tprintf("\\nUnesite broj n: ");\n'
 '\t\t\t\t\t\t\tscanf("%d",&n);

In [None]:
sourcefalseneg = global_data.test_over[np.logical_and(preds == 0,predictions.label_ids==1)][['source0','source1']]

In [None]:
pp.pprint(sourcefalseneg['source0'].iloc[2])

('#include <stdio.h>\n'
 '\n'
 'int main() {\n'
 '\tint M, N, i, j, z, k, w, p, r, b, t, d, x, c, a, h, niz[200], niz2[200], '
 'niz3[200];\n'
 '\tint matrica[200][200];\n'
 '\n'
 '\tdo //unos dimenzija matrice i provjera \n'
 '\t{\n'
 '\tprintf("Unesite brojeve M i N: ");\n'
 '\tscanf("%d %d", &M, &N);\n'
 '\tif(M<=0 || N<=0 || M>200 || N>200) printf("Brojevi nisu u trazenom '
 'opsegu.\\n");\n'
 '\t\t\n'
 '\t} while(M<=0 || N<=0 || M>200 || N>200);\n'
 '\t\n'
 '\t//unos matrice \n'
 '\tprintf("Unesite elemente matrice: ");\n'
 '\tfor(i=0; i<M; i++) \n'
 '\t{\n'
 '\t\tfor(j=0; j<N; j++)\n'
 '\t\t{\n'
 '\t\t\tscanf("%d", &matrica[i][j]);\n'
 '\t\t}\n'
 '\t}\n'
 '\t\n'
 '\tfor(i=0; i<M; i++)\n'
 '\t{\n'
 '\n'
 '\t\t\n'
 '\t\tfor(j=0; j<N; j++)\n'
 '\t\t{\n'
 '\t\t\tniz[j]=matrica[i][j]; \n'
 '\t\t}\n'
 '\t\t\n'
 '\t\n'
 '\n'
 '\t\tfor(t=i+1; t<M; t++)\n'
 '\t\t{\n'
 '\t\t\tniz2[t]=1;\n'
 '\t\t}\n'
 '\t\tfor(z=i+1; z<M; z++)\n'
 '\t\t{\n'
 '\t\t\tfor(k=0; k<N; k++)\n'
 '\t\t\t{\n'
 '\t\t

In [None]:
pp.pprint(sourcefalseneg['source1'].iloc[2])

('#include <stdio.h>\n'
 '\n'
 'int main() {\n'
 '\tint M=1,N=1;\n'
 '\tint i,j,k,m,n,p,q,g,h,a,b,x,d,e;\n'
 '\tint matrica[200][200];\n'
 '\tint red1[200],red2[200],red3[200]; \n'
 '\t/* uvodjenje pomocnih redova radi kasnijeg poredjenja sa matricom */\n'
 '\t\n'
 '\tdo{\n'
 '\t\tprintf("Unesite brojeve M i N: ");\n'
 '\t\tscanf("%d %d",&M,&N);\n'
 '\t\tif(M<=0 || N<=0 || M>200 || N>200)\n'
 '\t\tprintf("Brojevi nisu u trazenom opsegu.\\n");\n'
 '\t}while(M<=0 || M>200 || N<=0 || N>200);\n'
 '\tprintf("Unesite elemente matrice: ");\n'
 '\tfor(i=0;i<M;i++)\n'
 '\t{\n'
 '\t\tfor(j=0;j<N;j++)\n'
 '\t\t{\n'
 '\t\t\tscanf("%d",&matrica[i][j]);\n'
 '\t\t}\n'
 '\t}\n'
 '\tfor(i=0;i<M;i++)\n'
 '\t{\n'
 '\t\tfor(j=0;j<N;j++)\n'
 '\t\t\t{\n'
 '\t\t\t\tred1[j]=matrica[i][j];\n'
 '\t\t\t}\n'
 '\t\t\t/* sve jedinice, svi elementi jedne vrste jednaki elementima druge '
 'vrste */\n'
 '\t\tfor(k=i+1;k<M;k++)\n'
 '\t\t{\n'
 '\t\t\tred2[k]=1;\n'
 '\t\t}\n'
 '\t\tfor(m=i+1;m<M;m++)\n'
 '\t\t{\n'
 '\t\t

In [None]:
sourcetruepos = global_data.test_over[np.logical_and(preds == 1,predictions.label_ids==1)][['source0','source1']]

In [None]:
pp.pprint(sourcetruepos['source0'].iloc[0])

('#include <stdio.h>\n'
 '#include <math.h>\n'
 '#define PI 3.1415926\n'
 '#include<stdlib.h>\n'
 '\n'
 'int main() {\n'
 '    double rad[500];\n'
 '    double stepen;\n'
 '    double minuta;\n'
 '    double sekunda;\n'
 '    double ugao;\n'
 '    int x, i, j;\n'
 '    printf("Unesite broj uglova: ");\n'
 '    scanf("%d", &x);\n'
 '    for (i = 0; i < x; i++){\n'
 '        scanf("%lf", &rad[i]);\n'
 '    }\n'
 '    for(i = 0; i < x; i++){\n'
 '    ugao = fabs((rad[i]*180)/PI); minuta = (ugao - (int)ugao)*60;\n'
 '    sekunda = round((minuta - (int)minuta)*60);\n'
 '    /*Algoritam za izbacivanje clanove iz niza uz ocuvanje redoslijeda*/\n'
 '    if (sekunda == 60){\n'
 '        sekunda = 0;\n'
 '        minuta++;\n'
 '    }\n'
 '    if (sekunda > 30){\n'
 '        for(j = i; j < x - 1; j++){\n'
 '            rad[j] = rad[j+1];\n'
 '        }\n'
 '        i--;\n'
 '        x--;\n'
 '    }\n'
 '    }\n'
 '    printf("Uglovi su:\\n");\n'
 '    for(i = 0; i < x; i++){\n'
 '        ugao = (

In [None]:
pp.pprint(sourcetruepos['source1'].iloc[0])

('#include<stdio.h>\n'
 '#include<stdlib.h>\n'
 '#include<math.h>\n'
 '#define PI 3.1415926\n'
 '\n'
 'int main() {\n'
 '\t\n'
 '\tdouble niz[500];\n'
 '\tint i,j,n;\n'
 '\tdouble stepeni, minute, sekunde;\n'
 '\tdouble ugao;\n'
 '\t\n'
 '\tprintf("Unesite broj uglova: ");\n'
 '\tscanf("%d", &n);\n'
 '\t\n'
 '\tfor(i=0;i<n;i++) {\n'
 '\t\tscanf("%lf", &niz[i]);\n'
 '\t\t}\n'
 '\t\t\n'
 '\tfor(i=0; i<n; i++){\n'
 '\t\t\n'
 '\t\tugao=fabs((niz[i]*180)/PI);\n'
 '\t\tminute=(ugao-(int)ugao)*60;\n'
 '\t\tsekunde=round((minute-(int)minute)*60);\n'
 '\t\tif(sekunde==60){sekunde=0, minute++;}\n'
 '\t\t\n'
 '\t\tif(sekunde>30){\n'
 '\t\t\t\n'
 '\t\tfor(j=i; j<n-1; j++) {\n'
 '\t\t\t\n'
 '\t\t\tniz[j]=niz[j+1];\n'
 '\t\t}\n'
 '\t\t\n'
 '\t\tn--;\n'
 '\t\ti--;\n'
 '\t\t}\n'
 '\t}\n'
 '\t\t\n'
 '\t\t\n'
 '\t\tprintf("Uglovi su:\\n");\n'
 '\t\tfor(i=0; i<n; i++){\n'
 '\t\t\n'
 '\t\tugao=(niz[i]*180)/PI;\n'
 '\t\tminute=fabs((ugao-(int)ugao)*60);\n'
 '\t\tsekunde=round((minute-(int)minute)*60);\n'
 

In [None]:
sourcetrueneg = global_data.test_over[np.logical_and(preds == 0,predictions.label_ids==0)][['source0','source1']]

In [None]:
pp.pprint(sourcetrueneg['source0'].iloc[0])

('#include <stdio.h>\n'
 '\n'
 '\n'
 'int main() {\n'
 '\tint x=0,y=0,i=0,j=0,br_tacaka=0;\n'
 '\tchar mat[20][20];\n'
 '\t\n'
 '\tfor (i=0;i<20;i++) {\n'
 '\t\tfor (j=0;j<20;j++) {\n'
 "\t\t\tmat[i][j]=' ';\n"
 '\t\t}\n'
 '\t}\n'
 '    \n'
 '    do {\n'
 '    printf("Unesite broj tacaka: ");\n'
 '    scanf("%d", &br_tacaka);\n'
 '    if (br_tacaka<=0 || br_tacaka>10) printf("Pogresan unos\\n");\n'
 '    } while(br_tacaka<=0 || br_tacaka>10);\n'
 '    \n'
 '\n'
 '         for (i=0;i<br_tacaka;i++) {\n'
 '            do {\n'
 '       \t    printf ("Unesite %d. tacku: ",i+1);\n'
 '\t        scanf("%d %d", &x,&y);\n'
 '\t        if (x>0 || x<19 || y>0 || y<19) \n'
 '\t        break;\n'
 '\t        if (x<0 || x>19 || y<0 || y>19) printf("Pogresan unos\\n");\n'
 '\t        } while(x<0 || x>19 || y<0 || y>19);\n'
 "\t        mat[y][x]='*';\n"
 '}\n'
 '\t    \n'
 '\t \n'
 '\t for(i=0;i<20;i++) {\n'
 '\t \tfor(j=0;j<20;j++) {\n'
 '\t \t\tprintf("%c", mat[i][j]);\n'
 '\t }\n'
 '\t printf("\\n")

In [None]:
pp.pprint(sourcetrueneg['source1'].iloc[0])

('/*3. (0,5 bodova) Zamislimo da na ekranu imamo koordinatni sistem sastavljen '
 'od 20x20 mjesta. \n'
 'Ishodište koordinatnog sistema je u gornjem lijevom uglu i ono odgovara '
 'koordinatama (0,0).\n'
 '\n'
 '\n'
 'Omogućite korisniku da unese najviše 10 tačaka koristeći koordinate [0,19]. '
 'Zatim iscrtajte oblik \n'
 'sastavljen od znakova zvjezdica (asterisk) na onim koordinatama koje je '
 'korisnik unio, a na ostalim lokacijama \n'
 'prazno mjesto. U slučaju da je unesen neispravan broj tačaka ili koordinate '
 'izvan dozvoljenog opsega treba \n'
 'ispisati poruku "Pogresan unos" i zatražiti da se ponovo unese broj tačaka '
 'odnosno koordinate te tačke.\n'
 '\n'
 '\n'
 'Primjer ulaza i izlaza:\n'
 '\tUnesite broj tacaka: 4\n'
 '\tUnesite 1. tacku: 1 1\n'
 '\tUnesite 2. tacku: 2 2\n'
 '\tUnesite 3. tacku: 3 1\n'
 '\tUnesite 4. tacku: 4 0\n'
 '\t    *\n'
 '\t * *\n'
 '\t  *\n'
 '(radi uštede prostora izostavili smo 16 praznih redova ispod nacrtanog '
 'oblika)\n'
 '\n'
 '\n'
 

###Do main import of all approprite libraries for BigBIRD.

## Configuration

## Load the Data

Dataset Overview:

- source0: Homework assignment for 1st student.
- source1: Homework assignment for 2nd student.
- label: This is the label chosen for plagiarized content

Here are the "similarity" label values in our dataset:

- 0: no similarity
- 1: similarity

Let's look at one sample from the dataset:

## Preprocessing

Distribution of our validation targets.

One-hot encode training, validation, and test labels.

## Keras Custom Data Generator

## Build the model.

Create train and validation data generators

## Train the Model

Training is done only for the top layers to perform "feature extraction",
which will allow the model to use the representations of the pretrained model.

## Fine-tuning

This step must only be performed after the feature extraction model has
been trained to convergence on the new data.

This is an optional last step where `bert_model` is unfreezed and retrained
with a very low learning rate. This can deliver meaningful improvement by
incrementally adapting the pretrained features to the new data.

# Train the entire model end-to-end.

## Evaluate model on the test set

## Inference on custom sentences

In [None]:
!ls /usr


bin  games  grte  include  lib	lib32  lib64-nvidia  local  sbin  share  src


In [None]:

def check_similarity(source0, source1):
  sentence_pairs = np.array([[str(source0), str(source1)]])
  test_dataset = tokenizer(sentence_pairs[0],sentence_pairs[1], padding = 'max_length', truncation=True, max_length = 3072)
  test_results = trainer.predict(test_dataset)
  print(test_results)
  return(test_results)
    # sentence_pairs = np.array([[str(source0), str(source1)]])
    # test_data = BertSemanticDataGenerator(
    #     sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
    # )
    # proba = model.predict(test_data[0])[0]
    # #proba = model.predict(test_data)[0]
    # idx = np.argmax(proba)
    # proba = f"{proba[idx]*100: .2f}%"
    # pred = labels[idx]
    # return pred, proba


Check results on some example code pairs.

In [None]:
source0 = """int obrni(int broj)
{
        int cifra,nova=0;
        while(broj>0) {
                cifra=broj%10;
                nova=nova*10+cifra;
                broj/=10;
        }
        return nova;
}
"""
source1 = """int okreni_cifre(int broj)
{
        int cifra;
        int nova=0;
        while(broj>0) {
                cifra=broj%10;
                nova=nova*10+cifra;
                broj=broj/10;
        }
        return nova;
}"""
check_similarity(source0, source1)

IndexError: ignored