<a href="https://colab.research.google.com/github/rjenez/W266-final-project/blob/main/notebooks/Plagiarism_with_Chinese_Bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Plagiarism with ChineseBERT

> Indented block


**Author:*** Ricardo Jenez heavily modified from examples in HuggingFace
**Description:** NLP code to detect plagiarism in code.

## Introduction

This is a preliminary model for doing code plagiarism detection. The idea is to identify when students in a class has plagiarized a coding example.

### References

* [BERT Model after being trained on Chinse](https://huggingface.co/bert-base-chinese)
* [Plagiarism Detection in Computer Programming Using Feature Extraction From Ultra-Fine-Grained Repositories](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9097285)

## Setup

Note: install HuggingFace `transformers` via `pip install transformers` (version >= 2.11.0).

In [None]:
%%capture
!pip3 install transformers
!pip3 install sentence_transformers
!pip3 install imbalanced-learn
!pip3 install datasets
#!pip3 install wandb

In [None]:
import torch
import datasets
import transformers
import pandas as pd
import numpy as np
from transformers import BertTokenizer, \
BertForSequenceClassification, Trainer, TrainingArguments,EvalPrediction, \
AutoTokenizer,  BertTokenizerFast
from torch.utils.data import Dataset, DataLoader
#import wandb
import random
from imblearn.over_sampling import RandomOverSampler
import pprint


In [None]:
#!gsutil cp gs://w266finalproject/plagA20162017.tar plag2.tar
!gsutil cp gs://w266finalproject/plag2.tar plag2.tar

Copying gs://w266finalproject/plag2.tar...
- [1 files][ 77.8 MiB/ 77.8 MiB]                                                
Operation completed over 1 objects/77.8 MiB.                                     


In [None]:
#!echo Y | gcloud auth login
#gcloud auth login --remote-bootstrap="https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=a8PApkxcVY96cI0kqcyYRSLEz0yZP0&access_type=offline&code_challenge=nXI-Bvrdyh7AWh0n8iFt8UFtKFTjS2iN8gvUtQeilWE&code_challenge_method=S256&token_usage=remote"

In [None]:

!tar xvf plag2.tar
!ls -l
# !mv trainA*.csv train.csv
# !mv testA*.csv test.csv
!mv train2.csv train.csv
!mv test2.csv test.csv

alldata2.csv
groundtruth2.csv
test2.csv
train2.csv
total 159428
-rw-r--r-- 1  501 staff  1114619 Mar 16 08:22 alldata2.csv
-rw-r--r-- 1  501 staff   203396 Mar 16 08:19 groundtruth2.csv
-rw-r--r-- 1 root root  81619968 Mar 24 06:58 plag2.tar
drwxr-xr-x 1 root root      4096 Mar  9 14:48 sample_data
-rw-r--r-- 1  501 staff 15819857 Mar 16 08:22 test2.csv
-rw-r--r-- 1  501 staff 64478135 Mar 16 08:22 train2.csv


In [None]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
valid_df = train_df[int(len(train_df)*0.8):]
train_df = train_df[:int(len(train_df)*0.8)]#[:15000]

In [None]:
print("Train Target Distribution")
print(train_df.plagiarized.value_counts())

Train Target Distribution
0    10595
1      463
Name: plagiarized, dtype: int64


In [None]:
# y_train = tf.keras.utils.to_categorical(train_df.plagiarized, num_classes=2)
# y_val = tf.keras.utils.to_categorical(valid_df.plagiarized, num_classes=2)
# y_test = tf.keras.utils.to_categorical(test_df.plagiarized, num_classes=2)

# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority',random_state=1234)
train_over, y_train_over = oversample.fit_resample(train_df, train_df.plagiarized)
print("Train Target Distribution")
print(train_over.plagiarized.value_counts())

valid_over, y_valid_over = oversample.fit_resample(valid_df, valid_df.plagiarized)
print("Valid Target Distribution")
print(valid_over.plagiarized.value_counts())

test_over, y_test_over = oversample.fit_resample(test_df, test_df.plagiarized)
print("Test Target Distribution")
print(test_over.plagiarized.value_counts())

Train Target Distribution
0    10595
1    10595
Name: plagiarized, dtype: int64
Valid Target Distribution
0    2654
1    2654
Name: plagiarized, dtype: int64
Test Target Distribution
0    3294
1    3294
Name: plagiarized, dtype: int64


In [None]:
# train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'], 
#                                              cache_dir='/media/data_files/github/website_tutorials/data')

# train_data, test_data = datasets.load_dataset('csv',  split =['train', 'test'], data_files={'train': 'train.csv',
#                                               'test': 'test.csv'},cache_dir='data')

train_data = datasets.Dataset.from_pandas(train_over)
valid_data = datasets.Dataset.from_pandas(valid_over)
test_data = datasets.Dataset.from_pandas(test_over)

In [None]:
print(len(train_data),type(train_data),train_data)

21190 <class 'datasets.arrow_dataset.Dataset'> Dataset({
    features: ['label', 'filename0', 'filename1', 'source0', 'source1', 'percent', 'percent0', 'percent1', 'lines', 'plagiarized'],
    num_rows: 21190
})


In [None]:
# load model and tokenizer and define length of the text sequence
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model = BertForSequenceClassification.from_pretrained("bert-base-chinese",
#                gradient_checkpointing=False,
                num_labels = 2,
                cache_dir='data',
                return_dict=True).to(device)

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese", 
                                          max_length = 512,
                                          cache_dir='data',)
tokenizer = BertTokenizerFast.from_pretrained("bert-base-chinese")

def tokenization(batched_text):
    return tokenizer(batched_text['source0'],batched_text['source1'], padding = 'max_length', truncation=True, max_length = 512)



Downloading:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/393M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/107k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/107k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/263k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/624 [00:00<?, ?B/s]

In [None]:
train_data = train_data.map(tokenization, batched = True, batch_size = 128)
valid_data = valid_data.map(tokenization, batched = True, batch_size = 128)
test_data = test_data.map(tokenization, batched = True, batch_size = 128)

  0%|          | 0/166 [00:00<?, ?ba/s]

  0%|          | 0/42 [00:00<?, ?ba/s]

  0%|          | 0/52 [00:00<?, ?ba/s]

In [None]:
train_data = train_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
valid_data = valid_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
test_data = test_data.map(lambda examples: {'label': examples['plagiarized']}, batched=True)
# train_data = train_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
# valid_data = valid_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
# test_data = test_data.map(lambda examples: {'labels': examples['plagiarized']}, batched=True)
# train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
# valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
# test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
train_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
valid_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

  0%|          | 0/22 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [None]:
# define accuracy metrics
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# define the training arguments
training_args = TrainingArguments(
    output_dir = 'results',
    num_train_epochs = 4,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 32,    
    per_device_eval_batch_size= 16,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    disable_tqdm = False, 
    load_best_model_at_end=True,
    metric_for_best_model='eval_f1',
    greater_is_better=True,
    warmup_steps=160,
    weight_decay=0.01,
    logging_steps = 4,
    learning_rate = 1e-5,
    fp16 = True,
    logging_dir='logs',
    dataloader_num_workers = 0,
#    run_name = 'bigbird_classification_1e5'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_data,
    eval_dataset=valid_data
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device


Using amp half precision backend


'cuda'

In [None]:
# see how the basic model would perform
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 16


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_accuracy': 0.5,
 'eval_f1': 0.0,
 'eval_loss': 0.7628493309020996,
 'eval_precision': 0.0,
 'eval_recall': 0.0,
 'eval_runtime': 23.9638,
 'eval_samples_per_second': 221.501,
 'eval_steps_per_second': 13.854}

In [None]:
!nvidia-smi -L 

GPU 0: Tesla V100-SXM2-16GB (UUID: GPU-06e097e7-5423-698b-40d4-2efdbb168733)


In [None]:
torch.cuda.empty_cache()
import gc
gc.collect()

298

In [None]:
# train the model
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21190
  Num Epochs = 4
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 32
  Total optimization steps = 1324


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
0,0.3298,0.528028,0.763564,0.792733,0.705675,0.904295
1,0.1915,0.523032,0.813489,0.826741,0.771895,0.889977
2,0.1342,0.525742,0.82159,0.812438,0.856367,0.772796
3,0.0857,0.59335,0.801055,0.784753,0.854796,0.72532


The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 16


Saving model checkpoint to results/checkpoint-331
Configuration saved in results/checkpoint-331/config.json
Model weights saved in results/checkpoint-331/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 16
Saving model checkpoint to results/checkpoint-662
Configuration saved in results/checkpoint-662/config.json
Model weights saved in results/checkpoint-662/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, sour

TrainOutput(global_step=1324, training_loss=0.2539435678423351, metrics={'train_runtime': 2252.2858, 'train_samples_per_second': 37.633, 'train_steps_per_second': 0.588, 'total_flos': 2.229971438598144e+16, 'train_loss': 0.2539435678423351, 'epoch': 4.0})

In [None]:
# Evaluate the results
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 5308
  Batch size = 16


{'epoch': 4.0,
 'eval_accuracy': 0.8134890730972117,
 'eval_f1': 0.8267413370668535,
 'eval_loss': 0.5230316519737244,
 'eval_precision': 0.7718954248366013,
 'eval_recall': 0.8899773926149209,
 'eval_runtime': 23.9393,
 'eval_samples_per_second': 221.727,
 'eval_steps_per_second': 13.868}

In [None]:
results = trainer.predict(test_data)
pprint.pprint(results.metrics)

The following columns in the test set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1. If source0, source1, filename0, plagiarized, filename1, lines, percent0, percent, percent1 are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 6588
  Batch size = 16


{'test_accuracy': 0.7676077717061324,
 'test_f1': 0.7773738548785808,
 'test_loss': 0.6332941055297852,
 'test_precision': 0.7460228858498466,
 'test_recall': 0.8114754098360656,
 'test_runtime': 30.3321,
 'test_samples_per_second': 217.195,
 'test_steps_per_second': 13.583}
