thanks to Tifo：https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/324216

pppm_abstract：https://www.kaggle.com/datasets/fankaixie/pppm-abstract

## Import Library

In [None]:
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# author:quincy qiang

import os
import random
import warnings
import pandas as pd

import numpy as np
import torch
from transformers import (AutoModelForMaskedLM,
                          AutoTokenizer, LineByLineTextDataset,
                          DataCollatorForLanguageModeling,
                          Trainer, TrainingArguments)

warnings.filterwarnings('ignore')


def seed_torch(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True


seed_torch(42)



## Genegate Pretraining Corpus

In [None]:
df=pd.read_csv('../input/pppm-abstract/pppm_abstract.csv')

In [None]:
df=df.dropna().reset_index(drop=True)
df

In [None]:
with open('corpus.txt','w',encoding='utf-8') as f:
    for ab in df['abstract']:
        f.write(ab+'\n')

## Training Model

In [None]:


model_name = 'microsoft/deberta-v3-large'

model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained('pretrained_models/microsoft/deberta-v3-large')

train_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./corpus.txt",  # mention train text file here
    block_size=256)

valid_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./corpus.txt",  # mention valid text file here
    block_size=256)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="pretrained_models/microsoft/deberta-v3-large-pretrain",  # select model path for checkpoint
    overwrite_output_dir=True,
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    evaluation_strategy='steps',
    save_total_limit=2,
    eval_steps=5000,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    load_best_model_at_end=False,
    prediction_loss_only=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset)

trainer.train()
trainer.save_model(f'pretrained_models/microsoft/deberta-v3-large')  


```
|Step |Trainging Loss|Validation Loss
5000,1.816200,1.677946
10000,1.521900,1.444387
15000,1.398000,1.323548
20000,1.313400,1.239668
25000,1.229300,1.177530
30000,1.168100,1.124718
35000,1.162000,1.083667
40000,1.101400,1.045740
45000,1.081000,1.023710

```

Because I set the epoch to 10, the pre-training task takes a long time, which takes about ~15 hours. Here, the valid data can be reduced to shorten the time.

Here is pretrained models：https://www.kaggle.com/datasets/quincyqiang/deberta-v3-large-pretrain

## Conclusions

- Using the pre-training task for fine-tuning, the offline CV score becomes lower, CV Score: 0.8321-> 0.8081
- In the discussion forum, I also saw other people have the same situations,can someone explain？:[The key idea?](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/324389)