# U.S. Patent Phrase to Phrase Matching 
Help Identify Similar Phrases in U.S. Patents  

[U.S. Patent Phrase to Phrase Matching | Kaggle](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching)

### Introduction to the competition 
In this competition, the model will be trained on a new **semantic similarity dataset** to extract relevant information by matching **key phrases in patent documents**. During patent search and review, determining semantic similarity between phrases is crucial to determining whether an invention has been previously described. 
+ Type of competition: This competition belongs to Deep Learning/Natural Language Processing, so recommended model or library :Bert/DeBERTa/ELECTRA
+ Problem data: Contestants are presented with pairs of phrases (an anchor and a target) and asked to rate how similar they are on a scale from 0 to 1. The officially provided training set has about 36,000 pairs of phrases, and the test set has about 12,000 pairs.
+ Evaluation criteria: **Pearson correlation coefficient**. [U.S. Patent Phrase to Phrase Matching | Kaggle](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation) 

### Preparation done before the competition  
+ Course for Deep Learning.[Neural Networks and Deep Learning | Coursera](https://www.coursera.org/learn/neural-networks-deep-learning)  
  My license for the course.[Neural Networks and Deep Learning | Coursera](https://www.coursera.org/account/accomplishments/verify/RWVTZ62KDKR5) 
+ An article about EDA(Exploring Data Analysis) on Kaggle. [EDA | Kaggle](https://www.kaggle.com/code/remekkinas/eda-and-feature-engineering)
 
### Data declaration 
In this dataset, you are presented with pairs of phrases (an anchor and a target) and asked to rate how similar they are on a scale from 0(not similar at all) to 1(the same). This task differs from the standard semantic similarity task in that similarity here is scored in the context of the patent, specifically its **CPC classification**, which indicates the subject matter covered by the patent. For example, while the phrases "bird" and "Cape Cod" might have low semantic similarity in normal language, their meaning similarity would be closer if considered in the context of "house". 

The official document is as follows:[Neural Networks and Deep Learning | Coursera](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data) 

+ train.csv — — A training set containing phrases, contexts and their similarity scores.
+ test.csv — — Same construct as train.csv without scores.
+ sample_submission.csv — — A sample submission file in the correct format.
+ train.csv/test.csv fields: 
     - id — — A unique identifier for a pair of phrases.
     - anchor — — The first phrase.
     - target — — The second phrase.
     - context — — **CPC classification** (version 2021.05), which represents topics to be scored for similarity.
     - score — — Similarity ranges from 0 to 1, with step equals 0.25:
          * 1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
          * 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
          * 0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
          * 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
          * 0.0 - Unrelated. 

### Solution idea 

#### Data processing 
+ We first introduce the title of each patent code in the external **CPC** file as the title text. Then we **groupby** the anchor and context to get the aggregated targets list (GP_targets). On this basis, the training text "anchor [SEP] target [SEP] title [SEP] gp_targets" is generated. 
+ The data is split into training and validation sets using **Groupkfold**, and the group column is anchor. 

#### Model selection/structure
+ We choose the multi-model fusion of Bert For Patent + DeBERTa + ELECTRA + Funnel-Transformer
+ The model structure uses the above model as backbone, adding a Linear layer and Sigmoid. 

| Model  | seq_length  | CV score | PB score |
| :----: | :----: | :----: | :----: |
| deberta-v3-large | 200 | 0.844 | 0.842 |
| electra-large | 200 | 0.832 | 0.833 |
| funnel-large | 200 | 0.824 | 0.825 |
| bert-for-patents | 200 | 0.824 | 0.824 |
| **ensemble** |  | **0.855** | **0.868** |    

#### Datasets
+ Official datasets (train.csv, test.csv) [U.S. Patent Phrase to Phrase Matching | Kaggle](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data)
+ [CPC data](https://www.kaggle.com/datasets/yasufuminakama/cpc-data)
+ [deberta](https://www.kaggle.com/datasets/xhlulu/deberta)
+ [deberta-v3-base](https://www.kaggle.com/datasets/jonathanchan/deberta-v3-base)
+ [deberta-v3-large](https://www.kaggle.com/datasets/jonathanchan/deberta-v3-large)
+ [electra](https://www.kaggle.com/datasets/xhlulu/electra)
+ [funnel-large](https://www.kaggle.com/datasets/goldenlock/funnel-large)
+ [bert-for-patent](https://www.kaggle.com/datasets/ksork6s4/bert-for-patents) 

#### Other useful methods 
+ Loss function: MSEloss
+ Optimizer: AdamW
+ Scheduler: CosineAnnealingWarmRestarts
+ FMG confrontation training  

### Code

In [None]:
import pandas as pd
import numpy as np
import random
import gc
import time
import os
import re
import torch
import torch
from torch.utils.data.dataset import Dataset
import torch.nn as nn
from transformers import AutoConfig, AutoModel, AdamW, AutoTokenizer
import sys
import scipy
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
import os
from torch.utils.data import DataLoader
import warnings
import scipy.stats

class CFG:
    result_dir = '/home/huaxuechun/workspace/pppm' # result dir
    data_dir = '/home/huaxuechun/workspace/us-patent-phrase-to-phrase-matching' # data dir
    k_folds = 5 # k folds
    n_jobs = 5 # n_jobs
    seed = 42 # random seed
    device = torch.cuda.is_available() # use cuda
    print_freq = 100 # print frequency
    
    model_name = 'bert-for-patents' # model name  # electra-large / deberta-v3-large / funnel-large / bert-for-patents
    base_epoch = 5 # epoch
    batch_size = 32 # batch size
    lr = 1e-5 # learning rate
    seq_length = 200 # sequence length
    max_grad_norm = 1 # gradient clipping
    

def seed_everything(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed_everything(CFG.seed)


class KFold(object):
    """
    Group split by group_col
    """
    def __init__(self, k_folds=10, flag_name='fold_flag'):
        self.k_folds = k_folds # k folds
        self.flag_name = flag_name # fold_flag

    def group_split(self, train_df, group_col): 
        group_value = list(set(train_df[group_col])) # group value
        group_value.sort() # sort
        fold_flag = [i % self.k_folds for i in range(len(group_value))] # fold_flag
        np.random.shuffle(fold_flag) # shuffle
        train_df = train_df.merge(pd.DataFrame({group_col: group_value, self.flag_name: fold_flag}), how='left', on=group_col) # merge
        return train_df

def get_data():
    train_df = pd.read_csv(CFG.data_dir + '/train.csv') # train data
    train_df = KFold(CFG.k_folds).group_split(train_df, group_col='anchor') # kfold group split
    titles = get_cpc_texts() # cpc texts
    train_df = get_text(train_df, titles) # # train data get text
    test_df = pd.read_csv(CFG.data_dir + '/test.csv') # test data
    test_df['score'], test_df['fold_flag'] = 0, -1 # test fill score and fold_flag
    test_df = get_text(test_df, titles) # # test data get text
    print(train_df.shape, test_df.shape) # print shape
    return train_df, test_df # return train and test data

def get_text(df, titles):
    df['anchor'] = df['anchor'].apply(lambda x:x.lower()) # anchor lower
    df['target'] = df['target'].apply(lambda x:x.lower()) # target lower
    # title
    df['title'] = df['context'].map(titles)
    df['title'] = df['title'].apply(lambda x:x.lower().replace(';', '').replace('  ',' ').strip())

    df = df.join(df.groupby(['anchor', 'context']).target.agg(list).rename('gp_targets'), on=['anchor', 'context']) # group by anchor and context and get target_list
    df['gp_targets'] = df.apply(lambda x: ', '.join([i for i in x['gp_targets'] if i != x['target']]), axis=1) # get gp_targets
    df['text'] = df['anchor'] + '[SEP]' + df['target'] + '[SEP]'  + df['title'] + '[SEP]'  + df['gp_targets'] # anchor [SEP] target [SEP] title [SEP] gp_targets

    return df


def get_cpc_texts():
    '''
    get cpc texts
    '''
    # get cpc codes
    contexts = []  
    pattern = '[A-Z]\d+'
    for file_name in os.listdir(f'{CFG.data_dir}/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, []))) # all unique cpc codes
    # like ['A01', 'A21', 'A22', 'A23', 'A24', 'A41', 'A42', 'A43', 'A44', 'A45']
    
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'{CFG.data_dir}/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        # 总目录及其text 如 "A		HUMAN NECESSITIES"
        pattern = f'{cpc}\t\t.+' 
        result = re.findall(pattern, s)
        pattern = "^"+pattern[:-2]
        cpc_result = re.sub(pattern, "", result[0]) # 获取描述，如 'HUMAN NECESSITIES'

        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s) # cpc code及其text 如 'A01\t\tAGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING'
            pattern = "^"+pattern[:-2]
            results[context] = cpc_result + ". " + re.sub(pattern, "", result[0]) # 生成字典 like {'A01': 'HUMAN NECESSITIES. AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING'}
    return results

#### Dataset

In [None]:
class PatentDataset(Dataset):
    def __init__(self, meta_data: pd.DataFrame, tokenizer, fold: int = -1, mode='train'):
        self.meta_data = meta_data.copy() # meta_data
        self.meta_data.reset_index(drop=True, inplace=True) # reset index
        if mode == 'train':
            self.meta_data = self.meta_data[self.meta_data['fold_flag'] != fold].copy() # train data
        elif mode == 'valid':
            self.meta_data = self.meta_data[self.meta_data['fold_flag'] == fold].copy() # valid data
        elif mode == 'test':
            pass
        else:
            raise ValueError(mode)
        self.meta_data.reset_index(drop=True, inplace=True) # reset index
        self.seq_length = CFG.seq_length # seq_length
        if tokenizer.sep_token != '[SEP]': 
            self.meta_data['text'] = self.meta_data['text'].apply(lambda x:x.replace('[SEP]', tokenizer.sep_token )) # replace [SEP] to tokenizer.sep_token
        self.text = self.meta_data['text'].values # text
        self.target = self.meta_data['score'].values # target
        self.mode = mode
        self.tokenizer = tokenizer

    def __getitem__(self, index):
        seq = self.text[index] # seq
        target = self.target[index] # target
        encoded = self.tokenizer.encode_plus(
            text=seq, # text
            add_special_tokens=True, # add_special_tokens 
            max_length=self.seq_length, # max_length
            padding='max_length', # padding
            return_attention_mask=True, # return_attention_mask
            return_tensors='pt', # return_tensors
            truncation=True # truncation
        )
        input_ids = encoded['input_ids'][0] # input_ids
        attention_mask = encoded['attention_mask'][0] # attention_mask

        return input_ids, attention_mask, np.array(target, dtype=np.float32) 

    def __len__(self):
        return len(self.meta_data) # len

#### Models

In [None]:
class PatentModel(nn.Module):
    def __init__(self, name, num_classes=1, pretrained=True):
        super(PatentModel, self).__init__()
        self.config = AutoConfig.from_pretrained(name) # config
        self.attention_probs_dropout_prob=0. # attention_probs_dropout_prob
        self.hidden_dropout_prob=0. # hidden_dropout_prob
        if pretrained:
            self.encoder = AutoModel.from_pretrained(name, config=self.config) 
        else:
            self.encoder = AutoModel.from_config(self.config)
        in_dim = self.encoder.config.hidden_size # get hidden_size
        self.last_fc = nn.Linear(in_dim, num_classes) # last_fc
        torch.nn.init.normal_(self.last_fc.weight, std=0.02) # init last_fc
        self.sig = nn.Sigmoid() # Sigmoid

    def forward(self, seq, seq_mask):
        x = self.encoder(seq, attention_mask=seq_mask)["last_hidden_state"] # forward                       # torch.Size([32, 200, 1024])
        x = torch.sum(x * seq_mask.unsqueeze(-1), dim=1) / torch.sum(seq_mask, dim=1).unsqueeze(-1) # mean  # torch.Size([32, 1024])
        out = self.last_fc(x) # last_fc                                                                     # torch.Size([32, 1])
        out = self.sig(out) # Sigmoid                                                                       # torch.Size([32, 1])
        out = torch.squeeze(out)                                                                            # torch.Size([32])
        return out

#### Utils

In [None]:
def get_sorted_test_df(df, tokenizer, batch_size):
    # input ids lengths list 
    input_lengths = [] 
    for text in df['text'].fillna("").values:
        length = len(tokenizer(text, add_special_tokens=True)['input_ids'])
        input_lengths.append(length)
    df['input_lengths'] = input_lengths
    length_sorted_idx = np.argsort([-l for l in input_lengths])

    # sort dataframe by lengths
    sort_df = df.iloc[length_sorted_idx]
    # calc max_len per batch
    sorted_input_length = sort_df['input_lengths'].values # 
    batch_max_length = np.zeros_like(sorted_input_length) # zeros_like 
    # every batch
    for i in range((len(sorted_input_length)//batch_size)+1):
        batch_max_length[i*batch_size:(i+1)*batch_size] = np.max(sorted_input_length[i*batch_size:(i+1)*batch_size]) # max input length in every batch
    sort_df['batch_max_length'] = batch_max_length
    return sort_df, length_sorted_idx
    

class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self):
        self.reset() # reset

    def reset(self):
        self.val = 0. 
        self.avg = 0.
        self.sum = 0.
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

def save_model(model, save_path, model_name):
    '''
    save model
    '''
    if not os.path.exists(save_path):
        os.makedirs(save_path)
    filename = os.path.join(save_path, model_name + '.pth.tar')
    torch.save({'state_dict': model.state_dict(), }, filename)

def worker_init_fn(worker_id):
    """
    Handles PyTorch x Numpy seeding issues.

    Args:
        worker_id (int): Id of the worker.
    """
    np.random.seed(np.random.get_state()[1][0] + worker_id)


class MSELoss(nn.Module):
    '''
    MSELoss 
    '''
    def __init__(self):
        super().__init__()

    def forward(self, inputs, targets):
        loss = (inputs - targets) ** 2
        loss = loss.mean()
        loss = torch.sqrt(loss)
        return loss

def get_score(y_true, y_pred):
    return scipy.stats.pearsonr(y_true, y_pred)[0] # pearsonr

class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {}

    def attack(self, epsilon=1., emb_name='emb'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name and param.grad is not None:
                # print(name, param)
                self.backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / max(norm, 0.001)
                    param.data.add_(r_at)

    def restore(self, emb_name='emb'):
        # emb_name这个参数要换成你模型中embedding的参数名
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name and param.grad is not None:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {}

def get_model_path(model_name):
    '''
    get model path
    '''
    res = CFG.result_dir
    if model_name in ['electra-base', 'electra-large']:
        res += '/electra/' + model_name.split('-')[1] + '-discriminator'
    elif model_name == 'deberta-v3-large':
        res += '/deberta-v3-large/'
    elif model_name == 'funnel-large':
        res += '/funnel-large/'
    elif model_name == 'bert-for-patents':
        res += '/bert-for-patents/'
    else:
        raise ValueError(model_name)
    return res

#### Train

In [None]:
def train(model, train_loader, criterion, scheduler, optimizer, epoch, is_adversial=False):
    # training
    batch_time = AverageMeter() # batch time
    losses = AverageMeter() # loss
    # switch to train mode
    model.train() # train mode
    fgm = FGM(model) if is_adversial else None # fgm
    start = time.time()
    for i, batch_data in enumerate(train_loader):
        # optimizer.zero_grad()
        if CFG.device:
            batch_data = (t.cuda() for t in batch_data)
        seq, seq_mask, target = batch_data
        # print(seq.shape,seq_mask.shape,target.shape)
        output = model(seq, seq_mask)
        # print(seq.shape,seq_mask.shape,target.shape,output.shape)
        loss = criterion(output, target) # loss
        losses.update(loss.item())
        loss = loss
        loss.backward()
        if is_adversial:
            # 对抗训练
            fgm.attack()  # 在embedding上添加对抗扰动
            output = model(seq, seq_mask) # 模型输出
            loss_adv = criterion(output, target)  # 计算loss
            loss_adv.backward()  # 反向传播，并在正常的grad基础上，累加对抗训练的梯度
            fgm.restore()  # 恢复embedding参数
        if CFG.max_grad_norm > 0:
            torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm) # 梯度裁剪
        optimizer.step() # 更新参数
        optimizer.zero_grad() # 清空梯度

        batch_time.update(time.time() - start) # update batch time
        start = time.time() # update start time
        if i % CFG.print_freq == 0: 
            print('Epoch: [{0}][{1}/{2}], Loss {loss:.4f}\n'.format(epoch, i, len(train_loader), loss=loss.item())) # print info
    return losses.avg, batch_time.sum 


def validate(model, valid_loader, tokenizer):
    model.eval() # eval mode
    y_pred = [] # y_pred
    for i, batch_data in enumerate(valid_loader):
        if CFG.device:
            batch_data = (t.cuda() for t in batch_data) # cuda
        seq, seq_mask, target = batch_data # batch_data
        output = model(seq, seq_mask) # model output
        y_pred.append(output.detach().cpu().numpy()) # y_pred
    y_pred = np.concatenate(y_pred) # y_pred
    score = get_score(valid_loader.dataset.target, y_pred) # score
    # scoring
    return y_pred, score # y_pred, score


print('------------------------------------------------------Training-------------------------------------------------\n')
warnings.filterwarnings('ignore') # ignore warnings
model_dir = os.path.join(CFG.result_dir, 'models') # model dir
os.makedirs(model_dir, exist_ok=True) # make dir
os.environ["TOKENIZERS_PARALLELISM"] = "false"
print('>> data_processing...\n')

# load data
train_df, test_df = get_data() 

oof_prediction = np.zeros((len(train_df))) # oof prediction
eval_loss = [] # eval loss

for fold in range(CFG.k_folds):
    model = PatentModel(get_model_path(CFG.model_name), pretrained=True) # load model
    model.zero_grad() # zero grad
    model = model.cuda() # cuda
    tokenizer = AutoTokenizer.from_pretrained(get_model_path(CFG.model_name)) # load tokenizer
    train_model_filename = os.path.join(model_dir, CFG.model_name + '_fold{}.pth.tar'.format(fold)) # train model filename

    if fold == 0:
        col_lengths = [] # col lengths
        for text in train_df['text'].fillna("").values: # get col lengths
            length = len(tokenizer(text, add_special_tokens=True)['input_ids'])
            col_lengths.append(length)
        print(f'text max(lengths): {max(col_lengths)} {np.percentile(col_lengths, 95)}')

    train_dataset = PatentDataset(train_df, tokenizer, fold, mode='train') # train dataset
    train_loader = DataLoader(train_dataset, shuffle=True, batch_size=CFG.batch_size, num_workers=CFG.n_jobs, pin_memory=True, worker_init_fn=worker_init_fn) # train loader
    valid_dataset = PatentDataset(train_df, tokenizer, fold, mode='valid') # valid dataset 
    valid_loader = DataLoader(valid_dataset, shuffle=False, batch_size=CFG.batch_size * 4, num_workers=CFG.n_jobs, pin_memory=True) # valid loader
    criterion = MSELoss() # criterion

    best_score = -1
    patience_cnt = 0
    is_improved = True
    optimizer = AdamW(model.parameters(), lr=CFG.lr, betas=(0.9, 0.999), eps=1e-6, weight_decay=0) # optimizer
    scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=CFG.base_epoch, eta_min=CFG.lr / 5) # scheduler
    for epoch in range(CFG.base_epoch):
        scheduler.step(epoch=epoch) # scheduler
        print('Fold: [{0}] Epoch: [{1}], lr:[{2}]\n'.format(fold, epoch, optimizer.param_groups[0]['lr'])) # print Fold, Epoch, lr
        is_adversial = True # 对抗训练
        train_loss, train_batch_time = train(model, train_loader, criterion, scheduler, optimizer, epoch, is_adversial) # train
        print('Epoch avg loss: {0}, Epoch cost time:{1} min\n'.format(train_loss, train_batch_time / 60)) # print Epoch avg loss, Epoch cost time
        with torch.no_grad(): 
            y_pred, score = validate(model, valid_loader, tokenizer) # validate
            print('Epoch score: {0}\n'.format(score)) # print Epoch score
            if score > best_score:
                best_score, best_epoch = score, epoch # best_score, best_epoch
                oof_prediction[np.where(train_df['fold_flag'] == fold)] = y_pred.copy() # oof prediction
                save_model(model, model_dir, '{}_fold{}_seed{}'.format(CFG.model_name, fold, CFG.seed)) # save model
                print('********Best Epoch: [{0}], Best Score:{1}********\n'.format(best_epoch, best_score)) # print Best Epoch, Best Score
            else:
                is_improved = False # is_improved
                patience_cnt += 1 # patience_cnt

    eval_loss.append(best_score) # eval loss
    del model, y_pred
    _ = gc.collect()
    torch.cuda.empty_cache()
print('CV mean:{} std:{}.'.format(np.mean(eval_loss), np.std(eval_loss))) # CV mean, std
print('detail:{}'.format(np.round(eval_loss, 4))) # detail
np.save(os.path.join(CFG.result_dir, CFG.model_name + '_oof.npy'), oof_prediction) # save oof prediction

#### Inference

In [None]:
import numpy as np
import pandas as pd
import os
import re
import sys
import gc
import time
from transformers import BertTokenizer, RobertaTokenizerFast, AutoTokenizer
import torch
from torch.utils.data import DataLoader

CFG.batch_size = 32 # batch size
CFG.n_jobs = 4 # n_jobs
CFG.seq_length = 512 # seq_length
os.environ["TOKENIZERS_PARALLELISM"] = "false" # TOKENIZERS_PARALLELISM

def predict(model, data_loader):
    # switch to evaluate mode
    model.eval() # model
    y_pred = []
    for i, batch_data in enumerate(data_loader): # 载入每个batch的数据
        batch_data = (t.cuda() for t in batch_data)
        seq, seq_mask, _ = batch_data # seq, seq_mask, target
        outputs = model(seq, seq_mask).detach().cpu().numpy() # outputs
        y_pred.append(outputs)
    y_pred = np.concatenate(y_pred)
    return y_pred

def get_preds(my_df, my_loader, my_model, model_path, model_name=''):
    my_model.load_state_dict(torch.load(model_path)['state_dict']) # 载入模型
    my_model = my_model.cuda()
    with torch.no_grad():
        y_pred = predict(my_model, my_loader) # 获得y_pred
    return y_pred

train_df, test_df = get_data() # 获得训练集和测试集
ensemble_weight = [0.2, 0.6, 0.1, 0.1] 
    
print('>> predicting...\n')
start = time.time()
# -------------------- Model 1 --------------------
model_name = 'bert-for-patents' # model_name
tokenizer_path = get_model_path(model_name) # get_model_path
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) # tokenizer

sort_df, length_sorted_idx = get_sorted_test_df(test_df.copy(), tokenizer, batch_size=CFG.batch_size) # sort_df, length_sorted_idx
test_dataset = PatentDatasetV2(sort_df, tokenizer) # test_dataset
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=CFG.batch_size, num_workers=CFG.n_jobs, drop_last=False, pin_memory=True) # test_loader

res1 = []
folds = range(CFG.k_folds)
for fold in folds:
    model = PatentModel(get_model_path(model_name), pretrained=False) # model
    model_path = '/home/huaxuechun/workspace/pppm/models/{}_fold{}_seed{}.pth.tar'.format(model_name, fold, 42) # model_path
    print(model_path)
    y_preds = get_preds(test_df, test_loader, model, model_path) # y_preds
    y_preds = y_preds[np.argsort(length_sorted_idx)] # y_preds
    res1.append(y_preds)
    del model
    gc.collect()
    torch.cuda.empty_cache()
res1 = np.mean(res1, axis=0)

# -------------------- Model 2 --------------------
model_name = 'deberta-v3-large' # model_name
tokenizer_path = get_model_path(model_name) # get_model_path
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) # tokenizer

sort_df, length_sorted_idx = get_sorted_test_df(test_df.copy(), tokenizer, batch_size=CFG.batch_size) # sort_df, length_sorted_idx
test_dataset = PatentDatasetV2(sort_df, tokenizer) # test_dataset
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=CFG.batch_size, num_workers=CFG.n_jobs, drop_last=False, pin_memory=True) # test_loader
res2 = []
folds = range(CFG.k_folds)
for fold in folds:
    model = PatentModel(get_model_path(model_name), pretrained=False) # model
    model_path = '/home/huaxuechun/workspace/pppm/models/{}_fold{}_seed{}.pth.tar'.format(model_name, fold, 42) # model_path
    print(model_path)
    y_preds = get_preds(test_df, test_loader, model, model_path) # y_preds
    y_preds = y_preds[np.argsort(length_sorted_idx)] # y_preds
    res2.append(y_preds)
    del model
    gc.collect()
    torch.cuda.empty_cache()
res2 = np.mean(res2, axis=0)

# -------------------- Model 3 --------------------
model_name = 'electra-large' # model_name
tokenizer_path = get_model_path(model_name)# get_model_path
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) # tokenizer

sort_df, length_sorted_idx = get_sorted_test_df(test_df.copy(), tokenizer, batch_size=CFG.batch_size) # sort_df, length_sorted_idx
test_dataset = PatentDatasetV2(sort_df, tokenizer) # test_dataset
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=CFG.batch_size, num_workers=CFG.n_jobs, drop_last=False, pin_memory=True) # test_loader
res3 = []
folds = range(CFG.k_folds)
for fold in folds:
    model = PatentModel(get_model_path(model_name), pretrained=False) # model
    model_path = '/home/huaxuechun/workspace/pppm/models/{}_fold{}_seed{}.pth.tar'.format(model_name, fold, 42) # model_path
    print(model_path)
    y_preds = get_preds(test_df, test_loader, model, model_path) # y_preds
    y_preds = y_preds[np.argsort(length_sorted_idx)] # y_preds
    res3.append(y_preds)
    del model
    gc.collect()
    torch.cuda.empty_cache()
res3 = np.mean(res3, axis=0)

# -------------------- Model 4 --------------------
model_name = 'funnel-large' # model_name
tokenizer_path = get_model_path(model_name) # get_model_path
tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) # tokenizer

sort_df, length_sorted_idx = get_sorted_test_df(test_df.copy(), tokenizer, batch_size=CFG.batch_size) # sort_df, length_sorted_idx
test_dataset = PatentDatasetV2(sort_df, tokenizer) # test_dataset
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=CFG.batch_size, num_workers=CFG.n_jobs, drop_last=False, pin_memory=True) # test_loader
res4 = []
folds = range(CFG.k_folds)
for fold in folds:
    model = PatentModel(get_model_path(model_name), pretrained=False) # model
    model_path = '/home/huaxuechun/workspace/pppm/models/{}_fold{}_seed{}.pth.tar'.format(model_name, fold, 42) # model_path
    print(model_path)
    y_preds = get_preds(test_df, test_loader, model, model_path) # y_preds
    y_preds = y_preds[np.argsort(length_sorted_idx)] # y_preds
    res4.append(y_preds)
    del model
    gc.collect()
    torch.cuda.empty_cache()
res4 = np.mean(res4, axis=0)

# ensemble
res = [res1,res2,res3,res4]
for i in range(len(res)):
    res[i] = (res[i] - res[i].mean())/res[i].std()
test_df['score'] = np.sum([res[i] * ensemble_weight[i] for i in range(len(res))], axis=0)
test_df['score'] = (test_df['score'] - test_df['score'].mean()) /test_df['score'].std()

# get submission
print(test_df.shape)
test_df[['id', 'score']].to_csv("submission.csv", index=False)

### Specific Process 
+ Use **deberta-base baseline**, PB score: 0.789
+ Introduce the **title text** from the CPC file to the training set, PB score: 0.795
+ Groupby is performed on Anchor and Context to obtain the aggregated targets list (gp_targets), PB score: 0.810
+ Exchange  **deberta-base baseline** to **deberta-v3-large**, PB score: 0.841
+ Use **electra-large**, PB score: 0.831
+ Use **funnel-large**, PB score: 0.825
+ Use **bert-for-patent**, PB score: 0.822
+ Ensemble, PB score: 0.847
+ Introduce FMG confrontation training, PB score: 0.849
+ Adjust parameter like learning rate, seq_length, fold_numbers, PB score: 0.855