## U.S. Patent Phrase to Phrase Matching
### Help Identify Similar Phrases in U.S. Patents

![](https://storage.googleapis.com/kaggle-competitions/kaggle/33657/logos/header.png?t=2022-02-23-06-26-59)

## Data Description
In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".

This is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12k pairs of phrases. A small public test set has been provided for testing purposes, but is not used in scoring.

Information on the meaning of CPC codes may be found on the USPTO website. The CPC version 2021.05 can be found on the CPC archive website.

## Score meanings
The scores are in the 0-1 range with increments of 0.25 with the following meanings:

>- 1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
>- 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
>- 0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
>- 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
>- 0.0 - Unrelated.

## Files
>- train.csv - the training set, containing phrases, contexts, and their similarity scores
>- test.csv - the test set set, identical in structure to the training set but without the score
>- sample_submission.csv - a sample submission file in the correct format

## Columns
>- id - a unique identifier for a pair of phrases
>- anchor - the first phrase
>- target - the second phrase
>- context - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
>- score - the similarity. This is sourced from a combination of one or more manual expert ratings.

## CPC
The Cooperative Patent Classification (CPC) is a patent classification system, which has been jointly developed by the European Patent Office (EPO) and the United States Patent and Trademark Office (USPTO). The CPC is substantially based on the previous European classification system (ECLA), which itself was a more specific and detailed version of the International Patent Classification (IPC) system.

Each classification term consists of a symbol such as "A01B33/00" (which represents "tilling implements with rotary driven tools"). The first letter is the "section symbol" consisting of a letter from "A" ("Human Necessities") to "H" ("Electricity") or "Y" for emerging cross-sectional technologies. This is followed by a two-digit number to give a "class symbol" ("A01" represents "Agriculture; forestry; animal husbandry; trapping; fishing"). The final letter makes up the "subclass" (A01B represents "Soil working in agriculture or forestry, parts, details, or accessories of agricultural machines or implements, in general"). The subclass is then followed by a 1- to 3-digit "group" number, an oblique stroke and a number of at least two digits representing a "main group" ("00") or "subgroup". A patent examiner assigns a classification to the patent application or other document at the most detailed level which is applicable to its contents.

### Nomenclature:
>- A: Human Necessities
>- B: Operations and Transport
>- C: Chemistry and Metallurgy
>- D: Textiles
>- E: Fixed Constructions
>- F: Mechanical Engineering
>- G: Physics
>- H: Electricity
>- Y: Emerging Cross-Sectional Technologies

## Evaluation:
Submissions are evaluated on the Pearson correlation coefficient between the predicted and actual similarity scores

### If this notebook is helpful, feel free to upvote! :)

Reference notebooks used: 
>- https://www.kaggle.com/code/yasufuminakama/pppm-deberta-v3-large-baseline-w-w-b-train/notebook

## Import packages

In [None]:
import pandas as pd
import numpy as np
import math
import os
import re
import time
import random
import scipy as sp
import gc
import warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
tqdm.pandas()
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True)
from sklearn import model_selection as sk_model_selection
import torch
print(f"torch.__version__: {torch.__version__}")
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset
import tokenizers
import transformers
print(f"tokenizers.__version__: {tokenizers.__version__}")
print(f"transformers.__version__: {transformers.__version__}")
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

SEED=42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Load data

In [None]:
train_data = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
test_data = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")
sample_submission_data = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/sample_submission.csv")

In [None]:
OUTPUT_DIR = './'

## EDA

In [None]:
print(train_data.shape, train_data['id'].nunique(), train_data[['anchor','target','context']].drop_duplicates().shape[0])
train_data.head()

In [None]:
train_data.isnull().sum()

In [None]:
train_data[(train_data['anchor'].str.strip()=='') | (train_data['context'].str.strip()=='') | (train_data['target'].str.strip()=='')]

In [None]:
train_data['anchor'].nunique(), train_data['target'].nunique(), train_data['context'].nunique()

In [None]:
train_data.groupby(['anchor']).agg({'target':'nunique','context':'nunique'}).reset_index().sort_values(by = ['target'], ascending = False).reset_index(drop = True)

In [None]:
# 1 context can be associated with more than 1 anchor
train_data.groupby(['context']).agg({'anchor':'nunique'}).reset_index().sort_values(by = ['anchor'], ascending = False).reset_index(drop = True)

In [None]:
train_data[train_data['context']=='H01']

In [None]:
print(test_data.shape)
test_data.head()

In [None]:
test_data[~(test_data['anchor'].isin(train_data['anchor'].unique().tolist()))], test_data[~(test_data['target'].isin(train_data['target'].unique().tolist()))], test_data[~(test_data['context'].isin(train_data['context'].unique().tolist()))]

In [None]:
sample_target = 'inorganic photoconductor drum'
print('\nTest data record:\n')
display(test_data[test_data['target']==sample_target])
print('\nTrain data record:\n')
display(train_data[train_data['target']==sample_target])

In [None]:
# Checking if public test data is a sample drawn from train data
test_data[~(test_data['id'].isin(train_data['id'].unique().tolist()))]

In [None]:
print(sample_submission_data.shape)
sample_submission_data.head()

In [None]:
# Reference: https://www.kaggle.com/code/yasufuminakama/pppm-deberta-v3-large-baseline-w-w-b-train
# ====================================================
# CPC Data
# ====================================================
def get_cpc_texts():
    contexts = []
    pattern = '[A-Z]\d+'
    for file_name in os.listdir('../input/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, [])))
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'../input/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        pattern = f'{cpc}\t\t.+'
        result = re.findall(pattern, s)
        cpc_result = result[0].lstrip(pattern)
        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s)
            results[context] = cpc_result + ". " + result[0].lstrip(pattern)
    return results


cpc_texts = get_cpc_texts()
torch.save(cpc_texts, OUTPUT_DIR+"cpc_texts.pth")
train_data['context_text'] = train_data['context'].map(cpc_texts)
test_data['context_text'] = test_data['context'].map(cpc_texts)
display(train_data.head())
display(test_data.head())

In [None]:
train_data[train_data['context_text'].isnull()]

In [None]:
train_data['text'] = train_data['anchor'] + '[SEP]' + train_data['target'] + '[SEP]'  + train_data['context_text']
test_data['text'] = test_data['anchor'] + '[SEP]' + test_data['target'] + '[SEP]'  + test_data['context_text']
display(train_data.head())
display(test_data.head())

### Distribution of # records

In [None]:
train_data['# Words in text'] = train_data['text'].apply(lambda x: len(x.split(' ')))
fig = px.histogram(train_data, x='# Words in text', title = 'Distribution of text length')
py.offline.iplot(fig)

In [None]:
fig = px.histogram(train_data, x='score', title = 'Distribution of correlation score')
py.offline.iplot(fig)

In [None]:
num_records_by_patent_category = train_data['context'].apply(lambda x: x[0]).value_counts().reset_index()
num_records_by_patent_category.columns = ['Category','# Records']
fig = px.bar(num_records_by_patent_category.sort_values(by = ['# Records'], ascending = False), x='Category', y='# Records', title = 'Top patent categories by record count')
fig.update_traces(marker_color='green')
py.offline.iplot(fig)

## Modelling

In [None]:
# ====================================================
# Utils
# ====================================================
def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)[0]
    return score

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=SEED)

In [None]:
# ====================================================
# CFG
# ====================================================
class CFG:
    debug=False
    apex=True
    print_freq=100
    num_workers=4
    model="microsoft/deberta-v3-large"
    scheduler='cosine' # ['linear', 'cosine']
    batch_scheduler=True
    num_cycles=0.5
    num_warmup_steps=0
    epochs=4
    encoder_lr=2e-5
    decoder_lr=2e-5
    min_lr=1e-6
    eps=1e-6
    betas=(0.9, 0.999)
    batch_size=16
    fc_dropout=0.2
    target_size=1
    max_len=512
    weight_decay=0.01
    gradient_accumulation_steps=1
    max_grad_norm=1000
    seed=SEED
    n_fold=4
    trn_fold=[0, 1, 2, 3]
    train=True

In [None]:
tokenizer = AutoTokenizer.from_pretrained(CFG.model)
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer

In [None]:
# ====================================================
# Define max_len
# ====================================================
lengths_dict = {}

lengths = []
tk0 = tqdm(cpc_texts.values(), total=len(cpc_texts))
for text in tk0:
    length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
    lengths.append(length)
lengths_dict['context_text'] = lengths

for text_col in ['anchor', 'target']:
    lengths = []
    tk0 = tqdm(train_data[text_col].fillna("").values, total=len(train_data))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        lengths.append(length)
    lengths_dict[text_col] = lengths
    
CFG.max_len = max(lengths_dict['anchor']) + max(lengths_dict['target'])\
                + max(lengths_dict['context_text']) + 4 # CLS + SEP + SEP + SEP
CFG.max_len

In [None]:
print('Text:', train_data['text'].iloc[0])
print('\nTokenizer output\n')
tokenizer(train_data['text'].iloc[0], add_special_tokens=False)

In [None]:
# ====================================================
# Dataset
# ====================================================
def prepare_input(cfg, text):
    inputs = cfg.tokenizer(text,
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs


class TrainDataset(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.texts = df['text'].values
        self.labels = df['score'].values

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = torch.tensor(self.labels[item], dtype=torch.float)
        return inputs, label

In [None]:
# Reference: https://www.kaggle.com/code/yasufuminakama/pppm-deberta-v3-large-baseline-w-w-b-train/notebook
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, self.cfg.target_size)
        self._init_weights(self.fc)
        self.attention = nn.Sequential(
            nn.Linear(self.config.hidden_size, 512),
            nn.Tanh(),
            nn.Linear(512, 256),
            nn.Tanh(),
            nn.Linear(256, 1),
            nn.Softmax(dim=1)
        )
        self._init_weights(self.attention)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        # feature = torch.mean(last_hidden_states, 1)
        weights = self.attention(last_hidden_states)
        feature = torch.sum(weights * last_hidden_states, dim=1)
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

In [None]:
# ====================================================
# Helper functions
# ====================================================
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))


def train_fn(train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels) in enumerate(train_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.cuda.amp.autocast(enabled=CFG.apex):
            y_preds = model(inputs)
        loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    start = end = time.time()
    for step, (inputs, labels) in enumerate(valid_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
        loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        preds.append(y_preds.sigmoid().to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    predictions = np.concatenate(predictions)
    return losses.avg, predictions


def inference_fn(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        preds.append(y_preds.sigmoid().to('cpu').numpy())
    predictions = np.concatenate(preds)
    return predictions

In [None]:
train_data['score'].unique()

In [None]:
display(train_data['score'].describe())
train_data['score bin'] = pd.cut(train_data['score'], 5)

df_train, df_valid = sk_model_selection.train_test_split(
    train_data, 
    test_size=0.1, 
    random_state=SEED,
    stratify = train_data['score'])

In [None]:
train_data['score bin'].unique()

In [None]:
display(df_train['score'].describe())
display(df_valid['score'].describe())

In [None]:
train_dataset = TrainDataset(CFG, df_train)
valid_dataset = TrainDataset(CFG, df_valid)

train_loader = DataLoader(train_dataset,
                          batch_size=CFG.batch_size,
                          shuffle=True,
                          num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
valid_loader = DataLoader(valid_dataset,
                          batch_size=CFG.batch_size,
                          shuffle=False,
                          num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

# ====================================================
# model & optimizer
# ====================================================
model = CustomModel(CFG, config_path=None, pretrained=True)
torch.save(model.config, OUTPUT_DIR+'config.pth')
model.to(device)

def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
    param_optimizer = list(model.named_parameters())
    no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
    optimizer_parameters = [
        {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
         'lr': encoder_lr, 'weight_decay': weight_decay},
        {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
         'lr': encoder_lr, 'weight_decay': 0.0},
        {'params': [p for n, p in model.named_parameters() if "model" not in n],
         'lr': decoder_lr, 'weight_decay': 0.0}
    ]
    return optimizer_parameters

optimizer_parameters = get_optimizer_params(model,
                                            encoder_lr=CFG.encoder_lr, 
                                            decoder_lr=CFG.decoder_lr,
                                            weight_decay=CFG.weight_decay)
optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)

# ====================================================
# scheduler
# ====================================================
def get_scheduler(cfg, optimizer, num_train_steps):
    if cfg.scheduler == 'linear':
        scheduler = get_linear_schedule_with_warmup(
            optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
        )
    elif cfg.scheduler == 'cosine':
        scheduler = get_cosine_schedule_with_warmup(
            optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
        )
    return scheduler

num_train_steps = int(len(df_train) / CFG.batch_size * CFG.epochs)
scheduler = get_scheduler(CFG, optimizer, num_train_steps)

criterion = nn.BCEWithLogitsLoss(reduction="mean")

### Training & validation

In [None]:
valid_labels = df_valid['score'].values

In [None]:
best_score = np.inf

for epoch in range(CFG.epochs):

    start_time = time.time()

    # train
    avg_loss = train_fn(train_loader, model, criterion, optimizer, epoch, scheduler, device)

    # eval
    avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)

    # scoring
    score = get_score(valid_labels, predictions)

    elapsed = time.time() - start_time

    print(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
    print(f'Epoch {epoch+1} - Score: {score:.4f}')

    if best_score > score:
        best_score = score
        print(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
        torch.save({'model': model.state_dict(),
                    'predictions': predictions},
                    OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_best.pth")
        
    # Save model for every epoch
    torch.save({'model': model.state_dict(),
                'predictions': predictions},
                OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_epoch_{epoch}.pth")

predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_best.pth", 
                         map_location=torch.device('cpu'))['predictions']

torch.cuda.empty_cache()
gc.collect()