<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/master/us-patent-phrase-competition/02_us_patent_phrase_maching_roberta_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##US Patent Phrase to Phrase maching - Roberta -baseline

In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".

This is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12k pairs of phrases. A small public test set has been provided for testing purposes, but is not used in scoring.

Information on the meaning of CPC codes may be found on the USPTO website. The CPC version 2021.05 can be found on the CPC archive website.

**Score meanings**

The scores are in the 0-1 range with increments of 0.25 with the following meanings:

* **1.0 - Very close match**. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
* **0.75 - Close synonym**, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
* **0.5 - Synonyms** which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
* **0.25 - Somewhat related**, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
* **0.0 - Unrelated**.

**Columns**

* **id** - a unique identifier for a pair of phrases
* **anchor** - the first phrase
* **target** - the second phrase
* **context** - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
* **score** - the similarity. This is sourced from a combination of one or more manual expert ratings.

## Setup

In [None]:
%%shell

pip -q install transformers
pip -q install datasets

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from wordcloud import WordCloud,STOPWORDS
from termcolor import colored

import datasets,transformers

from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [5]:
import os

os.environ["WANDB_DISABLED"] = "true"

colors = ["#A2A21C", "#CBCB1A", "#E1E10B", "#F6F605", "#838305"]

Let's load dataset from Kaggle.

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [None]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle> URL: https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching
kaggle competitions download -c us-patent-phrase-to-phrase-matching
unzip -qq us-patent-phrase-to-phrase-matching.zip

Let's define config.

In [13]:
class config:
  input_path = '/us-patent-phrase-to-phrase-matching/'
  model_path = '/roberta-base'
  model = 'roberta-base'
  
  learning_rate = 2e-5
  weight_decay = 0.01
  
  epochs = 5
  batch_size = 32

In [14]:
sections = {
 'A': 'Human Necessities',
 'B': 'Operations and Transport',
 'C': 'Chemistry and Metallurgy',
 'D': 'Textiles',
 'E': 'Fixed Constructions',
 'F': 'Mechanical Engineering',
 'G': 'Physics',
 'H': 'Electricity',
 'Y': 'Emerging Cross-Sectional Technologies'
}

Let's load model.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(config.model, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(config.model)

## Loading dataset 

In [None]:
df_train = datasets.Dataset.from_csv('train.csv')

In [10]:
df_train

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score'],
    num_rows: 36473
})

In [None]:
df_test = datasets.Dataset.from_csv('test.csv')

In [12]:
df_test

Dataset({
    features: ['id', 'anchor', 'target', 'context'],
    num_rows: 36
})

## Preprocess Dataset

In [17]:
def preprocess(ds, eval=False):
  context = ds["context"][0]
  prefix = sections[context]
  anchor = ds["anchor"]

  return {
      **tokenizer(prefix + anchor, ds["target"], ), "label": ds["score"] 
  }

In [None]:
encoded_ds = df_train.map(preprocess, remove_columns=["id", "anchor", "target", "context", "score"])

In [19]:
encoded_ds[100]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [0, 48176, 37276, 2485, 873, 21113, 737, 2, 2, 873, 21113, 2],
 'label': 0.5}

In [20]:
encoded_ds = encoded_ds.train_test_split(test_size=0.2)

##Training model