<a href="https://colab.research.google.com/github/rahiakela/kaggle-competition-projects/blob/master/us-patent-phrase-competition/02_us_patent_phrase_maching_roberta_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##US Patent Phrase to Phrase maching - Roberta -baseline

In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".

This is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12k pairs of phrases. A small public test set has been provided for testing purposes, but is not used in scoring.

Information on the meaning of CPC codes may be found on the USPTO website. The CPC version 2021.05 can be found on the CPC archive website.

**Score meanings**

The scores are in the 0-1 range with increments of 0.25 with the following meanings:

* **1.0 - Very close match**. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
* **0.75 - Close synonym**, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
* **0.5 - Synonyms** which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
* **0.25 - Somewhat related**, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
* **0.0 - Unrelated**.

**Columns**

* **id** - a unique identifier for a pair of phrases
* **anchor** - the first phrase
* **target** - the second phrase
* **context** - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
* **score** - the similarity. This is sourced from a combination of one or more manual expert ratings.

## Setup

In [None]:
%%shell

pip -q install transformers
pip -q install datasets

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from wordcloud import WordCloud,STOPWORDS
from termcolor import colored

import datasets,transformers

from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [7]:
import os

os.environ["WANDB_DISABLED"] = "true"

colors = ["#A2A21C", "#CBCB1A", "#E1E10B", "#F6F605", "#838305"]

Let's load dataset from Kaggle.

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [34]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle> URL: https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching
kaggle competitions download -c us-patent-phrase-to-phrase-matching
unzip -qq us-patent-phrase-to-phrase-matching.zip

kaggle.json
Downloading us-patent-phrase-to-phrase-matching.zip to /content
  0% 0.00/682k [00:00<?, ?B/s]
100% 682k/682k [00:00<00:00, 68.2MB/s]
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y




Let's define config.

In [10]:
class config:
  input_path = '/us-patent-phrase-to-phrase-matching/'
  model_path = '/roberta-base'
  model = 'roberta-base'
  
  learning_rate = 2e-5
  weight_decay = 0.01
  
  epochs = 5
  batch_size = 32

In [11]:
sections = {
 'A': 'Human Necessities',
 'B': 'Operations and Transport',
 'C': 'Chemistry and Metallurgy',
 'D': 'Textiles',
 'E': 'Fixed Constructions',
 'F': 'Mechanical Engineering',
 'G': 'Physics',
 'H': 'Electricity',
 'Y': 'Emerging Cross-Sectional Technologies'
}

Let's load model.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(config.model, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(config.model)

## Loading dataset 

In [None]:
df_train = datasets.Dataset.from_csv('train.csv')

In [14]:
df_train

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score'],
    num_rows: 36473
})

In [None]:
df_test = datasets.Dataset.from_csv('test.csv')

In [16]:
df_test

Dataset({
    features: ['id', 'anchor', 'target', 'context'],
    num_rows: 36
})

## Preprocess Dataset

In [17]:
def preprocess(ds, eval=False):
  context = ds["context"][0]
  prefix = sections[context]
  anchor = ds["anchor"]

  return {
      **tokenizer(prefix + anchor, ds["target"], ), "label": ds["score"] 
  }

In [None]:
encoded_ds = df_train.map(preprocess, remove_columns=["id", "anchor", "target", "context", "score"])

In [19]:
encoded_ds[100]

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'input_ids': [0, 48176, 37276, 2485, 873, 21113, 737, 2, 2, 873, 21113, 2],
 'label': 0.5}

In [20]:
encoded_ds = encoded_ds.train_test_split(test_size=0.2)

##Training model

In [21]:
def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = predictions.reshape(len(predictions))
  return {"pearson": np.corrcoef(predictions, labels)[0][1]}

In [22]:
args = TrainingArguments(f"uspppm",
                         evaluation_strategy="epoch",
                         save_strategy="epoch",
                         learning_rate=config.learning_rate,
                         per_device_train_batch_size=config.batch_size,
                         per_device_eval_batch_size=config.batch_size,
                         num_train_epochs=config.epochs,
                         weight_decay=config.weight_decay,
                         load_best_model_at_end=True,
                         metric_for_best_model="pearson")

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [23]:
trainer = Trainer(model, args,
                  train_dataset=encoded_ds["train"],
                  eval_dataset=encoded_ds["test"],
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics)

Let's evaluation model.

In [24]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 7295
  Batch size = 32


{'eval_loss': 0.1443788707256317,
 'eval_pearson': -0.18269885965715912,
 'eval_runtime': 19.344,
 'eval_samples_per_second': 377.12,
 'eval_steps_per_second': 11.787}

Let's train model.

In [25]:
trainer.train()

***** Running training *****
  Num examples = 29178
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 4560


Epoch,Training Loss,Validation Loss,Pearson
1,0.0535,0.032402,0.731429
2,0.0338,0.028503,0.779076
3,0.0271,0.026226,0.796185
4,0.0226,0.026281,0.805003
5,0.0196,0.025503,0.808865


***** Running Evaluation *****
  Num examples = 7295
  Batch size = 32


Saving model checkpoint to uspppm/checkpoint-912
Configuration saved in uspppm/checkpoint-912/config.json
Model weights saved in uspppm/checkpoint-912/pytorch_model.bin
tokenizer config file saved in uspppm/checkpoint-912/tokenizer_config.json
Special tokens file saved in uspppm/checkpoint-912/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 7295
  Batch size = 32
Saving model checkpoint to uspppm/checkpoint-1824
Configuration saved in uspppm/checkpoint-1824/config.json
Model weights saved in uspppm/checkpoint-1824/pytorch_model.bin
tokenizer config file saved in uspppm/checkpoint-1824/tokenizer_config.json
Special tokens file saved in uspppm/checkpoint-1824/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 7295
  Batch size = 32
Saving model checkpoint to uspppm/checkpoint-2736
Configuration saved in uspppm/checkpoint-2736/config.json
Model weights saved in uspppm/checkpoint-2736/pytorch_model.bin
tokenizer config file saved in uspppm/check

TrainOutput(global_step=4560, training_loss=0.030024773588306027, metrics={'train_runtime': 1536.6571, 'train_samples_per_second': 94.94, 'train_steps_per_second': 2.967, 'total_flos': 1517723995694376.0, 'train_loss': 0.030024773588306027, 'epoch': 5.0})

Let's make prediction.

In [26]:
def test_preprocess(ds, eval=False):
  context = ds["context"][0]
  prefix = sections[context]
  anchor = ds["anchor"]

  return {
      **tokenizer(prefix + anchor, ds["target"], ), "label": -1
  }

In [None]:
test = datasets.Dataset.from_csv("test.csv")
encoded_test = test.map(test_preprocess, remove_columns=["id", "anchor", "target", "context"])

In [28]:
outputs = trainer.predict(encoded_test)
predictions = outputs.predictions.reshape(-1)

***** Running Prediction *****
  Num examples = 36
  Batch size = 32


  c /= stddev[:, None]
  c /= stddev[None, :]


##Submission

In [37]:
submission = datasets.Dataset.from_dict({"id": test["id"], "score": predictions})
submission.to_csv('submission.csv', index = False)

# submit the file to kaggle
!kaggle competitions submit us-patent-phrase-to-phrase-matching -f "submission.csv" -m 'Yeah! I submit my file through the Google Colab!'

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

100% 1.00k/1.00k [00:01<00:00, 633B/s]
400 - Bad Request
