# Note
Special thanks
- https://www.kaggle.com/code/remekkinas/quick-look-into-data-eda
- https://www.kaggle.com/code/phantivia/uspppm-huggingface-train-inference-baseline

# EDA

In [None]:
import pandas as pd
from termcolor import colored
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud, STOPWORDS
import numpy as np

### EVALUATION METRIC

**Pearson Correlation** is the coefficient that measures the degree of relationship between two random variables. The coefficient value ranges between +1 to -1. Pearson correlation is the normalization of covariance by the standard deviation of each random variable.

$$
P C C(X, Y)=\frac{C O V(X, Y)}{S D_{x} * S D_{y}}
$$
```
X, Y: Two random variables
COV(): covariance
SD: standard deviation
```
About Covariance:
$$
\operatorname{COV}(X, Y)=\frac{1}{n} * \sum_{i=1}^{n}\left(\left(X_{i}-\bar{X}\right) *\left(Y_{i}-\bar{Y}\right)\right)
$$
```
X, Y: Two random variables
X_bar: mean of random variable X
Y_bar: mean of random variable Y
n: length of random variable X, Y
```
About standard deviation:
$$
S D_{x}=\sqrt{\frac{\sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}}{n}}
$$
```
X: random variables
X_bar: mean of random variable X
n: length of random variable X
```



The host provided two files - train and test dataset.

In [None]:
train_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
test_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")

In [None]:
print(f"Number of observations in TRAIN: {colored(train_df.shape, 'yellow')}")
print(f"Number of observations in TEST: {colored(test_df.shape, 'yellow')}")

Let's look into first 20 observations in train dataset.

In [None]:
train_df.head(20)

In this dataset, you are presented pairs of phrases (an **anchor and a target phrase**) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".

This is a code competition, in which you will submit code that will be run against an unseen test set. The unseen test set contains approximately 12k pairs of phrases. A small public test set has been provided for testing purposes, but is not used in scoring.

- id 
    - a unique identifier for a pair of phrases
- anchor 
    - the first phrase
- target 
    - the second phrase
- context 
    - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
- score 
    - the similarity. This is sourced from a combination of one or more manual expert ratings.


## Train Data

In [None]:
train_df.groupby(['anchor']).count()['target'].sort_values()

In [None]:
print(f"Number of uniques values in ANCHOR column: {colored(train_df.anchor.nunique(), 'yellow')}")
# train_df.anchor.nunique()

In [None]:
print(f"Number of uniques values in TARGET column: {colored(train_df.target.nunique(), 'yellow')}")

In [None]:
temp = train_df.groupby(["anchor"])["target"].nunique()
df = pd.DataFrame({'anchor': temp.index,
                   'target': temp.values
                  })
df = df.sort_values(['target'], ascending=False)
print(f"Average number of uniques values in TARGET column per each ANCHOR: {colored(round(df.target.mean(), 3), 'yellow')}")

unique number of 'anchor' is 733.

unique number of 'target' is 29340.

The average number of target is 46 per each anchor.

In [None]:
# pd.options.display.max_rows = None
train_df.groupby(['anchor','target']).count()['id'].head(30)

In [None]:
temp = train_df.groupby(["anchor"])["target"].nunique()
df = pd.DataFrame({'anchor': temp.index,
                   'target': temp.values
                  })
df = df.sort_values(['target'], ascending=False)[0:50]
plt.figure(figsize = (15,6))
plt.title('Number of target Types per each anchor')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'anchor', y="target", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
temp = train_df.groupby(["anchor"])["target"].nunique()
df = pd.DataFrame({'anchor': temp.index,
                   'target': temp.values
                  })
df = df.sort_values(['target'], ascending=False)[50:100]
plt.figure(figsize = (15,6))
plt.title('Number of target Types per each anchor')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'anchor', y="target", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
temp = train_df.groupby(["anchor"])["target"].nunique()
df = pd.DataFrame({'anchor': temp.index,
                   'target': temp.values
                  })
df = df.sort_values(['target'], ascending=False)
df.head(20)

## Context

In [None]:
train_df['context'].value_counts()

In [None]:
temp = train_df.groupby(["anchor"])["context"].nunique()
df = pd.DataFrame({'anchor': temp.index,
                   'context': temp.values
                  })
df = df.sort_values(['context'], ascending=False)[0:50]
plt.figure(figsize = (15,6))
plt.title('Number of context Types per each anchor')
sns.set_color_codes("pastel")
s = sns.barplot(x = 'anchor', y="context", data=df)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
locs, labels = plt.xticks()
plt.show()

In [None]:
# df = train_df.groupby('anchor').agg(['min', 'count'])
df = train_df.groupby('anchor').nunique().head(20)
df

In [None]:
test_df.head(10)

### ANCHOR COLUMN -  the first phrase

In [None]:
print(f"Number of uniques values in ANCHOR column: {colored(train_df.anchor.nunique(), 'yellow')}")

In [None]:
# TOP 20 anchors values
train_df.anchor.value_counts().head(20)

In [None]:
anchor_desc = train_df[train_df.anchor.notnull()].anchor.values
stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      stopwords = stopwords,).generate(' '.join(anchor_desc)) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show()

### TARGET COLUMN -  the second phrase

In [None]:
print(f"Number of uniques values in TARGET column: {colored(train_df.target.nunique(), 'yellow')}")

In [None]:
train_df.target.value_counts().head(20)

In [None]:
target_desc = train_df[train_df.target.notnull()].target.values
stopwords = set(STOPWORDS) 
wordcloud = WordCloud(width = 800, 
                      height = 800,
                      background_color ='white',
                      min_font_size = 10,
                      stopwords = stopwords,).generate(' '.join(target_desc)) 

# plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad = 0) 

plt.show() 

### CONTEXT COLUMN

Source: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

The first letter is the "section symbol" consisting of a letter from "A" ("Human Necessities") to "H" ("Electricity") or "Y" for emerging cross-sectional technologies. This is followed by a two-digit number to give a "class symbol" ("A01" represents "Agriculture; forestry; animal husbandry; trapping; fishing"). 

* A: Human Necessities
* B: Operations and Transport
* C: Chemistry and Metallurgy
* D: Textiles
* E: Fixed Constructions
* F: Mechanical Engineering
* G: Physics
* H: Electricity
* Y: Emerging Cross-Sectional Technologies

* Hierarchy
    * Section (one letter A to H and also Y)
        * Class (two digits)

In [None]:
print(f"Number of uniques values in CONTEXT column: {colored(train_df.context.nunique(), 'yellow')}")

In [None]:
train_df.context.value_counts().head(20)

We can create separate columns for **Section** and **Class**

In [None]:
train_df['section'] = train_df['context'].astype(str).str[0]
train_df['classes'] = train_df['context'].astype(str).str[1:]
train_df.head(10)

In [None]:
print(f"Number of uniques SECTIONS: {colored(train_df.section.nunique(), 'yellow')}")
print(f"Number of uniques CLASS: {colored(train_df.classes.nunique(), 'yellow')}")

In [None]:
di = {"A" : "A - Human Necessities", 
      "B" : "B - Operations and Transport",
      "C" : "C - Chemistry and Metallurgy",
      "D" : "D - Textiles",
      "E" : "E - Fixed Constructions",
      "F" : "F- Mechanical Engineering",
      "G" : "G - Physics",
      "H" : "H - Electricity",
      "Y" : "Y - Emerging Cross-Sectional Technologies"}

In [None]:
train_df.replace({"section": di}).section.hist(orientation='horizontal')

In [None]:
train_df.classes.value_counts()

### Score meanings
The scores are in the 0-1 range with increments of 0.25 with the following meanings:

- 1.0 - Very close match. 
    - This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
- 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. 
    - This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
- 0.5 - Synonyms which don’t have the same meaning (same function, same properties). 
    - This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
- 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. 
    - This also includes antonyms.
- 0.0 - Unrelated.

In [None]:
train_df['score'].hist(bins=20, figsize=(12,8))
plt.grid(False)
plt.title('Number of scores', fontsize=16)
plt.show()

In [None]:
train_df.score.value_counts()

Look into very close match - score == 1

In [None]:
train_df['score'].agg(['min', 'max', 'mean'])

In [None]:
train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==1.0')

Look into something not related - score == 0

In [None]:
train_df[['anchor', 'target', 'section', 'classes', 'score']].replace({"section": di}).query('score==0.0')

### SUBMISSION FILE

In [None]:
sub = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/sample_submission.csv")
sub.head(10)

# SUBMISSION TIME
Reference
- https://www.kaggle.com/code/phantivia/uspppm-huggingface-train-inference-baseline/data

In [None]:
import os
import datasets, transformers

from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification, AutoTokenizer


import numpy as np

os.environ["WANDB_DISABLED"] = "true"

## Config

In [None]:
class CFG:
    
    input_path = '../input/us-patent-phrase-to-phrase-matching/'
    model_path = '../input/roberta-base'
    model = 'roberta-base'
    
    learning_rate = 2e-5
    weight_decay = 0.01
    
    epochs = 5
    batch_size = 32
    

## Preprocess

In [None]:
table = """
A: Human Necessities
B: Operations and Transport
C: Chemistry and Metallurgy
D: Textiles
E: Fixed Constructions
F: Mechanical Engineering
G: Physics
H: Electricity
Y: Emerging Cross-Sectional Technologies
"""
splits = [i for i in table.split('\n') if i != '']
table = {e.split(': ')[0]: e.split(': ')[1] for e in splits}
table

## Load model and tokenizer

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(CFG.model_path, num_labels=1)

tokenizer = AutoTokenizer.from_pretrained(CFG.model_path)

## Load Dataset

In [None]:
train = datasets.Dataset.from_csv(CFG.input_path + 'train.csv')
train

## Tokenize

In [None]:
def process(unit, eval = False):
    
    sig = unit['context'][0]
    prefix = table[sig]
    text = unit['anchor']
    
    return {
        **tokenizer( prefix + text, unit['target']),
        'label':unit['score']
    }

encoded_ds = train.map(process, remove_columns= ['id', 'anchor', 'target', 'context', 'score'])

In [None]:
encoded_ds = encoded_ds.train_test_split(test_size=0.1)
encoded_ds

## Training

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    predictions = predictions.reshape(len(predictions))
    return {
        'pearson': np.corrcoef(predictions, labels)[0][1]
    }


args = TrainingArguments(
    f"uspppm",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=CFG.learning_rate,
    per_device_train_batch_size=CFG.batch_size,
    per_device_eval_batch_size=CFG.batch_size,
    num_train_epochs=CFG.epochs,
    weight_decay=CFG.weight_decay,
    load_best_model_at_end=True,
    metric_for_best_model="pearson",
)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_ds["train"],
    eval_dataset=encoded_ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
trainer.evaluate()

In [None]:
trainer.train()

## Prediction

In [None]:
def test_process(unit, eval = False):
    
    sig = unit['context'][0]
    prefix = table[sig]
    text = unit['anchor']
    
    return {
        **tokenizer( prefix + text, unit['target']),
        'label':-1
    }



test = datasets.Dataset.from_csv(CFG.input_path + 'test.csv')

encoded_test = test.map(test_process, remove_columns= ['id', 'anchor', 'target', 'context'])

outputs = trainer.predict(encoded_test)
predictions = outputs.predictions.reshape(-1)

In [None]:
submission = datasets.Dataset.from_dict({
    'id': test['id'],
    'score': predictions,
})

submission.to_csv('submission.csv', index=False)