- Kindly understand that English is not my mother tongue.


# Table of Contents

<a id="toc"></a>
- [1. The Goal and Explanation for This Competition ](#1)
    
- [2. Data Structure ](#2)

    - [2.1 Example data for explanation ](#2.1)
    
    - [2.2 Real Data sample ](#2.2)
    
    - [2.3 CPC scheme ( context ) ](#2.3)
    
    - [2.4 What is score?](#2.4)

- [3. How to evaluate?](#3)

- [4. ETC](#4)
    

<a id="1"></a>
## 1. The Goal and Explanation for This Competition 

- USPTO has a lot of patent documents.

- When the **keyword was searched** in the patent documents, it seems to **be used for searching the similar patent.** 

- The goal of this competition is to **determine the similarity** of **the searched keyword** and **specific keyword in the the patent document.**

- Therefore, the similarity of **the word to be searched (anchor)** and **the word to be compared (target)** should be determined.



<a id="2"></a>
### 2. Data Structure

<a id="2.1"></a>

#### 2.1 Example data for explanation

- this is a fake data for better understanding.

- id : Identification for the data

- anchor : keyword to search

- target : keyword to compare in the patent document

- context : Category for patent

    - Z : Material
    
    - Z19 : Metal
    
    - Z20 : Fabric
    
    - ...

- score : Similarity score between anchor and target in the context 

|  id   |      anchor      |  target | context | score |
|:----------:|:--------------------:|:----------:|:----:| :----:|
| 37d61fd2272659b2 |  Rigid Material | Rigid Materials | Z19 ( Metal ) | 1.0 |
| 37d61fd2272659b3 |  Rigid Material | Material with rigidity | Z19 ( Metal ) | 0.75 |
| 37d61fd2272659b1 |  Rigid Material | Steel | Z19 ( Metal ) | 0.5 |
| 37d61fd2272659b4 |  Rigid Material | Flexible Material | Z19 ( Metal ) | 0.25 |
| 37d61fd2272659b5 |  Rigid Material | Nylon | Z19 ( Metal ) | 0.25 |
| 37d61fd2272659b6 |  Rigid Material | Office furniture | Z19 ( Metal ) | 0.00 |
| 37d61fd2272659b8 |  Rigid Material | Rigid Materials | Z20 ( Fabric ) | 1.0 |
| 37d61fd2272659b9 |  Rigid Material | Material with rigidity | Z20 ( Fabric ) | 0.75 |
| 37d61fd2272659b7 |  Rigid Material | Nylon | Z20 ( Fabric ) | 0.5 |
| 37d61fd2272659ba |  Rigid Material | Flexible material | Z20 ( Fabric ) | 0.25 |
| 37d61fd2272659bb |  Rigid Material | Steel | Z20 ( Fabric ) | 0.25 |
| 37d61fd2272659bc |  Rigid Material | Office furniture | Z20 ( Fabric ) | 0.00 |
| ... |  ... | ... | ... | ... |


<a id="2.2"></a>
#### 2.2 Real Data Sample

In [None]:
import os
import pandas as pd

INPUT_DIR = '../input/us-patent-phrase-to-phrase-matching/'

train = pd.read_csv(INPUT_DIR+'train.csv')

display(train.head())

<a id="2.3"></a>

#### 2.3 What is Context ( CPC Scheme )

- A table that was categorized by each character of patent in USPTO 

- In CPC Scheme, there are very detailed information such as ***'A01B 15/12'***, but in this competition, only the first three letters were used.

- It means ***Section(A) > Class(01) > Subclass(B) > Details(15/12)*** from the first letter.

- We used only ***Section and Class***.

- You can see details as below : https://www.uspto.gov/web/patents/classification/cpc/html/cpc.html

- ***CPC-Section***

|  Section   |      content      | 
|:----------:|:--------------------:|
|A |HUMAN NECESSITIES|
|B |PERFORMING OPERATIONS; TRANSPORTING|
|C |CHEMISTRY; METALLURGY| 
|D |TEXTILES; PAPER| 
|E |FIXED CONSTRUCTIONS| 
|F |MECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING| 
|G |PHYSICS| 
|H |ELECTRICITY| 
|Y |GENERAL TAGGING OF NEW TECHNOLOGICAL ...| 

- ***CPC Class in section A***

|  Class   |      content      |   
|:----------:|:--------------------:|
| A01 | AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING |
| A21 | BAKING; EDIBLE DOUGHS |  
| A22 | BUTCHERING; MEAT TREATMENT; PROCESSING POULTRY OR FISH |
| A23 | FOODS OR FOODSTUFFS; TREATMENT THEREOF, NOT COVERED BY OTHER CLASSES | 
| ... | ... | 

<a id="2.4"></a>

#### 2.4 What is similarity score?

- 1.0 : very close ( plural, stop word such as "and", "or", etc. )

- 0.75 : close to synonym ( synonym, abbreviation, full forms, etc. )

- 0.5 :  Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.

- 0.25 : Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.

- 0.0 : Unrelated.

##### Example data for score.

- Z : Material

|  id   |      anchor      |  target | context | score |
|:----------:|:--------------------:|:----------:|:----:| :----:|
| 37d61fd2272659b2 |  Rigid Material | Rigid Materials | Z19 ( Metal ) | 1.0 |
| 37d61fd2272659b3 |  Rigid Material | Material with rigidity | Z19 ( Metal ) | 0.75 |
| 37d61fd2272659b1 |  Rigid Material | Steel | Z19 ( Metal ) | 0.5 |
| 37d61fd2272659b4 |  **Rigid Material** | **Flexible Material** | **Z19 ( Metal )** | 0.25 |
| 37d61fd2272659b5 |  **Rigid Material** | **Nylon** | **Z19 ( Metal )** | 0.25 |
| 37d61fd2272659b6 |  Rigid Material | Office furniture | Z19 ( Metal ) | 0.00 |
| 37d61fd2272659b8 |  Rigid Material | Rigid Materials | Z20 ( Fabric ) | 1.0 |
| 37d61fd2272659b9 |  Rigid Material | Material with rigidity | Z20 ( Fabric ) | 0.75 |
| 37d61fd2272659b7 |  Rigid Material | Nylon | Z20 ( Fabric ) | 0.5 |
| 37d61fd2272659ba |  Rigid Material | Flexible material | Z20 ( Fabric ) | 0.25 |
| 37d61fd2272659bb |  Rigid Material | Steel | Z20 ( Fabric ) | 0.25 |
| 37d61fd2272659bc |  Rigid Material | Office furniture | Z20 ( Fabric ) | 0.00 |
| ... |  ... | ... | ... | ... |


- Nylon is in Material(Z section) category. So, Rigid Material and Nylon are 0.25 in Z19( Metal ) context.

- Rigid Material and Flexible Material are antonym. So, it is 0.25.

- Each section and class information are important.

<a id="3"></a>

### 3. How to evaluate?

- Pearson correlation coefficient :


$$
    P C C(X, Y)=\frac{C O V(X, Y)}{S D_{x} * S D_{y}} 
$$

$$
    P C C (X, Y)= \frac{\sum_{i=1}^{n}\left(\left(x_{i}-m_{x}\right) *\left(y_{i}-m_{y}\right)\right)}{\sqrt{  \sum_{i=1}^{n}\left(x_{i}-{m_{x}}\right)^{2} \sum_{i=1}^{n}\left(y_{i}-m_{y}\right)^{2} }}
$$


- It is a measure of linear correlation between two sets of data.

- PCC(X,Y) == 1.0 : An absolute value of exactly 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line. 

- PCC(X,Y) == 0.0 : A value of 0 implies that there is no linear dependency between the variables

- PCC(X,Y) == -1.0 : X and Y are negative relationship. ( Orthogonal )

In [None]:
import numpy as np

def pearson(x, y):
    mx = np.mean(x)
    my = np.mean(y)
    
    std_x = 0.0
    std_y = 0.0
    cov = 0.0
    eps = 1e-10
    
    for _x in x :
        std_x = std_x + (_x - mx )*(_x - mx) 
        
    for _y in y :
        std_y = std_y + (_y - my)*(_y - my)
    
    div = np.sqrt(std_x * std_y) + eps
    
    for idx in range(len(x)) :
        cov = cov + (x[idx]-mx)*(y[idx]-my) 
    
    r = cov/div
    
    return r



In [None]:

uniform_distribution = np.array([0.5, 0.5, 0.5, 0.5, 0.5]) # uniform

predicted = np.array([0.1, 0.4, 0.4, 0.5, 0.9]) # predicted score example

label = np.array([0.0, 0.25, 0.5, 0.75, 1.0]) # GT

reversed_label = np.array([1.0, 0.75, 0.5, 0.25, 0.0]) # reversed GT

print('pearson(uniform_distribution, label) : ', f'{pearson(uniform_distribution,label):.3f}')
print()
print('pearson(predicted, label) : ', f'{pearson(predicted, label):.3f}')
print()
print('pearson(label, label) : ', f'{pearson(label,label):.3f}')
print()
print('pearson(label, reversed_label) : ', f'{pearson(label, reversed_label):.3f}')

<a id="4"></a>

### 4. ETC

#### Distribution for each score

In [None]:
#x = train['score'].hist()
#display(train['score'].value_counts())
import matplotlib.pyplot as plt
keys = [0.00, 0.25, 0.5, 0.75, 1.0]
cnt = train['score'].value_counts()
counters = []
for key in keys :
    counters.append(cnt[key])

plt.bar(range(len(keys)), counters)
plt.xticks(range(len(keys)), keys)
plt.ylabel('freq')
plt.xlabel('score')
plt.show()

#### 4.2 Distribution for each section.

In [None]:

keys = ['A','B','C','D','E','F','G','H' ]
cnt = train['context'].apply(lambda x: x[0]).value_counts()
counters = []
data = {}
for key in sorted(keys) :
    counters.append(cnt[key])
    #data[key] = cnt[key]
    #print(key, ' : ', cnt[key])
plt.bar(range(len(keys)), counters)
plt.xticks(range(len(keys)), keys)
plt.ylabel('freq')
plt.xlabel('context(Section)')
plt.show()


#### 4.3 External Data

##### ⓐ Big patent Dataset ( https://www.tensorflow.org/datasets/catalog/big_patent )

- size : 10 GB.

- There are abstract and description in each CPC code.

</br>

##### ⓑ LM for patent in huggingface

- https://huggingface.co/google/bigbird-pegasus-large-bigpatent

- https://huggingface.co/anferico/bert-for-patents

- https://huggingface.co/AI-Growth-Lab/PatentSBERTa

- Unfortunately, I can not find deberta model for patent.

#### 4.4 EDA with Jaccard Similarity

$$
    Jaccard Similarity(X, Y)=\frac{len(X ∩ Y)}{len(X ∪ B)} 
$$

- 4.4.1 Remove keywords which are 'and','or' in anchor and target.

- 4.4.2 Split a word by space for each anchor and target.

- 4.4.3 Calculate Jaccard Similiarity using above data.

- 4.4.4 display( Jaccard Simliarity == 1.0 and train['score'] != 1.0 )

- Be aware of inversion of keyword. ( jaccard score : 1.0 , score : [ 0.25, 0.75 ] )



In [None]:
for index, row in train.iterrows():    
    anchor = set(row['anchor'].replace(' and ','').strip().replace(' or ','').strip().split())
    target = set(row['target'].replace(' and ','').strip().replace(' or ','').strip().split())    
    score = row['score']
    jaccard = len(anchor & target) / len( anchor | target)
    if score < 0.95 and jaccard > 0.95: 
        print(f'score({score:<5})\tjaccard : ({jaccard:^5}) \t', f"anchor : {row['anchor']:<30} target : {row['target']:<30} context : {row['context']:<30}"  )        

    
    
    


### 5. Input and Output for classification

#### 5.1  Input

    - Input type : "<s>anchor<sep>target<sep><context text></s>"
    
    - example data > anchor : "abatement", target : "abatement of pollution", context :A23 score : 0.50
    
    - input for LM > <s>abatement<sep>abatement of pollution<sep>HUMAN NECESSITIES; FOODS OR FOODSTUFFS; TREATMENT THEREOF, NOT COVERED BY OTHER CLASSES </s>
    
#### 5.2 Output :

    - for Classification task: 
    
        - LM ( input )
        
        - Get sentence embedding from cls token
    
        - Linear( hidden_dim, 5 ) # 0 : 0.0, 1 : 0.25, 2 : 0.5, 3 : 0.75, 4 : 1.0

        - MSE(label, pred)

    - Regression : 
    
        - LM ( input )
        
        - Get sentence embedding from cls token
    
        - Linear( hidden_dim, 1 ) # score regression

        - MSE(0.5, pred)
    