## About data

The U.S. Patent and Trademark Office (USPTO) offers one of the largest repositories of scientific, technical, and commercial information in the world through its Open Data Portal. Patents are a form of intellectual property granted in exchange for the public disclosure of new and useful inventions. Because patents undergo an intensive vetting process prior to grant, and because the history of U.S. innovation spans over two centuries and 11 million patents, the U.S. patent archives stand as a rare combination of data volume, quality, and diversity.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
from pathlib import Path
import path
import os
import warnings
warnings.filterwarnings("ignore")

#### In this competition, you will train your models on a novel semantic similarity dataset to extract relevant information by matching key phrases in patent documents. Determining the semantic similarity between phrases is critically important during the patent search and examination process to determine if an invention has been described before. For example, if one invention claims "television set" and a prior publication describes "TV set",

## Evaluation metric

The evaluation metric for this competition is the Pearson correlation coefficient between the predicted and actual scores. It is the ratio  between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.

$$ \rho = \frac{ \text{cov}(pred, target)}{\sigma_{pred}\sigma_{target}} $$

In [None]:
data_path = Path('../input/us-patent-phrase-to-phrase-matching')
os.listdir(data_path)

In [None]:
train_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
test_df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")

In [None]:
train_df.sample(5)

In [None]:
test_df.sample(5)

There are five columns:

`id` - a unique identifier for a pair of phrases

`anchor` - the first phrase

`target` - the second phrase

`context` - the CPC classification, which indicates the subject within which the similarity is to be scored

`score` - the similarity. This is sourced from a combination of **one or more manual expert ratings**.

Let's get some more information:

In [None]:
msno.matrix(train_df, figsize = (10,5))
plt.show()

In [None]:
msno.matrix(test_df, figsize = (10,5))
plt.show()

In [None]:
print("Count of duplicates: ",train_df.duplicated().sum())

# Anchor and target columns

In [None]:
train_df['anchorlen'] = train_df['anchor'].str.split().str.len()
train_df['targetlen'] = train_df['target'].str.split().str.len()
test_df['anchorlen'] = test_df['anchor'].str.split().str.len()
test_df['targetlen'] = test_df['target'].str.split().str.len()

In [None]:
sns.histplot(train_df.anchorlen.astype(str)).set_title('Train data')
plt.show()

In [None]:
sns.histplot(test_df.anchorlen.astype(str)).set_title('Test data')
plt.show()

In [None]:
sns.histplot(train_df.targetlen.astype(str)).set_title('Train data')
plt.show()

In [None]:
sns.histplot(test_df.targetlen.astype(str)).set_title('Test data')
plt.show()

In [None]:
train_df.targetlen.value_counts()

In [None]:
test_df.targetlen.value_counts()

In [None]:
train_df.target.value_counts().head(20)

In [None]:
test_df.target.value_counts().head(20)

### observations
1. No missing values
2. No duplicates
3. Max length in archors column is 5 words but max length in archors column in test data 3 words
4. Max length in target column is 15 words but max length in archors column in test data 4 words


# context

Thanks to REMEK KINAS for the code and description below 
https://www.kaggle.com/code/remekkinas/eda-and-feature-engineering

Source: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

The first letter is the "section symbol" consisting of a letter from "A" ("Human Necessities") to "H" ("Electricity") or "Y" for emerging cross-sectional technologies. This is followed by a two-digit number to give a "class symbol" ("A01" represents "Agriculture; forestry; animal husbandry; trapping; fishing"). 

* A: Human Necessities
* B: Operations and Transport
* C: Chemistry and Metallurgy
* D: Textiles
* E: Fixed Constructions
* F: Mechanical Engineering
* G: Physics
* H: Electricity
* Y: Emerging Cross-Sectional Technologies

* Hierarchy
    * Section (one letter A to H and also Y)
        * Class (two digits)
        
<div align="center"><img src="https://www.researchgate.net/publication/348420976/figure/fig2/AS:979346684645380@1610505853859/Example-of-a-simplified-Cooperative-Patent-Classification-CPC-tree-of-a-patent-parsed.ppm"/></div>

In [None]:
train_df.context.value_counts()

In [None]:
train_df['section'] = train_df['context'].astype(str).str[0]
train_df['classes'] = train_df['context'].astype(str).str[1:]
test_df['section'] = test_df['context'].astype(str).str[0]
test_df['classes'] = test_df['context'].astype(str).str[1:]

In [None]:
train_df.context.value_counts().head(20)

In [None]:
test_df.context.value_counts().head(20)

In [None]:
train_df['section'] = train_df['context'].astype(str).str[0]
train_df['classes'] = train_df['context'].astype(str).str[1:]

di = {"A" : "A - Human Necessities", 
      "B" : "B - Operations and Transport",
      "C" : "C - Chemistry and Metallurgy",
      "D" : "D - Textiles",
      "E" : "E - Fixed Constructions",
      "F" : "F- Mechanical Engineering",
      "G" : "G - Physics",
      "H" : "H - Electricity",
      "Y" : "Y - Emerging Cross-Sectional Technologies"}

train_df.head(10)

In [None]:
sns.histplot(y = train_df.replace({"section": di}).section, )
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(16, 8))
sns.countplot(data=train_df, x='classes', ax=ax);

# Score column
The scores are in the 0-1 range with increments of 0.25 with the following meanings:

* 1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
* 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
* 0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
* 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
* 0.0 - Unrelated.

In [None]:
sns.histplot(train_df.score.astype(str)).set_title('Score')
plt.show()

In [None]:
train_df.score.value_counts()