In [None]:
import numpy as np 
import pandas as pd 
import os
import seaborn as  sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS


## EDA

To understand competitions datasets

### load competition data

In [None]:
input_dir='../input/us-patent-phrase-to-phrase-matching'

In [None]:
train=pd.read_csv(f'{input_dir}/train.csv')
test=pd.read_csv(f'{input_dir}/test.csv')
sub=pd.read_csv(f'{input_dir}/sample_submission.csv')

In [None]:
print("train.shape",train.shape)
print("test.shape",test.shape)
print("sub.shape",sub.shape)

show some train's sampels

In [None]:
train.head()

* id - a unique identifier for a pair of phrases
* anchor - the first phrase
* target - the second phrase
* context - the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored
* score - the similarity. This is sourced from a combination of one or more manual expert ratings.

The meaning of score is:
* 1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).
* 0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".
* 0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad (hypernym) matches.
* 0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.
* 0.0 - Unrelated.


### check data

train unique values:

In [None]:
train.nunique()

check that if train  has null values?

In [None]:
train.isnull().sum()

### score value counts

In [None]:
train['score'].value_counts()

We can see  that the most frequent values is 0.50,then 0.25

In [None]:
sns.countplot(x='score',data=train)
plt.show()

### context column

In [None]:
train['context'].value_counts()[:10]

In [None]:
train['context'].map(len).unique()

The length of all contexts is 3

In [None]:
train['section']=train['context'].apply(lambda x:x[0])

In [None]:
train['section'].unique()

In [None]:
train['section'].value_counts()

In [None]:
table = [
["A", "Human Necessities"],
["B", "Operations and Transport"],
["C", "Chemistry and Metallurgy"],
["D", "Textiles"],
["E", "Fixed Constructions"],
["F", "Mechanical Engineering"],
["G", "Physics"],
["H", "Electricity"],
["Y", "Emerging Cross-Sectional Technologies"]]
table

In [None]:
sns.countplot(x='section',data=train)
plt.show()

Context column ：https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/316138


If someone didn't follow the context column details. This is what Hierarchy looks like,

* Section (one letter A to H and also Y)
* Class (two digits)
* Subclass (one letter)
* Group (one to three digits)
* Main group and subgroups (at least two digits)

In the above example "A01B33/00"

* Section A
* Class 01
* Subclass B
* Group 33
* Main group 00

These are the section values,

* A: Human Necessities
* B: Operations and Transport
* C: Chemistry and Metallurgy
* D: Textiles
* E: Fixed Constructions
* F: Mechanical Engineering
* G: Physics
* H: Electricity
* Y: Emerging Cross-Sectional Technologies

Wikipedia link: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

P.S.: The link is already given in the data tab, I just added it if someone missed that detail.

### text phrase values

- anchor

In [None]:
train['anchor'].value_counts()

In [None]:
train['anchor_word_count']=train['anchor'].apply(lambda x:len(x.split(' ')))

In [None]:
train['anchor_word_count'].describe()

In [None]:
sns.kdeplot(x='anchor_word_count',data=train)
plt.show()

- target

In [None]:
train['target_word_count']=train['target'].apply(lambda x:len(x.split(' ')))

In [None]:
train['target_word_count'].describe()

In [None]:
sns.kdeplot(x='target_word_count',data=train)
plt.show()

### cpc-codes:titles.csv
Let's deep dive on the meaning of section

In [None]:
cpc_codes_df = pd.read_csv("../input/cpc-codes/titles.csv")

In [None]:
cpc_codes_df.head()

In [None]:
len(STOPWORDS)

In [None]:
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=400,
        max_font_size=40, 
        scale=12,
        random_state=1
    ).generate(str(data))

    fig = plt.figure(1, figsize=(16,10))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
show_wordcloud(cpc_codes_df.loc[cpc_codes_df["section"]=="H", "title"], title = '')

In [None]:
show_wordcloud(cpc_codes_df.loc[cpc_codes_df["section"]=="A", "title"], title = '')

## Pretrained Patent Models on HuggingFace
https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/316706

1. https://huggingface.co/anferico/bert-for-patents
2. https://huggingface.co/google/bigbird-pegasus-large-bigpatent
3. https://huggingface.co/AI-Growth-Lab/PatentSBERTa
4. https://huggingface.co/Kevincp560/bigbird-pegasus-large-bigpatent-finetuned-pubMed
5. https://huggingface.co/google/pegasus-big_patent

## Pearsonr Metric 
https://www.kaggle.com/code/pukkinming/let-s-understand-the-metrics-a-bit?scriptVersionId=93832489
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html


scipy.stats.pearsonr(x, y)
Pearson correlation coefficient and p-value for testing non-correlation.

The Pearson correlation coefficient  measures the linear relationship between two datasets. The calculation of the p-value relies on the assumption that each dataset is normally distributed. (See Kowalski for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship.

![](https://www.lsbin.com/wp-content/uploads/2021/04/formula6.png)

In [None]:
import numpy as np
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error

In [None]:
actual = [0, 0.24, 0.25, 0.5, 0]
pred = [0, 0.5, 0.25, 1, 0]

print(f"pearson: {pearsonr(actual, pred)}")
print(f"mse: {mean_squared_error(y_true=actual, y_pred=pred)}")

In [None]:
## Reference
- [US Patent Phrase to Phrase Matching EDA](https://www.kaggle.com/code/gpreda/us-patent-phrase-to-phrase-matching-eda)
- [In Depth EDA_PatentChallenge](https://www.kaggle.com/code/valentinwerner/in-depth-eda-patentchallenge)