In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
palette = sns.color_palette('RdPu_r', 15)

## Data Loading

In [None]:
train_data_path = '../input/us-patent-phrase-to-phrase-matching/train.csv'
test_data_path = '../input/us-patent-phrase-to-phrase-matching/test.csv'
sample_submission_path = '../input/us-patent-phrase-to-phrase-matching/sample_submission.csv'

In [None]:
train_df = pd.read_csv(train_data_path)
test_df = pd.read_csv(test_data_path)
sample_submission_df = pd.read_csv(sample_submission_path)

Let's take a look at the shape of our datasets

In [None]:
print(f"Train dataset shape: {train_df.shape}")
print(f"Test dataset shape: {test_df.shape}")
print(f"Submission sample dataset shape: {sample_submission_df.shape}")

And also take a look at our datasets

Train

In [None]:
train_df.sample(5)

Test

In [None]:
test_df.sample(5)

## EDA

Let's go deeper and explore our features

### Anchors

**Anchor** - it is the first phrase

In [None]:
n = 15

In [None]:
anchor_df = train_df.groupby('anchor')['id'].count().sort_values(ascending=False).head(n).reset_index()
anchor_df.columns = ['Anchor', '# freq']

In [None]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Anchor', y='# freq', data=anchor_df, palette=palette)
plt.xticks(rotation=30)
plt.title(f"Top {n} anchors")

### Target

**Target** - it is the second phrase

In [None]:
target_df = train_df.groupby('target')['id'].count().sort_values(ascending=False).head(n).reset_index()
target_df.columns = ['Target', '# freq']

In [None]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Target', y='# freq', data=target_df, palette=palette)
plt.xticks(rotation=30)
plt.title(f"Top {n} targets")

### Context

**Context** - it is the CPC classification (version 2021.05), which indicates the subject within which the similarity is to be scored

The main levels inside patents:

* A: Human Necessities
* B: Operations and Transport
* C: Chemistry and Metallurgy
* D: Textiles
* E: Fixed Constructions
* F: Mechanical Engineering
* G: Physics
* H: Electricity
* Y: Emerging Cross-Sectional Technologies

In [None]:
context_df = train_df.groupby('context')['id'].count().sort_values(ascending=False).head(n).reset_index()
context_df.columns = ['Context', '# freq']

In [None]:
plt.figure(figsize=(15, 10))
sns.barplot(x='Context', y='# freq', data=context_df, palette=palette)
plt.xticks(rotation=30)
plt.title(f"Top {n} contexts")

### Score

**Score** - it is the similarity. This is sourced from a combination of one or more manual expert ratings.

In [None]:
score_df = train_df.groupby('score')['id'].count().sort_values(ascending=False).head(n).reset_index()
score_df.columns = ['Score', '# freq']
rank = score_df['Score'].argsort().argsort()

In [None]:
plt.figure(figsize=(15, 5))
sns.barplot(x='# freq', y='Score', data=score_df, palette=np.array(palette)[rank], orient='h')
plt.title(f"Scores")

### Target length

Let's explore target's length

In [None]:
train_df['target length'] = train_df['target'].str.len()

The 10 longest targets

In [None]:
train_df.sort_values('target length', ascending=False)[:10]

In [None]:
train_df.sort_values('target length', ascending=False)[-10:]

Extract patent level to explore target's length inside

In [None]:
train_df['Patent level'] = train_df['context'].str[0]

In [None]:
plt.figure(figsize=(15, 10))
sns.boxplot(x='Patent level', y='target length', data=train_df, palette=palette)
plt.title("Patent level target length distribution")

We can see, that variance of the **target length** for the **C** patent level is the biggest 

## Text Distance Metrics

Let's take a look at some *Text Distance Metrics*:

* Hamming Distance
* Levenshtein Distance
* Cosine Distance

### Hamming Distance

The *Hamming Distance* compares every letter of the two strings based purely on position.

So if you want to compare `abcdefg` and `bcdefgh` you will get *Hamming Distance* = 7, because each position doesn't match.

### Levenshtein Distance

The *Levenshtein Distance* is the number of operations needed to convert one string into another.

Three types of operations count:
    
* Inserting/adding a character counts as an operation
* Deleting a character counts as an operation
* Substituting a character counts as an operations

So if you want to compare `abcdefg` and `bcdefgh` again, you will get *Levenshtein Distance* = 2, because you need to delete `a` char and then move your first sequence one character to the right side.

### Cosine Distance

The Cosine Distance applies to the vector representation of documents.

So first you need to get a vector from your text and then count *Cosine Distance* (or *Cosine Similarity*) using 

\begin{equation}
\cos ({\bf A},{\bf B})= {{\bf A} {\bf B} \over \|{\bf A}\| \|{\bf B}\|} = \frac{ \sum_{i=1}^{n}{{\bf A}_i{\bf B}_i} }{ \sqrt{\sum_{i=1}^{n}{({\bf A}_i)^2}} \sqrt{\sum_{i=1}^{n}{({\bf B}_i)^2}} }
\end{equation}

Now Let's count *Levenshtein distance* and *Cosine Distance* for our train dataset.

Using *Hamming Distance* is not a good idea, because our `anchor` and `target` don't have the same length.

**Levenshtein Distance**

When we get distance, we need to convert it into score result, so we will divide in to the max length and then do `score = 1 - distance`.

In [None]:
from enchant.utils import levenshtein

In [None]:
train_df['levenshtein_score'] = train_df.apply(lambda x: 1 - levenshtein(x['anchor'], x['target']) / max(len(x['anchor']), len(x['target'])), axis=1)

#### Cosine Distance

I will use `TfidfVectorizer` with char level for getting vectors. It's pretty straightforward and will be ok for the baseline.

But you need to notice that I'm going to do that without any text preprocessing. Just for comparing 2 approaches.

In [None]:
from scipy import spatial
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(analyzer='char')

In [None]:
tfidf_anchor = tfidf.fit_transform(train_df['anchor']).toarray()
tfidf_target = tfidf.transform(train_df['target']).toarray()

In [None]:
cosine_similarity_values = []

for a, t in zip(tfidf_anchor, tfidf_target):
    sim = 1 - spatial.distance.cosine(a, t)
    cosine_similarity_values.append(sim)

In [None]:
train_df['cosine_score'] = cosine_similarity_values

In [None]:
train_df.sample(5)

### Which one is better?

As we need to find the most similar metric to `score`, we can compare our metrics.

As we have 1 control and 2 tests we should use *Bonferroni Correction* to find the right alpha.

In [None]:
n_tests = 2

In [None]:
conf_level = 1 - (1 - 0.95) ** n_tests
alpha = 1 - conf_level

In [None]:
alpha

Ok, we're ready to conduct our tests. I will use **t-Test**, because our samples are related.

In [None]:
from scipy.stats import ttest_rel

In [None]:
levenshtein_test = ttest_rel(train_df['score'], train_df['levenshtein_score'])
cosine_test = ttest_rel(train_df['score'], train_df['cosine_score'])

In [None]:
levenshtein_test

In [None]:
cosine_test

Here we see, that `cosine_score` is far from `score`. That's why it's better to use `levenshtein_score` in baseline.

Levenshtein is also far, but it's definitely closer.

## Inference

Here I'll use *Levenshtein Distance*

In [None]:
test_df['score'] = test_df.apply(lambda x: 1 - levenshtein(x['anchor'], x['target']) / max(len(x['anchor']), len(x['target'])), axis=1)

In [None]:
submit_df = test_df[['id', 'score']]

In [None]:
submit_df.to_csv('levenshtein_baseline_submission.csv', index=False)

In [None]:
submit_df.head()