## About the dataset
**In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Importing the training dataset

In [None]:
train = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
train.head()

**In this dataset, we have primarily see whether the given *target* phrase has any semantic relation to the *anchor* phase but in respect to the *context* category.**

**Importing the submission file**

In [None]:
submission = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/sample_submission.csv")
submission.head()

In [None]:
test = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")
test.head()

In [None]:
# The shape of the training dataset
train.shape

In [None]:
test.shape

#### Number of unique categories in the *context* feature are 106, as we can see below.

In [None]:
np.sort(train['context'].unique())

In [None]:
train['context'].nunique()

In [None]:
train['score'].unique()

### Data preprocessing of the training dataset. 

#### Transform the *anchor* and *target* features into lowercase.

In [None]:
train['anchor'] = train['anchor'].str.lower()
train['target'] = train['target'].str.lower()

In [None]:
test['anchor'] = test['anchor'].str.lower()
test['target'] = test['target'].str.lower()

#### Removing punctuations from *anchor* and *target* features

In [None]:
train['anchor'] = train['anchor'].str.replace('[^\w\s]','',regex=True)
train['target'] = train['target'].str.replace('[^\w\s]','',regex=True)

In [None]:
test['anchor'] = test['anchor'].str.replace('[^\w\s]','',regex=True)
test['target'] = test['target'].str.replace('[^\w\s]','',regex=True)

#### Removing stopwords from *anchor* and *target* features, and creating new features *new_anchor* and  *new_target*, respectively.

In [None]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])
train['new_target'] = train['target'].apply(lambda x: remove_stopwords(x))
train['new_anchor'] = train['anchor'].apply(lambda x: remove_stopwords(x))
train.head()

In [None]:
test['new_target'] = test['target'].apply(lambda x: remove_stopwords(x))
test['new_anchor'] = test['anchor'].apply(lambda x: remove_stopwords(x))
test.head()

**Analyzing further the features *new_target* and *new_anchor*.**

In [None]:
# The minimum string length in new_target and target features
print(f"The minimum string length of target feature is {train['target'].str.len().min() }, and for the new_target feature the minimum string length is {train['new_target'].str.len().min()}")
# Printing those rows which have minimum string length for the new_target feature
train[train['new_target'].str.len()==train['new_target'].str.len().min()]

In [None]:
# Printing those rows which have minimum string length for the target feature
train[train['target'].str.len()==train['target'].str.len().min()]

In [None]:
# The minimum string length in new_target and target features
print(f"The minimum string length of target feature is {test['target'].str.len().min() }, and for the new_target feature the minimum string length is {test['new_target'].str.len().min()}")
# Printing those rows which have minimum string length for the new_target feature
test[test['new_target'].str.len()==test['new_target'].str.len().min()]

In [None]:
# Printing those rows which have minimum string length for the target feature
test[test['target'].str.len()==test['target'].str.len().min()]

**The place where *new_target* is string length is equal to zero will be replaced by the corresponding *target* values. Thereafter, I drop the column *target* and rename *new_target* as *target* for sake of continuity.**

In [None]:
# Replacing those values in new_target where string length==0 by corresponding target values
train.loc[train['new_target'].str.len()== train['new_target']\
          .str.len().min(),'new_target']=train[train['new_target'].str.len()== train['new_target']\
                                                  .str.len().min()]['target']

In [None]:
# Printing those rows which have minimum string length for the new_target feature
train[train['new_target'].str.len()==train['new_target'].str.len().min()]

In [None]:
# Dropping the target column
train = train.drop(["target"],axis=1)
train.head()

In [None]:
# Dropping the target column
test = test.drop(["target"],axis=1)
test.head()

In [None]:
# Renaming the new_target as target
train.rename(columns={"new_target":"target"}, inplace=True)
train.head()

In [None]:
# Renaming the new_target as target
test.rename(columns={"new_target":"target"}, inplace=True)
test.head()

In [None]:
# Similarly as above we observe for anchor and new_anchor feature
# The minimum string length in new_anchor and anchor features
print(f"The minimum string length of anchor feature is {train['anchor'].str.len().min() }, and for the new_anchor feature the minimum string length is {train['new_anchor'].str.len().min()}")

In the *anchor* and *new_anchor* features have same minimum string length, therefore we can directly drop the *anchor* feature and rename the *new_anchor* feature as *anchor* feature.

In [None]:
# Similarly as above we observe for anchor and new_anchor feature
# The minimum string length in new_anchor and anchor features
print(f"The minimum string length of anchor feature is {test['anchor'].str.len().min() }, and for the new_anchor feature the minimum string length is {test['new_anchor'].str.len().min()}")

In [None]:
# Dropping the anchor feature
train = train.drop(["anchor"],axis=1)
# Rename the new_anchor feature as anchor feature
train.rename(columns={"new_anchor":"anchor"}, inplace=True)
train.head()

In [None]:
# Dropping the anchor feature
test = test.drop(["anchor"],axis=1)
# Rename the new_anchor feature as anchor feature
test.rename(columns={"new_anchor":"anchor"}, inplace=True)
test.head()

In [None]:
# Simply rearranging the columns
train = train[['id','anchor','target','context','score']]
train.head()

In [None]:
# Simply rearranging the columns
test = test[['id','anchor','target','context']]
test.head()

#### Exploratory Data Analysis (EDA)
**Now we analyze whether there is any null elements in the dataframe**

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

#### First we analyze the *target* feature.

**Distribution of sentence length in the *target* feature**

In [None]:
import matplotlib.pyplot as plt
train['target'].str.split(" ").apply(lambda x: len(x)).hist()
plt.title("Histogram of sentence length of the target feature")

In [None]:
# Distribution of number of words in a sentence of the target feature
sent_max = train['target'].str.split(" ").apply(lambda x: len(x)).max()
sent_min = train['target'].str.split(" ").apply(lambda x: len(x)).min()
print(f"The minimum and maximum sentence length in the target feature are {sent_min} and {sent_max}, respectively.")

**Mostly the sentence length in the *target* feature is about 2 i.e. 2 words in a sentence.**

**Distribution of character length in the *target* feature**

In [None]:
train['target'].str.len().hist()
plt.title("Histogram of character length of the target feature")

In [None]:
char_max = train['target'].str.len().max()
char_min = train['target'].str.len().min()
print(f"The minimum and maximum character length in the target feature are {char_min} and {char_max}, respectively.")

**Mostly the character length in the *target* feature is about 20 i.e. 20 characters in a sentence.**

**Observing the Word Cloud of the *target* feature.**

In [None]:
from wordcloud import WordCloud
# joining all the target features sentences into one line
text = " ".join(sent for sent in train.target)
# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(max_font_size=50, max_words=100).generate(text)
# Plotting the figure
plt.figure(figsize=(12, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word cloud of the target feature")
plt.show()

**Frequently occuring words in the *target* feature are *device*, *system*, *material*, *member*, etc.**

#### Now we look into the *anchor* feature.

In [None]:
# Distribution of number of words in a sentence of the anchor feature
train['anchor'].str.split(" ").apply(lambda x: len(x)).hist(align='mid')
plt.title("Histogram of sentence length of the anchor feature")

In [None]:
sent_max = train['anchor'].str.split(" ").apply(lambda x: len(x)).max()
sent_min = train['anchor'].str.split(" ").apply(lambda x: len(x)).min()
print(f"The minimum and maximum sentence length in the anchor feature are {sent_min} and {sent_max}, respectively.")

**On an average the common number of words in a sentence of the *anchor* feature is 2. The minimum length is 1 and maximum length is 5.**

In [None]:
# Distribution of number of characters in a sentence of the anchor feature
train['anchor'].str.len().hist()
plt.title("Histogram of character length of the anchor feature")

In [None]:
char_max = train['anchor'].str.len().max()
char_min = train['anchor'].str.len().min()
print(f"The minimum and maximum character length in the anchor feature are {char_min} and {char_max}, respectively.")

**On an average the common number of charaters in a sentence of the *anchor* feature is in the range 10 to 20. The minimum length is 3 and maximum length is 38.**

In [None]:
# joining all the target features sentences into one line
text = " ".join(sent for sent in train.anchor)
# lower max_font_size, change the maximum number of word and lighten the background:
wordcloud = WordCloud(max_font_size=50, max_words=100).generate(text)
# Plotting the figure
plt.figure(figsize=(12, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Word cloud of the anchor feature")
plt.show()

In [None]:
np.sort(train['context'].unique())

#### Importing the *Cooperative Patent Classification Codes Meaning* dataset from the Kaggle notebook given [here](https://www.kaggle.com/datasets/xhlulu/cpc-codes).  

In [None]:
df_cpccm = pd.read_csv("../input/cpcc-dataset/Cooperative_Patent_Classification_Codes_Meaning.csv")
df_cpccm.head()

#### Left join the *train* and the *df_cpccm* datasets

In [None]:
train = pd.merge(train, df_cpccm[["code","title"]], 
                 left_on = "context", right_on = "code",
                 how='left')

test = pd.merge(test, df_cpccm[["code","title"]], 
                 left_on = "context", right_on = "code",
                 how='left')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

#### The top 5 titles having the highest frequency 

In [None]:
train['title'].value_counts(dropna=False)[:5]

In [None]:
# Saving the preprocessed dataset
train.to_csv("train_cleaned_us_patent.csv",index=False)

test.to_csv("test_cleaned_us_patent.csv",index=False)