**Content:**
* [Data Description](#section-one)
* [Load all dependencies we need](#section-two)
* [EDA](#section-three)
    - [General](#one)
    - [Most common words in anchors](#two)
    - [Most common words in targets](#three)
    - [WordClouds](#foor)
* [Create 5 folds cross validation](#folds)
* [Supervised Learning](#supervised-learning)
    - [Hyperparameters](#h)
    - [Seed Everything](#s)
    - [Dataset](#d)
    - [Model](#m)
    - [Utils + Train Function](#u)
    - [Train](#t)




<a id="section-one"></a>
# Data Description

In this dataset, you are presented pairs of phrases (an anchor and a target phrase) and asked to rate how similar they are on a scale from 0 (not at all similar) to 1 (identical in meaning). This challenge differs from a standard semantic similarity task in that similarity has been scored here within a patent's context, specifically its CPC classification (version 2021.05), which indicates the subject to which the patent relates. For example, while the phrases "bird" and "Cape Cod" may have low semantic similarity in normal language, the likeness of their meaning is much closer if considered in the context of "house".



<a id="section-two"></a>
# Load all dependencies we need

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from plotly import graph_objs as go
from collections import Counter
import plotly.express as px
import seaborn as sns

### dependencies we need to creat 5 folds cross validation
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedGroupKFold

### dependencies we need to do supervised learning (Multi class classification)
import torch
import torch.nn as nn
from scipy import stats
from torch.utils.data import DataLoader
from tqdm import tqdm
import torch.optim as optim
import scipy as sp
from transformers import AutoModel, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup


In [None]:
train             = pd.read_csv('/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv')
test              = pd.read_csv('/kaggle/input/us-patent-phrase-to-phrase-matching/test.csv')
sample_submission = pd.read_csv('/kaggle/input/us-patent-phrase-to-phrase-matching/sample_submission.csv')

<a id="section-two"></a>
# EDA

<a id="subsection-one"></a>
**General**

In [None]:
print(train.shape)
print(test.shape)

So We have 36473 samples in the train set and 36 samples in the test set


In [None]:
train.info()

In [None]:
test.info()

There are no null Values in the test set and train set.

In [None]:
train.head()

Lets look at the distribution of score in the train set

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='score',data=train)

Let's draw a Funnel-Chart for better visualization

In [None]:
temp = train.groupby('score').count()['target'].reset_index().sort_values(by='target',ascending=False)
fig = go.Figure(go.Funnelarea(
    text =temp.score,
    values = temp.target,
    title = {"position": "top center", "text": "Funnel-Chart of Score Distribution"}
    ))
fig.show()

In [None]:
print(f"Number of uniques values in ANCHOR column: {train.anchor.nunique()} in train set")
print(f"Number of uniques values in TARGET column: {train.target.nunique()} in train set")
print(f"Number of uniques values in CONTEXT column: {train.context.nunique()} in train set")


print(f"Number of uniques values in ANCHOR column: {test.anchor.nunique()} in test set")
print(f"Number of uniques values in TARGET column: {test.target.nunique()} in test set")
print(f"Number of uniques values in CONTEXT column: {test.context.nunique()} in test set")

What do we currently Know About our Data:

Before starting let's look at some things that we already know about the data and will help us in gaining more new insights:
* The data is imbalance we have just 3.16% of samples that have score 1, and 33.7 of samples with score 0.5. To solve this problem, we can augment the data containing the number of small samples to be the same as the largest class (score=0.5)
* We have a lot of duplicated anchors(733) compared to targets(29340) in the train set.
* Thanks to this discussion:https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/discussion/315220 We know that the anchors present in the test set are not included in the training data


TOP 20 anchors values

In [None]:
train.anchor.value_counts().head(20)

TOP 20 targets values

In [None]:
train.target.value_counts().head(20)

TOP 20 context values

In [None]:
train.context.value_counts().head(20)

<a id="two"></a>
**Most Common words in anchor**

In [None]:
train['temp_list'] = train['anchor'].apply(lambda x:str(x).split())
top = Counter([item for sublist in train['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

Oops we forgot to remove stopwors, but there is not much, so I guess there is no need

In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in anchor', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

In [None]:
fig = px.treemap(temp, path=['Common_words'], values='count',title='Tree of Most Common Words in anchor')
fig.show()

<a id="three"></a>
**Most Common words in target**


In [None]:
train['temp_list'] = train['target'].apply(lambda x:str(x).split())
top = Counter([item for sublist in train['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(20))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Purples')

In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in target', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

In [None]:
fig = px.treemap(temp, path=['Common_words'], values='count',title='Tree of Most Common Words in target')
fig.show()

So we can see that a lot of most common words in anchor and target are the same as 'source', 'system', 'surface', and 'member',which was obvious bc we are asked to rate how similar anchor and target.

<a id="foor"></a>

**It's Time For WordClouds**

We will be building wordclouds in the following order:

* WordCloud of anchors
* WordCloud of targets

In [None]:
def generate_wordCloud(text, color, title, title_size):
    wordcloud = WordCloud(background_color=color,
                    min_font_size = 5,
                    random_state = 42,
                    width=400, 
                    height=200)
    wordcloud.generate(str(text))

    plt.imshow(wordcloud);
    plt.title(title, fontdict={'size': title_size, 'color': 'black', 
                              'verticalalignment': 'bottom'})
    plt.axis('off');
    plt.tight_layout()  

**WORDCLOUD OF ANCHORS¶**

We Have already visualized our Most Common anchors words ,but Wordclouds Provide us much more clarity

In [None]:
generate_wordCloud(train['anchor'],color='black',title_size=15,title="WordCloud of Anchors")

**WORDCLOUD OF TARGETS¶**

We Have already visualized our Most Common targets words ,but Wordclouds Provide us much more clarity

In [None]:
generate_wordCloud(train['target'],color='black',title_size=15,title="WordCloud of Targets")

<a id="section-one"></a>

# Create folds

In [None]:
df = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/train.csv')
df['score_map'] = df['score'].map({0.00: 0, 0.25: 1, 0.50: 2, 0.75: 3, 1.00: 4})

encoder = LabelEncoder()
df['anchor_map'] = encoder.fit_transform(df['anchor'])

kf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
for n, (_, valid_index) in enumerate(kf.split(df, df['score_map'], groups=df['anchor_map'])):
    df.loc[valid_index, 'fold'] = int(n)

df['fold'] = df['fold'].astype(int)
df.to_csv('folds.csv')
df

<a id="supervised-learning"></a>
# supervised learning (Multi class classification)

<a id="h"></a>
**Hyperparameters**

In [None]:
class args:
    model = "anferico/bert-for-patents"
    max_len = 32
    accumulation_steps = 1
    batch_size = 64
    epochs = 5
    learning_rate = 2e-5

<a id="s"></a>

**seed everything**

In [None]:
def seed_everything(seed=42):
    #random.seed(seed)
    #os.environ['PYTHONHASHSEED'] = str(seed)
    #np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

<a id="d"></a>
**Dataset**

In [None]:
class PhraseDataset:
    def __init__(self, anchor, target, context, score, tokenizer, max_len):
        self.anchor = anchor
        self.target = target
        self.context = context
        self.score = score
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.anchor)

    def __getitem__(self, item):
        anchor = self.anchor[item]
        context = self.context[item]
        target = self.target[item]
        score = self.score[item]
        
        if score == 0.0  : score = [1,0,0,0,0]
        if score == 0.25 : score = [0,1,0,0,0]
        if score == 0.5  : score = [0,0,1,0,0]
        if score == 0.75 : score = [0,0,0,1,0]
        if score == 1.0  : score = [0,0,0,0,1]


        encoded_text = self.tokenizer.encode_plus(
            context + " " + anchor,
            target,
            padding="max_length",
            max_length=self.max_len,
            truncation=True,
        )
        input_ids = encoded_text["input_ids"]
        attention_mask = encoded_text["attention_mask"]
        token_type_ids = encoded_text["token_type_ids"]

        return {
            "ids": torch.tensor(input_ids, dtype=torch.long),
            "mask": torch.tensor(attention_mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "score": torch.tensor(score, dtype=torch.float32),
        }

<a id="m"></a>
**Model**

In [None]:
class PhraseModel(nn.Module):
    def __init__(self, model_name, learning_rate, num_train_steps, steps_per_epoch):
        super().__init__()
        self.learning_rate = learning_rate
        self.model_name = model_name
        self.num_train_steps = num_train_steps
        self.steps_per_epoch = steps_per_epoch

        config = AutoConfig.from_pretrained(model_name)
        config.update(
            {
                "output_hidden_states": True,
                "add_pooling_layer": True,
                "num_labels": 5,
            }
        )
        self.transformer = AutoModel.from_pretrained(model_name, config=config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.output = nn.Linear(config.hidden_size, 5)

    def forward(self, ids, mask, token_type_ids, score):
        transformer_out = self.transformer(ids, mask, token_type_ids)
        output = transformer_out.pooler_output
        output = self.dropout(output)
        output = self.output(output)

        return output

<a id="u"></a>
**Utils and train finction**

In [None]:
def save_checkpoint(state, filename): 
    print("=> Saving checkpoint")
    torch.save(state, filename)

def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)
    return score[0]

def check_acc(loader, model, loss):
    model.eval()
    loop       = tqdm(loader)
    total_loss = 0
    total_p    = 0
    with torch.no_grad():
        for batch_idx, inputs in enumerate(loop):

            ids            = inputs['ids'].to('cuda')
            mask           = inputs['mask'].to('cuda')
            token_type_ids = inputs['token_type_ids'].to('cuda')
            score          = inputs['score'].to('cuda')
            
            output         = model(ids, mask, token_type_ids, score)
            total_loss    += loss(score, output)      
            total_p       += get_score((torch.argmax(score, dim=1) / 4).cpu(), (torch.argmax(output, dim=1) / 4).cpu())
    print(total_loss/len(loader), total_p/len(loader))
    return total_p/len(loader)
    model.train()

    
def train_fn(loader, model, optimizer, loss_fn):
    loop = tqdm(loader)
    
    for batch_idx, inputs in enumerate(loop):
    
        ids            = inputs['ids'].to('cuda')
        mask           = inputs['mask'].to('cuda')
        token_type_ids = inputs['token_type_ids'].to('cuda')
        score          = inputs['score'].to('cuda')
  
        optimizer.zero_grad()
        output = model(ids, mask, token_type_ids, score)
        loss   = loss_fn(output.squeeze(), score.squeeze())

        loss.backward()
        optimizer.step()
        # update tqdm loop
        loop.set_postfix(loss=loss.item())

<a id="t"></a>
**Train**

In [None]:
for fold_ in range(5):
    print('#############################' + str(fold_))
    df = pd.read_csv("./folds.csv")

    context_mapping = {
        "A": "Human Necessities",
        "B": "Operations and Transport",
        "C": "Chemistry and Metallurgy",
        "D": "Textiles",
        "E": "Fixed Constructions",
        "F": "Mechanical Engineering",
        "G": "Physics",
        "H": "Electricity",
        "Y": "Emerging Cross-Sectional Technologies",
    }

    df.context = df.context.apply(lambda x: context_mapping[x[0]])

    train_df = df[df["fold"] != fold_].reset_index(drop=True)
    valid_df = df[df["fold"] == fold_].reset_index(drop=True)

    tokenizer = AutoTokenizer.from_pretrained(args.model)
    train_dataset = PhraseDataset(
        anchor=train_df.anchor.values,
        target=train_df.target.values,
        context=train_df.context.values,
        score=train_df.score.values,
        tokenizer=tokenizer,
        max_len=args.max_len,
    )
    train_loader = DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=True,
    )
    valid_dataset = PhraseDataset(
        anchor=valid_df.anchor.values,
        target=valid_df.target.values,
        context=valid_df.context.values,
        score=valid_df.score.values,
        tokenizer=tokenizer,
        max_len=args.max_len,
    )
    val_loader = DataLoader(
        valid_dataset,
        batch_size=args.batch_size,
        shuffle=False,
    )

    num_train_steps = int(len(train_dataset) / args.batch_size / args.accumulation_steps * args.epochs)
    steps_per_epoch = len(train_dataset) / args.batch_size

    model = PhraseModel(
        model_name      = args.model,
        learning_rate   = args.learning_rate,
        num_train_steps = num_train_steps,
        steps_per_epoch = steps_per_epoch,
    ).to('cuda')
    optimizer = optim.Adam(model.parameters(), lr=args.learning_rate) 
    loss_fn   = nn.BCEWithLogitsLoss()
    
    check_acc(val_loader, model, loss_fn)
    loss = 0
    for epoch in range(args.epochs):
        print(epoch)
        train_fn(train_loader, model, optimizer, loss_fn)
        loss_val = check_acc(val_loader, model, loss_fn)
        
        if loss_val > loss:
            loss = loss_val

            checkpoint = {
            "state_dict": model.state_dict(),
            "optimizer":  optimizer.state_dict(),
            }

            save_checkpoint(checkpoint, filename='my_checkpoint.pth.tar'+ str(fold_))