### Dear Reader, please note that the EDA for this notebook was written in Jupyter Notebook and as such some features (particularly graphs) may not work on kaggle as intended - I have marked the respective parts and generally recommend downloading and opening in Jupyter Notebook to be able to follow my thoughts behind some graphs 

In [None]:
import warnings
warnings.filterwarnings("ignore") #can get annoying and visually distracting

In [None]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.patches as mpatches
import seaborn as sns

def ecdf(data):
    #select and sort data for the x_axis
    x_axis = np.sort(data)
    #aranges the x-values evenly spaced along the y-axis (5000 evenly spaced points): start = 1, stop is len(data)+1 / len(data) -> this goes from basically 0 to basically 1
    ##having the x-values evenly spaced later allows interpretations that are "kind of like quantiles" -> Y% of data is below X
    y_axis = np.arange(1, len(data)+1)/len(data)
    #return allows the variables to be assigned to multiple variables when function is being called
    return x_axis, y_axis

import pyLDAvis
import pyLDAvis.gensim #for kaggle
#import pyLDAvis.gensim_models #for Jupyter
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.feature_extraction.text import TfidfVectorizer

import gensim
from gensim import corpora
import en_core_web_sm
import re
import spacy
from wordcloud import WordCloud
from transformers import pipeline
from transformers import AutoModelForSequenceClassification,AutoTokenizer, DataCollatorWithPadding
from datasets import load_dataset, Dataset, DatasetDict

seed_value = 2
os.environ['PYTHONHASHSEED'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)
import tensorflow as tf
tf.random.set_seed(seed_value)

In [None]:
#Loading data directly in kaggle
df = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/train.csv")
test = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/test.csv")

print(f"train data shape: {df.shape}; test data shape: {test.shape}")
#notably, the test data very short and contains no output feature

Breakdown of features:
1. ID: unique identifier  - won't be used
2. anchor: first phrase
3. target: second phrase
4. context: CPC Classification Number - scoring similarity within these groups (https://en.wikipedia.org/wiki/Cooperative_Patent_Classification)
5. score: similarity score = outcome variable
      * 1.0 = very close; 0.75 = close; 0.5 synonyms with different meaning; 0.25 = somewhat related; 0.0 = unrelated
     

# Goal:
predict the score as value of similarity between anchor and target within each context


-> While we want to score the similarity between anchor and target, the context can heavily impact this similarity! 

In result, all columns of the data set (except ID) need to be explored

# Preprocessing

In [None]:
df.head()

In [None]:
#Are all IDs unique identifiers? (because you never know)
print(f"{len(np.unique(df.id))} out of {df.shape[0]} samples are unique")
#the length of unique values matches the train shape; there are no duplicates in the dataset

#unique values per feature (not including ID)
vals = [len(np.unique(df.anchor)), len(np.unique(df.target)), len(np.unique(df.context))]
sns.barplot(x = ["anchor", "target", "context"], y = vals);
#notably, although anchor and target are heavily related by meaning, the unique values vary greatly. 
#However, ~7000 target values seem to be identical, given that there are 36473 unique entries in the df.

## Feature: Anchor

In [None]:
#How often to anchors occur? 
df.anchor.value_counts().reset_index().describe().T
#The 733 anchors, appear on average 50 times; however, they at least appear once (duh) and 152 times at most

In [None]:
#tip: double clicking the plot will increase readability.
sns.set(font_scale = 0.5)
fig, ax =plt.subplots(figsize = (65,30))
sns.countplot(x = df.anchor, order = df.anchor.value_counts().index, ax = ax, color = "b")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.axhline(df.anchor.value_counts().reset_index().describe().loc["25%"][0], color = "r", label = "25% percentile")
ax.axhline(df.anchor.value_counts().reset_index().describe().loc["50%"][0], color = "orange", label = "50% percentile")
ax.axhline(df.anchor.value_counts().reset_index().describe().loc["75%"][0], color = "r", label = "75% percentile")
plt.title("Counts of Anchors", fontsize = 40)
plt.legend(fontsize=40)
#there are many values that are above the 3rd quartile and below the first quartile

In [None]:
x,y = ecdf(df.anchor.value_counts())
plt.plot(x, y, marker = ".", linestyle ="None")
plt.xlabel("Anchor occurences");
#We can see that that the anchor occurences are pretty unbalanced

In [None]:
fig, ax = plt.subplots(1,2, figsize = (30,10))
sns.set(font_scale = 0.9)
symbols = []
for i in df.anchor:
    symbols.append(len(i))

sns.countplot(x = symbols, color = "b", ax = ax[0])
ax[0].set_title("Number of letters in Anchors");
#the number of symbols in the anchor are normally distributed
#the most values are in the range of 12-19 letters

word_count = []
for i in df.anchor:
    word_count.append(len(i.split()))

sns.countplot(x = word_count, color = "b", ax = ax[1])
ax[1].set_title("Number of words in Anchors");
#the anchors contain 1-5 words; most of them contain 2
#this may be relevant for truncation and padding on the modelling process

## Target

In [None]:
df.target.value_counts().reset_index().describe().T
#there are a lot more unique values in the target and most of them only appear once
#however, some of them appear up to 24 times; on average targets appear 1.24 times

#this may also mean that it is hard to train (and also overfit) the models for specific targets!

In [None]:
#Checking numbers in anchor feature
#Code from: https://www.kaggle.com/code/remekkinas/eda-and-feature-engineering/notebook

pattern = '[0-9]'
mask = df['anchor'].str.contains(pattern, na=False)
df['nun_anchor'] = mask
df[mask]['anchor'].value_counts()
#5 anchors contain numbers
#generally these names are rather cryptic

In [None]:
df[df.anchor == "conh2"]
#there is a lot of domain knowledge necessary here

In [None]:
fig, ax = plt.subplots(1,2, figsize = (30,10))
sns.set(font_scale = 0.9)

symbols = []
for i in df.target:
    symbols.append(len(i))

sns.countplot(x = symbols, color = "b", ax = ax[0])
ax[0].set_title("Number of letters in Anchors");
#the number of symbols in the target are (beautifully) normal distributed

sns.set(font_scale = 0.75)
word_count = []
for i in df.target:
    word_count.append(len(i.split()))

sns.countplot(x = word_count, color = "b", ax = ax[1])
ax[1].set_title("Number of words in Anchors");
#the targets contain 1-15 words; most of them contain 1 to 3 words
#some of the anchors are very long (15 words / 98 symbols)

#this will be relevant for modelling later on
#but given that these words are very rare, we can easily truncate them without hard feelings

## Context

In [None]:
#Dropping the int of the context to cluster on general category (called gen_cat)
df["gen_cat"] = 0
for index in df.index:
    df["gen_cat"].iloc[index] = df.context.iloc[index][0]

In [None]:
context = df.context.value_counts().reset_index().describe().T
pd.concat([context, df.gen_cat.value_counts().reset_index().describe().T])
#there are 106 different context codes from 8 overall categories
#the context appear at least 18 times, while the general categories appear at least 1279 times
#the most common context appears 2186 times and the most common general category appeast 8019 times

#this could potentialy impact overfitting on certain contexts / categories

In [None]:
#Checking numbers in target feature
#Code from: https://www.kaggle.com/code/remekkinas/eda-and-feature-engineering/notebook

pattern = '[0-9]'
mask = df['target'].str.contains(pattern, na=False)
df['num_target'] = mask
df[mask]['target'].value_counts()
#there are more values in target containing numbers, but they are always less frequent.

In [None]:
df[df.target == "h2o product"]
#this should have a higher score in my opinion.
#0.5 implies synonyms with the different meaning, I disagree on this score :)

In [None]:
#tip: double clicking the plot will increase readability.
sns.set(font_scale = 1.5)
fig, ax =plt.subplots(figsize = (65,30))
sns.countplot(x = df.context, order = df.context.value_counts().index, ax = ax, color = "b")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.axhline(df.context.value_counts().reset_index().describe().loc["25%"][0], color = "r", linewidth = 3, label = "25% percentile")
ax.axhline(df.context.value_counts().reset_index().describe().loc["50%"][0], color = "orange",linewidth = 3, label = "50% percentile")
ax.axhline(df.context.value_counts().reset_index().describe().loc["75%"][0], color = "r", linewidth = 3, label = "75% percentile")
plt.title("Counts of Context", fontsize = 40)
plt.legend(fontsize=40)
#there are several values for context which heavily outweigh most other values

In [None]:
#tip: double clicking the plot will increase readability.
sns.set(font_scale = 1.5)
fig, ax =plt.subplots(figsize = (25,10))
sns.countplot(x = df.gen_cat, order = df.gen_cat.value_counts().index, ax = ax, color = "b")
ax.set_xticklabels(ax.get_xticklabels());
ax.axhline(df.gen_cat.value_counts().reset_index().describe().loc["25%"][0], color = "r", linewidth = 3, label = "25% percentile")
ax.axhline(df.gen_cat.value_counts().reset_index().describe().loc["50%"][0], color = "orange",linewidth = 3, label = "50% percentile")
ax.axhline(df.gen_cat.value_counts().reset_index().describe().loc["75%"][0], color = "r", linewidth = 3, label = "75% percentile")
plt.title("Counts of general Categories", fontsize = 25)
plt.legend(fontsize=15)
#unlike the individual contexts, the general contexts are more balanced
#However, there is only little context for the general categories E & D

In [None]:
#since there are many more anchors in the anchor-count plot than in the context-count plot, we know that some contexts
#have multiple anchors; at the same time: multiple contexts can also have the same anchor!
print(df[df.anchor == "activating position"].context.nunique(), df[df.anchor == "activating position"].gen_cat.nunique())
df[df.anchor == "activating position"]
#this example shows that some anchors are shared among contexts (in this case 3 different contexts in 3 different general categories)

In [None]:
#How many unique contexts are given in train?
np.unique(df.context), f"{len(np.unique(df.context))} unique values"

In [None]:
#How many unique contexts are given in test?
np.unique(test.context), f"{len(np.unique(test.context))} unique values"
#all the labels from test are included in train 

#notably, there are many context values given in the training data, which are not contained in the test data
#However, this does not mean, that the final kaggle resut will not contain the missing 77 values!

In [None]:
#Closer look at the contexts which only have a few entries
df[df.context == "F26"]
#it will maybe be hard to train models on this little data, however, the target words are very similar without much variation.
#is there a way to arbitrarily increase the combinations for these contexts?

In [None]:
#Closer look at the contexts which only have a few entries
df[df.context == "A62"]
#some of these word combinations seem wildly different.
#also, some of these word combinations seem again ambigiously placed: 
#matel phase -> metal of material = 0.5
#metal phase -> metal material = 0.25

#with this context containing many anchors, it may make less sense to include context in the anchor for these samples
#some of these anchors are particularly rare in this context

In [None]:
list(df["gen_cat"].unique())
#we would expect B, E, F, G and H to be close to another! (just from general domains)

    A: Human Necessities
    B: Operations and Transport
    C: Chemistry and Metallurgy
    D: Textiles
    E: Fixed Constructions
    F: Mechanical Engineering
    G: Physics
    H: Electricity
    Y: Emerging Cross-Sectional Technologies

In [None]:
#Wordcloud per (general) context (most frequent words per context)
wc_a = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "A"].target))
wc_b = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "B"].target))
wc_c = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "C"].target))
wc_d = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "D"].target))
wc_e = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "E"].target))
wc_f = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "F"].target))
wc_g = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "G"].target))
wc_h = WordCloud(width = 800, height = 400, background_color="white").generate(" ".join(target for target in df[df.gen_cat == "H"].target))

In [None]:
#Show the wordclouds
fig = plt.figure(figsize = (40,40))
ims = [[wc_a, "Wordcloud: Context A"],
       [wc_b, "Wordcloud: Context B"],
       [wc_c, "Wordcloud: Context C"],
       [wc_d, "Wordcloud: Context D"],
       [wc_e, "Wordcloud: Context E"],
       [wc_f, "Wordcloud: Context F"],
       [wc_g, "Wordcloud: Context G"],
       [wc_h, "Wordcloud: Context H"]]

for a, b in enumerate(ims):
    fig.add_subplot(4,2, a+1)
    plt.imshow(b[0], interpolation='bilinear')
    plt.title(b[1], fontsize = 30)
    plt.axis("off")
    
#Double-clicking may increase readability :) 
#Lets quickly look at the first things we can notice:
    #Looking at the wordcloud, we see the word "device" being common in context A, B, E, G, H
    #Context B and D both have the word "layer" as common occurence
    #Context B, D, E and F all have the word "water" as common occurence
    #Context A, B, E and F all have the word "member" as common occurence
    #Context B and C both have the word "metal" as common occurence
#In result, none of the wordclouds are fully disconnected from the others
    #C seems "the most disconnected"

In [None]:
#Lengths of target per context
df["target_length"] = 0
for i in df.index:
    df.target_length.iloc[i] = len(df.target.iloc[i].split())

sns.boxplot(x = "target_length", y = "gen_cat", data = df, color = "b")
plt.xticks([1,2,3,4,5, 10, 15]);
#most context categories are in the area of 2-3 words for target
#C has the relative-most longest targets
#C and D have the relative-most shortest targets

In [None]:
fig = plt.figure(figsize = (15,60))
sns.boxplot(x = "target_length", y = "context", data = df, hue = "gen_cat")
#interestingly some contexts (such as C07 and C08) are very short but also have the strongest outliers
#we can see that the sub categories' context-length are often similar within their categories

In [None]:
#Looking at these word lengths, lets have a look at the scores they receive
#(because maybe they have a terrible score just because of the lengths)
df[df.target_length >= 6].head(25)
#index 3402 seems particularly fun (also has a very high score)
#again we can see that two basically identical lines have a different context nummer (7341, 7369)

In [None]:
df[df.target_length >= 6].boxplot(column = "score", by = "target_length")
#it seems like longer targets will not be able to receive full score

In [None]:
df[(df.target_length >= 6) & (df.score == 1)]
#the only case of a perfect score with a long target has a very long anchor itself (so its only 2 words longer)

In [None]:
#Maybe instead of looking at absolute lengths, we should look at relative lengths compared to the anchor
df["length_diff"] = 0
for i in df.index:
    df.length_diff.iloc[i] = df["target_length"].iloc[i] - len(df.anchor.iloc[i].split())
    
df.boxplot(column = "score", by = "length_diff")
#it seems like a length difference of more than 3 and lower than -2 will not allow a perfect score
#while it seems that the target being way shorter than the anchor is generally bad for score
#the target being longer than the anchor seems to generally have a positive impact

#these findings need to be looked at with some respect, though, given that there are only few data points, on which this data is based on
# Accordingly, this may be completely different for unknown test data

## Score

In [None]:
sns.set(font_scale = 1)
sns.boxplot(x = df.score)
#scores of 1 are so rare that they are considered outliers

In [None]:
sns.histplot(x = df.score, bins = 5)
plt.xticks([0.0, 0.25, np.mean(df.score), 0.5, 0.75, 1.0]);
plt.axvline(np.mean(df.score), color = "red", label = "mean")
plt.legend()
plt.title("Hitsogramm of Score");

In [None]:
#Which entries have a score of 1?
df[df.score == 1].head(15)
#it seems like patents with the same anchor and target have sometimes different context (B65 & G06; A41 & B23)

In [None]:
#How many are there per context group?
context_counts = df[df.score == 1].groupby("context").id.count().reset_index().sort_values("id", ascending = False)
context_counts.T
#100 contexts have have perfect scores (out of 106)
#however, 9 of them only have one perfect score; which basically allows no training for perfect synonyms

In [None]:
#looking at an example of a context with only one perfect score (out of 70 entries)
print(f"there are {df[df.context == 'A22'].shape[0]} samples for this context")
df[df.context == "A22"].head(20)
#maybe turning word groups into syllables will help in prediction
#alternatively, it probably makes sense to reduce key words in to their parts for abbreviations
#such as electromagnectic -> electro magnetic -> em  

In [None]:
#Creating this dataframe for a stacked barchart is tidious but mostly copy-paste
scores_plot = df[df.score == 0].groupby(["context"]).id.count().reset_index()
scores_plot.columns = ["context","count_score_0"]
scores_plot = scores_plot.merge(df[df.score == 0.25].groupby(["context"]).id.count().reset_index(), on = "context")
scores_plot = scores_plot.merge(df[df.score == 0.50].groupby(["context"]).id.count().reset_index(), on = "context")
scores_plot = scores_plot.merge(df[df.score == 0.75].groupby(["context"]).id.count().reset_index(), on = "context")
scores_plot = scores_plot.merge(df[df.score == 1].groupby(["context"]).id.count().reset_index(), on = "context")
scores_plot = scores_plot.merge(df.groupby("context").id.count().reset_index(), on = "context")
scores_plot.columns = ["context", "count: score 0.0", "count: score 0.25", "count: score 0.50", "count: score 0.75", "count: score 1.0", "overall"]
scores_plot = scores_plot.sort_values("overall", ascending = False).set_index("context")
scores_plot.drop(columns = ["overall"], inplace = True)

#Creating the stacked barchart for scores
fig, ax =plt.subplots(figsize = (65,30))
scores_plot.plot(kind = "bar", stacked = True, ax = ax)
plt.legend(fontsize = 40)
#This plot underlines how rare perfect scores are and how very common 0.25 and 0.5 are as score.

In [None]:
perfect_scores = df[df.score == 1].groupby("context").id.count().reset_index().sort_values("id", ascending = False)

#tip: double clicking the plot will increase readability.
sns.set(font_scale = 1.5)
fig, ax =plt.subplots(figsize = (65,30))
sns.barplot(x = "context", y ="id", data = perfect_scores, ax = ax, color = "b")
sns.barplot(x = "context", y ="id", data = perfect_scores, ax = ax, color = "b")
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);
ax.axhline(perfect_scores.describe().loc["25%"][0], color = "r", linewidth = 3, label = "25% percentile")
ax.axhline(perfect_scores.describe().loc["50%"][0], color = "orange",linewidth = 3, label = "50% percentile")
ax.axhline(perfect_scores.describe().loc["75%"][0], color = "r", linewidth = 3, label = "75% percentile")
plt.title("Counts of perfect scores per context", fontsize = 40)
plt.ylabel("count")
plt.legend(fontsize=40);
#again, some contexts are heavily outweighing the other contexts
#However, the order of perfect scores is not identical to the order of overall counts per context

In [None]:
#Which entries have a score of 0?
df[df.score == 0].head(25)
#some of these seem unjustified scored low: abatement- rent abatement; abatement- tax abatement

In [None]:
df[df.score == 0.75].head(25)
#stopwords matter! (last two lines) -> if you kick them out, the target and anchor would be identical

## Similiarities
Further explore on the ideas that were first shown in the wordclouds

In [None]:
#This thing will take a hot minute but will help for word clouds and clustering
nlp = en_core_web_sm.load()
#Lemmatize the data 
data_lem = []
for i in list(df.target): 
    lemma = nlp(i)
    data_lem.append(" ".join([word.lemma_ for word in lemma]))

In [None]:
#Create dictionary and bag of words from the data
tokens = [[word for word in data.split()] for data in data_lem]
dictionary = corpora.Dictionary(tokens)
doc_term_matrix = [dictionary.doc2bow(patent) for patent in tokens]

In [None]:
#Initiate the gensim LDA model for pyLDAvis (also will take a short while)
LDA = gensim.models.ldamodel.LdaModel
ldamodel = LDA(corpus = doc_term_matrix,
               id2word = dictionary,
               num_topics = len(list(df["gen_cat"].unique())), 
               #it might make sense to explore how many ACTUALLY different topics there are based on the targets (probably less than 8)
               random_state = 0,
               chunksize = 2000,
               passes = 50, 
               iterations = 100)

In [None]:
#check coherence (high = good) and perplexity (low = good)
from gensim.models import CoherenceModel
coherence_model = CoherenceModel(model = ldamodel, texts = tokens, dictionary = dictionary, coherence = "c_v")
ldamodel.log_perplexity(doc_term_matrix, total_docs = df.shape[0]), coherence_model.get_coherence()

In [None]:
#Looks a lot better on white background ;)
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary) #for Kaggle
#vis = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary) for Jupyter
vis

#we can see that stopwords are often in the most salient terms. 
#However, since the targets are very short, it doesnt make sense to remove them, since they sometimes reduce overall score

In [None]:
#Now, lets replicate the results with sk-learn (which shows the cluster less "beautiful")
#Sklearn is a great alternative, because we can see how the groups are actually located

#Vectorize data
idf = TfidfVectorizer(min_df = 0.001) 
#0.001 will reduce computing time (a lot) and increase variance ratio on the first 3 PCs
text_idf = idf.fit_transform(df.target).toarray()
y = list(df["gen_cat"])

In [None]:
#Fit classifier (may take a while)
clf = LinearDiscriminantAnalysis()
X_r2 = clf.fit(text_idf, y).transform(text_idf)

In [None]:
#the first 3 components explain 70% of variance
clf.explained_variance_ratio_

In [None]:
map_col = {"A":"blue",
          "B":"green",
          "C":"black",
          "D":"red",
          "E":"yellow",
          "F":"purple",
          "G":"brown",
          "H":"orange"}
df["colours"] = df["gen_cat"].map(map_col)
df.head()

In [None]:
#this plot was created to be opened in jupyter notebook (to have an interactive 3D Chart and being able to see the clusters better)
#%matplotlib notebook #activate this in jupyter
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(111, projection = '3d')


x = X_r2[:,0]
y = X_r2[:,1]
z = X_r2[:,2]

ax.scatter(x,y,z, c = df["colours"], marker = ".")

col_a = mpatches.Patch(color='blue', label='A / Human Necessities')
col_b = mpatches.Patch(color='green', label='B / Operations and Transport')
col_c = mpatches.Patch(color='black', label='C / Chemistry and Metallurgy')
col_d = mpatches.Patch(color='red', label='D / Textiles')
col_e = mpatches.Patch(color='yellow', label='E / Fixed Constructs')
col_f = mpatches.Patch(color='purple', label='F / Mechanical Engineering')
col_g = mpatches.Patch(color='brown', label='G / Physics')
col_h = mpatches.Patch(color='orange', label='H / Electricity')
handles=[col_a, col_b, col_c, col_d, col_e, col_f, col_g, col_h]
plt.legend(handles=handles, loc = "upper right", fontsize = 8);

#we expected B, E, F, G and H to be close to another! (green, yellow, purple, brown, orange); just by topic names
#this means, everything but: black, blue, red
#However, we can see that only black is clustered apart (and still some outliers fall into other clusters)
#interestingly, purple seems somewhat separated as well.

## Summarizing EDA & Preprocessing:
 - Some anchors are shared among several contexts (and general categories)
 - Most context contain several different anchors
 - Some contexts are heavily outweighing others in overall occurence (heavily right-skewed)
 - In general, the contexts of the categories D & E are under-represented in the data
 - The proportion of scores are more or less similar around all contexts
 - Most of the contexts in which we want to predict the scores are similar in regards to words used and words lengths
 
 
 - It does not make sense to remove stop words or short words, since they actually impact the score ("accept information -> accept this information" = 0.75)
 - Abbreviations are a thing in the dataset (e.g., Electromagnetic = em; Water = h2o) -> It might makes sense to find a model for domain specific abbreviations (also for possibly unknown categories & abbreviations in the test set)
     -> BUT: abreviations also penalize score!
 - Synonyms are often not as heavily penalized as abreviations - a good synonym finder will be helpful
     -> generally, the penalization of synonyms seems to be sometimes weird (e.g., absorbant properties and absorbant characteristics is a perfect match at one point (id: 621b048d70aa8867) but absorption characteristics an inperfect match (0.75) at another point (id: e6f92889099fd908)) -> maybe lemmatization will mess up these relationships were they are considered "inperfect" because there are two small misallignments

# Modeling

In [None]:
#Cstom callback to return pearson correlation of val set after every epoch; maybe only predict on parts to enhance speed
class callback_pearson(tf.keras.callbacks.Callback):
    def __init__(self):
        self.Y_val = np.array(val_ds["label"]).reshape(1,-1)
    def on_epoch_end(self, epoch, logs):
        X_val_preds = self.model.predict(tf_validation_dataset)["logits"].reshape(1,-1)
        pearson_corr = np.corrcoef(X_val_preds, self.Y_val)
        print("pearson r on the validation set =", pearson_corr[0][1])
        logs["val_corr"] = pearson_corr[0][1]

In [None]:
#increased train data from 70% to 75% 
#notably, several other notebooks (e.g., https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster#Improving-the-model)
#have achieved higher correlation with less training; this may be due to a higher training size.

#Creating validation set; again, copied mostly from: https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster
#this random shuffling is done because the hidden test data does not overlap with the known training data (so it's random)
anchors = df.anchor.unique()
np.random.shuffle(anchors)

#gets the anchors 
val_prop = 0.25 #this was commonly used in other notebooks
val_sz = int(len(anchors)*val_prop)
val_anchors = anchors[:val_sz]

#decide on validation indices and train indices
is_val = np.isin(df.anchor, val_anchors)
idxs = np.arange(len(df))
val_idxs = idxs[ is_val]
trn_idxs = idxs[~is_val]
len(val_idxs),len(trn_idxs) #print length of validation and train set

In [None]:
#With using the context instead of full name, I achieved a score of 0.815 on large -> lets try with more world clues
#created this map (as above) from wikipedia

#Hopefully with this extra information it will work better
gen_cat_map = {"A": "Human Necessities",
    "B": "Operations and Transport",
    "C": "Chemistry and Metallurgy",
    "D": "Textiles",
    "E": "Fixed Constructions",
    "F": "Mechanical Engineering",
    "G": "Physics",
    "H": "Electricity",
    "Y": "Emerging Cross-Sectional Technologies"}

df["full"] = df.gen_cat.map(gen_cat_map)
test["gen_cat"] = 0
for index in test.index:
    test["gen_cat"].iloc[index] = test.context.iloc[index][0]

test["full"] = test.gen_cat.map(gen_cat_map)

In [None]:
cpc = pd.read_csv("../input/cpc-codes/titles.csv")
cpc = cpc[["code","title"]]
cpc.head()

In [None]:
df = df.merge(cpc, left_on = "context", right_on = "code", how = "left")
test = test.merge(cpc, left_on = "context", right_on = "code", how = "left")

In [None]:
df["title"] = df["title"].str.lower()
df["full"] = df["full"].str.lower() #using lower seems to be a best practice 
df.head()

In [None]:
test["title"] = test["title"].str.lower()
test["full"] = test["full"].str.lower() #using lower seems to be a best practice 
test.head()

# Fitting

## DeBERTa small

Things that influenced finetuning DeBERTa small:
- using context instead of sep token reduced score by 0.005
- smaller batches did not change score, but increased fitting time
- increasing train / test split from 0.3 to 0.25 increased score by 0.005
- higher learning rate (1e-4 instead of 5e-5) drastically reduced score, by 0.05

Best score achieved with DeBERTa small (did not go back with further optimisations afterwards): 0.781

## DeBERTa base

Things that influenced finetuning DeBERTa base
- created warm up ratio -> increased score by 0.02
- used recommended hypertuning parameters -> had no impact

# Electra -> Val Score: 0.8264

Things that influenced Electra:
- changed input to include "full" instead of context -> boosted from 0.805 to 0.815
- changed input to include "context" and "title" instead of "full" + dynamic padding -> boosted to 0.820
- reduced warmup a little (0.15 instead of 0.1) -> boosted to 0.825
- had to enable shuffle = False on fit and set random states to get consistent results; else model often died
- had to also put Electra on first position so it runs with less problems (before it was third)

In general, the consense in many discussions seems to be that Electra is a rather hard to train model

In [None]:
import warnings
warnings.filterwarnings("ignore") #can get annoying and visually distracting

In [None]:
electra = "../input/google-electra-large-discriminator"
#ELECTRA is a new method for self-supervised language representation learning. 
#It can be used to pre-train transformer networks using relatively little compute. 
#ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens 
#generated by another neural network, similar to the discriminator of a GAN. 
#TLDR; google did something cool and it works

tokenizer = AutoTokenizer.from_pretrained(electra)
sep = tokenizer.sep_token #is "[SEP]"
df["inputs"] = df.context + sep + df.title + sep + df.anchor + sep + df.target + sep 
df1 = df[["inputs","score"]]


def tok_func(x): return tokenizer(x["inputs"])

#create ds from dataframe
ds = Dataset.from_pandas(df).rename_column('score', 'label')
#split into seperate sections
new_ds = DatasetDict({"train":ds.select(trn_idxs),
             "val": ds.select(val_idxs)})
#split into seperate ds
train_ds = new_ds["train"]
val_ds  = new_ds["val"]
#tokenize
tok_train = train_ds.map(tok_func,batched = True) 
tok_val = val_ds.map(tok_func, batched = True)

from transformers import DataCollatorWithPadding
#dynamic padding just decreased score
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True, return_tensors="tf")
tf_train_dataset = tok_train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=16,
)
tf_validation_dataset = tok_val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=16,
)

In [None]:
#electra was trained on google colab:
#https://colab.research.google.com/drive/1whTtxMpZ0BGO-TdhVCW54prJ5FEK0MuC?authuser=1#scrollTo=in7NUwNkcRuV

In [None]:
from transformers import TFAutoModelForSequenceClassification
model_elec = TFAutoModelForSequenceClassification.from_pretrained(electra, num_labels=1, from_pt = True) #importantly, we are loading from pytorch as nobody likes TF
model_elec.load_weights("../input/valentin-electra-patents-finetuned/tf-electra-large-finetuned.h5")

In [None]:
#lets also analyse the error
preds_elec = model_elec.predict(tf_validation_dataset)
#quick reality check if the model took best model or last model
np.corrcoef(np.array(val_ds["label"]).reshape(1,-1), preds_elec["logits"].reshape(1,-1))[0][1]

In [None]:
val_df_l = df.iloc[val_idxs]
val_df_l["pred_elec"] = np.array(preds_elec["logits"]).reshape(1,-1)[0]
val_df_l["diff_elec"] = val_df_l.score - val_df_l.pred_elec
val_df_l["good_pred_elec"] = np.abs(val_df_l["diff_elec"]) < 0.125

sns.histplot(x = "diff_elec", data = val_df_l, bins = 100)
plt.axvline(0.125, color = "r")
plt.axvline(-0.125, color = "r")
#the red lines are the number of "correct predictions"

#number of "wrong predictions"
print("accuracy", 1 - val_df_l[(val_df_l["diff_elec"] >= 0.125) | (val_df_l["diff_elec"] <= -0.125)].shape[0] / val_df_l.shape[0])

In [None]:
#add noise to score to have a proper scatterplot
noise = 0.075 * np.random.randn(val_df_l.shape[0])
val_df_l["score_noised"] = val_df_l.score + noise
plt.scatter(val_df_l.pred_elec, val_df_l.score_noised, color = "b", marker = ".")
#we should shave of values under 0 and over1
#the gap between 0.9 and 1.0 seems very itneresting

In [None]:
#Create test predictions
 #prepare input; mostly copied from https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster
tokenizer = AutoTokenizer.from_pretrained(electra)
sep = tokenizer.sep_token #is "[SEP]" #also try different seperators later for performance
test["inputs"] = test.context + sep + test.title + sep + test.anchor + sep + test.target + sep
eval_ds = Dataset.from_pandas(test)
#tokenize
tok_eval = eval_ds.map(tok_func, batched = True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_test_dataset = tok_eval.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=1,
)

preds_test_elec = model_elec.predict(tf_test_dataset)
elec_pred = preds_test_elec["logits"]

## Deberta Large -> Val Score: 0.8321

Things that influenced DeBERTA large
- short input (context + anchor + target) -> put score at 0.805
- changed inputs to context + title + dynamic padding -> put score at 0.815
- truncation and fixed padding reduced by score by .015
- changed inputs to full instead, turned dynamic padding off -> put score to 0.829
- changed dynamic padding on -> score to .832
- tried out custom seperator tokens with the format ["context"] -> reduced score to 0.77
- doesnt run on anything higher than batchsize 16
- higher learning rate reduced score; no warmup reduced score; 

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
#model was trained on collab: 
#https://colab.research.google.com/drive/1vxAQCJM3hPdUp0DrrlHF2vg9gcIFEfVC?authuser=2#scrollTo=8b0wmHGsxQMC

In [None]:
#Different model: try deberta large instead of small
deberta = "../input/deberta-v3-large/deberta-v3-large"
#DeBERTa improves the BERT and RoBERTa models using disentangled attention and enhanced mask decoder.
#With those two improvements, DeBERTa out perform RoBERTa on a majority of NLU tasks with 80GB training data. 
#our V3 version significantly improves the model performance on downstream tasks. 
#TLDR; its the hot shit!

tokenizer = AutoTokenizer.from_pretrained(deberta)
sep = tokenizer.sep_token #is "[SEP]"
#recommendation for building on deberta: [CLS] A [SEP] B [SEP] #this came out at 0.8192
df["inputs"] = df.full + sep + df.anchor + sep + df.target + sep 
#deberta large performs better on full than on title + context
df1 = df[["inputs","score"]]

#Function to apply tokenizer
#maxlen = 17 #cls + 4 for context + sep + 4 for anchor + sep + 6 for target
#i tested on a boxplot; this will truncate 387 out of the 27346 inputs (~1.4%)
def tok_func(x): return tokenizer(x["inputs"])#.batch_encode_plus(x["inputs"], max_length = maxlen, padding = "max_length", truncation = True) #reduced score

In [None]:
#Down to 16 for RAM purposes
#create ds from dataframe
ds = Dataset.from_pandas(df).rename_column('score', 'label')
#split into seperate sections
new_ds = DatasetDict({"train":ds.select(trn_idxs),
             "val": ds.select(val_idxs)})
#split into seperate ds
train_ds = new_ds["train"]
val_ds  = new_ds["val"]
#tokenize
tok_train = train_ds.map(tok_func,batched = True) #I overwrite the earlier datasets for RAM and not-having-to-change-my-code purposes
tok_val = val_ds.map(tok_func, batched = True)

from transformers import DataCollatorWithPadding
#changed padding = True along the way -> dynamic padding (was supposed to reduce time, but did effectively nothing), if anything: score decreased a little
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_train_dataset = tok_train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=16,
)
tf_validation_dataset = tok_val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=16,
)

In [None]:
from transformers import TFAutoModelForSequenceClassification
model_l = TFAutoModelForSequenceClassification.from_pretrained(deberta, num_labels=1)
model_l.load_weights("../input/valentin-debertalarge-patents-finetuned/tf-deberta-large-finetuned.h5")

In [None]:
#lets also analyse the error
preds_l = model_l.predict(tf_validation_dataset)
#quick reality check if the model took best model or last model
np.corrcoef(np.array(val_ds["label"]).reshape(1,-1), preds_l["logits"].reshape(1,-1))[0][1]

In [None]:
val_df_l["pred"] = np.array(preds_l["logits"]).reshape(1,-1)[0]
val_df_l["diff"] = val_df_l.score - val_df_l.pred
val_df_l["good_pred"] = np.abs(val_df_l["diff"]) < 0.125

In [None]:
sns.histplot(x = "diff", data = val_df_l, bins = 100)
plt.axvline(0.125, color = "r")
plt.axvline(-0.125, color = "r")
#the red lines are the number of "correct predictions"

#number of "wrong predictions"
print("accuracy", 1 - val_df_l[(val_df_l["diff"] >= 0.125) | (val_df_l["diff"] <= -0.125)].shape[0] / val_df_l.shape[0])

In [None]:
val_df_l.groupby("gen_cat")["diff","good_pred"].mean()

In [None]:
plt.scatter(val_df_l.pred, val_df_l.score_noised, color = "b", marker = ".")
#the gap between 0.9 and 1.0 seems very interesting but goes hand in hand with our findings from the EDA

In [None]:
#remove values lower than 0 and higher than 1
val_df_l["preds_lim"]  = val_df_l.pred.apply(lambda x: 0 if x < 0 else x)
val_df_l["preds_lim"]  = val_df_l.pred.apply(lambda x: 1 if x > 1 else x)

#map to nearest value
val_df_l.reset_index(inplace = True)
val_df_l["preds_map"] = 0
for i in val_df_l.index:
    if val_df_l.pred.iloc[i] < 0.125: val_df_l.preds_map.iloc[i] = 0
    elif val_df_l.pred.iloc[i] < 0.375: val_df_l.preds_map.iloc[i] = 0.25
    elif val_df_l.pred.iloc[i] < 0.625: val_df_l.preds_map.iloc[i] = 0.50
    elif val_df_l.pred.iloc[i] < 0.875: val_df_l.preds_map.iloc[i] = 0.75
    elif val_df_l.pred.iloc[i] >= 0.875: val_df_l.preds_map.iloc[i] = 1

In [None]:
val_df_l.head()

In [None]:
print(f"""
 normal: {np.corrcoef(np.array(val_ds["label"]), val_df_l["pred"])[0][1]}
 with limits: {np.corrcoef(np.array(val_ds["label"]), val_df_l["preds_lim"])[0][1]}
 with mapping: {np.corrcoef(np.array(val_ds["label"]), val_df_l["preds_map"])[0][1]}
 """)
#normal has best score

In [None]:
#Create test predictions
#prepare input; mostly copied from https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster
tokenizer = AutoTokenizer.from_pretrained(deberta)
sep = tokenizer.sep_token #is "[SEP]" #also try different seperators later for performance
test["inputs"] = test.full + sep + test.anchor + sep + test.target + sep 
eval_ds = Dataset.from_pandas(test)
#tokenize
tok_eval = eval_ds.map(tok_func, batched = True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_test_dataset = tok_eval.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=1,
)

preds_train_l = model_l.predict(tf_test_dataset)
deberta_pred = preds_train_l["logits"]

In [None]:
deberta_pred

# Patent BERT -> Val Score: 0.8205

Things that influenced patent Bert
- Rarely had model dying 
- dynamic padding reduced score by .004 with short input (context + anchor + target)
- score was better with full + anchor + target (~.005 better than short input)
- score was best with context + title + anchor + target (~.01 better than short input) & dynamic padding
- model was more consistent with batch size = 32

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
patentbert = "../input/bert-for-patents/bert-for-patents"
#BERT for Patents is a model trained by Google on 100M+ patents (not just US patents). It is based on BERTLARGE.

tokenizer = AutoTokenizer.from_pretrained(patentbert)
sep = tokenizer.sep_token #is "[SEP]"
#df["inputs"] = df.full + sep + df.anchor + sep + df.target + sep #using the translated gen_cat ("full") came out at ~0.814
df["inputs"] = df.context + sep + df.title + sep + df.anchor + sep + df.target + sep 
#df["inputs"] = df.context + sep + df.anchor + sep + df.target + sep #same shape as deberta
df1 = df[["inputs","score"]]


def tok_func(x): return tokenizer(x["inputs"])

#create ds from dataframe
ds = Dataset.from_pandas(df).rename_column('score', 'label')
#split into seperate sections
new_ds = DatasetDict({"train":ds.select(trn_idxs),
             "val": ds.select(val_idxs)})
#split into seperate ds
train_ds = new_ds["train"]
val_ds  = new_ds["val"]
#tokenize
tok_train = train_ds.map(tok_func,batched = True) 
tok_val = val_ds.map(tok_func, batched = True)

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding = True, return_tensors="tf")
tf_train_dataset = tok_train.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=32
)
tf_validation_dataset = tok_val.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=32
)

In [None]:
#training of patent bert happened on collab: 
#https://colab.research.google.com/drive/1-bQ6bmojllp5bNMQb182KiM8f6QWpwvp?authuser=1#scrollTo=GggDEa5dXIAj

In [None]:
from transformers import TFAutoModelForSequenceClassification
model_pb = TFAutoModelForSequenceClassification.from_pretrained(patentbert, num_labels=1, from_pt = True) #importantly, we are loading from pytorch as nobody likes TF
model_pb.load_weights("../input/valentin-patent-bert-finetuned/tf-patent-bert-finetuned(1).h5")

In [None]:
#lets also analyse the error
preds_pb = model_pb.predict(tf_validation_dataset)
#quick reality check if the model took best model or last model
np.corrcoef(np.array(val_ds["label"]).reshape(1,-1), preds_pb["logits"].reshape(1,-1))[0][1]

In [None]:
val_df_l["preds_pb"] = np.array(preds_pb["logits"]).reshape(1,-1)[0]
val_df_l["diff_pb"] = val_df_l.score - val_df_l.preds_pb
val_df_l["good_pred_pb"] = np.abs(val_df_l["diff_pb"]) < 0.125

sns.histplot(x = "diff_pb", data = val_df_l, bins = 100)
plt.axvline(0.125, color = "r")
plt.axvline(-0.125, color = "r")
#the red lines are the number of "correct predictions"

#number of "wrong predictions"
print("accuracy", 1 - val_df_l[(val_df_l["diff_pb"] >= 0.125) | (val_df_l["diff_pb"] <= -0.125)].shape[0] / val_df_l.shape[0])
#accuracy is around the same as deberta large

In [None]:
#plotting predictions vs noised scored
plt.scatter(val_df_l.preds_pb, val_df_l.score_noised, color = "b", marker = ".")

In [None]:
val_df_l.groupby("gen_cat")["diff_pb","good_pred_pb"].mean()

In [None]:
#Create test predictions
#prepare input; mostly copied from https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster
tokenizer = AutoTokenizer.from_pretrained(patentbert)
sep = tokenizer.sep_token #is "[SEP]" #also try different seperators later for performance
test["inputs"] = test.context + sep + test.title + sep + test.anchor + sep + test.target + sep
eval_ds = Dataset.from_pandas(test)
#tokenize
tok_eval = eval_ds.map(tok_func, batched = True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
tf_test_dataset = tok_eval.to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=1,
)

preds_test_pb = model_pb.predict(tf_test_dataset)
pb_pred = preds_test_pb["logits"]

In [None]:
#compare quote of in-range predictions for each model
val_df_l.groupby("gen_cat")["good_pred","good_pred_elec","good_pred_pb"].mean()
#the bert actually performs best on E; while it is hanging behind on all other 
#electra outperforms deberta on almost all categories in regards to accuracy

In [None]:
#Gridsearch for the best ratio of models
parts = np.arange(0, 1.005, 0.005)
comb_preds = []
for ratio_pb in parts:
    comb_preds.append([ratio_pb, np.corrcoef((ratio_pb * preds_pb["logits"] + (1 - ratio_pb) * preds_elec["logits"]).reshape(-1),val_ds["label"])[0][1]])

#Plot the score of ratio
comb = pd.DataFrame(comb_preds, columns = ["RatioPatentBert", "correlation"])
sns.lineplot(x = "RatioPatentBert", y = "correlation", data = comb)
plt.xlabel("Ratio of PatentBert model (electra ratio = 1 - ratio of PB)");
#in combination the two models are really a lot better!
#also around 0.5 seems to be the sweet spot

In [None]:
#Gridsearch for the best ratio of models
parts = np.arange(0, 1.005, 0.005)
comb_preds = []
for ratio_pb in parts:
    comb_preds.append([ratio_pb, np.corrcoef((ratio_pb * preds_pb["logits"] + (1 - ratio_pb) * preds_l["logits"]).reshape(-1),val_ds["label"])[0][1]])

#Plot the score of ratio
comb = pd.DataFrame(comb_preds, columns = ["RatioPatentBert", "correlation"])
sns.lineplot(x = "RatioPatentBert", y = "correlation", data = comb)
plt.xlabel("Ratio of PatentBert model (DeBERTa ratio = 1 - ratio of PB)");
#in combination the two models are really a lot better!
#also around 0.5 seems to be the sweet spot

In [None]:
#Gridsearch for the best ratio of models
parts = np.arange(0, 1.005, 0.005)
comb_preds = []
for ratio_elec in parts:
    comb_preds.append([ratio_elec, np.corrcoef((ratio_elec * preds_elec["logits"] + (1 - ratio_elec) * preds_l["logits"]).reshape(-1),val_ds["label"])[0][1]])

#Plot the score of ratio
comb = pd.DataFrame(comb_preds, columns = ["RatioElectra", "correlation"])
sns.lineplot(x = "RatioElectra", y = "correlation", data = comb)
plt.xlabel("Ratio of Electra model (DeBERTa ratio = 1 - ratio of PB)");
#in combination the two models are really a lot better!
#also around 0.5 seems to be the sweet spot

In [None]:
#GridSearch again
#for 3models, we just search for ratio of best model and do (1-ratio) / 2the other two models 
parts = np.arange(0, 1.005, 0.005)
comb_preds = []
for ratio_db in parts:
    comb_preds.append([ratio_db, np.corrcoef((((1-ratio_db)/2) * preds_pb["logits"].reshape(-1) + ((1-ratio_db)/2) * preds_elec["logits"].reshape(-1) + ratio_db * preds_l["logits"].reshape(-1)),val_ds["label"])[0][1]])

comb = pd.DataFrame(comb_preds, columns = ["RatioDeberta", "correlation"])
best_ratio_deberta = round(max(comb.correlation),3)
print(f"maximum val score was reached on a ratio of {round(comb.RatioDeberta.iloc[comb.correlation.idxmax()],3)} with a score of {best_ratio_deberta}")
sns.lineplot(x = "RatioDeberta", y = "correlation", data = comb)
plt.xlabel("Ratio of Deberta-large model (electra & patent bert ratio = (1 - ratio of PB)/2)");
#the model of 3 outperforms every model of 2

In [None]:
#for simplicity sake we will do a 3/3/4 split; with deberta having the highest portion
ensemble_val = np.corrcoef((0.3* preds_pb["logits"].reshape(-1) + 0.3 * preds_elec["logits"].reshape(-1) + 0.4 * preds_l["logits"].reshape(-1)),val_ds["label"])[0][1]
ensemble_val

In [None]:
#How is the distribution for this?
#for graphical purposes; lets plot all 4 next to another
val_df_l["ensemble"] = (0.3 * preds_pb["logits"] + 0.3 * preds_elec["logits"] + 0.4 * preds_l["logits"]).reshape(-1)

fig, ax = plt.subplots(1,4, figsize = (20,4))
ax[0].scatter(val_df_l.pred_elec, val_df_l.score_noised, color = "b", marker = ".")
ax[0].set_title("Predictions vs. Label - Electra")
ax[1].scatter(val_df_l.pred, val_df_l.score_noised, color = "b", marker = ".")
ax[1].set_title("Predictions vs. Label - DeBERTa Large")
ax[2].scatter(val_df_l.preds_pb, val_df_l.score_noised, color = "b", marker = ".")
ax[2].set_title("Predictions vs. Label - Patent Bert")
ax[3].scatter(val_df_l.ensemble, val_df_l.score_noised, color = "y", marker = ".")
ax[3].set_title("Predictions vs. Label - Ensemble")

#we can see that the ensembling forces the outliers of single models further into the middle
#in general the paterns are very similar

In [None]:
#example data for validation set
preds = pd.DataFrame((val_df_l.gen_cat, val_df_l.pred_elec, val_df_l.pred, val_df_l.preds_pb, val_df_l.ensemble, val_df_l.score)).T
preds.columns = ["category","electra","deberta", "bert", "ensemble", "score"]
preds.head(25)

In [None]:
cats = preds.category.unique()
deb_corr_cats = []
pb_corr_cats = []
elec_corr_cats = []
avg_corr_cats = []
for category in cats:
    deb_corr_cats.append(np.corrcoef(list(preds[preds.category == category].deberta), list(preds[preds.category == category].score))[0][1])
    pb_corr_cats.append(np.corrcoef(list(preds[preds.category == category].bert), list(preds[preds.category == category].score))[0][1])
    elec_corr_cats.append(np.corrcoef(list(preds[preds.category == category].electra), list(preds[preds.category == category].score))[0][1])
    avg_corr_cats.append(np.corrcoef(list(preds[preds.category == category].ensemble), list(preds[preds.category == category].score))[0][1])

In [None]:
corrs = pd.DataFrame(preds.category.unique(), columns = ["category"])
corrs["correlation_deberta"] = deb_corr_cats
corrs["correlation_pb"] = pb_corr_cats
corrs["correlation_electra"] = elec_corr_cats
corrs["avg_corr"] = avg_corr_cats
corrs.sort_values("category")
#the ensemble outperforms every individual model along all categories

#strong categories: B, D, E, F
#Week categories: A, C, G, H

In [None]:
corrs = corrs.merge(test.groupby("gen_cat").id.count().reset_index(), left_on = "category", right_on = "gen_cat").drop(columns = ["gen_cat"])
corrs
#our weeker categories make up around 53% of the dataset
#crossvalidation would probably help with this

In [None]:
corrs["weighted"] = corrs.avg_corr * corrs.id / sum(corrs.id)
sum(corrs.weighted) #we would expect a lb score around this 

#Note from future Valentin: Since we performed worse than this, it seems like the categories are among the categories which were harder to classify
#As shown in the EDA several contexts are in validation but not in test; the ones in test may be harder to classify

# Submission

In [None]:
#print all predictions and the ensembled score in a dataframe
pred_ensemble = 0.3 * elec_pred.reshape(-1) + 0.4 * deberta_pred.reshape(-1) + 0.3 * preds_test_pb["logits"].reshape(-1)
preds = pd.DataFrame((elec_pred.reshape(-1), deberta_pred.reshape(-1), preds_test_pb["logits"].reshape(-1), pred_ensemble)).T
preds.columns = ["electra","deberta", "bert", "ensemble"]
preds.head(36)

In [None]:
sub = pd.read_csv("../input/us-patent-phrase-to-phrase-matching/sample_submission.csv")
sub["score"] = pred_ensemble
sub[["id","score"]].to_csv("submission.csv", index=False)

# Open To-Do's that would influence the performance but are out of scope
- **Cross Validation** (was commonly used on the fit and resulted in significantly better scores)
    - sadly, this would mean training the whole models 3+ times, which would heavily increase GPU runtime
    - there are workarounds, but sadly, the time was not available for me
- weight decay (somehow doesnt seem to be a thing with Keras; would have to use PyTorch / TrainerAPI for it)
    - i tried with AdamW from tensorflow but it didn't work with the setup
- trying more models; see how well an ensemble of 4 or 5 would perform; current model selection was not analysed 

# What I have learned from this Challenge
- PyTorch seems to be way more popular than Keras, but PyTorch models can also be run from Keras
- Batching the inputs matters for RAM (very important on Kaggle / Collab); this can be "tricked" with gradient accumulation (which is not a thing in keras apparently)
- Fine tuning includes a lot of things apart from Hyperparameters; such as deciding on input formats; sep-tokens etc.
- there is a multitude of fine-tunable models for the problem; eventhough it doesnt seem like it at first
- Custom metrics in Keras are a thing (but are hard to get to work with normal Keras functions)
- **Ensemble models** are a thing (and often outperform its strongest individual models)
    - it seems like even though the individual models perform at different levels, simply averaging them will give the best result
- Managing GPU usage time is critical in a kaggle competition

- Warmup ratio matters a lot
- Playing around with inputs matters a lot
- Sometimes the order of the models in which they are computed seem to matter! (must be some hardware / GPU thing)
- Submissions to Kaggle are painful: it is so hard to debug kaggle code before submitting and often submissions are wasted or fail on the last lines of code due to little changes
    - from my submissions (33):
    - 1 Submission for solely EDA 
    - 3 versions that actually worked as intended
    - 20 crashed submissions to figure out how to submit on kaggle
    - 4 included models that did not train properly
    - 3 were just saves to clean up code
    - 2 were missclicks that were submitted when they werent ready

- just dont spend 4 hours on every submission and train the model once on colab -> save weight -> upload to kaggle -> restore weights
    