The goal of this notebook is to evaluate how well does the vocabulary of each language model represent the patent data. If a model's tokenizer doesn't contain a word, it will split it into subwords. Therefore, we can look at the distribution of the number of tokens per word. This might be a good signal for whether a model is suitable for this competition. 

I created the notebook following a suggestion from the user @hengck23.

In [None]:
! pip install -q transformers sentencepiece
! git clone https://github.com/microsoft/COCO-LM.git
! cp -r COCO-LM/huggingface/* .

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pylab as plt

from transformers import AutoTokenizer
import transformers.utils.logging

transformers.utils.logging.disable_progress_bar()

import transformers
from cocolm.tokenization_cocolm import COCOLMTokenizer



df_train = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/train.csv')

def plot_avg_tokens_per_word(df_train, model_name):
    df_train = df_train.copy(deep=True)
    
    if 'cocolm' in model_name:
        tokz = COCOLMTokenizer.from_pretrained(model_name)
    elif 'roberta' in model_name:
        tokz = transformers.AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
    else:
        tokz = transformers.AutoTokenizer.from_pretrained(model_name)

        
    #df_train['num_toks'] = df_train.anchor.apply(lambda x: len(tokz.convert_ids_to_tokens(tokz.encode(x.split(' '), is_split_into_words=True)))-2)
    df_train['num_toks'] = df_train.anchor.apply(lambda x: len(tokz.convert_ids_to_tokens(tokz.encode(x)))-2)
    df_train['num_words'] = df_train.anchor.apply(lambda x: len(x.split(' ')))
    df_train['tok_rate'] = df_train['num_toks'] / df_train['num_words']
    df_train['tok_rate'].hist()

    df_train['tok_rate'].hist()
    plt.xlabel('Number of Tokens per Word')
    plt.ylabel('Frequency')
    plt.title(f'model: {model_name}')
    
    avg_num_toks_per_word = (df_train['num_toks'] / df_train['num_words']).mean()
    return avg_num_toks_per_word

avg_tokens_per_word = {}

In [None]:
#avg_tokens_per_word['microsoft/cocolm-large'] = plot_avg_tokens_per_word(df_train,'microsoft/cocolm-large')
#avg_tokens_per_word['bert-base-uncased'] = plot_avg_tokens_per_word(df_train, 'bert-base-uncased')

In [None]:
avg_tokens_per_word['anferico/bert-for-patents'] = plot_avg_tokens_per_word(df_train,'anferico/bert-for-patents')

In [None]:
avg_tokens_per_word['roberta-large'] = plot_avg_tokens_per_word(df_train,'roberta-large')

In [None]:
avg_tokens_per_word['microsoft/deberta-v3-small'] = plot_avg_tokens_per_word(df_train,'microsoft/deberta-v3-small')

In [None]:
avg_tokens_per_word['microsoft/deberta-v3-large'] = plot_avg_tokens_per_word(df_train,'microsoft/deberta-v3-large')

In [None]:
avg_tokens_per_word['allenai/scibert_scivocab_uncased'] = plot_avg_tokens_per_word(df_train,'allenai/scibert_scivocab_uncased')

**Comparison of model tokenizers by average number of tokens per word**

In [None]:
pd.DataFrame.from_dict(avg_tokens_per_word, orient='index', columns=['avg_tokens_per_word']).sort_values(by='avg_tokens_per_word').plot(kind='bar')

Now let's consider how having the target word in the model's vocabulary affects the predictions we need to make in the competition.

In [None]:
model_name = 'microsoft/deberta-v3-large'
tokz = transformers.AutoTokenizer.from_pretrained(model_name) 
df_oof = pd.read_pickle('../input/pppmdebertaoof/oof_df.pkl')

df_oof['toks'] = df_oof.target.apply(lambda x: tokz.convert_ids_to_tokens(tokz.encode(x.split(' '), is_split_into_words=True)))
df_oof['num_toks'] = df_oof.toks.apply(lambda x: len(x)-2)
df_oof['num_words'] = df_oof.target.apply(lambda x: len(x.split(' ')))
df_oof['tok_rate'] = df_oof['num_toks'] / df_oof['num_words']
df_oof['abs_err'] = np.abs((df_oof['score']-df_oof['score'].mean())/df_oof['score'].var() - (df_oof['pred']-df_oof['pred'].mean())/df_oof.pred.var() )
df_oof['tok_rate_adj'] = df_oof['tok_rate'].apply(lambda x: 2 if x > 1 else 1)


df_oof[df_oof.tok_rate_adj == 1].abs_err.hist(ax=plt.gca(), alpha=0.5, density=True, color='b', bins=50)
#plt.hold(True)
df_oof[df_oof.tok_rate_adj == 2].abs_err.hist(ax=plt.gca(), alpha=0.5, density=True, color='r', bins=50)
plt.xlabel('Standardized Error')
plt.ylabel('Density')
plt.legend(['One token per word', 'More than one token per word'])

You can see from the plot that targets with one token per word have higher scores than targets with more than one token per word. Now let's look at the statistics:

In [None]:
df_summary = df_oof.groupby('tok_rate_adj').agg({'abs_err': ['mean', 'var', 'max', 'count']})#.plot(y=('abs_err', 'mean'), kind='bar')
df_summary

Here tok_rate_adj=2 is the group of samples with more than one token per word, you can see that it has a higher average error and also higher error variance.