This notebook is a copy of [Jeremy Howard's, Getting started with NLP for absolute beginners](https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners/notebook). Here we have explored deberta-v3-model from the huggingface library.

In [1]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [2]:
from pathlib import Path

if iskaggle:
    path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets

    path2 = Path('../input/deberta-v3-small/deberta-v3-small')

# Import Data and EDA

In [3]:
!ls {path}

sample_submission.csv  test.csv  train.csv


In [4]:
!ls {path2}

config.json    pytorch_model.bin  spm.model    tokenizer_config.json
gitattributes  README.md	  tf_model.h5


In [5]:
import numpy as np 
import pandas as pd 

In [6]:
df = pd.read_csv(path/'train.csv')

In [7]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [8]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,8d135da0b55b8c88,component composite coating,composition,H01
freq,1,152,24,2186


We can observe that from 36473 anchor only 733 are unique and about 30k unique target.

For input to the model, we can concatenate anchor, target and context together, call that a model and then try to predict the scores. Take a similarity problem and turn it into something that looks like a classification problem. It's like looking at a problem which looks novel or different and turn them into something that we recognize.

In [9]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [10]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

# Tokenization

A deep learning model expects numbers as inputs, so we need to do :
* Tokeanization: Split ach text into words/ tokens
* Numericalization: Convert each word/ token into numbers

To convert the input into tokens, we are going to convert the pandas dataframe into HuggingFace "Dataset".

In [11]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

In [12]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

We need to choose a model with similar tokens and numericalization from Hugging Face library.

In [13]:
model_nm = 'microsoft/deberta-v3-small'

To tell transformers that we want to tokenize the same way that the people that built a model did, we use something called AutoTokenizer. AutoTokenizer will create a tokenizer appropriate for a given model. “AutoTokenizer.from_pretrained” will download the vocabulary and the details about how this particular model tokenized the dataset.

In [14]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer

In [15]:
# !rm -r models
# !ls

In [16]:
# !ls models/

In [17]:
# def setup_for_offline_use():
#     model_name = 'microsoft/deberta-v3-small'
#     local_dir = "deberta-v3-small"
    
#     tokz = AutoTokenizer.from_pretrained(model_name)
#     model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
    
#     tokz.save_pretrained(local_dir)
#     model.save_pretrained(local_dir)
    
#     return tokz, model

# # Run this once with internet
# tokz, model = setup_for_offline_use()

In [18]:
def offline_load_model():
    # path_to_zip_file = "../input/deberta-v3-small.zip"
    local_dir = "../input/deberta-v3-small/deberta-v3-small"
    
    # import zipfile
    # with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    #   zip_ref.extractall(local_dir)
    
    
    if not os.path.exists(local_dir):
        raise FileNotFoundError(f"Model directory {local_dir} not found. Run setup_for_offline_use() first with internet.")
    
    tokz = AutoTokenizer.from_pretrained(local_dir, local_files_only=True)
    model = AutoModelForSequenceClassification.from_pretrained(local_dir, num_labels=1, local_files_only=True)
    
    return tokz, model

# This works offline
tokz, model = offline_load_model()

2025-09-03 20:46:20.054318: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756932380.076182    3382 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756932380.083249    3382 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at ../input/deberta-v3-small/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
tokz.tokenize("Hi, we are using Huggingface transformers for out NLP model!")

['▁Hi',
 ',',
 '▁we',
 '▁are',
 '▁using',
 '▁Hugg',
 'ing',
 'face',
 '▁transformers',
 '▁for',
 '▁out',
 '▁NLP',
 '▁model',
 '!']

Uncommon words will be split into pieces. The start of a new word is represented by ▁:

In [20]:
tokz.tokenize("Sometimes I get intoxicated by the exuberance of my own verbosity")

['▁Sometimes',
 '▁I',
 '▁get',
 '▁intoxicated',
 '▁by',
 '▁the',
 '▁exuberance',
 '▁of',
 '▁my',
 '▁own',
 '▁verb',
 'osity']

In [21]:
# Function which takes input and tokenize it
def tok_func(x): return tokz(x["input"])

In [22]:
# To run this parallely to every row in our dataset, we use map:
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

This will add a new row to our dataset, e.g., first row in our dataset:

In [23]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

This numbers come from a list called vocab in the tokenizer, which contains a unique integer for every possible token string.

In [24]:
tokz.vocab['▁of'], tokz.vocab['of']

(265, 1580)

Transformers assumes that the labels have the column name labels, but in this dataset it is score, therefore, we need to rename it.

In [25]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

Now we have created our tokens and labels, we need to create a validation set.

# Test and Validation sets

In [26]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,hybrid bearing,inorganic photoconductor drum,G02
freq,1,2,1,3


Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use train_test_split:

In [27]:
# train and validation set
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

Here, validation set is named as test and not as valid.

In [28]:
# test set
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

# Metrics and correlation

The competition mentions the evaluation will be done using the Pearson correlation coefficient between the predicted and actual similarity scores.
Transformers expects metrics to be returned as a dict, since that way the trainer knows what label to use.

In [29]:
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

# Training

To train our model, we'll need these from transformers:

In [30]:
from transformers import TrainingArguments,Trainer

In [31]:
from transformers import TrainingArguments
print(TrainingArguments.__init__.__code__.co_varnames)


('self', 'output_dir', 'overwrite_output_dir', 'do_train', 'do_eval', 'do_predict', 'eval_strategy', 'prediction_loss_only', 'per_device_train_batch_size', 'per_device_eval_batch_size', 'per_gpu_train_batch_size', 'per_gpu_eval_batch_size', 'gradient_accumulation_steps', 'eval_accumulation_steps', 'eval_delay', 'torch_empty_cache_steps', 'learning_rate', 'weight_decay', 'adam_beta1', 'adam_beta2', 'adam_epsilon', 'max_grad_norm', 'num_train_epochs', 'max_steps', 'lr_scheduler_type', 'lr_scheduler_kwargs', 'warmup_ratio', 'warmup_steps', 'log_level', 'log_level_replica', 'log_on_each_node', 'logging_dir', 'logging_strategy', 'logging_first_step', 'logging_steps', 'logging_nan_inf_filter', 'save_strategy', 'save_steps', 'save_total_limit', 'save_safetensors', 'save_on_each_node', 'save_only_model', 'restore_callback_states_from_checkpoint', 'no_cuda', 'use_cpu', 'use_mps_device', 'seed', 'data_seed', 'jit_mode_eval', 'use_ipex', 'bf16', 'fp16', 'fp16_opt_level', 'half_precision_backend',

We use batch size, which fits the GPU, and use small number of epochs for faster training.

In [32]:
bs = 128
epochs = 5

In [33]:
lr = 8e-5

Transformers uses the TrainingArguments class to set up arguments.

In [34]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    eval_strategy="epoch", logging_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [35]:
# model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

  trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],


In [36]:
trainer.train();



Epoch,Training Loss,Validation Loss,Pearson
1,0.068,0.03025,0.774524
2,0.029,0.025263,0.792014
3,0.02,0.022544,0.821885
4,0.0152,0.023553,0.830098
5,0.013,0.022942,0.831233




In [37]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

array([[ 0.50079411],
       [ 0.69522429],
       [ 0.6164003 ],
       [ 0.34443745],
       [-0.04182976],
       [ 0.53344709],
       [ 0.5098325 ],
       [-0.01717768],
       [ 0.27085549],
       [ 1.10770631],
       [ 0.26348093],
       [ 0.23570023],
       [ 0.78616691],
       [ 1.01833248],
       [ 0.74594665],
       [ 0.3691071 ],
       [ 0.26086798],
       [-0.05681508],
       [ 0.62070602],
       [ 0.36089557],
       [ 0.48032713],
       [ 0.26338986],
       [ 0.15293182],
       [ 0.23470487],
       [ 0.55423379],
       [-0.02562906],
       [-0.05165453],
       [-0.03737988],
       [-0.04751722],
       [ 0.73212296],
       [ 0.31206596],
       [ 0.03996582],
       [ 0.69898397],
       [ 0.50855434],
       [ 0.47621   ],
       [ 0.21128435]])

In [38]:
preds = np.clip(preds, 0, 1)

In [39]:
preds

array([[0.50079411],
       [0.69522429],
       [0.6164003 ],
       [0.34443745],
       [0.        ],
       [0.53344709],
       [0.5098325 ],
       [0.        ],
       [0.27085549],
       [1.        ],
       [0.26348093],
       [0.23570023],
       [0.78616691],
       [1.        ],
       [0.74594665],
       [0.3691071 ],
       [0.26086798],
       [0.        ],
       [0.62070602],
       [0.36089557],
       [0.48032713],
       [0.26338986],
       [0.15293182],
       [0.23470487],
       [0.55423379],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.73212296],
       [0.31206596],
       [0.03996582],
       [0.69898397],
       [0.50855434],
       [0.47621   ],
       [0.21128435]])

In [40]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds.flatten()
})

submission.to_csv("/kaggle/working/submission.csv", index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1178

In [41]:
submission

Dataset({
    features: ['id', 'score'],
    num_rows: 36
})

In [42]:
# Reload the saved submission
check = pd.read_csv("submission.csv")

print("Shape:", check.shape)
print("Dtypes:\n", check.dtypes)
print("\nFirst few rows:")
print(check.head())

# Checks
assert "id" in check.columns, "Missing 'id' column"
assert "score" in check.columns, "Missing 'score' column"
assert check['score'].dtype in [float, 'float64'], "Scores must be floats"
assert not check['score'].isnull().any(), "Found NaNs in scores"
assert check['id'].equals(eval_df['id']), "IDs don't match test.csv order"
print("\n✅ Submission file is Kaggle-ready!")

Shape: (36, 2)
Dtypes:
 id        object
score    float64
dtype: object

First few rows:
                 id     score
0  4112d61851461f60  0.500794
1  09e418c93a776564  0.695224
2  36baf228038e314b  0.616400
3  1f37ead645e7f0c8  0.344437
4  71a5b6ad068d531f  0.000000

✅ Submission file is Kaggle-ready!
