# NLP

NLP using huggingface transformers.

## Pretraining and finetuning neural nets

A pretrained model is bunch of parameters that have already been fit. Some of them we are confident about where they should be and some of them less so.

Finetuning is working out the ones we had less of an idea about and wiggling the others a bit.

### ULMFit

First proposed by ULMFit. Used RNN not transformers.

Wikitext 103 - Language Model (predicts the next word of wikipedia) - requires "understanding" of language structure, the world, maths etc. Got to about 30% accuracy.

IMDb - Language model (use Wiktext as pretrained model to train same thing on IMDb)

IMDB Classifier - Classifier - fine tune IMDb language model to detect sentiment.

### Transformers

Not structured the same so need a different but similar idea:

Take a chunk of text and ask model to detect/predict the missing words.

## NLP

Most common/useful/practical area for NLP is classification (sentiment analysis, author id, topics etc.).

One example is [U.S. Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview)

The task is comparing two words or short phrases and scoring them basdd on if they are similar or not, basaed on which patent class they were used in. 

```
1    == Very close match
0.75 == Close synonym (eg. “mobile phone” vs. “cellphone”)
0.5  == Synonyms which don’t have the same meaning (eg. "abatement" and "eliminating process")
0.25 == Somewhat related,
0    == Unrelated
```

For this to be a categorisation problem just consider each score as a category.



## Init

In [1]:
import os
from pathlib import Path

iskaggle = os.environ.get("KAGGLE_KERNEL_RUN_TYPE", "")

In [2]:
path = Path("data/us-patent-phrase-to-phrase-matching")
if iskaggle and not path.exists():
    path = Path("../input/us-patent-phrase-to-phrase-matching")

## Data

In [3]:
import pandas as pd

train_frame = pd.read_csv(path / "train.csv")
train_frame

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [4]:
train_frame.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


## Input

We need to provide input to the model in some format, lets shove it into a semi-structured string in the form:
    
`TEXT1: {context}; TEXT2: {target}; ANC1: {anchor};`

In [5]:
train_frame["input"] = (
    "TEXT1: " + train_frame.context + "; TEXT2: " + train_frame.target + "; ANC1: " + train_frame.anchor + ";"
)
train_frame.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatem...
Name: input, dtype: object

## Tokenisation and numericalisation

- Tokenisation - Some languages dont have words so we split into subwords ("tokens").
- Numericalisation - convert each token into a number for our model.

We'll use huggingface DataSets. This is what huggingface transformers uses.

In [6]:
from datasets import Dataset, DatasetDict

train_dataset = Dataset.from_pandas(train_frame)
train_dataset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

We need to pick a model as tokenisation/numericalisation is a bit different for each pretrained model. We need to do the same thing as the preople who trained the model did so we get the same inputs.

In [35]:
# can use -small or -large
model_nm = "microsoft/deberta-v3-small"

if iskaggle:
    model_nm = "../input/debertav3small/"

In [8]:
# AutoTokenizer creates an appropriate tokeniser for the model
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tok = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [9]:
# An example of the tokenisation
tok.tokenize("Hello, this is a sentence WITHSOMEWORDs that will be tok'ed into sub-words, FUN!")

['▁Hello',
 ',',
 '▁this',
 '▁is',
 '▁a',
 '▁sentence',
 '▁WITH',
 'SOME',
 'WORD',
 's',
 '▁that',
 '▁will',
 '▁be',
 '▁to',
 'k',
 "'",
 'ed',
 '▁into',
 '▁sub',
 '-',
 'words',
 ',',
 '▁FUN',
 '!']

In [10]:
# A function to tokenise our inputs


def tok_fn(x):
    return tok(x["input"])

In [11]:
# Run it on the training data
tok_train = train_dataset.map(tok_fn, batched=True)
tok_train

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

In [12]:
# We now have a new input_ids field, this is a list of numbers that maps to words in the vocab, for example
tok.vocab["▁of"]

265

In [13]:
# We need the labels (categories) for our training input, transformers assumes its called labels, but its currently called score
tok_train = tok_train.rename_columns({"score": "labels"})
tok_train

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

## Test/validation data sets

We also have a test set. This is to help us identify overfitting/overfitting.

* Underfit - not enough complexity in the model to represent things. Systematically biased. Training data will not fit well.
* Overfit - too much complexity in the model, represents our training set but not our general problem. Test/validation data will not fit well.

We need to be careful of how we create test/validation data. We could accidentally create a set that still matches an overfitted model. For example a uniform distribution of points from a timeseries dataset (bad) vs taking the last 1 week (better).

So we need a validation set.

In [14]:
# transformers can take a dataset and split it into two
# Note, its a bit confusing as our validation set is called test

dataset_dict = tok_train.train_test_split(0.25)
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

Lets also load the test set

In [30]:
test_frame = pd.read_csv(path / "test.csv")
test_frame["input"] = (
    "TEXT1: " + test_frame.context + "; TEXT2: " + test_frame.target + "; ANC1: " + test_frame.anchor + ";"
)

test_dataset = Dataset.from_pandas(test_frame).map(tok_fn, batched=True)
test_dataset

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})

## Metrics

Measurements we are interseted in maximising or minimising.

This kaggle comp measures based on the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)

Note that in the real world optimising for metrics is a double edged sword. You have to optimise for something, so you need metrics, but any given metric will be unlikely to represent to complexity of performance in the real world. AI can sometimes be a bit too good at optimising for a metric rather than its actual job.

Lets define our own correlation function to use as the compute metric:

In [25]:
import numpy as np


def corr(x, y):
    return np.corrcoef(x, y)[0][1]


def corr_d(eval_pred):
    return {"pearson": corr(*eval_pred)}

## Training

In [27]:
from transformers import TrainingArguments, Trainer

bs = 128
epochs = 10
lr = 8e-5

# Arguments for the trainer
args = TrainingArguments(
    "outputs",
    learning_rate=lr,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=True,
    evaluation_strategy="epoch",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs * 2,
    num_train_epochs=epochs,
    weight_decay=0.01,
    report_to="none",
)

# Create the model and the trainer
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(
    model,
    args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    tokenizer=tok,
    compute_metrics=corr_d,
)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.bias', 'classifier.bias', 'classifier.weight', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.027138,0.79576
2,No log,0.022707,0.825082
3,0.036800,0.024571,0.826926
4,0.036800,0.020292,0.838661
5,0.013600,0.023074,0.83977
6,0.013600,0.021925,0.842033
7,0.013600,0.021572,0.841824
8,0.008100,0.021221,0.843823
9,0.008100,0.021251,0.843621
10,0.005900,0.021292,0.843733


TrainOutput(global_step=2140, training_loss=0.015394991691981521, metrics={'train_runtime': 412.252, 'train_samples_per_second': 663.526, 'train_steps_per_second': 5.191, 'total_flos': 1867646572494840.0, 'train_loss': 0.015394991691981521, 'epoch': 10.0})

## Submission

In [31]:
# Generate predictions from the test data set
preds = trainer.predict(test_dataset).predictions.astype(float)
preds

array([[ 0.42993164],
       [ 0.75341797],
       [ 0.51318359],
       [ 0.24438477],
       [-0.02354431],
       [ 0.50439453],
       [ 0.45068359],
       [-0.01178741],
       [ 0.28466797],
       [ 1.04101562],
       [ 0.25048828],
       [ 0.22973633],
       [ 0.83105469],
       [ 0.84277344],
       [ 0.76367188],
       [ 0.41601562],
       [ 0.30200195],
       [-0.03231812],
       [ 0.56152344],
       [ 0.43774414],
       [ 0.53417969],
       [ 0.24853516],
       [ 0.16418457],
       [ 0.22900391],
       [ 0.58935547],
       [-0.01359558],
       [-0.02694702],
       [-0.02174377],
       [-0.01811218],
       [ 0.65136719],
       [ 0.12145996],
       [-0.01974487],
       [ 0.66162109],
       [ 0.55029297],
       [ 0.30200195],
       [ 0.2467041 ]])

In [39]:
# Some of them are > or < 0, lets just clip things
preds = np.clip(preds, 0, 1)
preds = preds.flatten()
preds

array([0.42993164, 0.75341797, 0.51318359, 0.24438477, 0.        ,
       0.50439453, 0.45068359, 0.        , 0.28466797, 1.        ,
       0.25048828, 0.22973633, 0.83105469, 0.84277344, 0.76367188,
       0.41601562, 0.30200195, 0.        , 0.56152344, 0.43774414,
       0.53417969, 0.24853516, 0.16418457, 0.22900391, 0.58935547,
       0.        , 0.        , 0.        , 0.        , 0.65136719,
       0.12145996, 0.        , 0.66162109, 0.55029297, 0.30200195,
       0.2467041 ])

In [40]:
# And now generate the submission.csv
import datasets

submission = datasets.Dataset.from_dict({"id": test_dataset["id"], "score": preds})

submission.to_csv("submission.csv", index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1038

When submitted to the kaggle competition this got ~0.8.