# 04_Lecture_NLP

Fine-tuning a pre-trained NLP model for classification using Hugging Face Transformers library
As a dataset we use U.S. Patent Phrase to Phrase Matching from Kaggel competition (https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data)

**Comments:**  
Hugging Face model hub: huggingface.co/models. Models are differ in training on different corpuses to solve different problems. By using search and a keyword, you can actually find a model that is trained on smth pretty similar to your task (trained on a same kind of documents). 

Article on how to split train data to valid dataset: https://www.fast.ai/2017/11/13/validation-sets/

Kaggle has a second test set, which is yet another held-out dataset that's only used at the end of the competition to assess your predictions. That's called the "private leaderboard". Here's a great post about what can happen if you overfit to the public leaderboard: https://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/

The problem with metrics is a big problem for AI: https://www.fast.ai/2019/09/24/metrics/  
In real life, outside of Kaggle, we don't really know what metrics to use... Here is more detail under the section "Metrics and correlation": https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners#Tokenization  
When you work with new stat or evaluation metrics, play with graphs to understand how does it feel, what 0.34 looks like, or 0.68 or other values on that metric.

Another Notebook how to do the same task on a high-level - Iterate Like a Grandmaster notebook (https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster/)

# Import libraries

In [34]:
import pandas as pd  # for work with dataframes
from datasets import Dataset, DatasetDict  # for Hugging Face dataset (needed for tokenization) 
from transformers import AutoModelForSequenceClassification, AutoTokenizer  # for tokenization
from transformers import TrainingArguments, Trainer  # for training the model
import numpy as np  # for operations with numbers and arrays
import datasets  # for saving results as csv

# Step 1. Download the dataset
1) Download the dataframe

The dataset was downloaded on PC from the web https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data and saved to the working directory.

2) Investigate the dataframe

In [2]:
train_df = pd.read_csv('/Users/hela/Code/fast_ai/us-patent-phrase-to-phrase-matching/train.csv')
train_df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [3]:
train_df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


# Step 2. Train_data: prepare input and output

## 1) Input (tokenization & numeralization)

### 1) Code the input
We can represent the input to the model as something like "TEXT1: abatement; TEXT2: eliminating process". (And the output will be a category of meaning similarity: "Different; Similar; Identical")  
(we can refer to a column (also known as a series) either using regular python "dotted" notation, or access it like a dictionary. To define, use dict. To get the first few rows, use head():)

In [4]:
train_df['input'] = 'TEXT1: ' + train_df.context + '; TEXT2: ' + train_df.target + '; ANC1: ' + train_df.anchor
train_df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

### 2) Create Hugging Face dataset
But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

1) Tokenization - split each text up into tokens
2) Numericalization: Convert each token into a number

These procedures will be performed through Hugging Face transformers ans Hugging Face datasets. So we need to turn our Pandas dataframe into Hugging Face "datasets" dataset.

In [5]:
train_ds = Dataset.from_pandas(train_df)
train_ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

### 3) Find a pre-trained model for tokenization
The details about how tokenization and numeraliation are done actually depend on the particular model we use.   
Hugging Face transformers is like timm, with a lot of pre-built models.  

So first we'll need to pick a model. There are thousands of models available on Hugging Face...

In [6]:
#the one from the course had problems with issues with tokenizer loading so I tried another:
# model_nm = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english' # good, but num_labels=2 (trained for binary classification)
model_nm = 'distilbert-base-uncased'
model_nm

'distilbert-base-uncased'

### 4) Exctract AutoTokenizer from the model
The reason why we pick the model is because we have to make sure we tokenize in the same way. (Thst's why I chose the model trained on English and on classification task. An ideall scenario is that the model is trained exactly for your task, of course. But I did not find potent models)  
So to tell transformers that we want to tokenize the same way that the people that built the model did, we use:  
AutoTokenizer

In [7]:
tokz = AutoTokenizer.from_pretrained(model_nm)

Check autotokenizer on any data:
Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces). Everyone of these tokens is stored in the vocabulary that was created than the model was trained and has a number. So our tokens will be given a number, too.

In [8]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")  #uncommon words
# or
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['g',
 "'",
 'day',
 'folks',
 ',',
 'i',
 "'",
 'm',
 'jeremy',
 'from',
 'fast',
 '.',
 'ai',
 '!']

### 6) Tokenize our input by autotokenizer
Let's now perform tokenization of out input. This adds a new item to our dataset called input_ids.

In [9]:
# create a fn
def tok_fn(x):
    return tokz(x['input'])

In [10]:
# run fn quickly in parallel on every row in our dataset using 'map':
tok_train_ds = train_ds.map(tok_fn, batched=True)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Check the result.  
For instance, let's look at the input and correspoding IDs (token numbers) for the first row of our data:

In [11]:
print(tok_train_ds['input'][0])
print(tok_train_ds['input_ids'][0])

TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement
[101, 3793, 2487, 1024, 1037, 22610, 1025, 3793, 2475, 1024, 19557, 18532, 4765, 1997, 10796, 1025, 2019, 2278, 2487, 1024, 19557, 18532, 4765, 102]


In [12]:
tokz.vocab['a']  # and we see this number in our tokens above, indeed, where A

1037

## 2) Output
Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it's currently "score". Therefore, we need to rename it:

In [13]:
tok_train_ds = tok_train_ds.rename_columns({'score':'labels'})

# Step 3. Prepare test_data & valid_data
## 1) Valid_data
Transformers uses a DatasetDict for holding our training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use train_test_split.  
As you see above, the validation set here is called test and not validate, so be careful!!!  
Also, in practice, a random split like we've used here might not be a good idea.

In [14]:
dds = tok_train_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'attention_mask'],
        num_rows: 9119
    })
})

## 2) Test_data
Our directory contained another file - test data. We'll use 'eval_data' as our name for the test set, to avoid confusion with the test dataset for validation data that was created above:

In [15]:
eval_df = pd.read_csv('/Users/hela/Code/fast_ai/us-patent-phrase-to-phrase-matching/test.csv')
eval_df
eval_df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [16]:
# create an input for evaluaiton data:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
# turn into a dataset and perform tokenization:
tok_eval_ds = Dataset.from_pandas(eval_df).map(tok_fn, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

Check the results:

In [17]:
print(tok_eval_ds['input'][0])
print(tok_eval_ds['input_ids'][0])

TEXT1: G02; TEXT2: inorganic photoconductor drum; ANC1: opc drum
[101, 3793, 2487, 1024, 1043, 2692, 2475, 1025, 3793, 2475, 1024, 28256, 6302, 8663, 8566, 16761, 6943, 1025, 2019, 2278, 2487, 1024, 6728, 2278, 6943, 102]


## 3) Evaluation metrics: Pearson corr
For this particular task, Kaggel competition says that the evaluation metric is Pearson correlation coefficient between the predicted and actual output (which is, similarity of TEXT1 and TEXT2). Remember that with Pearson, even several outliers can significantly influence your results, so they need to be deleted. Same with NN - even a couple of rows really badly wrong can destroy everything. So you want to be sure that you perform a good job on eny row.

But in real life, outside of Kaggle, we don't really know what metrics to use... Here is more detail under the section "Metrics and correlation": https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners#Tokenization

We need the report of the correlation after each epoch.  
Transformers expects you to return metrics as a dictionary, because it will use the keys of the dictionary to label each metric. So let's create a fn that count correlation and returns it as a dictionary with a label "pearson":

In [39]:
# Old from the lecture:
def corr_d(eval_pred):
    return {'pearson': pearsonr(*eval_pred)}

In [18]:
# Corrected by me, so it works:
import numpy as np
from scipy.stats import pearsonr

def corr_d(eval_pred):
    predictions, label_ids = eval_pred
    print("Predictions shape:", predictions.shape)
    print("Label_ids shape:", label_ids.shape)
    # Flatten arrays to ensure 1D
    predictions = np.array(predictions).flatten()
    label_ids = np.array(label_ids).flatten()
    print("Flattened predictions shape:", predictions.shape)
    print("Flattened label_ids shape:", label_ids.shape)
    if len(predictions) != len(label_ids):
        raise ValueError(f"Mismatch: predictions ({len(predictions)}) and labels ({len(label_ids)}) have different lengths")
    if len(predictions) < 2:
        return {'pearson': 0.0}  # Handle edge case
    return {'pearson': pearsonr(predictions, label_ids)[0]}

# Step 4. Train the model
We will train the model in Transformers.

## 1) Define hyperparameters
Learning rate = fastai provides a learning rate finder to help you figure this out, but Transformers doesn't, so you'll just have to use trial and error. The idea is to find the largest value you can, but which doesn't result in training failing.

In [19]:
bs = 128  # batch size: find the one that fits your GPU
epochs = 4  # to run quickly
lr = 8e-5

## 2) Set up arguments for Transformers
Transformers uses the TrainingArguments class to set up arguments. Don't worry too much about the values we're using here -- they should generally work fine in most cases. It's just the 3 parameters above that you may need to change for different models.

In [20]:
args = TrainingArguments('outputs', 
                         learning_rate=lr, 
                         warmup_ratio=0.1, 
                         lr_scheduler_type='cosine', 
                         fp16=False,  # Disable FP16 to use fp32
                         bf16=False,  # Disable BF16 to use fp32
                         eval_strategy="epoch", 
                         per_device_train_batch_size=bs, 
                         per_device_eval_batch_size=bs*2,
                         num_train_epochs=epochs, 
                         weight_decay=0.01, 
                         report_to='none')

## 3) Create a model and a Trainer


In [21]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
trainer = Trainer(model, 
                  args,
                  train_dataset=dds['train'],
                  eval_dataset=dds['test'],
                  processing_class=tokz, 
                  compute_metrics=corr_d)
# I used processing_class instead of tokenizer because it was an update in the Trainer() class

## 4) Train our model
The key thing to look at is the "Pearson" value in table above. As you see, it's increasing, and is already above 0.8. That's great news! 

In [23]:
trainer.train();



Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.034373,0.730294
2,No log,0.026375,0.776283
3,0.034600,0.024545,0.793398
4,0.034600,0.025271,0.793651


Predictions shape: (9119, 1)
Label_ids shape: (9119,)
Flattened predictions shape: (9119,)
Flattened label_ids shape: (9119,)
Predictions shape: (9119, 1)
Label_ids shape: (9119,)
Flattened predictions shape: (9119,)
Flattened label_ids shape: (9119,)




Predictions shape: (9119, 1)
Label_ids shape: (9119,)
Flattened predictions shape: (9119,)
Flattened label_ids shape: (9119,)




Predictions shape: (9119, 1)
Label_ids shape: (9119,)
Flattened predictions shape: (9119,)
Flattened label_ids shape: (9119,)


# Step 5. Check the results
## 1) Run model on test data
Let's get some predictions on the test set:

In [26]:
preds = trainer.predict(tok_eval_ds).predictions.astype(float)
# Round to 2 decimal places for readability
preds_rounded = np.round(preds, decimals=2)
preds_rounded

array([[ 0.46],
       [ 0.54],
       [ 0.47],
       [ 0.29],
       [-0.01],
       [ 0.54],
       [ 0.52],
       [-0.03],
       [ 0.21],
       [ 0.98],
       [ 0.19],
       [ 0.34],
       [ 0.76],
       [ 0.76],
       [ 0.73],
       [ 0.42],
       [ 0.28],
       [ 0.02],
       [ 0.56],
       [ 0.35],
       [ 0.44],
       [ 0.17],
       [ 0.09],
       [ 0.24],
       [ 0.42],
       [ 0.  ],
       [-0.02],
       [ 0.  ],
       [-0.03],
       [ 0.69],
       [ 0.32],
       [ 0.  ],
       [ 0.73],
       [ 0.51],
       [ 0.32],
       [ 0.27]])

## 2) Fix out-of-bounds predictions
Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually look at your data. Let's fix those out-of-bounds predictions... For now we will do it in a primitive way, just making values being <0 equal 0, and values >1 being equal 1.

In [31]:
preds = np.clip(preds_rounded, 0, 1)
preds

array([[0.46],
       [0.54],
       [0.47],
       [0.29],
       [0.  ],
       [0.54],
       [0.52],
       [0.  ],
       [0.21],
       [0.98],
       [0.19],
       [0.34],
       [0.76],
       [0.76],
       [0.73],
       [0.42],
       [0.28],
       [0.02],
       [0.56],
       [0.35],
       [0.44],
       [0.17],
       [0.09],
       [0.24],
       [0.42],
       [0.  ],
       [0.  ],
       [0.  ],
       [0.  ],
       [0.69],
       [0.32],
       [0.  ],
       [0.73],
       [0.51],
       [0.32],
       [0.27]])

# Step 6. Save the results as CSV
OK, now we're ready to create our submission file

In [33]:
submission = datasets.Dataset.from_dict({
    'id': tok_eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

859