# ICPR 2024 Competition on Claim Span Identification

## Disclaimer
- The dataset may contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety
- The dataset may identify individuals (i.e., one or more natural persons), either directly or indirectly
- The dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions, etc.)

## Data Format
The English and Hindi sets of the data have been split into train **(~6k samples)** and validation sets **(500 samples)**. These splits are the stored in standard JSON files are present in the ```data``` directory. You may open the ```.json``` files in any text editor to visualist the data structure.

Note that the usernames have been anonimyzed by giving them unique IDs.

On loading the data file (say with ```json.load()``` in python), it will return a list of dictionaries, one for each of the data points. Each dictionary has an ***"index"*** key (0, 1, 2, ...) and the following two important keys: ***"text_tokens"*** and ***"claims"***.

- The *"text_tokens"* contain a list of tokens that when joined form the text input.
*Note that the output vectors (described below) for each data point need to be of the same size as the "text_tokens" list*

- The *"claims"* again contain a list of dictionaries, one for each of the disjoint claim-spans present in the corresponding text. An empty list denotes there are no claim spans in the text.
Each of these dictionaries contain the ***"start"*** and ***"end"*** indices of that particular claim-span in the text.

*Note that the claim span-start index is inclusive, but the claim span-end index are exclusive and the indexing of tokens starts from 0. For example, if a span start and end is 3 and 7, and the text is "I read that mrna vaccines cause cancer !", then the claim is "mrna vaccines cause cancer" (i.e., consisting of the tokens indexed 3, 4, 5, 6).*


## Output Predictions Format
The output predictions file should again be a ```.json``` file, containing a list of lists, one list for each of the data points. Each of the interal lists should be the same size as the corresponding *"text_tokens"* (**0/1** for each of the tokens). The elements marked **1** denote that the corresponding token is part of a claim-span, and **0** denotes it is *NOT* part of any claims.

For example, consider two texts --
- *"I  read  that  mrna  vaccines  cause  cancer  !"*
- *"I  will  never  take  a  covid  vaccine  .  .  ."*

Then the output JSON-file should look something like:
```
[
    [0 0 0 1 1 1 1 0],
    [0 0 0 0 0 0 0 0 0 0]
]
```

*Note that the number of tokens generated by a model's tokenizer may not be equal to the actual number of text tokens. You may need to convert the predicted vectors so that the final output vectors are the same size as text_tokens.*


## Evaluation Metrics
We have calculated standard Macro-F1 and Jaccard scores for the individual data points, and then average them over all data points in the validation / test set.
You can run the following to print the scores:

```python  metrics.py  <path_to_input_data_file>  <path_to_output_preds_file>```

*Note: you may need to install "numpy" and "scikit-learn" libraries*.
The input data file should be of the same format as given, and the output preds file should be of the format as described above.


## Scores on Validation set
As a baseline we used a basic [BERTforTokenClassification](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertForTokenClassification) model which can generate a class prediction (here 0/1) for each of the tokens in the text.

We loaded the model encoder with weights from [bert-base-multilingual-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased) AND fine-tuned it on the respective train sets for 5 epochs. This model was achieved a score of 49.71 Jaccard and 74.09 M-F1 on the English validation set; and 65.64 Jaccard and 79.19 M-F1 on the Hindi validation set


## More Queries
For any further clarification, please write to [csi.icpr2024@gmail.com](mailto:csi.icpr2024@gmail.com)

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AdamW,
    get_cosine_schedule_with_warmup,
    get_linear_schedule_with_warmup,
    AutoConfig,
    AutoModel,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoTokenizer,
    BertConfig,
    BertForSequenceClassification,
    BertModel,
    BertPreTrainedModel,
    BertTokenizerFast,
    DebertaConfig,
    DebertaModel,
    DebertaTokenizerFast,
    DebertaV2Config,
    DebertaV2Model,
    DebertaV2TokenizerFast,
    GemmaConfig,
    GemmaModel,
    GemmaTokenizerFast,
    PreTrainedModel,
    XLMRobertaConfig,
    XLMRobertaModel,
    XLMRobertaTokenizerFast,
)

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    f1_score,
    jaccard_score,
    precision_score,
    recall_score,
)

from tqdm import tqdm

# Preprocessing

In [2]:
train_en = pd.read_json('/kaggle/input/icpr-csi/train-en.json')
val_en = pd.read_json('/kaggle/input/icpr-csi/val-en.json')

In [3]:
print(train_en.iloc[9])

index                                                        509
claims         [{'index': 0, 'start': 26, 'end': 33, 'terms':...
text_tokens    [COVID, 19, mortality, Stats, have, been, hamm...
Name: 9, dtype: object


In [4]:
print(train_en.iloc[9]['text_tokens'][26:33])

['imprisonment', 'to', 'prove', 'living', 'in', 'abject', 'fear']


In [5]:
model_checkpoint = "google-bert/bert-large-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [6]:
def identify_problematic_chars(data, tokenizer):
    problematic = []
    for idx,item in tqdm(data.iterrows(), total = len(data)):
        words = item['text_tokens']
        for word_idx,word in enumerate(words):
            if not tokenizer.tokenize(word):
                problematic.append((idx,word_idx))
    return problematic

In [7]:
train_problematic = identify_problematic_chars(train_en,tokenizer)
val_problematic = identify_problematic_chars(val_en,tokenizer)

100%|██████████| 5999/5999 [00:13<00:00, 428.56it/s]
100%|██████████| 500/500 [00:01<00:00, 433.11it/s]


In [8]:
for x,y in train_problematic:
    train_en.iloc[x,2][y] =  tokenizer.pad_token
for x,y in val_problematic:
    val_en.iloc[x,2][y] =  tokenizer.pad_token

In [9]:
def preprocess(data):
    processed_data = []

    for idx, item in tqdm(data.iterrows(), total=len(data)):
        text_tokens = item['text_tokens']
        claims = item['claims']
        # Create label sequence (O: 0, B: 1, I: 2)
        labels = [0] * len(text_tokens)  # Initialize all as 'O'
        for claim in claims:
            start, end = claim['start'], claim['end']
            labels[start] = 1  # 'B' for beginning of claim if num_labels = 3 otherwise we set it to 2 if we only want Inside/Outside Tags.
            for i in range(start + 1, end):
                labels[i] = 2  # 'I' for inside of claim

        # Tokenize without padding
        encoded = tokenizer(
            text_tokens,
            is_split_into_words=True,
            return_tensors='pt',
            padding=False,
            truncation=False
        )

        input_ids = encoded['input_ids'].squeeze()
        attention_mask = encoded['attention_mask'].squeeze()

        # Align labels with tokenized input
        word_ids = encoded.word_ids()
        aligned_labels = [-100] * len(input_ids)  # -100 is ignored by PyTorch loss functions

        for i, word_id in enumerate(word_ids):
            if word_id is not None:
                aligned_labels[i] = labels[word_id]

        processed_data.append({
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': torch.tensor(aligned_labels),
            'word_ids': word_ids  # Store word_ids for later use
        })

    return processed_data

class ClaimDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

In [10]:
train_processed = preprocess(train_en)
val_processed = preprocess(val_en)

100%|██████████| 5999/5999 [00:04<00:00, 1237.60it/s]
100%|██████████| 500/500 [00:00<00:00, 1280.76it/s]


In [11]:
train_dataset = ClaimDataset(train_processed)
val_dataset = ClaimDataset(val_processed)

In [12]:
train_en.iloc[9,1]

[{'index': 0,
  'start': 26,
  'end': 33,
  'terms': 'imprisonment to prove living in abject fear'},
 {'index': 1, 'start': 41, 'end': 44, 'terms': 'most expensive vaccine'}]

In [13]:
train_dataset[9]

{'input_ids': tensor([  101,  2522, 17258,  2539, 13356, 26319,  2031,  2042, 25756,  2046,
          2256,  4641,  2484,  1013,  1021,  2011,  5796,  2213,  2296,  2154,
          2027,  3189,  2524,  1000,  8866,  1000,  2000, 16114,  2256, 10219,
          2000,  6011,  2542,  1999, 11113, 20614,  3571,  2003,  1996,  2069,
         21082,  4668,  1998,  2069,  1996,  2087,  6450, 17404,  2412, 14917,
          2064,  4298,  3828,  2149,  3531,  5256,  2039,   102]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'labels': tensor([-100,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    1,    2,    2,    2,    2,    2,    2,
            2,    0,    0,    0,    0,    0,    0, 

In [14]:
from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):
    input_ids = [item['input_ids'] for item in batch]
    attention_masks = [item['attention_mask'] for item in batch]
    labels = [item['labels'] for item in batch]
    word_ids = [item['word_ids'] for item in batch]

    # Pad sequences
    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_masks = pad_sequence(attention_masks, batch_first=True, padding_value=0)
    labels = pad_sequence(labels, batch_first=True, padding_value=-100)

    return {
        'input_ids': input_ids,
        'attention_mask': attention_masks,
        'labels': labels,
        'word_ids': word_ids
    }

In [15]:
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")

Training samples: 5999
Validation samples: 500


In [16]:
BATCH_SIZE = 16

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_fn)

In [17]:
import torch
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm
import torch.nn as nn
from sklearn.metrics import f1_score
import numpy as np

def calculate_f1_score(preds, labels):
    mask = labels != -100
    preds = preds[mask]
    labels = labels[mask]
    return f1_score(labels, preds, average='macro', zero_division=0)

def calculate_accuracy_score(preds, labels):
    mask = labels != -100
    preds = preds[mask]
    labels = labels[mask]
    return (preds == labels).sum().item() / len(labels) if len(labels) != 0 else 0

import torch
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm
import torch.nn as nn
from sklearn.metrics import f1_score
import numpy as np

def train_model(model, train_dataloader, val_dataloader, device, num_labels = 3, num_epochs=3, lr=2e-5, warmup_steps=0, accumulation_steps=1):
    optimizer = AdamW([
        {'params': model.plm.parameters(), 'lr': lr},
        {'params': model.classifier.parameters(), 'lr': lr * 30}
    ])

    total_steps = len(train_dataloader) * num_epochs // accumulation_steps
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    loss_function = nn.CrossEntropyLoss(ignore_index=-100, label_smoothing=0.2)

    train_losses, val_losses, train_f1s, val_f1s = [], [], [], []

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss, train_f1_scores, train_acc_scores = 0.0, [], []
        print(f"Epoch {epoch+1}/{num_epochs}")
        train_progress_bar = tqdm(train_dataloader, desc=f'Epoch {epoch+1}/{num_epochs} - Training')

        for step, batch in enumerate(train_progress_bar):
            ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            logits, probs = model(ids, attention_mask=attention_mask)
            loss = loss_function(logits.view(-1, num_labels), labels.view(-1))
            loss = loss / accumulation_steps  # Normalize loss for gradient accumulation

            loss.backward()

            if (step + 1) % accumulation_steps == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

            train_loss += loss.item() * accumulation_steps  # Scale loss back to original
            _, preds = torch.max(probs, dim=2)
            
            for pred, label in zip(preds, labels):
                train_f1_scores.append(calculate_f1_score(pred.cpu().numpy(), label.cpu().numpy()))
                train_acc_scores.append(calculate_accuracy_score(pred.cpu().numpy(), label.cpu().numpy()))

            train_progress_bar.set_postfix({
                'Training Loss': train_loss / (step + 1),
                'Training F1': np.nanmean(train_f1_scores),
                'Training Acc': np.nanmean(train_acc_scores)
            })

        avg_train_loss = train_loss / len(train_dataloader)
        avg_train_f1 = np.nanmean(train_f1_scores)
        avg_train_acc = np.nanmean(train_acc_scores)
        train_losses.append(avg_train_loss)
        train_f1s.append(avg_train_f1)
        print(f"Training Loss: {avg_train_loss}, F1 Score: {avg_train_f1}, Accuracy: {avg_train_acc}")

        # Validation phase
        model.eval()
        val_loss, val_f1_scores, val_acc_scores = 0.0, [], []
        val_progress_bar = tqdm(val_dataloader, desc=f'Epoch {epoch+1}/{num_epochs} - Validation')

        with torch.no_grad():
            for batch in val_progress_bar:
                ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                logits, probs = model(ids, attention_mask=attention_mask)
                loss = loss_function(logits.view(-1, num_labels), labels.view(-1))

                val_loss += loss.item()
                _, preds = torch.max(probs, dim=2)

                for pred, label in zip(preds, labels):
                    val_f1_scores.append(calculate_f1_score(pred.cpu().numpy(), label.cpu().numpy()))
                    val_acc_scores.append(calculate_accuracy_score(pred.cpu().numpy(), label.cpu().numpy()))

                val_progress_bar.set_postfix({
                    'Validation Loss': val_loss / (val_progress_bar.n + 1),
                    'Validation F1': np.nanmean(val_f1_scores),
                    'Validation Acc': np.nanmean(val_acc_scores)
                })

        avg_val_loss = val_loss / len(val_dataloader)
        avg_val_f1 = np.nanmean(val_f1_scores)
        avg_val_acc = np.nanmean(val_acc_scores)
        val_losses.append(avg_val_loss)
        val_f1s.append(avg_val_f1)
        print(f"Validation Loss: {avg_val_loss}, F1 Score: {avg_val_f1}, Accuracy: {avg_val_acc}")

    return model, train_losses, val_losses, train_f1s, val_f1s

In [18]:
class BaseModel(PreTrainedModel):
    def __init__(self, config, num_labels):
        super(BaseModel, self).__init__(config)
        self.plm = AutoModel.from_pretrained(model_checkpoint, output_hidden_states=True)
        self.num_labels = num_labels
        self.high_dropout = torch.nn.Dropout(0.3)
        self.classifier = nn.Linear(config.hidden_size,self.num_labels)


        self.softmax = nn.Softmax(dim=-1)

    def forward(self, ids, attention_mask):
        outputs = self.plm(ids, attention_mask=attention_mask)
        out = outputs.last_hidden_state

        logits = torch.mean(torch.stack([
            self.classifier(self.high_dropout(out))
            for _ in range(8)
        ], dim=0), dim=0)


        probs = self.softmax(logits)

        return logits, probs

In [19]:
class BaseModelLSTM(PreTrainedModel):
    def __init__(self, config):
        super(BaseModelLSTM, self).__init__(config)
        self.plm = AutoModel.from_pretrained(model_checkpoint, output_hidden_states=True)
        self.high_dropout = torch.nn.Dropout(0.3)

        # Define LSTM layers
        self.lstm = nn.LSTM(input_size=config.hidden_size, hidden_size=512, num_layers=2, batch_first=True, bidirectional=True)

        # Define a linear layer to map LSTM output to the number of classes
        self.classifier = nn.Linear(512 * 2, 2)  # 512 * 2 because of bidirectional LSTM

        self.softmax = nn.Softmax(dim=-1)

    def forward(self, ids, attention_mask):
        outputs = self.plm(ids, attention_mask=attention_mask)
        out = outputs.last_hidden_state

        # Apply high dropout and LSTM layers
        out = self.high_dropout(out)
        lstm_out, _ = self.lstm(out)

        # Apply the classifier to each token
        logits = self.classifier(lstm_out)

        # Apply softmax to get probabilities for each token
        probs = self.softmax(logits)

        return logits, probs

In [20]:
class BaseModelCNN(BertPreTrainedModel):
    def __init__(self, conf):
        super(BaseModelCNN, self).__init__(conf)
        self.plm = AutoModel.from_pretrained(model_checkpoint, output_hidden_states=True)
        self.high_dropout = torch.nn.Dropout(0.3)

        self.conv1 = nn.Conv1d(in_channels=768, out_channels=512, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(in_channels=512, out_channels=256, kernel_size=3, padding=1)
        self.conv3 = nn.Conv1d(in_channels=256, out_channels=3, kernel_size=3, padding=1)

        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, ids, attention_mask):
        outputs = self.plm(ids, attention_mask=attention_mask)
        out = outputs.last_hidden_state # Shape: (batch_size, sequence_length, hidden_size)

        out = out.permute(0, 2, 1)  # Change to (batch_size, hidden_size, sequence_length)

        # Apply dropout and convolutional layers
        out = torch.mean(torch.stack([
            self.conv3(self.relu(self.conv2(self.relu(self.conv1(self.high_dropout(out))))))
            for _ in range(8)
        ], dim=0), dim=0) # Ensemble averaging

        out = out.permute(0, 2, 1)  # Change back to (batch_size, sequence_length, 2)

        out = self.softmax(out)

        return out

In [21]:
import logging

# Get a list of all loggers
loggers = [logging.getLogger(name) for name in logging.root.manager.loggerDict]

# Set the logging level to ERROR for any logger related to "transformers"
for logger in loggers:
    if "transformers" in logger.name.lower():
        logger.setLevel(logging.ERROR)

In [22]:
# Set up the training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

EPOCHS = 5
LEARNING_RATE = 2e-5
WARMUP_STEPS = 1
ACCUMULATION_STEPS = 2

configuration = AutoConfig.from_pretrained(model_checkpoint, use_safetensors=True)
model = BaseModel(configuration, 3).to(device)

Using device: cuda


model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [23]:
trained_model, train_losses, val_losses, train_f1s, val_f1s = train_model(
    model,
    train_dataloader,
    val_dataloader,
    device,
    num_labels = 3,
    num_epochs=EPOCHS,
    lr=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    accumulation_steps=ACCUMULATION_STEPS
)

Epoch 1/5


Epoch 1/5 - Training: 100%|██████████| 375/375 [03:07<00:00,  2.00it/s, Training Loss=0.76, Training F1=0.489, Training Acc=0.771] 


Training Loss: 0.7601133948961893, F1 Score: 0.4885916722443396, Accuracy: 0.7714730761927527


Epoch 1/5 - Validation: 100%|██████████| 32/32 [00:05<00:00,  5.61it/s, Validation Loss=0.732, Validation F1=0.516, Validation Acc=0.788]


Validation Loss: 0.7320959828794003, F1 Score: 0.515835384125386, Accuracy: 0.7877227461387426
Epoch 2/5


Epoch 2/5 - Training: 100%|██████████| 375/375 [03:08<00:00,  1.99it/s, Training Loss=0.709, Training F1=0.554, Training Acc=0.818]


Training Loss: 0.7086982873280843, F1 Score: 0.5542257979565599, Accuracy: 0.8177673480861705


Epoch 2/5 - Validation: 100%|██████████| 32/32 [00:05<00:00,  5.61it/s, Validation Loss=0.72, Validation F1=0.568, Validation Acc=0.8]   


Validation Loss: 0.7199755664914846, F1 Score: 0.5682780302667482, Accuracy: 0.7999620497281118
Epoch 3/5


Epoch 3/5 - Training: 100%|██████████| 375/375 [03:08<00:00,  1.99it/s, Training Loss=0.688, Training F1=0.61, Training Acc=0.836] 


Training Loss: 0.688026177406311, F1 Score: 0.6101793001107689, Accuracy: 0.8364205353415347


Epoch 3/5 - Validation: 100%|██████████| 32/32 [00:05<00:00,  5.61it/s, Validation Loss=0.721, Validation F1=0.586, Validation Acc=0.799]


Validation Loss: 0.7207291703671217, F1 Score: 0.5860333654973973, Accuracy: 0.7986102581557942
Epoch 4/5


Epoch 4/5 - Training: 100%|██████████| 375/375 [03:08<00:00,  1.99it/s, Training Loss=0.671, Training F1=0.641, Training Acc=0.852]


Training Loss: 0.6708872640927632, F1 Score: 0.6410127681618185, Accuracy: 0.8520113252734108


Epoch 4/5 - Validation: 100%|██████████| 32/32 [00:05<00:00,  5.60it/s, Validation Loss=0.719, Validation F1=0.601, Validation Acc=0.805]


Validation Loss: 0.7189784348011017, F1 Score: 0.6005905117163747, Accuracy: 0.8053716724258224
Epoch 5/5


Epoch 5/5 - Training: 100%|██████████| 375/375 [03:08<00:00,  1.99it/s, Training Loss=0.657, Training F1=0.66, Training Acc=0.866] 


Training Loss: 0.656572429339091, F1 Score: 0.6599623933137496, Accuracy: 0.8655327202637692


Epoch 5/5 - Validation: 100%|██████████| 32/32 [00:05<00:00,  5.61it/s, Validation Loss=0.721, Validation F1=0.606, Validation Acc=0.807]

Validation Loss: 0.7210478130728006, F1 Score: 0.605685851467926, Accuracy: 0.8065693179059776





# Post-Processing

In [24]:
def align_predictions(predictions, word_ids):
    aligned_predictions = []
    for pred, word_id in zip(predictions, word_ids):
        word_id_to_preds = {}

        # Collect predictions for each word_id
        for p, w_id in zip(pred, word_id):
            if w_id is not None:
                if w_id not in word_id_to_preds:
                    word_id_to_preds[w_id] = []
                if p == 1 or p == 2:
                    word_id_to_preds[w_id].append(1)
                else:
                    word_id_to_preds[w_id].append(0)

        # Average predictions for each word_id and round
        aligned_pred = []
        if word_id_to_preds:  # Check if word_id_to_preds is not empty
            max_word_id = max(word_id_to_preds.keys())
            for i in range(max_word_id + 1):
                if i in word_id_to_preds:
                    avg_pred = np.mean(word_id_to_preds[i])
                    aligned_pred.append(round(avg_pred))
                else:
                    aligned_pred.append(0)  # Default to 'O' for any missing word_ids
        else:
            aligned_pred = []  # Default to 'O' for all if no valid word_ids ( for handling examples with empty spans)

        aligned_predictions.append(aligned_pred)
    return aligned_predictions


# Function to convert logits to predictions
def logits_to_predictions(logits, word_ids_batch):
    predictions = torch.argmax(logits, dim=2).cpu().numpy()
    aligned_predictions = align_predictions(predictions, word_ids_batch)
    return aligned_predictions

# Function to generate the output JSON file
def generate_output_json(predictions, output_path):
    with open(output_path, 'w', encoding='utf-8') as file:
        json.dump(predictions, file, indent=2)

# Postprocess predictions and save to file
def postprocess_and_save_predictions(model, dataloader, tokenizer, device, output_path):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Postprocessing"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            word_ids_batch = batch['word_ids']

            logits, probs = model(input_ids, attention_mask)

            batch_predictions = logits_to_predictions(logits, word_ids_batch)
            predictions.extend(batch_predictions)

    generate_output_json(predictions, output_path)

# Function to calculate Macro-F1 and Jaccard scores
def evaluate(predictions, ground_truth):
    macro_f1_scores = []
    jaccard_scores = []

    for idx ,(pred, true) in enumerate(zip(predictions, ground_truth)):
        pred = np.array(pred)
        true = np.array(true)
#         print(f'Index: {idx}')
#         print(f'Pred type: {type(pred)}, True type: {type(true)}')
#         print(f'Pred: {pred}, True: {true}')

        # Ensure the lengths match
        if len(pred) != len(true):
            print(idx)
        else:
            macro_f1 = f1_score(true, pred, average='macro',zero_division = 0)
            jaccard = jaccard_score(true, pred, zero_division = 0)

            macro_f1_scores.append(macro_f1)
            jaccard_scores.append(jaccard)

    avg_macro_f1 = np.nanmean(macro_f1_scores)
    avg_jaccard = np.nanmean(jaccard_scores)

    return avg_macro_f1, avg_jaccard

# Function to extract ground truth labels from the validation data
def extract_ground_truth(data):
    ground_truth = []
    for item in data:
        text_tokens = item['text_tokens']
        claims = item['claims']
        labels = [0] * len(text_tokens)  # Initialize all as 'O'
        for claim in claims:
            start, end = claim['start'], claim['end']
            labels[start] = 1
            for i in range(start + 1, end):
                labels[i] = 1
        ground_truth.append(labels)
    return ground_truth

# Evaluate predictions against ground truth
def evaluate_predictions(predictions_path, ground_truth):
    with open(predictions_path, 'r', encoding='utf-8') as file:
        predictions = json.load(file)

    avg_macro_f1, avg_jaccard = evaluate(predictions, ground_truth)
    print(f"Macro-F1 Score: {avg_macro_f1:.4f}")
    print(f"Jaccard Score: {avg_jaccard:.4f}")
    return avg_macro_f1,avg_jaccard

In [25]:
# req_idxs = []
# for idx,item in val_en.iterrows():
#     if item['claims']:
#         req_idxs.append(idx)

In [26]:
ground_truth = extract_ground_truth(val_en.to_dict('records'))

In [27]:
output_predictions_path = 'output_predictions.json'

postprocess_and_save_predictions(model, val_dataloader, tokenizer, device, output_predictions_path)

Postprocessing: 100%|██████████| 32/32 [00:05<00:00,  6.25it/s]


In [28]:
avg_macro_f1, avg_jaccard = evaluate_predictions(output_predictions_path, ground_truth)

Macro-F1 Score: 0.7107
Jaccard Score: 0.4609


# PLM + CRF Head

In [29]:
class PLMCRF(PreTrainedModel):
    def __init__(self, config, num_labels=3):
        super(PLMCRF, self).__init__(config)
        self.config = config
        self.num_labels = num_labels
        self.plm = AutoModel.from_pretrained(model_checkpoint, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, num_labels)
        self.crf = CRF(num_labels, batch_first=True)
        self.init_weights()

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = self.plm(input_ids, attention_mask=attention_mask)
        sequence_output = self.dropout(outputs[0])
        emissions = self.classifier(sequence_output)

        if labels is not None:
            # Create a mask for non-padded positions (using attention_mask)
            crf_mask = attention_mask.byte()

            # Create a mask for non-ignored positions (where labels != -100)
            active_loss = labels != -100

            # Replace -100 with a valid label (0) just for the CRF
            labels_for_crf = labels.clone()
            labels_for_crf[labels == -100] = 0


            loss = -self.crf(emissions, labels_for_crf, mask=crf_mask)


            loss = loss * active_loss.float().sum(dim=1) / crf_mask.float().sum(dim=1)


            loss = loss.mean()

            return loss
        else:

            predictions = self.crf.decode(emissions, mask=attention_mask.byte())

            padded_predictions = []
            for pred, mask in zip(predictions, attention_mask):
                seq_len = mask.sum().item()
                padded_pred = pred[:seq_len] + [-100] * (mask.size(0) - seq_len)
                padded_predictions.append(padded_pred)
            return padded_predictions

In [30]:
import torch
from torch.utils.data import DataLoader
from transformers import AdamW, get_scheduler
from tqdm.auto import tqdm
from sklearn.metrics import f1_score
import numpy as np

def train_plmcrf_model(model, train_dataloader: DataLoader, val_dataloader: DataLoader, 
                       device: torch.device, num_epochs: int = 5, lr: float = 3e-5, 
                       warmup_steps: int = 0, accumulation_steps: int = 1, log_interval: int = 50):
    
    model.to(device)

    # Initialize optimizer
    optimizer = AdamW(model.parameters(), lr=lr)

    # Scheduler for learning rate decay
    num_training_steps = num_epochs * len(train_dataloader)
    lr_scheduler = get_scheduler(
        "linear", optimizer=optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_training_steps
    )

    train_losses = []
    val_losses = []
    train_f1s = []
    val_f1s = []

    progress_bar = tqdm(range(num_training_steps))
    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        true_labels = []
        pred_labels = []
        
        for step, batch in enumerate(train_dataloader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs
            loss = loss / accumulation_steps
            loss.backward()

            total_train_loss += loss.item()

            if (step + 1) % accumulation_steps == 0:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                progress_bar.update(1)
            
            # Accumulate predictions and labels for F1 score calculation
            with torch.no_grad():
                predictions = model(input_ids, attention_mask=attention_mask)
                true_labels.extend(labels.cpu().numpy().flatten())
                pred_labels.extend(np.concatenate(predictions).flatten())
            
            if step % log_interval == 0 and step > 0:
                avg_loss = total_train_loss / log_interval
                train_losses.append(avg_loss)
                if true_labels and pred_labels:
                    f1 = f1_score(true_labels, pred_labels, average='weighted')
                else:
                    f1 = 0.0
                train_f1s.append(f1)
                print(f"Epoch [{epoch+1}/{num_epochs}], Step [{step}/{len(train_dataloader)}], "
                      f"Loss: {avg_loss:.4f}, Train F1: {f1:.4f}")
                total_train_loss = 0
                true_labels = []
                pred_labels = []
        
        # Validation phase
        model.eval()
        total_val_loss = 0
        val_true_labels = []
        val_pred_labels = []
        
        with torch.no_grad():
            for batch in val_dataloader:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)

                loss = model(input_ids, attention_mask=attention_mask, labels=labels)
                total_val_loss += loss.item()

                predictions = model(input_ids, attention_mask=attention_mask)
                val_true_labels.extend(labels.cpu().numpy().flatten())
                val_pred_labels.extend(np.concatenate(predictions).flatten())
        
        avg_val_loss = total_val_loss / len(val_dataloader)
        val_losses.append(avg_val_loss)
        if val_true_labels and val_pred_labels:
            val_f1 = f1_score(val_true_labels, val_pred_labels, average='weighted')
        else:
            val_f1 = 0.0
        val_f1s.append(val_f1)
        print(f"Epoch [{epoch+1}/{num_epochs}] Validation Loss: {avg_val_loss:.4f}, Validation F1: {val_f1:.4f}")

    print("Training complete.")
    
    return train_losses, val_losses, train_f1s, val_f1s

In [31]:
!pip install pytorch-crf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting pytorch-crf
  Downloading pytorch_crf-0.7.2-py3-none-any.whl.metadata (2.4 kB)
Downloading pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Installing collected packages: pytorch-crf
Successfully installed pytorch-crf-0.7.2


In [32]:
from torchcrf import CRF

In [33]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 1
ACCUMULATION_STEPS = 2

model = PLMCRF(AutoConfig.from_pretrained(model_checkpoint)).to(device)

Using device: cuda


In [34]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

EPOCHS = 5
LEARNING_RATE = 3e-5
WARMUP_STEPS = 1
ACCUMULATION_STEPS = 2

model = PLMCRF(AutoConfig.from_pretrained(model_checkpoint)).to(device)

# Train model
train_losses, val_losses, train_f1s, val_f1s = train_plmcrf_model(
    model,
    train_dataloader,
    val_dataloader,
    device,
    num_epochs=EPOCHS,
    lr=LEARNING_RATE,
    warmup_steps=WARMUP_STEPS,
    accumulation_steps=ACCUMULATION_STEPS
)

Using device: cuda


  0%|          | 0/1875 [00:00<?, ?it/s]

Epoch [1/5], Step [50/375], Loss: 288.0034, Train F1: 0.7862
Epoch [1/5], Step [100/375], Loss: 227.9701, Train F1: 0.8234
Epoch [1/5], Step [150/375], Loss: 215.7917, Train F1: 0.8447
Epoch [1/5], Step [200/375], Loss: 206.7438, Train F1: 0.8461
Epoch [1/5], Step [250/375], Loss: 204.3820, Train F1: 0.8505
Epoch [1/5], Step [300/375], Loss: 184.2770, Train F1: 0.8636
Epoch [1/5], Step [350/375], Loss: 195.8674, Train F1: 0.8542
Epoch [1/5] Validation Loss: 372.1465, Validation F1: 0.8609
Epoch [2/5], Step [50/375], Loss: 169.1743, Train F1: 0.8815
Epoch [2/5], Step [100/375], Loss: 173.6890, Train F1: 0.8707
Epoch [2/5], Step [150/375], Loss: 165.5052, Train F1: 0.8806
Epoch [2/5], Step [200/375], Loss: 168.1667, Train F1: 0.8760
Epoch [2/5], Step [250/375], Loss: 159.7442, Train F1: 0.8817
Epoch [2/5], Step [300/375], Loss: 157.7958, Train F1: 0.8833
Epoch [2/5], Step [350/375], Loss: 168.4650, Train F1: 0.8718
Epoch [2/5] Validation Loss: 373.4070, Validation F1: 0.8656
Epoch [3/5],

# Postprocessing for model with CRF Head

In [56]:
def align_predictions(predictions, word_ids):
    aligned_predictions = []
    for pred, word_id in zip(predictions, word_ids):
        word_id_to_preds = {}

        # Collect predictions for each word_id
        for p, w_id in zip(pred, word_id):
            if w_id is not None:
                if w_id not in word_id_to_preds:
                    word_id_to_preds[w_id] = []
                if p == 1 or p == 2:
                    word_id_to_preds[w_id].append(1)
                else:
                    word_id_to_preds[w_id].append(0)

        # Average predictions for each word_id and round
        aligned_pred = []
        if word_id_to_preds:  # Check if word_id_to_preds is not empty
            max_word_id = max(word_id_to_preds.keys())
            for i in range(max_word_id + 1):
                if i in word_id_to_preds:
                    avg_pred = np.mean(word_id_to_preds[i])
                    aligned_pred.append(round(avg_pred))
                else:
                    aligned_pred.append(0)  # Default to 'O' for any missing word_ids
        else:
            aligned_pred = []  # Default to 'O' for all if no valid word_ids ( for handling examples with empty spans)

        aligned_predictions.append(aligned_pred)
    return aligned_predictions

# Function to generate the output JSON file
def generate_output_json(predictions, output_path):
    with open(output_path, 'w', encoding='utf-8') as file:
        json.dump(predictions, file, indent=2)

# Postprocess predictions and save to file
def postprocess_and_save_predictions(model, dataloader, tokenizer, device, output_path):
    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Postprocessing"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            word_ids_batch = batch['word_ids']
            labels = batch['labels'].to(device)
            batch_predictions = model(input_ids = input_ids, attention_mask = attention_mask)
            aligned_predictions = align_predictions(batch_predictions, word_ids_batch)
            predictions.extend(aligned_predictions)

    generate_output_json(predictions, output_path)

# Function to calculate Macro-F1 and Jaccard scores
def evaluate(predictions, ground_truth):
    macro_f1_scores = []
    jaccard_scores = []

    for idx ,(pred, true) in enumerate(zip(predictions, ground_truth)):
        pred = np.array(pred)
        true = np.array(true)
#         print(f'Index: {idx}')
#         print(f'Pred type: {type(pred)}, True type: {type(true)}')
#         print(f'Pred: {pred}, True: {true}')

        # Ensure the lengths match
        if len(pred) != len(true):
            print(idx)
        else:
            macro_f1 = f1_score(true, pred, average='macro',zero_division = 0)
            jaccard = jaccard_score(true, pred, zero_division = 0)

            macro_f1_scores.append(macro_f1)
            jaccard_scores.append(jaccard)

    avg_macro_f1 = np.nanmean(macro_f1_scores)
    avg_jaccard = np.nanmean(jaccard_scores)

    return avg_macro_f1, avg_jaccard

# Function to extract ground truth labels from the validation data
def extract_ground_truth(data):
    ground_truth = []
    for item in data:
        text_tokens = item['text_tokens']
        claims = item['claims']
        labels = [0] * len(text_tokens)  # Initialize all as 'O'
        for claim in claims:
            start, end = claim['start'], claim['end']
            labels[start] = 1
            for i in range(start + 1, end):
                labels[i] = 1
        ground_truth.append(labels)
    return ground_truth

# Evaluate predictions against ground truth
def evaluate_predictions(predictions_path, ground_truth):
    with open(predictions_path, 'r', encoding='utf-8') as file:
        predictions = json.load(file)

    avg_macro_f1, avg_jaccard = evaluate(predictions, ground_truth)
    print(f"Macro-F1 Score: {avg_macro_f1:.4f}")
    print(f"Jaccard Score: {avg_jaccard:.4f}")
    return avg_macro_f1,avg_jaccard

In [57]:
output_predictions_path = 'output_predictions_CRFHead.json'

postprocess_and_save_predictions(model, val_dataloader, tokenizer, device, output_predictions_path)

Postprocessing:   0%|          | 0/32 [00:00<?, ?it/s]

In [58]:
avg_macro_f1, avg_jaccard = evaluate_predictions(output_predictions_path, ground_truth)

Macro-F1 Score: 0.7137
Jaccard Score: 0.4767
