# **Import data**

Preprocess CLEF TAR datasets for model input. 

Implement a transformer-based or simpler ML classification model (e.g., logistic regression, BERT). 

Train the model on CLEF TAR datasets. 

Evaluate its performance (accuracy, F1-score). 
📌 Deliverable: A trained citation screening model with evaluation metrics. 

## **Model training**

## **Model evaluation**

## **Summary**

## **Data preprocessing**

The process involved:

1.  **Extracting PIDs:** Unique PubMed IDs (PIDs) were collected from the training and test qrels files of the CLEF TAR 2019 Task 2 dataset. These qrels define the topic-article mappings.
2.  **Fetching Article Data:** For these unique PIDs, article titles and abstracts were fetched from NCBI PubMed using the E-utilities API. This was done in batches (chunks of 300 PIDs) and parallelized using a thread pool executor for efficiency.
3.  **Parsing & Storing:** The XML responses from PubMed were parsed to extract the PID, title, and abstract for each article. This data was initially saved into intermediate chunk CSV files.
4.  **Consolidation & Merging:** (Implicitly, the chunk CSVs were combined into a single 'articles' DataFrame). This consolidated article data (PID, title, abstract) was then merged with the original training and test qrels DataFrames based on 'PID'. This step incorporates the topic information (which likely corresponds to `title_id`) with the fetched textual content.
5.  **Final Output:** The merged data resulted in `train.csv` and `test.csv` files, each containing `PID`, `title`, `abstract`, and the associated `title_id` (from the qrels).

In [9]:
df_train.head()

Unnamed: 0,topic_id,PID,relevance,title,abstract
0,CD007431,7072537,0,Lumbar spondylolisthesis. Clinical syndrome an...,"The paper gives a survey, based on literature ..."
1,CD007431,8748845,0,The C-reactive protein for detection of early ...,The tendency for short hospitalization after l...
2,CD007431,3819738,0,Pain in sciatica depresses lower limb nocicept...,The inhibitory effects of acute pain produced ...
3,CD007431,7941692,0,[Satisfaction following automated percutaneous...,182 patients assessed their condition after au...
4,CD007431,16261104,0,Adjacent segment degeneration at T1-T2 present...,A case report of a T1-T2 herniated disc adjace...


### **Preprocessed data analysis**

#### **Amount of unique topics - train**

In [99]:
uq_topics = df_train['topic_id'].unique()

len(uq_topics)

99

#### **Amount of unique topics - test**

In [100]:
uq_topics_test = df_test['topic_id'].unique()

len(uq_topics_test)

28

#### **Topics from train that are also in test**

In [12]:
[x for x in uq_topics if x in uq_topics_test]

['CD011686', 'CD011571', 'CD012164']

#### **Topics with most positive 'relevance'**

In [102]:
df_train.groupby('topic_id').sum('relevance').sort_values(by='relevance', ascending=False)

Unnamed: 0_level_0,PID,relevance
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
CD011975,67435351855,414
CD012599,66093127885,402
CD010213,158206893401,402
CD009925,43777216611,314
CD011984,67413914183,307
...,...,...
CD005253,12483082898,1
CD010386,9445416789,1
CD011549,78868246171,1
CD012083,558537196,0


#### **Top topics**

In [108]:
top_topics = df_train.groupby('topic_id').sum('relevance')[
                df_train.groupby('topic_id').sum('relevance')['relevance'] >= 70
                ].sort_values(by='relevance', ascending=False).index
top_topics

Index(['CD011975', 'CD012599', 'CD010213', 'CD009925', 'CD011984', 'CD012165',
       'CD012010', 'CD012179', 'CD011431', 'CD010502', 'CD011145', 'CD011134',
       'CD008122', 'CD009591', 'CD008054', 'CD010657', 'CD009020', 'CD009579',
       'CD009263', 'CD009944', 'CD011515', 'CD005139', 'CD007394'],
      dtype='object', name='topic_id')

#### **The most balanced topics**

In [101]:
balanced_topics = df_train \
  .groupby('topic_id')['relevance'] \
  .agg(
      rel_count=lambda x: (x == 1).sum(),
      not_rel_count=lambda x: (x == 0).sum(),
      count='count'
  ) \
  .sort_values(by='rel_count', ascending=False) \
  .iloc[0:15]

balanced_topics['balance_score'] = abs(balanced_topics['rel_count'] - balanced_topics['not_rel_count'])

balanced_topics.sort_values(by='balance_score', ascending=True)


Unnamed: 0_level_0,rel_count,not_rel_count,count,balance_score
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CD011431,184,661,845,477
CD008122,133,847,980,714
CD011134,141,1074,1215,933
CD008054,113,1823,1936,1710
CD010502,166,1975,2141,1809
CD009925,314,4113,4427,3799
CD012010,209,4697,4906,4488
CD012599,402,5165,5567,4763
CD011975,414,5214,5628,4800
CD011984,307,5320,5627,5013


## **Model implementation**

A transformer-based classification model, `BERTClassifier`, was implemented using PyTorch.

1.  **Base Model:** It utilizes the pre-trained `bert-base-uncased` model, specifically loaded using `BertForSequenceClassification` from the Hugging Face `transformers` library. This base model already includes a classification head that will be possible to be used with `sentiment-analysis` pipeline.
2.  **Custom Head:**
    * The output logits from the underlying `BertForSequenceClassification` are taken.
    * A dropout layer is applied to these logits for regularization.
    * A custom fully connected (linear) layer (`nn.Linear(2, 1)`) is then applied, taking the 2 features from the BERT model's classification head output and mapping them to a single output feature.
3.  **Output:** The model outputs a single logit per input sequence, which is suitable for binary classification tasks.

In [109]:
from transformers import AutoConfig

class BERTClassifier(nn.Module):
    def __init__(self, dropout=0.3):
        super().__init__()
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(dropout)

        self.fc = nn.Linear(2, 1)
        self.config = AutoConfig.from_pretrained('bert-base-uncased')
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


    def forward(self, input_ids, attention_mask, token_type_ids):
        outputs = self.bert(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

        x = outputs.logits
        x = self.dropout(x)
        x = self.fc(x)

        return x.squeeze(-1)

# **Training**

In [24]:
NUM_OF_EPOCHS = 10

In [32]:
def prepare_for_training(train_dataset, train_loader, lr=2e-5):
  set_seed(42)

  model = BERTClassifier()
  optimizer = AdamW(model.parameters(), lr=lr)
  loss_fn = nn.BCEWithLogitsLoss()  # For binary classification

  total_steps = len(train_loader) * NUM_OF_EPOCHS
  scheduler = get_linear_schedule_with_warmup(
      optimizer,
      num_warmup_steps=int(0.1 * total_steps),
      num_training_steps=total_steps
  )

  y_train = train_dataset.y
  num_pos = (y_train == 1).sum()
  num_neg = (y_train == 0).sum()
  pos_weight = torch.tensor([num_neg / num_pos], dtype=torch.float32)

  loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  model = model.to(device)
  loss_fn = loss_fn.to(device)

  return model, optimizer, loss_fn, scheduler, device

In [33]:
def train(model, optimizer, loss_fn, scheduler, device, train_loader, num_epochs=NUM_OF_EPOCHS):
    model.train()

    for epoch in range(num_epochs):
        total_loss = 0

        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask, token_type_ids)
            loss = loss_fn(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f}")

    return model

# **Evaluation**

In [34]:
def evaluate(model, test_loader, device, threshold=0.5):
    model.eval()
    all_preds, all_labels = [], []

    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids, attention_mask, token_type_ids)
            probs = torch.sigmoid(outputs)  # [batch_size]

            all_preds.append(probs.cpu())
            all_labels.append(labels.cpu())

    all_preds = torch.cat(all_preds)
    all_labels = torch.cat(all_labels)

    thresholds = [0.1 * i for i in range(1, 10)]
    for t in thresholds:
        preds_binary = (all_preds >= t).int()
        f1 = f1_score(all_labels, preds_binary)
        precision = precision_score(all_labels, preds_binary, zero_division=0)
        recall = recall_score(all_labels, preds_binary, zero_division=0)
        accuracy = accuracy_score(all_labels, preds_binary)
        print(f"Threshold: {t:.1f} | F1: {f1:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | Accuracy: {accuracy:.4f}")


# **Whole pipeline**

In [35]:
def train_and_evaluate(topic, df=df_train):
  set_seed(42)

  train_dataset, test_dataset, train_loader, test_loader = preprocess(topic, df)

  model, optimizer, loss_fn, scheduler, device = prepare_for_training(train_dataset, train_loader)

  model = train(model, optimizer, loss_fn, scheduler, device, train_loader)

  yield model, test_loader, device

  save_dir = '/content/drive/MyDrive/DevWorkshop_data/models'
  os.makedirs(save_dir, exist_ok=True)

  save_path = os.path.join(save_dir, f"bert_classifier_{topic}_balanced_1.pt")
  torch.save(model.state_dict(), save_path)
  print(f"Model saved to: {save_path}")

  evaluate(model, test_loader, device)

# **TESTS**

In [36]:
top_topics[0]

'CD011975'

In [37]:
# IGNORE ERRORS
import transformers
transformers.logging.set_verbosity_error()

In [42]:
model, test_loader, device = train_and_evaluate(top_topics[0])

Epoch 1: 100%|██████████| 36/36 [00:32<00:00,  1.11it/s]


Epoch 1 | Loss: 0.7117


Epoch 2: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 2 | Loss: 0.6303


Epoch 3: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 3 | Loss: 0.5494


Epoch 4: 100%|██████████| 36/36 [00:33<00:00,  1.09it/s]


Epoch 4 | Loss: 0.4773


Epoch 5: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 5 | Loss: 0.4127


Epoch 6: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 6 | Loss: 0.3669


Epoch 7: 100%|██████████| 36/36 [00:33<00:00,  1.08it/s]


Epoch 7 | Loss: 0.2930


Epoch 8: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 8 | Loss: 0.2869


Epoch 9: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 9 | Loss: 0.2784


Epoch 10: 100%|██████████| 36/36 [00:33<00:00,  1.07it/s]


Epoch 10 | Loss: 0.2415
Model saved to: /content/drive/MyDrive/DevWorkshop_data/models/bert_classifier_CD011975_balanced_1.pt


Evaluating: 100%|██████████| 12/12 [00:03<00:00,  3.18it/s]

Threshold: 0.1 | F1: 0.8052 | Precision: 0.6838 | Recall: 0.9789 | Accuracy: 0.7632
Threshold: 0.2 | F1: 0.8186 | Precision: 0.7333 | Recall: 0.9263 | Accuracy: 0.7947
Threshold: 0.3 | F1: 0.8230 | Precision: 0.7544 | Recall: 0.9053 | Accuracy: 0.8053
Threshold: 0.4 | F1: 0.8252 | Precision: 0.7658 | Recall: 0.8947 | Accuracy: 0.8105
Threshold: 0.5 | F1: 0.8195 | Precision: 0.7636 | Recall: 0.8842 | Accuracy: 0.8053
Threshold: 0.6 | F1: 0.8159 | Precision: 0.7736 | Recall: 0.8632 | Accuracy: 0.8053
Threshold: 0.7 | F1: 0.8081 | Precision: 0.7767 | Recall: 0.8421 | Accuracy: 0.8000
Threshold: 0.8 | F1: 0.8021 | Precision: 0.7938 | Recall: 0.8105 | Accuracy: 0.8000
Threshold: 0.9 | F1: 0.0000 | Precision: 0.0000 | Recall: 0.0000 | Accuracy: 0.5000





ValueError: not enough values to unpack (expected 3, got 1)

In [51]:
train_dataset, test_dataset, train_loader, test_loader = preprocess(top_topics[0])
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = BERTClassifier()
model.load_state_dict(torch.load(
    '/content/drive/MyDrive/DevWorkshop_data/models/bert_classifier_CD011975_balanced_1.pt',
    map_location=torch.device('cpu')
))
model = model.to(device)

# **Model explainability**

In [57]:
import shap
import transformers

In [43]:
sent_analyzer = transformers.pipeline(
    "sentiment-analysis",
    tokenizer=tokenizer,
    return_all_scores=True,
    model=model.bert,
)

The model 'BertModel' is not supported for sentiment-analysis. Supported models are ['AlbertForSequenceClassification', 'BartForSequenceClassification', 'BertForSequenceClassification', 'BigBirdForSequenceClassification', 'BigBirdPegasusForSequenceClassification', 'BioGptForSequenceClassification', 'BloomForSequenceClassification', 'CamembertForSequenceClassification', 'CanineForSequenceClassification', 'LlamaForSequenceClassification', 'ConvBertForSequenceClassification', 'CTRLForSequenceClassification', 'Data2VecTextForSequenceClassification', 'DebertaForSequenceClassification', 'DebertaV2ForSequenceClassification', 'DiffLlamaForSequenceClassification', 'DistilBertForSequenceClassification', 'ElectraForSequenceClassification', 'ErnieForSequenceClassification', 'ErnieMForSequenceClassification', 'EsmForSequenceClassification', 'FalconForSequenceClassification', 'FlaubertForSequenceClassification', 'FNetForSequenceClassification', 'FunnelForSequenceClassification', 'GemmaForSequenceCla

In [60]:
test_dataset.X.loc[290]['abstract']

'To compare the yield of multiple-marker biochemical screening with that of minor fetal anomalies observed on ultrasound for detection of aneuploidy in low-risk patients.'

In [45]:
explainer = shap.Explainer(sent_analyzer, output_names=['Irrelavant', 'Relevant'])

In [46]:
shaps = explainer(test_dataset.X.loc[290])
shap.plots.text(shaps)

NameError: name 'test_dataset' is not defined