# Baseline for POLAR - Subtask 1

**SemEval 2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection**

In [1]:
# Verify GPU availability
!nvidia-smi

Mon Dec  1 15:15:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00

In [2]:
!pip uninstall -y transformers
!pip install -q transformers==4.44.0 accelerate wandb

Found existing installation: transformers 4.44.0
Uninstalling transformers-4.44.0:
  Successfully uninstalled transformers-4.44.0


In [3]:
# import os
# os.kill(os.getpid(), 9)

In [4]:
# List available input data
!ls /kaggle/input/

dataset


In [5]:
import pandas as pd

from sklearn.metrics import recall_score, precision_score, f1_score
import numpy as np

import torch

from sklearn.metrics import f1_score

from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)
from torch.utils.data import Dataset

2025-12-01 15:15:23.374710: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764602123.396571     181 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764602123.403340     181 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

In [6]:
import wandb

# Disable wandb logging for this script
wandb.init(mode="disabled")



## Data Import

The training data consists of a short text and binary labels

The data is structured as a CSV file with the following fields:
- id: a unique identifier for the sample
- text: a sentence or short text
- polarization:  1 text is polarized, 0 text is not polarized

In [7]:
# Load the training and validation data for subtask 1

train = pd.read_csv('/kaggle/input/dataset/dataset/train_eng.csv')
val = pd.read_csv('/kaggle/input/dataset/dataset/train_eng.csv')

train.head()

Unnamed: 0,id,text,polarization
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0


## Dataset
-  Create a pytorch class for handling data
-  Wrapping the raw texts and labels into a format that Huggingface's Trainer can use for training and evaluation

In [8]:
# Fix the dataset class by inheriting from torch.utils.data.Dataset
class PolarizationDataset(torch.utils.data.Dataset):
  def __init__(self,texts,labels,tokenizer,max_length =128):
    self.texts=texts
    self.labels=labels
    self.tokenizer= tokenizer
    self.max_length = max_length # Store max_length

  def __len__(self):
    return len(self.texts)

  def __getitem__(self,idx):
    text=self.texts[idx]
    label=self.labels[idx]
    encoding=self.tokenizer(text,truncation=True,padding=False,max_length=self.max_length,return_tensors='pt')

    # Ensure consistent tensor conversion for all items
    item = {key: encoding[key].squeeze() for key in encoding.keys()}
    item['labels'] = torch.tensor(label, dtype=torch.long)
    return item

Now, we'll tokenize the text data and create the datasets using `bert-base-uncased` as the tokenizer.

In [9]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Create datasets
train_dataset = PolarizationDataset(train['text'].tolist(), train['polarization'].tolist(), tokenizer)
val_dataset = PolarizationDataset(val['text'].tolist(), val['polarization'].tolist(), tokenizer)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



Next, we'll load the pre-trained `bert-base-uncased` model for sequence classification. Since this is a binary classification task (Polarized/Not Polarized), we set `num_labels=2`.

In [10]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now, we'll define the training arguments and the evaluation metric. We'll use macro F1 score for evaluation.

In [11]:
# Define metrics function
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    return {'f1_macro': f1_score(p.label_ids, preds, average='macro')}

# Define training arguments
training_args = TrainingArguments(
        output_dir=f"./",
        num_train_epochs=3,
        learning_rate=2e-5,
        per_device_train_batch_size=64,
        per_device_eval_batch_size=8,
        eval_strategy="epoch",
        save_strategy="no",
        logging_steps=100,
        disable_tqdm=False
    )

Finally, we'll initialize the `Trainer` and start training.

In [12]:
# Initialize the Trainer
trainer = Trainer(
    model=model,                         # the instantiated ðŸ¤— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
    data_collator=DataCollatorWithPadding(tokenizer) # Data collator for dynamic padding
)

# Train the model
trainer.train()

# Evaluate the model on the validation set
eval_results = trainer.evaluate()
print(f"Macro F1 score on validation set: {eval_results['eval_f1_macro']}")



Epoch,Training Loss,Validation Loss,F1 Macro
1,No log,0.521276,0.716115
2,No log,0.448673,0.763781
3,No log,0.42011,0.787005




Macro F1 score on validation set: 0.7870053297882424


# Submisson 

In [25]:
# Load dev set for prediction
dev = pd.read_csv('//kaggle/input/dataset/eng.csv')
dev.head()

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,id,text,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,God is with Ukraine and Zelensky,
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,"4 Dems, 2 Republicans Luzerne County Council s...",
2,eng_95770ff547ea5e48b0be00f385986483,Abuse Survivor Recounts Her Struggles at YWCA ...,
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,"After Rwanda, another deportation camp disaster",
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,Another plea in Trump election interference probe,


In [26]:
# Create dataset for dev set (use dummy labels since we don't have ground truth)
dev_dataset = PolarizationDataset(dev['text'].tolist(), [0]*len(dev), tokenizer)

In [27]:
# Generate predictions
predictions = trainer.predict(dev_dataset)
preds = np.argmax(predictions.predictions, axis=1)



In [28]:
# Create submission dataframe
submission = pd.DataFrame({
    'id': dev['id'],
    'polarization': preds
})
submission.head()

Unnamed: 0,id,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,0
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,0
2,eng_95770ff547ea5e48b0be00f385986483,0
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,0
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,0


In [29]:
# Create submission folder and save prediction file
import os

os.makedirs('subtask_1', exist_ok=True)
submission.to_csv('subtask_1/pred_eng.csv', index=False)

print("Prediction file saved!")
print(f"Total predictions: {len(submission)}")
print(f"Label distribution:\n{submission['polarization'].value_counts()}")

Prediction file saved!
Total predictions: 160
Label distribution:
polarization
0    110
1     50
Name: count, dtype: int64


In [30]:
# Create zip file for submission
!zip -r submission.zip subtask_1/
print("submission.zip created! Download and upload to Codabench.")

updating: subtask_1/ (stored 0%)
updating: subtask_1/pred_eng.csv (deflated 48%)
submission.zip created! Download and upload to Codabench.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [31]:
# Verify submission file
!unzip -l submission.zip

Archive:  submission.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2025-12-01 15:43   subtask_1/
     6256  2025-12-01 15:52   subtask_1/pred_eng.csv
---------                     -------
     6256                     2 files


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [32]:
# # Load dev set for prediction
# dev = pd.read_csv('/kaggle/input/dataset/eng.csv')
# print(f"Total rows: {len(dev)}")
# print(f"Columns: {dev.columns.tolist()}")
# dev.head()

Total rows: 160
Columns: ['id', 'text', 'polarization']


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,id,text,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,God is with Ukraine and Zelensky,
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,"4 Dems, 2 Republicans Luzerne County Council s...",
2,eng_95770ff547ea5e48b0be00f385986483,Abuse Survivor Recounts Her Struggles at YWCA ...,
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,"After Rwanda, another deportation camp disaster",
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,Another plea in Trump election interference probe,
