# **POLAR @ SemEval-2026 — Task9**
## Subtask 1: Polarization Detection    
### Authors: Sujay Nalimela 
### Date: <auto-update>

---

This notebook contains the *full pipeline* for all POLAR shared task subtasks:

### ✔ Subtask 1 — Polarization Detection (binary)  
### ⬜ Subtask 2 — Polarization Type Classification (multi-class)  
### ⬜ Subtask 3 — Polarization Manifestation Identification


In [1]:
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
)

from transformers.training_args import TrainingArguments

from shared.preprocess import clean_text
from shared.metrics import macro_f1

torch.backends.mps.allow_tf32 = True

print("Torch version:", torch.__version__)


  from .autonotebook import tqdm as notebook_tqdm


Torch version: 2.9.1


We begin with **Subtask 1 (binary)** using **XLM-RoBERTa-Base**.

In [2]:
BASE_DATA = "data/subtask1"
LANG = "eng"  # change to arb, spa, amh, etc.

MODEL_NAME = "bert-base-multilingual-cased"
MAX_LEN = 128
EPOCHS = 2
BATCH_SIZE = 4

## **Load Dataset**

Each language includes:

- `train/<lang>.csv`  
- `dev/<lang>.csv`  
- `test/<lang>.csv`  

We read all splits and inspect the schema.

In [3]:
train_path = f"{BASE_DATA}/train/{LANG}.csv"
dev_path   = f"{BASE_DATA}/dev/{LANG}.csv"

train_df = pd.read_csv(train_path)
dev_df   = pd.read_csv(dev_path)

print("Train size:", len(train_df))
print("Dev size:", len(dev_df))

train_df.head()


Train size: 2676
Dev size: 133


Unnamed: 0,id,text,polarization
0,eng_973938b90b0ff5d87d35a582f83f5c89,is defending imperialism in the dnd chat,0
1,eng_07dfd4600426caca6e2c5883fcbea9ea,Still playing with this. I am now following Ra...,0
2,eng_f14519ff2302b6cd47712073f13bc461,.senate.gov Theres 3 groups out there Republic...,0
3,eng_e48b7e7542faafa544ac57b64bc80daf,"""ABC MD, David Anderson, said the additional f...",0
4,eng_7c581fb77bce8033aeba3d6dbd6273eb,"""bad people"" I have some conservative values s...",0


## **Preprocessing**

We apply light normalization (lowercasing, URL removal, emoji normalization)
using our shared `clean_text()` function.


In [4]:
train_df["text"] = train_df["text"].apply(clean_text)
dev_df["text"]   = dev_df["text"].apply(clean_text)


In [5]:
train_df_split, val_df_split = train_test_split(
    train_df,
    test_size=0.1,
    stratify=train_df["polarization"],
    random_state=42
)
train_dataset = Dataset.from_pandas(train_df)
val_dataset   = Dataset.from_pandas(val_df_split)


## **Tokenization**

We use a single tokenizer across all languages (multilingual model).


In [6]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        padding="max_length",
        max_length=MAX_LEN
    )

# tokenization
train_dataset = train_dataset.map(tokenize, batched=True)
val_dataset   = val_dataset.map(tokenize, batched=True)

# rename correct label column
train_dataset = train_dataset.rename_column("polarization", "labels")
val_dataset   = val_dataset.rename_column("polarization", "labels")

# set PyTorch formatting
train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
val_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


Map: 100%|██████████| 2676/2676 [00:00<00:00, 24420.08 examples/s]
Map: 100%|██████████| 268/268 [00:00<00:00, 21809.73 examples/s]


In [7]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
training_args = TrainingArguments(
    output_dir=f"subtask1/{LANG}_checkpoints",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=4,   # reduce load
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    fp16=False,     # MPS doesn't support FP16 fully
    bf16=False,
    logging_steps=50,
    report_to="none"
)


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=1)
    return {"macro_f1": macro_f1(preds, labels)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train()

  4%|▎         | 50/1338 [00:34<12:31,  1.71it/s]

{'loss': 0.6748, 'grad_norm': 11.767070770263672, 'learning_rate': 2.8878923766816144e-05, 'epoch': 0.07}


  7%|▋         | 100/1338 [01:04<12:17,  1.68it/s]

{'loss': 0.6206, 'grad_norm': 8.858292579650879, 'learning_rate': 2.7757847533632287e-05, 'epoch': 0.15}


 11%|█         | 150/1338 [01:44<13:14,  1.49it/s]

{'loss': 0.6654, 'grad_norm': 11.481993675231934, 'learning_rate': 2.663677130044843e-05, 'epoch': 0.22}


 15%|█▍        | 200/1338 [02:21<13:07,  1.45it/s]

{'loss': 0.6378, 'grad_norm': 6.704477310180664, 'learning_rate': 2.5515695067264577e-05, 'epoch': 0.3}


 19%|█▊        | 250/1338 [02:55<14:39,  1.24it/s]

{'loss': 0.6735, 'grad_norm': 13.358336448669434, 'learning_rate': 2.4394618834080717e-05, 'epoch': 0.37}


 22%|██▏       | 300/1338 [03:27<10:33,  1.64it/s]

{'loss': 0.5299, 'grad_norm': 8.486689567565918, 'learning_rate': 2.3273542600896863e-05, 'epoch': 0.45}


 26%|██▌       | 350/1338 [04:00<10:04,  1.63it/s]

{'loss': 0.587, 'grad_norm': 8.289909362792969, 'learning_rate': 2.2152466367713003e-05, 'epoch': 0.52}


 30%|██▉       | 400/1338 [04:34<09:22,  1.67it/s]

{'loss': 0.5678, 'grad_norm': 7.295685768127441, 'learning_rate': 2.1031390134529146e-05, 'epoch': 0.6}


 34%|███▎      | 450/1338 [05:06<09:16,  1.60it/s]

{'loss': 0.5586, 'grad_norm': 0.6503819227218628, 'learning_rate': 1.9910313901345293e-05, 'epoch': 0.67}


 37%|███▋      | 500/1338 [05:40<11:24,  1.22it/s]

{'loss': 0.6287, 'grad_norm': 5.932218551635742, 'learning_rate': 1.8789237668161433e-05, 'epoch': 0.75}


 41%|████      | 550/1338 [06:13<08:19,  1.58it/s]

{'loss': 0.5487, 'grad_norm': 7.555407524108887, 'learning_rate': 1.766816143497758e-05, 'epoch': 0.82}


 45%|████▍     | 600/1338 [06:47<07:36,  1.62it/s]

{'loss': 0.5107, 'grad_norm': 12.692445755004883, 'learning_rate': 1.6547085201793723e-05, 'epoch': 0.9}


 49%|████▊     | 650/1338 [07:23<11:39,  1.02s/it]

{'loss': 0.4913, 'grad_norm': 28.551973342895508, 'learning_rate': 1.5426008968609866e-05, 'epoch': 0.97}


                                                  
 50%|█████     | 669/1338 [07:41<06:53,  1.62it/s]

{'eval_loss': 0.686150074005127, 'eval_macro_f1': 0.7764614405881105, 'eval_runtime': 6.4391, 'eval_samples_per_second': 41.621, 'eval_steps_per_second': 10.405, 'epoch': 1.0}


 52%|█████▏    | 700/1338 [08:11<07:41,  1.38it/s]

{'loss': 0.5128, 'grad_norm': 71.91007995605469, 'learning_rate': 1.430493273542601e-05, 'epoch': 1.05}


 56%|█████▌    | 750/1338 [08:47<09:56,  1.01s/it]

{'loss': 0.6211, 'grad_norm': 60.96859359741211, 'learning_rate': 1.3183856502242152e-05, 'epoch': 1.12}


 60%|█████▉    | 800/1338 [09:22<05:56,  1.51it/s]

{'loss': 0.5812, 'grad_norm': 7.747817039489746, 'learning_rate': 1.2062780269058296e-05, 'epoch': 1.2}


 64%|██████▎   | 850/1338 [09:57<05:22,  1.51it/s]

{'loss': 0.5457, 'grad_norm': 8.988685607910156, 'learning_rate': 1.094170403587444e-05, 'epoch': 1.27}


 67%|██████▋   | 900/1338 [10:31<04:37,  1.58it/s]

{'loss': 0.5474, 'grad_norm': 36.022884368896484, 'learning_rate': 9.820627802690584e-06, 'epoch': 1.35}


 71%|███████   | 950/1338 [11:06<03:56,  1.64it/s]

{'loss': 0.569, 'grad_norm': 11.950432777404785, 'learning_rate': 8.699551569506727e-06, 'epoch': 1.42}


 75%|███████▍  | 1000/1338 [11:41<03:50,  1.47it/s]

{'loss': 0.6502, 'grad_norm': 2.3797364234924316, 'learning_rate': 7.578475336322871e-06, 'epoch': 1.49}


 78%|███████▊  | 1050/1338 [12:15<03:16,  1.47it/s]

{'loss': 0.41, 'grad_norm': 64.84599304199219, 'learning_rate': 6.457399103139014e-06, 'epoch': 1.57}


 82%|████████▏ | 1100/1338 [12:48<02:31,  1.57it/s]

{'loss': 0.4531, 'grad_norm': 52.04609680175781, 'learning_rate': 5.336322869955157e-06, 'epoch': 1.64}


 86%|████████▌ | 1150/1338 [13:21<01:58,  1.59it/s]

{'loss': 0.6556, 'grad_norm': 6.106030464172363, 'learning_rate': 4.215246636771301e-06, 'epoch': 1.72}


 90%|████████▉ | 1200/1338 [13:55<01:30,  1.53it/s]

{'loss': 0.3987, 'grad_norm': 51.52360534667969, 'learning_rate': 3.0941704035874443e-06, 'epoch': 1.79}


 93%|█████████▎| 1250/1338 [14:28<00:54,  1.63it/s]

{'loss': 0.5128, 'grad_norm': 10.010797500610352, 'learning_rate': 1.9730941704035875e-06, 'epoch': 1.87}


 97%|█████████▋| 1300/1338 [15:02<00:25,  1.51it/s]

{'loss': 0.5369, 'grad_norm': 27.298784255981445, 'learning_rate': 8.520179372197309e-07, 'epoch': 1.94}


                                                   
100%|██████████| 1338/1338 [15:34<00:00,  1.60it/s]

{'eval_loss': 0.417903333902359, 'eval_macro_f1': 0.8527363184079602, 'eval_runtime': 6.2116, 'eval_samples_per_second': 43.145, 'eval_steps_per_second': 10.786, 'epoch': 2.0}


100%|██████████| 1338/1338 [15:40<00:00,  1.42it/s]


{'train_runtime': 939.9256, 'train_samples_per_second': 5.694, 'train_steps_per_second': 1.424, 'train_loss': 0.5633094913043605, 'epoch': 2.0}


TrainOutput(global_step=1338, training_loss=0.5633094913043605, metrics={'train_runtime': 939.9256, 'train_samples_per_second': 5.694, 'train_steps_per_second': 1.424, 'total_flos': 352042592071680.0, 'train_loss': 0.5633094913043605, 'epoch': 2.0})

In [9]:
full_train_dataset = Dataset.from_pandas(train_df)

full_train_dataset = full_train_dataset.map(tokenize, batched=True)
full_train_dataset = full_train_dataset.rename_column("polarization", "labels")
full_train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

full_training_args = TrainingArguments(
    output_dir=f"subtask1/{LANG}_checkpoints",
    eval_strategy="no",       # ⛔ Disable evaluation
    save_strategy="no",       # Optional: don’t save intermediate checkpoints
    learning_rate=3e-5,
    per_device_train_batch_size=4,
    num_train_epochs=2,
    logging_steps=50,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=full_training_args,
    train_dataset=full_train_dataset,
    tokenizer=tokenizer,
)

trainer.train()


Map: 100%|██████████| 2676/2676 [00:01<00:00, 2156.85 examples/s]
  4%|▎         | 50/1338 [00:35<18:37,  1.15it/s]

{'loss': 0.4968, 'grad_norm': 0.4871438145637512, 'learning_rate': 2.8878923766816144e-05, 'epoch': 0.07}


  7%|▋         | 100/1338 [01:06<12:24,  1.66it/s]

{'loss': 0.5278, 'grad_norm': 13.145429611206055, 'learning_rate': 2.7757847533632287e-05, 'epoch': 0.15}


 11%|█         | 150/1338 [01:37<11:59,  1.65it/s]

{'loss': 0.5541, 'grad_norm': 24.226577758789062, 'learning_rate': 2.663677130044843e-05, 'epoch': 0.22}


 15%|█▍        | 200/1338 [02:08<11:21,  1.67it/s]

{'loss': 0.4142, 'grad_norm': 40.36188507080078, 'learning_rate': 2.5515695067264577e-05, 'epoch': 0.3}


 19%|█▊        | 250/1338 [02:41<11:23,  1.59it/s]

{'loss': 0.6511, 'grad_norm': 424.3372802734375, 'learning_rate': 2.4394618834080717e-05, 'epoch': 0.37}


 22%|██▏       | 300/1338 [03:14<10:52,  1.59it/s]

{'loss': 0.4692, 'grad_norm': 0.39467954635620117, 'learning_rate': 2.3273542600896863e-05, 'epoch': 0.45}


 26%|██▌       | 350/1338 [03:46<10:15,  1.61it/s]

{'loss': 0.4857, 'grad_norm': 43.22133255004883, 'learning_rate': 2.2152466367713003e-05, 'epoch': 0.52}


 30%|██▉       | 400/1338 [04:18<09:17,  1.68it/s]

{'loss': 0.4102, 'grad_norm': 141.78317260742188, 'learning_rate': 2.1031390134529146e-05, 'epoch': 0.6}


 34%|███▎      | 450/1338 [04:53<11:16,  1.31it/s]

{'loss': 0.428, 'grad_norm': 0.07539859414100647, 'learning_rate': 1.9910313901345293e-05, 'epoch': 0.67}


 37%|███▋      | 500/1338 [05:30<09:28,  1.47it/s]

{'loss': 0.4019, 'grad_norm': 0.18938903510570526, 'learning_rate': 1.8789237668161433e-05, 'epoch': 0.75}


 41%|████      | 550/1338 [06:02<09:35,  1.37it/s]

{'loss': 0.3534, 'grad_norm': 0.46135228872299194, 'learning_rate': 1.766816143497758e-05, 'epoch': 0.82}


 45%|████▍     | 600/1338 [06:38<07:44,  1.59it/s]

{'loss': 0.4242, 'grad_norm': 0.49064335227012634, 'learning_rate': 1.6547085201793723e-05, 'epoch': 0.9}


 49%|████▊     | 650/1338 [07:11<07:45,  1.48it/s]

{'loss': 0.4009, 'grad_norm': 225.22157287597656, 'learning_rate': 1.5426008968609866e-05, 'epoch': 0.97}


 52%|█████▏    | 700/1338 [07:45<06:49,  1.56it/s]

{'loss': 0.3362, 'grad_norm': 0.20043176412582397, 'learning_rate': 1.430493273542601e-05, 'epoch': 1.05}


 56%|█████▌    | 750/1338 [08:21<06:30,  1.50it/s]

{'loss': 0.213, 'grad_norm': 0.5800285935401917, 'learning_rate': 1.3183856502242152e-05, 'epoch': 1.12}


 60%|█████▉    | 800/1338 [08:58<05:26,  1.65it/s]

{'loss': 0.1316, 'grad_norm': 6.251967906951904, 'learning_rate': 1.2062780269058296e-05, 'epoch': 1.2}


 64%|██████▎   | 850/1338 [09:32<05:29,  1.48it/s]

{'loss': 0.23, 'grad_norm': 0.1772935688495636, 'learning_rate': 1.094170403587444e-05, 'epoch': 1.27}


 67%|██████▋   | 900/1338 [10:06<06:48,  1.07it/s]

{'loss': 0.2213, 'grad_norm': 0.03188279643654823, 'learning_rate': 9.820627802690584e-06, 'epoch': 1.35}


 71%|███████   | 950/1338 [10:41<04:03,  1.59it/s]

{'loss': 0.2063, 'grad_norm': 48.558536529541016, 'learning_rate': 8.699551569506727e-06, 'epoch': 1.42}


 75%|███████▍  | 1000/1338 [11:16<03:54,  1.44it/s]

{'loss': 0.4537, 'grad_norm': 0.38463738560676575, 'learning_rate': 7.578475336322871e-06, 'epoch': 1.49}


 78%|███████▊  | 1050/1338 [11:54<03:09,  1.52it/s]

{'loss': 0.2643, 'grad_norm': 209.21890258789062, 'learning_rate': 6.457399103139014e-06, 'epoch': 1.57}


 82%|████████▏ | 1100/1338 [12:35<05:54,  1.49s/it]

{'loss': 0.2896, 'grad_norm': 0.07160954177379608, 'learning_rate': 5.336322869955157e-06, 'epoch': 1.64}


 86%|████████▌ | 1150/1338 [13:17<03:29,  1.11s/it]

{'loss': 0.5259, 'grad_norm': 0.7199346423149109, 'learning_rate': 4.215246636771301e-06, 'epoch': 1.72}


 90%|████████▉ | 1200/1338 [13:55<01:29,  1.55it/s]

{'loss': 0.3368, 'grad_norm': 0.30561378598213196, 'learning_rate': 3.0941704035874443e-06, 'epoch': 1.79}


 93%|█████████▎| 1250/1338 [14:33<00:55,  1.58it/s]

{'loss': 0.2958, 'grad_norm': 36.98379898071289, 'learning_rate': 1.9730941704035875e-06, 'epoch': 1.87}


 97%|█████████▋| 1300/1338 [15:10<00:25,  1.50it/s]

{'loss': 0.3612, 'grad_norm': 72.72859954833984, 'learning_rate': 8.520179372197309e-07, 'epoch': 1.94}


100%|██████████| 1338/1338 [15:37<00:00,  1.43it/s]

{'train_runtime': 937.0419, 'train_samples_per_second': 5.712, 'train_steps_per_second': 1.428, 'train_loss': 0.37606397302339606, 'epoch': 2.0}





TrainOutput(global_step=1338, training_loss=0.37606397302339606, metrics={'train_runtime': 937.0419, 'train_samples_per_second': 5.712, 'train_steps_per_second': 1.428, 'total_flos': 352042592071680.0, 'train_loss': 0.37606397302339606, 'epoch': 2.0})

In [11]:
# -------------------- Load official test data (dev/) --------------------

dev_pred_path = f"{BASE_DATA}/dev/{LANG}.csv"
test_df = pd.read_csv(dev_pred_path)

print("Loaded test (dev) size:", len(test_df))
test_df.head()

Loaded test (dev) size: 133


Unnamed: 0,id,text,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,God is with Ukraine and Zelensky,
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,"4 Dems, 2 Republicans Luzerne County Council s...",
2,eng_95770ff547ea5e48b0be00f385986483,Abuse Survivor Recounts Her Struggles at YWCA ...,
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,"After Rwanda, another deportation camp disaster",
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,Another plea in Trump election interference probe,


In [12]:
# Clean text using your preprocessing function
test_df["text"] = test_df["text"].apply(clean_text)


In [13]:
# Prepare inputs
inputs = tokenizer(
    test_df["text"].tolist(),
    truncation=True,
    padding=True,
    max_length=MAX_LEN,
    return_tensors="pt"
)

# Move to MPS or CPU
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}


In [14]:
# Disable gradient calculation for faster inference
with torch.no_grad():
    logits = model(**inputs).logits

# Predicted labels: 0 or 1
preds = logits.argmax(dim=1).cpu().numpy()


In [16]:
submission_df = pd.DataFrame({
    "id": test_df["id"],
    "polarization": preds.astype(int)
})

submission_df.head()


Unnamed: 0,id,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,0
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,0
2,eng_95770ff547ea5e48b0be00f385986483,0
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,0
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,1


In [17]:
submission_df = pd.DataFrame({
    "id": test_df["id"],
    "polarization": preds.astype(int)
})

submission_df.head()


Unnamed: 0,id,polarization
0,eng_f66ca14d60851371f9720aaf4ccd9b58,0
1,eng_3a489aa7fed9726aa8d3d4fe74c57efb,0
2,eng_95770ff547ea5e48b0be00f385986483,0
3,eng_2048ae6f9aa261c48e6d777bcc5b38bf,0
4,eng_07781aa88e61e7c0a996abd1e5ea3a20,1


In [18]:
import os

SAVE_DIR = "submissions/subtask1"
os.makedirs(SAVE_DIR, exist_ok=True)

OUT_PATH = f"{SAVE_DIR}/pred_{LANG}.csv"
submission_df.to_csv(OUT_PATH, index=False)

print("Saved prediction file:", OUT_PATH)


Saved prediction file: submissions/subtask1/pred_eng.csv


In [None]:
import shutil

zip_name = "subtask_1"   # final file will be submission_subtask1.zip
shutil.make_archive(zip_name, "zip", SAVE_DIR)

print("Created ZIP:", zip_name + ".zip")


Created ZIP: submission_subtask1.zip
