
#  Vietnamese News Sentiment – 2-Stage Pipeline (PhoBERT)

Pipeline 2 giai đoạn:
1. **Stage 1 (6 cảm xúc)** – Fine-tune PhoBERT trên UIT-VSMEC (Excel).
2. **Stage 2 (3 nhãn polarity)** – Thích nghi sang miền báo chí (CSV giả đã tạo).

> Yêu cầu tệp đầu vào ở **cùng thư mục notebook** (hoặc thay đường dẫn):
- `train_nor_811.xlsx`
- `valid_nor_811.xlsx`
- `test_nor_811.xlsx`
- `news_sentiment_fake_large.csv` *(hoặc `news_sentiment_fake.csv`)*


In [None]:


from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 1) Cài đặt thư viện

In [None]:

!pip install -q transformers datasets accelerate evaluate underthesea scikit-learn openpyxl


## 2) Chuyển đổi UIT-VSMEC (Excel) → CSV chuẩn (6 cảm xúc)

In [None]:
import pandas as pd

INPUT_TRAIN_XLS = "/content/drive/MyDrive/data for sen/train_nor_811.xlsx"
INPUT_VALID_XLS = "/content/drive/MyDrive/data for sen/valid_nor_811.xlsx"
INPUT_TEST_XLS  = "/content/drive/MyDrive/data for sen/test_nor_811.xlsx"

OUT_TRAIN_CSV = "/content/drive/MyDrive/uit_vsmec_train.csv"
OUT_VALID_CSV = "/content/drive/MyDrive/uit_vsmec_valid.csv"
OUT_TEST_CSV  = "/content/drive/MyDrive/uit_vsmec_test.csv"


# Danh sách alias tên cột (không phân biệt hoa/thường)
TEXT_ALIASES  = {"text","sentence","content","utterance"}
LABEL_ALIASES = {"label","emotion","emo"}

def standardize_columns(df: pd.DataFrame) -> pd.DataFrame:
    # bỏ cột chỉ số thừa do Excel export
    drop_cols = [c for c in df.columns if c.strip().lower().startswith("unnamed")]
    if drop_cols:
        df = df.drop(columns=drop_cols)

    # tạo map: tên gốc -> tên chuẩn hóa (lower + strip)
    lower_map = {c: c.strip().lower() for c in df.columns}
    df = df.rename(columns=lower_map)

    # tìm cột text và label theo alias
    cols = set(df.columns)
    text_col  = next((c for c in cols if c in TEXT_ALIASES), None)
    label_col = next((c for c in cols if c in LABEL_ALIASES), None)
    if text_col is None or label_col is None:
        raise ValueError(f"Không tìm thấy cột văn bản/nhãn. Cột hiện có: {list(df.columns)}")

    # chỉ giữ 2 cột cần thiết và chuẩn hóa
    out = df[[text_col, label_col]].rename(columns={text_col:"text", label_col:"label"}).copy()
    out["text"] = out["text"].astype(str).str.strip()
    out["label"] = out["label"].astype(str).str.strip().str.lower()
    out = out.dropna().drop_duplicates(subset=["text"])
    return out

def convert_vsmec_excels_to_csv(train_xls, valid_xls, test_xls):
    train_df = pd.read_excel(train_xls)
    valid_df = pd.read_excel(valid_xls)
    test_df  = pd.read_excel(test_xls)

    train = standardize_columns(train_df)
    valid = standardize_columns(valid_df)
    test  = standardize_columns(test_df)

    allowed6 = {"anger","disgust","fear","happiness","sadness","surprise"}
    for name, df in [("train",train),("valid",valid),("test",test)]:
        bad = set(df["label"]) - allowed6
        if bad:
            print(f"[Cảnh báo] Bộ {name} có nhãn lạ: {bad} — hãy map/correct trước khi train.")

    train.to_csv(OUT_TRAIN_CSV, index=False, encoding="utf-8-sig")
    valid.to_csv(OUT_VALID_CSV, index=False, encoding="utf-8-sig")
    test.to_csv(OUT_TEST_CSV,  index=False, encoding="utf-8-sig")

    print(" Đã lưu:", OUT_TRAIN_CSV, OUT_VALID_CSV, OUT_TEST_CSV)
    print("\nPhân bố nhãn (train):\n", train["label"].value_counts())
    print("\nPhân bố nhãn (valid):\n", valid["label"].value_counts())
    print("\nPhân bố nhãn (test):\n",  test["label"].value_counts())

convert_vsmec_excels_to_csv(INPUT_TRAIN_XLS, INPUT_VALID_XLS, INPUT_TEST_XLS)


[Cảnh báo] Bộ train có nhãn lạ: {'other', 'enjoyment'} — hãy map/correct trước khi train.
[Cảnh báo] Bộ valid có nhãn lạ: {'other', 'enjoyment'} — hãy map/correct trước khi train.
[Cảnh báo] Bộ test có nhãn lạ: {'other', 'enjoyment'} — hãy map/correct trước khi train.
✅ Đã lưu: /content/drive/MyDrive/uit_vsmec_train.csv /content/drive/MyDrive/uit_vsmec_valid.csv /content/drive/MyDrive/uit_vsmec_test.csv

Phân bố nhãn (train):
 label
enjoyment    1558
disgust      1070
other        1020
sadness       947
anger         391
fear          316
surprise      242
Name: count, dtype: int64

Phân bố nhãn (valid):
 label
enjoyment    214
other        141
disgust      135
sadness       86
anger         49
fear          31
surprise      30
Name: count, dtype: int64

Phân bố nhãn (test):
 label
enjoyment    193
disgust      131
other        129
sadness      116
fear          46
anger         40
surprise      37
Name: count, dtype: int64


In [None]:
import pandas as pd

def fix_labels_6(df):
    df = df.copy()
    # map về 6 lớp chuẩn
    map6 = {
        "enjoyment": "happiness",
        "other": "None"  # hoặc: None để loại bỏ
    }
    df["label"] = df["label"].replace(map6)
    # nếu muốn loại 'other' thay vì map, dùng dòng dưới:
    df = df[df["label"] != "other"]
    allowed6 = {"anger","disgust","fear","happiness","sadness","surprise"}
    df = df[df["label"].isin(allowed6)]
    return df

# Đường dẫn Google Drive
train_path = "/content/drive/MyDrive/data for sen/uit_vsmec_train.csv"
valid_path = "/content/drive/MyDrive/data for sen/uit_vsmec_valid.csv"
test_path  = "/content/drive/MyDrive/data for sen/uit_vsmec_test.csv"

train = pd.read_csv(train_path)
valid = pd.read_csv(valid_path)
test  = pd.read_csv(test_path)

train = fix_labels_6(train)
valid = fix_labels_6(valid)
test  = fix_labels_6(test)

print("Phân bố (train):\n", train["label"].value_counts())
print("Phân bố (valid):\n", valid["label"].value_counts())
print("Phân bố (test):\n",  test["label"].value_counts())

train.to_csv(train_path, index=False, encoding="utf-8-sig")
valid.to_csv(valid_path, index=False, encoding="utf-8-sig")
test.to_csv(test_path,  index=False, encoding="utf-8-sig")
print(" Đã cập nhật lại 3 file CSV cho Stage 1.")


Phân bố (train):
 label
happiness    1558
disgust      1070
sadness       947
anger         391
fear          316
surprise      242
Name: count, dtype: int64
Phân bố (valid):
 label
happiness    214
disgust      135
sadness       86
anger         49
fear          31
surprise      30
Name: count, dtype: int64
Phân bố (test):
 label
happiness    193
disgust      131
sadness      116
fear          46
anger         40
surprise      37
Name: count, dtype: int64
✅ Đã cập nhật lại 3 file CSV cho Stage 1.


## 3) (Tuỳ chọn) Gộp 6 cảm xúc → 3 nhãn polarity (neg/neu/pos)

In [None]:
import pandas as pd

# Đường dẫn Google Drive
train_path = "/content/drive/MyDrive/data for sen/uit_vsmec_train.csv"
valid_path = "/content/drive/MyDrive/data for sen/uit_vsmec_valid.csv"
test_path  = "/content/drive/MyDrive/data for sen/uit_vsmec_test.csv"

train_out = "/content/drive/MyDrive/data for sen/uit_vsmec_train_pol3.csv"
valid_out = "/content/drive/MyDrive/data for sen/uit_vsmec_valid_pol3.csv"
test_out  = "/content/drive/MyDrive/data for sen/uit_vsmec_test_pol3.csv"

# Đọc file gốc
train = pd.read_csv(train_path)
valid = pd.read_csv(valid_path)
test  = pd.read_csv(test_path)

# Map về polarities
map_to_pol = {
    "happiness": "pos",
    "sadness": "neg", "anger": "neg", "disgust": "neg", "fear": "neg",
    "surprise": "neu"
}

train_pol = train.assign(label=train["label"].map(map_to_pol))
valid_pol = valid.assign(label=valid["label"].map(map_to_pol))
test_pol  = test.assign(label=test["label"].map(map_to_pol))

# Lưu ra Google Drive
train_pol.to_csv(train_out, index=False, encoding="utf-8-sig")
valid_pol.to_csv(valid_out, index=False, encoding="utf-8-sig")
test_pol.to_csv(test_out,  index=False, encoding="utf-8-sig")

print("Đã lưu:", train_out, valid_out, test_out)
print("\nPhân bố nhãn (train_pol):\n", train_pol["label"].value_counts())


✅ Đã lưu: /content/drive/MyDrive/data for sen/uit_vsmec_train_pol3.csv /content/drive/MyDrive/data for sen/uit_vsmec_valid_pol3.csv /content/drive/MyDrive/data for sen/uit_vsmec_test_pol3.csv

Phân bố nhãn (train_pol):
 label
neg    2724
pos    1558
neu     242
Name: count, dtype: int64


## 4) Stage 1 – Fine-tune PhoBERT trên UIT-VSMEC (6 lớp)

In [None]:
pip install -U "transformers>=4.36,<5"




In [None]:
# !pip install -U transformers datasets evaluate accelerate

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate, numpy as np

MODEL_NAME = "vinai/phobert-base"
MAX_LEN = 256
emo_labels = ["anger","disgust","fear","happiness","sadness","surprise"]
emo2id = {l:i for i,l in enumerate(emo_labels)}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)

# Nếu file ở Google Drive:
train_csv = "/content/drive/MyDrive/data for sen/uit_vsmec_train.csv"
valid_csv = "/content/drive/MyDrive/data for sen/uit_vsmec_valid.csv"

vsmec = load_dataset("csv", data_files={"train": train_csv, "valid": valid_csv})

def preprocess_vsmec(batch):
    enc = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=MAX_LEN)
    # nếu còn nhãn lạ sẽ KeyError — đảm bảo dữ liệu đã fix về 6 lớp
    enc["labels"] = [emo2id[x] for x in batch["label"]]
    return enc

vsmec_enc = vsmec.map(preprocess_vsmec, batched=True, remove_columns=vsmec["train"].column_names)

metric_f1 = evaluate.load("f1")

def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    f1_macro = metric_f1.compute(predictions=preds, references=p.label_ids, average="macro")["f1"]
    return {"f1_macro": f1_macro}

model_stage1 = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(emo_labels),
    id2label={i:l for i,l in enumerate(emo_labels)},
    label2id={l:i for i,l in enumerate(emo_labels)},
)

args1 = TrainingArguments(
    output_dir="/content/drive/MyDrive/runs/stage1_vsmec",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    eval_strategy="epoch",   # <-- đúng tham số
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    logging_steps=50,
    report_to="none",
)

trainer1 = Trainer(
    model=model_stage1,
    args=args1,
    train_dataset=vsmec_enc["train"],
    eval_dataset=vsmec_enc["valid"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer1.train()

save_dir = "/content/drive/MyDrive/models/stage1_vsmec"
trainer1.save_model(save_dir)
tokenizer.save_pretrained(save_dir)
print(f"Saved to {save_dir}")


Map:   0%|          | 0/4524 [00:00<?, ? examples/s]

Map:   0%|          | 0/545 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/phobert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer1 = Trainer(


Epoch,Training Loss,Validation Loss,F1 Macro
1,1.154,1.022282,0.421996
2,0.8645,0.959774,0.563965
3,0.7026,0.907575,0.583089


✅ Saved to /content/drive/MyDrive/models/stage1_vsmec


## 5) Stage 2 – Thích nghi sang báo (3 lớp polarity)

In [None]:
# !pip install -U "transformers>=4.36,<5" datasets evaluate accelerate scikit-learn

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate, numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch



# NEWS_CSV = "/content/drive/MyDrive/data for sen/News_Sentiment_3000.csv"
NEWS_CSV = "/content/drive/MyDrive/data for sen/News_Sentiment_3000__balanced_.csv"
STAGE1_DIR = "/content/drive/MyDrive/models/stage1_vsmec"
RUNS_DIR   = "/content/drive/MyDrive/runs/stage2_news"
OUT_DIR    = "/content/drive/MyDrive/models/news_sentiment_pol3"

# ====== Cấu hình ======
pol_labels = ["neg","neu","pos"]
pol2id = {l:i for i,l in enumerate(pol_labels)}
id2pol = {i:l for l,i in pol2id.items()}
MAX_LEN = 256

# ====== Tokenizer từ Stage 1 ======
tokenizer = AutoTokenizer.from_pretrained(STAGE1_DIR, use_fast=False)

# ====== Chuẩn bị dữ liệu ======
df_news = pd.read_csv(NEWS_CSV)

# Chuẩn hoá nhãn về neg/neu/pos (chấp nhận cả viết hoa/thường)
label_map = {"negative":"neg","neutral":"neu","positive":"pos",
             "neg":"neg","neu":"neu","pos":"pos",
             "Negative":"neg","Neutral":"neu","Positive":"pos"}
df_news["label"] = df_news["label"].astype(str).str.strip().map(label_map)

# Đảm bảo cột văn bản là "text"
if "text" not in df_news.columns:
    if "content" in df_news.columns:
        df_news = df_news.rename(columns={"content":"text"})
    elif "title" in df_news.columns:
        df_news["text"] = df_news["title"].astype(str)
    else:
        raise ValueError("Không tìm thấy cột văn bản. Cần 'text' hoặc 'content' hoặc 'title'.")

# Làm sạch tối thiểu
df_news = df_news.dropna(subset=["text","label"]).drop_duplicates(subset=["text"]).reset_index(drop=True)

# Kiểm tra nhãn hợp lệ
valid_set = set(pol_labels)
bad = set(df_news["label"].unique()) - valid_set
if bad:
    raise ValueError(f"Nhãn lạ sau chuẩn hoá: {bad}. Hợp lệ: {valid_set}")

print("Phân bố nhãn sau chuẩn hoá:\n", df_news["label"].value_counts())

# Stratify split 80/10/10
df_train, df_temp = train_test_split(df_news, test_size=0.2, random_state=42, stratify=df_news["label"])
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=42, stratify=df_temp["label"])

train_csv = "/content/drive/MyDrive/data for sen/news_labeled_train.csv"
valid_csv = "/content/drive/MyDrive/data for sen/news_labeled_valid.csv"
test_csv  = "/content/drive/MyDrive/data for sen/news_labeled_test.csv"

df_train.to_csv(train_csv, index=False, encoding="utf-8-sig")
df_valid.to_csv(valid_csv, index=False, encoding="utf-8-sig")
df_test.to_csv(test_csv,   index=False, encoding="utf-8-sig")

news = load_dataset("csv", data_files={"train": train_csv, "valid": valid_csv, "test": test_csv})

def preprocess_news(batch):
    enc = tokenizer(batch["text"], truncation=True, padding="max_length", max_length=MAX_LEN)
    enc["labels"] = [pol2id[x] for x in batch["label"]]
    return enc

news_enc = news.map(preprocess_news, batched=True, remove_columns=news["train"].column_names)

# ====== Metric ======
metric_f1 = evaluate.load("f1")
def compute_metrics(p):
    preds = np.argmax(p.predictions, axis=1)
    f1_macro = metric_f1.compute(predictions=preds, references=p.label_ids, average="macro")["f1"]
    return {"f1_macro": f1_macro}

# ====== Model (khởi tạo từ Stage 1, head 3 lớp) ======
model_stage2 = AutoModelForSequenceClassification.from_pretrained(
    STAGE1_DIR,
    num_labels=3,
    id2label=id2pol,
    label2id=pol2id,
    ignore_mismatched_sizes=True   # bỏ qua lệch size head 6 -> 3
)

# Bật fp16 nếu có GPU (tăng tốc)
use_fp16 = torch.cuda.is_available()

args2 = TrainingArguments(
    output_dir=RUNS_DIR,
    learning_rate=1e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=32,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    greater_is_better=True,
    weight_decay=0.01,
    warmup_ratio=0.1,
    fp16=use_fp16,
    logging_steps=50,
    report_to="none",
    seed=42,
)

trainer2 = Trainer(
    model=model_stage2,
    args=args2,
    train_dataset=news_enc["train"],
    eval_dataset=news_enc["valid"],
    tokenizer=tokenizer,           # cảnh báo deprecate có thể bỏ qua
    compute_metrics=compute_metrics
)

trainer2.train()
print(" Eval (test):", trainer2.evaluate(news_enc["test"]))

trainer2.save_model(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)
print(f" Saved to {OUT_DIR}")


Phân bố nhãn sau chuẩn hoá:
 label
pos    1000
neg    1000
neu    1000
Name: count, dtype: int64


Generating train split: 0 examples [00:00, ? examples/s]

Generating valid split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2400 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/drive/MyDrive/models/stage1_vsmec and are newly initialized because the shapes did not match:
- classifier.out_proj.bias: found shape torch.Size([6]) in the checkpoint and torch.Size([3]) in the model instantiated
- classifier.out_proj.weight: found shape torch.Size([6, 768]) in the checkpoint and torch.Size([3, 768]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer2 = Trainer(


Epoch,Training Loss,Validation Loss,F1 Macro
1,0.0105,0.006112,1.0
2,0.0046,0.002831,1.0
3,0.0036,0.002278,1.0


✅ Eval (test): {'eval_loss': 0.006115557160228491, 'eval_f1_macro': 1.0, 'eval_runtime': 1.3043, 'eval_samples_per_second': 230.015, 'eval_steps_per_second': 7.667, 'epoch': 3.0}
✅ Saved to /content/drive/MyDrive/models/news_sentiment_pol3


## 6) (Tuỳ chọn) Pseudo-label dữ liệu báo chưa gán nhãn

In [None]:

import torch, pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification

UNLABELED_CSV = "news_unlabeled.csv"  # thay bằng file của bạn nếu có
CONF_TH = 0.9
pol_labels = ["neg","neu","pos"]

try:
    infer_tokenizer = AutoTokenizer.from_pretrained("./models/news_sentiment_pol3", use_fast=False)
    infer_model = AutoModelForSequenceClassification.from_pretrained("./models/news_sentiment_pol3")
    infer_model.eval()
    softmax = torch.nn.Softmax(dim=-1)

    df_u = pd.read_csv(UNLABELED_CSV)
    keep_rows = []
    with torch.no_grad():
        for i, row in df_u.iterrows():
            text = row["text"] if "text" in row else row["content"]
            enc = infer_tokenizer(text, truncation=True, padding="max_length", max_length=256, return_tensors="pt")
            logits = infer_model(**enc).logits
            prob = softmax(logits)[0]
            conf, pred = torch.max(prob, dim=0)
            if conf.item() >= CONF_TH:
                keep_rows.append({"text": text, "label": pol_labels[pred.item()], "conf": float(conf.item())})

    if keep_rows:
        pd.DataFrame(keep_rows).to_csv("news_pseudo.csv", index=False, encoding="utf-8-sig")
        print(f"Đã lưu news_pseudo.csv ({len(keep_rows)} mẫu, conf>={CONF_TH})")
    else:
        print("Không thu được mẫu pseudo-label đạt ngưỡng. Hãy giảm CONF_TH hoặc tăng dữ liệu.")

except FileNotFoundError:
    print("Không tìm thấy UNLABELED_CSV. Bỏ qua bước pseudo-label.")


## 7) Hàm suy luận cho bài báo dài (gộp logits theo đoạn)

In [None]:

import torch, numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification

LABELS = ["neg","neu","pos"]

class NewsSentimentClassifier:
    def __init__(self, model_dir="./models/news_sentiment_pol3", max_len=256):
        self.tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_dir)
        self.model.eval()
        self.max_len = max_len
        self.softmax = torch.nn.Softmax(dim=-1)

    def _chunk_text(self, text, max_tokens=256):
        # Tách đoạn theo xuống dòng, fallback tách theo dấu chấm nếu ít đoạn
        parts = [p.strip() for p in text.split("\n") if p.strip()]
        if len(parts) < 2:
            parts = [p.strip() for p in text.split(". ") if p.strip()]
        return parts[:6] if parts else [text]  # giới hạn 6 đoạn

    def predict(self, title, content):
        segments = [title] + self._chunk_text(content, self.max_len)
        prob_sum = np.zeros(len(LABELS), dtype=float)
        with torch.no_grad():
            for seg in segments:
                enc = self.tokenizer(seg, truncation=True, padding="max_length", max_length=self.max_len, return_tensors="pt")
                logits = self.model(**enc).logits
                prob = self.softmax(logits)[0].numpy()
                prob_sum += prob
        pred_id = int(prob_sum.argmax())
        return LABELS[pred_id], (prob_sum / prob_sum.sum()).tolist()



---
**Hoàn tất.** Chạy lần lượt từng cell từ trên xuống dưới. Nếu cần tuỳ chỉnh đường dẫn, sửa các hằng số ở đầu mỗi cell.