# **Final Project**

## **Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





# Read dataset

In [3]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

In [4]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=a567161af394d34dbce76589b0a745cf27884ae51324fdda3d26a477f44a6ae9
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [5]:
!pip install emoji



## Import library

In [10]:
import html
import os
import quopri
import re
from collections import Counter

import emoji
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from langdetect import detect, LangDetectException
from tqdm import tqdm

import contractions
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    BertForSequenceClassification,
    BertTokenizer,
    RobertaForSequenceClassification,
    XLNetForSequenceClassification,
    get_scheduler,
)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import (
    accuracy_score,
    auc,
    classification_report,
    confusion_matrix,
    roc_curve,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

import matplotlib.pyplot as plt
import seaborn as sns

import csv
from tqdm import tqdm
import torch
from transformers import get_scheduler
from sklearn.metrics import accuracy_score


In [11]:
true_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_TRUE.csv")
true_df

Unnamed: 0.1,Unnamed: 0,text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...
...,...,...
34970,34970,Most conservatives who oppose marriage equalit...
34971,34971,The freshman senator from Georgia quoted scrip...
34972,34972,The State Department told the Republican Natio...
34973,34973,"ADDIS ABABA, Ethiopia —President Obama convene..."


In [12]:
fake_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_FAKE.csv")
fake_df

Unnamed: 0.1,Unnamed: 0,text
0,0,Donald Trump just couldn t wish all Americans ...
1,1,House Intelligence Committee Chairman Devin Nu...
2,2,"On Friday, it was revealed that former Milwauk..."
3,3,"On Christmas day, Donald Trump announced that ..."
4,4,Pope Francis used his annual Christmas Day mes...
...,...,...
43637,44422,The USA wants to divide Syria.\r\n\r\nGreat Br...
43638,44423,The Ukrainian coup d'etat cost the US nothing ...
43639,44424,The European Parliament falsifies history by d...
43640,44425,The European Parliament falsifies history by d...


In [13]:
# Delete order column
true_df = true_df.drop('Unnamed: 0', axis=1)
fake_df = fake_df.drop('Unnamed: 0', axis=1)

In [None]:
fake_df.info()

In [None]:
true_df.describe()

In [None]:
fake_df.describe()

# Data Preprocessing

- Xử lý giá trị null

In [None]:
true_df.isnull().sum()

In [None]:
fake_df.isnull().sum()

In [14]:
true_df = true_df.dropna()

- Xử lý giá trị duplicate

In [None]:
true_df.duplicated().sum()

In [None]:
fake_df.duplicated().sum()

In [15]:
true_df = true_df.drop_duplicates()
fake_df = fake_df.drop_duplicates()

- Thêm label và gộp 2 tập dữ liệu

In [16]:
true_df['label'] = 1
fake_df['label'] = 0

df = pd.concat([true_df, fake_df], ignore_index=True)
df

Unnamed: 0,text,label
0,The head of a conservative Republican faction ...,1
1,Transgender people will be allowed for the fir...,1
2,The special counsel investigation of links bet...,1
3,Trump campaign adviser George Papadopoulos tol...,1
4,President Donald Trump called on the U.S. Post...,1
...,...,...
68599,"Apparently, the new Kyiv government is in a hu...",0
68600,The USA wants to divide Syria.\r\n\r\nGreat Br...,0
68601,The Ukrainian coup d'etat cost the US nothing ...,0
68602,The European Parliament falsifies history by d...,0


In [17]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle dataset
df

Unnamed: 0,text,label
0,Former Russian economy minister Alexei Ulyukay...,1
1,Republicans were just given a leg up over Demo...,0
2,This has to be one of the best remix videos ev...,0
3,"In line with the new Language Law, Russian is ...",0
4,JERUSALEM — A day after approving the const...,1
...,...,...
68599,The Super Bowl had not yet begun and Trump fan...,0
68600,U.S. House Republicans on Friday won passage o...,1
68601,Share on Facebook Share on Twitter Known to th...,0
68602,A New Jersey man who worked at the World Trade...,1


* Kiểm tra imbalance

In [None]:
df['label'].value_counts()

=> Dữ liệu không bị imbalance

- Xử lý các văn bản không phải là tiếng Anh

In [18]:
def safe_detect(x):
    if isinstance(x, str) and x.strip() and len(x.strip()) > 20:
        try:
            return detect(x)
        except LangDetectException:
            return 'unknown'
    return 'unknown'

df['lang'] = df['text'].apply(safe_detect)
non_english = df[df['lang'] != 'en']
print(non_english)

                                                    text  label     lang
127                                   Florida for Trump!      0  unknown
253    0 комментариев 0 поделились Фото: AP \nКоммент...      0       ru
294    0 комментариев 7 поделились \n"Это полный бред...      0       ru
297    +++ Beim Jupiter! Spuren römischer Zivilisatio...      0       de
433    0 комментариев 0 поделились \n23 октября в Ниж...      0       ru
...                                                  ...    ...      ...
68104  Mittwoch, 16. November 2016 Neue App ruft auto...      0       de
68347  +++ Muhten ihm einiges zu: Bauer soll Streit u...      0       de
68388  — The Sun (@TheSun) 23. November 2016 Laut Fer...      0       de
68406                       President Obama is a Muslim.      0       ca
68444  Страна: Китай Заявления КНДР о завершении свое...      0       ru

[646 rows x 3 columns]


In [19]:
df = df.drop(columns="lang", axis=1) # Xóa cột phụ sau khi xử lý

=> Không xóa các dòng văn bản không phải tiếng Anh vì label 0 - fake news chiếm đa số

- Xử lý các văn bản với số từ ít hơn 5

In [20]:
# Đếm số từ trong mỗi dòng
short_texts = df[df['text'].apply(lambda x: len(str(x).split()) < 5)]

# In ra các dòng này
print(short_texts)

                                                  text  label
127                                 Florida for Trump!      0
288    A MUST watch video!https://youtu.be/-5Z-jJ2Z4bU      0
772                                               Cool      0
965                    That would be unconstitutional.      0
1115                   Around 120,000 displaced people      1
...                                                ...    ...
67547                           TRUMP VICTORY FOR SURE      0
67689                                        Brilliant      0
67766                                  Good guy.\n👍👍👍👍      0
67797      https://www.youtube.com/watch?v=gqxwF-TeYas      0
67834                                        Horseshit      0

[170 rows x 2 columns]


In [21]:
short_texts[short_texts['label']==1]

Unnamed: 0,text,label
1115,"Around 120,000 displaced people",1
20229,Republican Congressman Will Hurd,1
24713,Ted Cruz,1
26250,Four U.S. senators,1
28034,“On 1/20,1
31892,No.,1
40323,(Reuters),1
57596,Jan 29 (Reuters),1
65117,advertisement,1


=> Vì các dòng có text dưới 5 kí tự không mang nhiều ý nghĩa nên loại bỏ

In [22]:
# Loại bỏ các dòng có số từ < 5
df = df[df['text'].apply(lambda x: len(str(x).split()) >= 5)]

## Clean text

- Làm sạch văn bản (lower, bỏ dấu câu, stopwords, stemming...) + Tokenizer

In [23]:
# The first running
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [24]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [25]:
def clean_text(row):
    row = str(row).lower()

    # Remove email headers
    row = re.sub(r'(?i)\b(from|to|cc|bcc|subject|date|return-path|message-id|thread-topic|thread-index|content-type|mime-version|boundary|received|x-[\w-]+):.*', ' ', row)

    # Remove mailto links
    row = re.sub(r'mailto:[^\s]+', ' ', row)

    # Decode quoted-printable
    row = quopri.decodestring(row.encode('utf-8')).decode('utf-8', errors='ignore')

    # Unescape HTML entities
    row = html.unescape(row)

    # Strip HTML tags
    if '<' in row and '>' in row:
        row = BeautifulSoup(row, "lxml").get_text()

    # Normalize
    row = re.sub(r'[\t\r\n]', ' ', row)
    row = re.sub(r'[_~+\-]{2,}', ' ', row)
    row = re.sub(r"[<>()|&©ø%\[\]\\~*\$€£¥]", ' ', row)
    row = re.sub(r"\\x[0-9a-fA-F]{2}", ' ', row)
    row = re.sub(r'(https?://)([^/\s]+)([^\s]*)', r'\2', row)
    row = re.sub(r'[a-f0-9]{16,}', ' ', row)
    row = re.sub(r'([.?!])[\s]*\1+', r'\1', row)
    row = re.sub(r'\s+', ' ', row)

    # Remove code-like keywords
    row = re.sub(r'\b(function|var|return|typeof|window|document|eval|\.split)\b', ' ', row)

    # Remove programming symbols
    row = re.sub(r'[{}=<>\[\]^~|`#@*]', ' ', row)

    # Remove all emoji
    row = emoji.replace_emoji(row, replace='')

    # Cut code JS minify or base36 encode
    code_gibberish = re.search(r'[a-z0-9]{20,}', row)
    if code_gibberish and len(row) - code_gibberish.start() > 50:
        row = row[:code_gibberish.start()]

    # Cut off JS/CDATA tail
    cutoff = re.search(
        r'(//\s*!?\s*cdata|function\s*\(|var\s+[a-zA-Z]|window\s*\.\s*|document\s*\.\s*|this\s*\.)',
        row
    )
    if cutoff and len(row) - cutoff.start() > 10:
        row = row[:cutoff.start()]

    row = re.sub(r'!+\s*cdata\s*!+', ' ', row, flags=re.IGNORECASE)

    return row.strip()

df['clean_text'] = df['text'].apply(clean_text)
df['clean_text']

0        former russian economy minister alexei ulyukay...
1        republicans were just given a leg up over demo...
2        this has to be one of the best remix videos ev...
3        in line with the new language law, russian is ...
4        jerusalem — a day after approving the construc...
                               ...                        
68599    the super bowl had not yet begun and trump fan...
68600    u.s. house republicans on friday won passage o...
68601    share on facebook share on twitter known to th...
68602    a new jersey man who worked at the world trade...
68603    turkey and iran have agreed to discuss within ...
Name: clean_text, Length: 68434, dtype: object

# Exploratory Data Analysis (EDA)

## Label Analysis

In [None]:
def plot_label_distribution(df, label_col):
    ax = sns.countplot(x=label_col, data=df, hue=label_col, palette='pastel', dodge=False)

    counts = df[label_col].value_counts().sort_index()
    for x, y in enumerate(counts.values):
        ax.text(x, y, f'{y}', ha='center', va='bottom', fontsize=11)

    plt.legend(title='Label', labels=df[label_col].unique(),)
    plt.title('Label Distribution')
    plt.xlabel('Label')
    plt.ylabel('Count')
    plt.show()

plot_label_distribution(df, 'label')

Sự chênh lệch giữa hai nhãn là rất nhỏ (chỉ 448 mẫu), cho thấy tập dữ liệu khá cân bằng giữa hai lớp.

## Distribution Analysis

In [None]:
df['word_count'] = df['clean_text'].apply(lambda x: len(x.split()))
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='word_count', hue='label', multiple='dodge', bins=20)
plt.title('Distribution of Word Count in True vs Fake News')
plt.xlabel('Word Count')
plt.ylabel('Number of Articles')
plt.show()

Đa số các bài viết (khoảng 30000 bài) có số từ từ 0 đến 5000, cho cả True News và Fake News, với nhãn 0 (True News) có phần vượt trội hơn. Rất ít bài viết có số từ vượt quá 10000, cho thấy phân bố tập trung chủ yếu ở các bài viết ngắn đến trung bình.

In [None]:
df['char_count'] = df['clean_text'].apply(lambda x: len(x))
plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='char_count', hue='label', multiple='dodge', bins=20)
plt.title('Distribution of Character Count in True vs Fake News')
plt.xlabel('Word Count')
plt.ylabel('Number of Articles')
plt.show()

Phần lớn các bài viết (khoảng 30000 bài) có số ký tự từ 0 đến 20000, với nhãn 0 (True News) chiếm ưu thế. Số lượng giảm mạnh sau 20000 ký tự, thể hiện sự tập trung ở các bài viết có số ký tự thấp đến trung bình.

## Word Frequency Analysis

In [None]:
true_words = ' '.join(df[df['label'] == 1]['clean_text']).split()
true_words = set(true_words)

fake_words = ' '.join(df[df['label'] == 0]['clean_text']).split()
fake_words = set(fake_words)

common_words = true_words.intersection(fake_words)

unique_true_words = true_words - common_words
unique_fake_words = fake_words - common_words

print(f"Number of common words between true and fake news: {len(common_words)}")
print(f"Number of unique words in true news: {len(unique_true_words)}")
print(f"Number of unique words in fake news: {len(unique_fake_words)}")

In [None]:
true_word_freq = Counter(true_words)
most_common_true = true_word_freq.most_common(20)

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in most_common_true], y=[word[0] for word in most_common_true])
plt.title('Top 20 Most Common Words in True News')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.show()

In [None]:
fake_word_freq = Counter(fake_words)
most_common_fake = fake_word_freq.most_common(20)

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in most_common_fake], y=[word[0] for word in most_common_fake])
plt.title('Top 20 Most Common Words in Fake News')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.show()

Cả hai đồ thị "Top 20 Most Common Words in True News" và "Top 20 Most Common Words in Fake News" đều thể hiện tần suất xuất hiện của các từ phổ biến nhất trong từng loại tin tức. Từ `the` dẫn đầu với tần suất cao nhất trong cả hai trường hợp, tiếp theo là `to`, `of`, và `and`, cho thấy đây là các từ chức năng phổ biến. True News có tần suất tối đa khoảng 1 triệu, trong khi Fake News có tần suất cao hơn đáng kể, lên đến gần 8 triệu, phản ánh mật độ từ cao hơn trong Fake News.

In [None]:
common_word_freq = Counter(common_words)
most_common_shared = common_word_freq.most_common(20)

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in most_common_shared], y=[word[0] for word in most_common_shared])
plt.title('Top 20 Most Common Words Shared Between True and Fake News')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.show()

In [None]:
true_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(true_words))

plt.figure(figsize=(10, 6))
plt.imshow(true_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for True News')
plt.axis('off')
plt.show()

In [None]:
fake_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(fake_words))

plt.figure(figsize=(10, 6))
plt.imshow(fake_wordcloud, interpolation='bilinear')
plt.title('Word Cloud for Fake News')
plt.axis('off')
plt.show()

Cả hai đều có sự xuất hiện của `trump` và `clinton` với kích thước lớn, nhưng Fake News có thêm các từ liên quan đến phương tiện truyền thông (như `twitter`, `youtube`, `video`) và từ cảm xúc (như `good`, `attack`), gợi ý sự khác biệt về phong cách và nội dung so với True News chỉ tập trung vào các thuật ngữ chính trị và hành chính.

## n-grams

In [None]:
def get_top_n_grams(corpus, ngram_range=(2, 2), n=None):
    vec = CountVectorizer(ngram_range=ngram_range, stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
top_positive_unigrams = get_top_n_grams(df[df['label'] == 1]['clean_text'], ngram_range=(1, 1), n=20)
top_negative_unigrams = get_top_n_grams(df[df['label'] == 0]['clean_text'], ngram_range=(1, 1), n=20)

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in top_positive_unigrams], y=[word[0] for word in top_positive_unigrams])
plt.title('Top 20 Unigrams in True News')
plt.xlabel('Frequency')
plt.ylabel('Unigrams')
plt.show()

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in top_negative_unigrams], y=[word[0] for word in top_negative_unigrams])
plt.title('Top 20 Unigrams in Fake News')
plt.xlabel('Frequency')
plt.ylabel('Unigrams')
plt.show()

Trong True News, `said` dẫn đầu với tần suất cao nhất (gần 160,000), theo sau là `trump`, `mr`, `president`, và `new`, cho thấy sự tập trung vào phát ngôn và các nhân vật chính trị. Trong Fake News, `trump` đứng đầu với tần suất vượt trội (gần 80,000), tiếp theo là `people`, `said`, `clinton` và `president`, phản ánh sự chú trọng vào các nhân vật chính trị và công chúng.

True News có tần suất tổng thể cao hơn (lên đến 160,000), trong khi Fake News có phạm vi tần suất thấp hơn (tối đa 80,000), nhưng danh sách từ đa dạng hơn với các thuật ngữ như `election` và `world`.

In [None]:
top_positive_bigrams = get_top_n_grams(df[df['label'] == 1]['clean_text'], ngram_range=(2, 2), n=20)
top_negative_bigrams = get_top_n_grams(df[df['label'] == 0]['clean_text'], ngram_range=(2, 2), n=20)

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in top_positive_bigrams], y=[word[0] for word in top_positive_bigrams])
plt.title('Top 20 Bigrams in True News')
plt.xlabel('Frequency')
plt.ylabel('Bigrams')
plt.show()

plt.figure(figsize=(10, 6))
sns.barplot(x=[word[1] for word in top_negative_bigrams], y=[word[0] for word in top_negative_bigrams])
plt.title('Top 20 Bigrams in Fake News')
plt.xlabel('Frequency')
plt.ylabel('Bigrams')
plt.show()

Cả hai đều có sự xuất hiện mạnh của `trump`, `clinton`, và `united states`, nhưng True News tập trung hơn vào các thuật ngữ chính thức (như `prime minister`, `supreme court`) với tần suất giảm đều, trong khi Fake News có thêm các từ liên quan đến truyền thông (như `twitter com`, `pic twitter`) và hình ảnh (như `featured image`, `getty images`), cho thấy sự khác biệt về phong cách và nguồn thông tin.

## TF-IDF

In [None]:
true_reviews = df[df['label'] == 1]['clean_text']
tfidf_vectorizer_true = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_true = tfidf_vectorizer_true.fit_transform(true_reviews)
true_top_words = pd.DataFrame(tfidf_true.toarray(), columns=tfidf_vectorizer_true.get_feature_names_out()).mean().sort_values(ascending=False)[:20]

plt.figure(figsize=(10, 6))
sns.barplot(x=true_top_words.values, y=true_top_words.index)
plt.title('Top 20 TF-IDF Words in True News')
plt.xlabel('TF-IDF Score')
plt.ylabel('Words')
plt.show()

In [None]:
fake_reviews = df[df['label'] == 0]['clean_text']
tfidf_vectorizer_fake = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_fake = tfidf_vectorizer_fake.fit_transform(fake_reviews)
fake_top_words = pd.DataFrame(tfidf_fake.toarray(), columns=tfidf_vectorizer_fake.get_feature_names_out()).mean().sort_values(ascending=False)[:20]

plt.figure(figsize=(10, 6))
sns.barplot(x=fake_top_words.values, y=fake_top_words.index)
plt.title('Top 20 TF-IDF Words in Fake News')
plt.xlabel('TF-IDF Score')
plt.ylabel('Words')
plt.show()

Cả hai đều có sự xuất hiện mạnh của `trump`, `clinton`, `president`, và `said`, nhưng True News nhấn mạnh các thuật ngữ hành chính (như `government`, `states`) với điểm TF-IDF giảm đều, trong khi Fake News nổi bật với các từ như `hillary`, `obama`, và `russia`, gợi ý sự tập trung vào các cá nhân và sự kiện cụ thể.

## Textual Feature Distribution Analysis

In [None]:
def get_polarity(text):
    return TextBlob(text).sentiment.polarity

df['polarity'] = df['clean_text'].apply(get_polarity)

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='polarity', hue='label', multiple='stack', bins=50, kde=True)
plt.title('Polarity Distribution by Label')
plt.xlabel('Polarity')
plt.ylabel('Number of Articles')
plt.show()

Phân phối của cả hai nhãn (0 và 1) tập trung chủ yếu quanh giá trị Polarity gần 0, rất ít bài viết với Polarity cực đoan (dưới -0.75 hoặc trên 0.75) cho thấy phần lớn các bài viết có độ tích cực hoặc tiêu cực trung bình. Nhãn 0 (Fake News) có số lượng bài viết cao hơn đáng kể so với nhãn 1 (True News).

In [None]:
def get_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

df['subjectivity'] = df['clean_text'].apply(get_subjectivity)

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='subjectivity', hue='label', multiple='stack', bins=50, kde=True)
plt.title('Subjectivity Distribution by Label')
plt.xlabel('Subjectivity')
plt.ylabel('Number of Articles')
plt.show()

Phân phối của cả hai nhãn (0 và 1) tập trung chủ yếu ở giá trị Subjectivity từ 0 đến 0.6, rất ít bài viết lớn hơn 0.6. Nhãn 0 (Fake News) chiếm ưu thế tổng thể và có số lượng bài viết cao vượt trội hơn nhãn 1 ở giá trị 0.1 với khoảng 3.000 bài viết.

In [None]:
def get_flesch_kincaid(text):
    return textstat.flesch_kincaid_grade(text)

df['readability_score'] = df['clean_text'].apply(get_flesch_kincaid)

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='readability_score', hue='label', multiple='stack', bins=50, kde=True)
plt.title('Readability Score Distribution for True vs Fake News')
plt.xlabel('Flesch-Kincaid Grade Level')
plt.ylabel('Number of Articles')
plt.legend(labels=['Fake', 'True'])
plt.show()

Phân phối của cả tin thật (nhãn 1) và tin giả (nhãn 0) tập trung chủ yếu ở mức Flesch-Kincaid Grade Level từ 0 đến 40, cả hai loại tin đều có số lượng giảm mạnh khi Flesch-Kincaid Grade Level tăng trên 40, với rất ít bài viết ở mức trên 80, cho thấy cả hai loại đều có mức độ dễ đọc.

## Bonus

In [None]:
def count_punctuation(text, punct):
    return text.count(punct)

df['exclamation_count'] = df['clean_text'].apply(lambda x: count_punctuation(x, '!'))
df['question_count'] = df['clean_text'].apply(lambda x: count_punctuation(x, '?'))

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='exclamation_count', hue='label', multiple='stack', bins=30)
plt.title('Exclamation Mark (!) Distribution by Label')
plt.xlabel('Number of Exclamation Marks')
plt.ylabel('Number of Articles')
plt.show()

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='question_count', hue='label', multiple='stack', bins=30)
plt.title('Question Mark (?) Distribution by Label')
plt.xlabel('Number of Question Marks')
plt.ylabel('Number of Articles')
plt.show()

In [None]:
df = df[['text', 'label', 'clean_text']]

# Model building

## Set parameters

In [26]:
RANDOM_STATE = 42
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Train-test split

In [27]:
X = df['clean_text'].tolist()
y = df['label'].tolist()

In [28]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE, stratify=df['label'])

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=RANDOM_STATE, stratify=y_temp)

* Đếm số lượng nhãn

In [29]:
print("Train label distribution:", Counter(y_train))
print("Validation label distribution:", Counter(y_val))
print("Test label distribution:", Counter(y_test))

Train label distribution: Counter({1: 24161, 0: 23742})
Validation label distribution: Counter({1: 5178, 0: 5087})
Test label distribution: Counter({1: 5178, 0: 5088})


## The necessary functions

In [30]:
def evaluate(y_true, y_pred):
  label_description = {"0": "Fake News", "1": "True News"}
  print("Classification report: \n", classification_report(y_true , y_pred))

  print("Confusion matrix: \n")
  conf_matrix = confusion_matrix(y_true , y_pred)
  plt.figure(figsize=(10, 7))
  sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=list(label_description.values()), yticklabels=list(label_description.values()))
  plt.xlabel('Predicted Class')
  plt.ylabel('True Class')
  plt.title('Confusion Matrix')
  plt.show()

In [31]:
def plot_curves(train_loss, val_loss, val_acc, title="Learning Curve"):
    epochs = range(1, len(train_loss) + 1)
    fig, axes = plt.subplots(1, 2, figsize=(10, 4))

    # Loss
    axes[0].plot(epochs, train_loss, label="Train loss")
    axes[0].plot(epochs, val_loss,   label="Val loss")
    axes[0].set_xlabel("Epoch"); axes[0].set_ylabel("Loss")
    axes[0].set_title("Loss"); axes[0].legend(); axes[0].grid(ls="--", alpha=.4)

    # Val accuracy
    axes[1].plot(epochs, val_acc, label="Val acc", color="tab:orange")
    axes[1].set_xlabel("Epoch"); axes[1].set_ylabel("Accuracy")
    axes[1].set_title("Validation accuracy")
    axes[1].legend(); axes[1].grid(ls="--", alpha=.4)

    plt.suptitle(title)
    plt.tight_layout()
    plt.show()

In [32]:
def plot_roc_curve(y_true, y_score, pos_label=1, title="ROC Curve"):
    fpr, tpr, _ = roc_curve(y_true, y_score, pos_label=pos_label)
    roc_auc = auc(fpr, tpr)

    fig, ax = plt.subplots(figsize=(6, 4))

    ax.plot(fpr, tpr, color="tab:red",
            label=f"User model (AUC = {roc_auc:.2f})", lw=2)

    # random
    ax.plot([0, 1], [0, 1], "k--", lw=2)  # Đường chéo

    # perfect
    ax.plot([0, 0, 1], [0, 1, 1], color="green",
            label="Perfect model", lw=1)

    ax.set_xlim([0.0, 1.0]); ax.set_ylim([0.0, 1.02])
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title(title)
    ax.legend(loc="lower right")
    ax.grid(alpha=0.3, ls="--")

    plt.show()

    return ax

## Build model

### Model ML

In [None]:
stemmer    = PorterStemmer()
stop_words = set(stopwords.words("english"))

In [None]:
def tokenize_and_filter(text):
    text = contractions.fix(text)
    tokens = word_tokenize(text)
    return [stemmer.stem(w)
            for w in tokens
            if w.lower() not in stop_words
            and w.isalpha()]

In [None]:
def build_ml_model(X, y):
    model = Pipeline([
        ("tfidf", TfidfVectorizer(
            tokenizer=tokenize_and_filter,
            lowercase=False,
            preprocessor=None,
            token_pattern=None,
            ngram_range=(1, 2)
        )),
        ("svc", SVC(kernel='linear'))
    ])

    model.fit(X, y)

    return model

In [None]:
model = build_ml_model(X_train, y_train)

pred = model.predict(X_val)
evaluate(y_val, pred)

In [None]:
# Define the parameter grid
param_grid = {
    "tfidf__ngram_range": [(1, 1), (1, 2)],
    "tfidf__min_df":      [2, 5, 10],
    "tfidf__max_df":      [0.85, 0.9, 0.95],
    "tfidf__max_features": [None, 50_000, 100_000],

    "svc__C":            [0.1, 1, 2, 5],
    "svc__class_weight": [None, "balanced"],
}

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(
        tokenizer      = tokenize_and_filter,
        lowercase      = False,   # ta đã xử lý trong tokenizer
        preprocessor   = None,
        token_pattern  = None     # tắt regex mặc định
    )),
    ("svc",  SVC(kernel="linear", random_state=RANDOM_STATE))
])

# Set up GridSearchCV
gridsearch = GridSearchCV(pipeline, param_grid, cv=5, scoring="f1", verbose=1)

# Find the best hyperparameters
gridsearch.fit(X_train, y_train)

# Print the best hyperparameters found and the best cross-validation score
print("Best Parameters:", gridsearch.best_params_)
print("Best Cross-Validation Score:", gridsearch.best_score_)

# Save the fitted GridSearchCV object to a pkl file
with open('best_svm.pkl', 'wb') as file:
    pickle.dump(gridsearch, file)

In [None]:
# Load the GridSearchCV object from the pickle file
with open('best_svm.pkl', 'rb') as file:
    loaded_gridsearch = pickle.load(file)

print("Best Parameters:", loaded_gridsearch.best_params_)

best_model = loaded_gridsearch.best_estimator_

y_score = best_model.decision_function(X_test)

### Model DL cơ bản

#### 1. Multilevel-CNN
- Bắt được đặc trưng từ từ, cụm từ, câu bằng các kernel kích thước khác nhau (3, 4, 5,...).
- Phù hợp với dữ liệu dài + đa dạng, không phụ thuộc vào thứ tự quá dài như RNN.
- Huấn luyện nhanh hơn LSTM, độ chính xác cao hơn CNN đơn thuần.

#### 2. CNN + BiLSTM
- CNN trích đặc trưng cục bộ, sau đó BiLSTM hiểu ngữ cảnh hai chiều (trước và sau).
- Phù hợp cho dữ liệu có logic tuyến tính (như tin tức).
- Độ chính xác cao, tuy chậm hơn Multilevel-CNN chút nhưng vẫn tốt nếu tối ưu đúng.

#### Multilevel-CNN

**Kiến trúc gợi ý:**

Input (chuỗi văn bản)
→ Embedding Layer
→ Conv1D (kernel size 3) → GlobalMaxPool
→ Conv1D (kernel size 4) → GlobalMaxPool
→ Conv1D (kernel size 5) → GlobalMaxPool
→ Concatenate
→ Dense layers → Dropout (chưa có trong code)
→ Output (Sigmoid / Softmax)

In [None]:
# Tokenization
# Fit tokenizer trên dữ liệu train
tokenizer = Tokenizer(num_words=5000) # giữ lại 5000 từ phổ biến nhất
tokenizer.fit_on_texts(X_train)
# Convert văn bản thành câu và padding
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=512, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=512, padding='post', truncating='post')

# Convert to tensor
X_train_tensor = torch.tensor(X_train_pad, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_pad, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)

# Dataset & DataLoader
class TextDataset(Dataset):
    '''
    Tạo custom Dataset từ dữ liệu đã padding và label.
    '''
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = TextDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TextDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32)

# Định nghĩa model
class MultilevelCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes=1):
        super(MultilevelCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim) # Embedding layer
        self.conv3 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=3)
        self.conv4 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=4)
        self.conv5 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=5)
        self.fc = nn.Linear(32*3, 10)
        self.out = nn.Linear(10, num_classes)
    def forward(self, x):
        x = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        x = x.permute(0, 2, 1) # (batch_size, embed_dim, seq_len)

        x1 = F.relu(self.conv3(x)) # Conv1d với kernel_size = 3
        x2 = F.relu(self.conv4(x)) # Conv1d với kernel_size = 4
        x3 = F.relu(self.conv5(x)) # Conv1d với kernel_size = 5

        x1 = F.max_pool1d(x1, x1.size(2)).squeeze(2)
        x2 = F.max_pool1d(x2, x2.size(2)).squeeze(2)
        x3 = F.max_pool1d(x3, x3.size(2)).squeeze(2)

        x = torch.cat((x1, x2, x3), 1) # Nối lại các features
        x = F.relu(self.fc(x))
        x = torch.sigmoid(self.out(x)) # Binary classification

        return x

# Model, loss, optimizer
model = MultilevelCNN(vocab_size=5000, embed_dim=16).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

best_val_loss = float('inf')
patience = 2
wait = 0
num_epochs = 20
# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    loop = tqdm(train_loader, desc=f"Epoch {epoch+1} [Train]", leave=True)
    for batch_X, batch_y in loop:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X).squeeze(1)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        loop.set_postfix(loss=loss.item())

    # === Validation sau mỗi epoch ===
    model.eval()
    correct = 0
    val_loss_total = 0
    total = 0
    val_loop = tqdm(test_loader, desc=f"Epoch {epoch+1} [Val]", leave=True)
    with torch.no_grad():
        for batch_X, batch_y in val_loop:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X).squeeze(1)
            val_loss = criterion(outputs, batch_y)
            val_loss_total += val_loss.item()

            preds = (outputs > 0.5).float()
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)
            val_loop.set_postfix(val_loss=val_loss.item())

    val_acc = correct / total
    val_loss_avg = val_loss_total / len(test_loader)
    total_loss_avg = total_loss / len(test_loader)
    print(f"Epoch {epoch+1} | Train Loss: {total_loss_avg:.4f} | Val Loss: {val_loss_avg:.4f} | Val Acc: {val_acc*100:.2f}%")

    # === Early stopping check ===
    if val_loss_avg < best_val_loss:
        best_val_loss = val_loss_avg
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print("Early stopping triggered.")
            break

In [None]:
# Predict
def predict_text(texts):
    model.eval()
    seq = tokenizer.texts_to_sequences(texts)
    padded = pad_sequences(seq, maxlen=512, padding='post', truncating='post')
    tensor = torch.tensor(padded, dtype=torch.long).to(device)
    with torch.no_grad():
        outputs = model(tensor)
        probs = outputs.cpu().numpy()
        return probs, (probs > 0.5).astype(int)

probs, preds = predict_text(X_test)

### Model DL

In [33]:
learning_rate = 2e-5
epochs = 5
patience = 2

In [34]:
def get_tokenizer_and_model(model_name: str, num_labels: int):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    return tokenizer, model

In [35]:
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        item = {key: val.squeeze(0) for key, val in encoding.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item


In [36]:
class DataLoaderBuilder:
    def __init__(self, dataset, batch_size=32, shuffle=True, num_workers=2):
        self.dataset = dataset
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.num_workers = num_workers

    def get_dataloader(self):
        return DataLoader(
            dataset=self.dataset,
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            num_workers=self.num_workers
        )

In [37]:
class Trainer:
    def __init__(self, model, train_loader, val_loader, model_name, lr=2e-5, epochs=5, patience=2, device='cuda'):
        self.model = model.to(device)
        self.train_loader = train_loader
        self.val_loader = val_loader
        self.model_name = model_name
        self.lr = lr
        self.epochs = epochs
        self.patience = patience
        self.device = device
        self.optimizer = AdamW(self.model.parameters(), lr=lr)
        self.scheduler = get_scheduler("linear", self.optimizer, num_warmup_steps=0,
                                       num_training_steps=len(train_loader) * epochs)
        
        # Track history
        self.train_losses = []
        self.val_losses = []
        self.val_accuracies = []

    def train_one_epoch(self):
        self.model.train()
        total_loss = 0
        for batch in tqdm(self.train_loader, desc="Training"):
            batch = {k: v.to(self.device) for k, v in batch.items()}
            outputs = self.model(**batch)
            loss = outputs.loss
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
            self.optimizer.step()
            self.scheduler.step()
            self.optimizer.zero_grad()
            total_loss += loss.item()
        return total_loss / len(self.train_loader)

    def evaluate(self):
        self.model.eval()
        all_preds, all_labels = [], []
        total_loss = 0
        with torch.no_grad():
            for batch in tqdm(self.val_loader, desc="Validating"):
                batch = {k: v.to(self.device) for k, v in batch.items()}
                outputs = self.model(**batch)
                loss = outputs.loss
                logits = outputs.logits
                preds = torch.argmax(logits, dim=-1)
                total_loss += loss.item()
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(batch['labels'].cpu().numpy())
        acc = accuracy_score(all_labels, all_preds)
        avg_loss = total_loss / len(self.val_loader)
        return acc, avg_loss, all_preds, all_labels

    def train(self):
        best_loss = float('inf')
        stop_count = 0
        save_path = os.path.join("/kaggle/working", f"{self.model_name}_best.pt")

        for epoch in range(1, self.epochs + 1):
            print(f"\nEpoch {epoch}/{self.epochs}")
            train_loss = self.train_one_epoch()
            val_acc, val_loss, _, _ = self.evaluate()

            self.train_losses.append(train_loss)
            self.val_losses.append(val_loss)
            self.val_accuracies.append(val_acc)

            print(f"Train Loss: {train_loss:.4f}")
            print(f"Val Loss:   {val_loss:.4f} | Accuracy: {val_acc:.4f}")

            if val_loss < best_loss:
                best_loss = val_loss
                stop_count = 0
                torch.save(self.model.state_dict(), save_path)
            else:
                stop_count += 1
                if stop_count >= self.patience:
                    print("Early stopping triggered.")
                    break

        # Reload best model
        self.model.load_state_dict(torch.load(save_path))
        
        # Lưu lại lịch sử train/val
        log_df = pd.DataFrame({
            "epoch": list(range(1, len(self.train_losses) + 1)),
            "train_loss": self.train_losses,
            "val_loss": self.val_losses,
            "val_acc": self.val_accuracies
        })
        log_df.to_csv(f"/kaggle/working/{self.model_name}_training_log.csv", index=False)
        print("Training log saved.")

        return self.train_losses, self.val_losses, self.val_accuracies

    def test_model(self, test_loader):
        """
        Đánh giá mô hình đã lưu trên tập test.
        Trả về: acc, loss, preds, true_labels, probs
        """
        save_path = os.path.join("/kaggle/working", f"{self.model_name}_best.pt")
        self.model.load_state_dict(torch.load(save_path))
        self.model.eval()

        all_preds, all_labels, all_probs = [], [], []
        total_loss = 0.0

        with torch.no_grad():
            for batch in tqdm(test_loader, desc="Testing"):
                batch = {k: v.to(self.device) for k, v in batch.items()}
                outputs = self.model(**batch)
                loss = outputs.loss
                logits = outputs.logits
                probs = torch.softmax(logits, dim=-1)
                preds = torch.argmax(probs, dim=-1)

                total_loss += loss.item()
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(batch['labels'].cpu().numpy())
                all_probs.extend(probs[:, 1].cpu().numpy() if probs.shape[1] > 1 else probs[:, 0].cpu().numpy())

        avg_loss = total_loss / len(test_loader)
        acc = accuracy_score(all_labels, all_preds)

        # Save for later analysis
        results_df = pd.DataFrame({
            "true_label": all_labels,
            "pred_label": all_preds,
            "prob_class1": all_probs
        })
        results_path = f"/kaggle/working/{self.model_name}_test_results.csv"
        results_df.to_csv(results_path, index=False)
        print(f"Test results saved to {results_path}")

        return acc, avg_loss, all_preds, all_labels, all_probs

In [40]:
def run_model(model_name, batch_size):
    global df, learning_rate, epochs, patience, X_train, X_val, X_test, y_train, y_val, y_test

    print("========== LOAD TOKENIZER & MODEL ==========")
    tokenizer, model = get_tokenizer_and_model(model_name, num_labels=len(set(df["label"])))

    print("========== CREATE DATASETS ==========")
    train_dataset = TextClassificationDataset(X_train, y_train, tokenizer)
    val_dataset = TextClassificationDataset(X_val, y_val, tokenizer)
    test_dataset = TextClassificationDataset(X_test, y_test, tokenizer)

    print("========== CREATE DATALOADERS ==========")
    train_loader = DataLoaderBuilder(train_dataset, batch_size=batch_size, shuffle=True).get_dataloader()
    val_loader = DataLoaderBuilder(val_dataset, batch_size=batch_size, shuffle=False).get_dataloader()
    test_loader = DataLoaderBuilder(test_dataset, batch_size=batch_size, shuffle=False).get_dataloader()

    print("========== INITIALIZE TRAINER ==========")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    trainer = Trainer(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        model_name=model_name,
        lr=learning_rate,
        epochs=epochs,
        patience=patience,
        device=device
    )

    print(f"\n{'-'*50}")
    print(f"\tTRAINING MODEL: {model_name}")
    print(f"{'-'*50}")
    train_losses, val_losses, val_accuracies = trainer.train()

    print(f"\n{'-'*50}")
    print(f"\tEVALUATION ON TEST SET")
    print(f"{'-'*50}")
    test_acc, test_loss, test_preds, test_true_labels, test_probs = trainer.test_model(test_loader)

    print("\nClassification Report:")
    print(classification_report(test_true_labels, test_preds, target_names=[str(i) for i in sorted(df['label'].unique())]))

    cm = confusion_matrix(test_true_labels, test_preds)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=sorted(df['label'].unique()),
                yticklabels=sorted(df['label'].unique()))
    plt.title('Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.show()

    print(f"Test Accuracy: {test_acc:.4f} | Test Loss: {test_loss:.4f}")

    # Trả về dữ liệu cho việc vẽ về sau hoặc lưu log thêm
    return {
        "train_losses": train_losses,
        "val_losses": val_losses,
        "val_accuracies": val_accuracies,
        "test_acc": test_acc,
        "test_loss": test_loss,
        "test_preds": test_preds,
        "test_labels": test_true_labels,
        "test_probs": test_probs
    }


In [None]:
results = run_model("bert-base-uncased", batch_size=64)

plot_curves(results["train_losses"], results["val_losses"], results["val_accuracies"])
plot_roc_curve(results["test_labels"], results["test_probs"])
evaluate(results["test_labels"], results["test_preds"])



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



--------------------------------------------------
	TRAINING MODEL: bert-base-uncased
--------------------------------------------------

Epoch 1/5


Training:   1%|▏         | 11/749 [00:18<19:52,  1.62s/it]

In [None]:
results_2 = run_model("roberta-base", batch_size=64)

plot_curves(results_2["train_losses"], results_2["val_losses"], results_2["val_accuracies"])
plot_roc_curve(results_2["test_labels"], results_2["test_probs"])
evaluate(results_2["test_labels"], results_2["test_preds"])

In [None]:
results_3 = run_model(model_name="xlnet-base-cased", batch_size=32)

plot_curves(results_3["train_losses"], results_3["val_losses"], results_3["val_accuracies"])
plot_roc_curve(results_3["test_labels"], results_3["test_probs"])
evaluate(results_3["test_labels"], results_3["test_preds"])