# **Final Project**

## **Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





# Read dataset

In [1]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl.metadata (1.5 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (118 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.3/118.3 kB[0m 

In [2]:
!pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=67e37f002872a376fe2fd5ff8557bd9e3aa066f69f8039da117809d1d7dc10fd
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [3]:
import pandas as pd
import re
import contractions

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from transformers import BertTokenizer

import html
import quopri
import emoji
from bs4 import BeautifulSoup
from langdetect import detect, LangDetectException

In [4]:
true_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_TRUE.csv")
true_df

Unnamed: 0.1,Unnamed: 0,text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...
...,...,...
34970,34970,Most conservatives who oppose marriage equalit...
34971,34971,The freshman senator from Georgia quoted scrip...
34972,34972,The State Department told the Republican Natio...
34973,34973,"ADDIS ABABA, Ethiopia —President Obama convene..."


In [5]:
fake_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_FAKE.csv")
fake_df

Unnamed: 0.1,Unnamed: 0,text
0,0,Donald Trump just couldn t wish all Americans ...
1,1,House Intelligence Committee Chairman Devin Nu...
2,2,"On Friday, it was revealed that former Milwauk..."
3,3,"On Christmas day, Donald Trump announced that ..."
4,4,Pope Francis used his annual Christmas Day mes...
...,...,...
43637,44422,The USA wants to divide Syria.\r\n\r\nGreat Br...
43638,44423,The Ukrainian coup d'etat cost the US nothing ...
43639,44424,The European Parliament falsifies history by d...
43640,44425,The European Parliament falsifies history by d...


In [6]:
# Delete order column
true_df = true_df.drop('Unnamed: 0', axis=1)
fake_df = fake_df.drop('Unnamed: 0', axis=1)

In [7]:
true_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34975 entries, 0 to 34974
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    34946 non-null  object
dtypes: object(1)
memory usage: 273.4+ KB


In [8]:
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43642 entries, 0 to 43641
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    43642 non-null  object
dtypes: object(1)
memory usage: 341.1+ KB


In [9]:
true_df.describe()

Unnamed: 0,text
count,34946
unique,34526
top,"Killing Obama administration rules, dismantlin..."
freq,58


In [10]:
fake_df.describe()

Unnamed: 0,text
count,43642
unique,34078
top,Leave a Reply Click here to get more info on f...
freq,38


# Data Preprocessing

- Xử lý giá trị null

In [11]:
true_df.isnull().sum()

text    29
dtype: int64

In [12]:
fake_df.isnull().sum()

text    0
dtype: int64

In [13]:
true_df = true_df.dropna()

- Xử lý giá trị duplicate

In [14]:
true_df.duplicated().sum()

420

In [15]:
fake_df.duplicated().sum()

9564

In [16]:
true_df = true_df.drop_duplicates()
fake_df = fake_df.drop_duplicates()

- Thêm label và gộp 2 tập dữ liệu

In [17]:
true_df['label'] = 1
fake_df['label'] = 0

df = pd.concat([true_df, fake_df], ignore_index=True)
df

Unnamed: 0,text,label
0,The head of a conservative Republican faction ...,1
1,Transgender people will be allowed for the fir...,1
2,The special counsel investigation of links bet...,1
3,Trump campaign adviser George Papadopoulos tol...,1
4,President Donald Trump called on the U.S. Post...,1
...,...,...
68599,"Apparently, the new Kyiv government is in a hu...",0
68600,The USA wants to divide Syria.\r\n\r\nGreat Br...,0
68601,The Ukrainian coup d'etat cost the US nothing ...,0
68602,The European Parliament falsifies history by d...,0


In [18]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle dataset
df

Unnamed: 0,text,label
0,Former Russian economy minister Alexei Ulyukay...,1
1,Republicans were just given a leg up over Demo...,0
2,This has to be one of the best remix videos ev...,0
3,"In line with the new Language Law, Russian is ...",0
4,JERUSALEM — A day after approving the const...,1
...,...,...
68599,The Super Bowl had not yet begun and Trump fan...,0
68600,U.S. House Republicans on Friday won passage o...,1
68601,Share on Facebook Share on Twitter Known to th...,0
68602,A New Jersey man who worked at the World Trade...,1


* Kiểm tra imbalance

In [19]:
df['label'].value_counts()

label
1    34526
0    34078
Name: count, dtype: int64

=> Dữ liệu không bị imbalance

- Xử lý các văn bản không phải là tiếng Anh

In [20]:
# def safe_detect(x):
#     if isinstance(x, str) and x.strip() and len(x.strip()) > 20:
#         try:
#             return detect(x)
#         except LangDetectException:
#             return 'unknown'
#     return 'unknown'

# df['lang'] = df['text'].apply(safe_detect)
# non_english = df[df['lang'] != 'en']
# print(non_english)

In [21]:
# len(non_english[non_english['label']==1])

In [22]:
# df = df.drop(columns="lang", axis=1) # Xóa cột phụ sau khi xử lý

=> Không xóa các dòng văn bản không phải tiếng Anh vì label 0 - fake news chiếm đa số

- Xử lý các văn bản với số từ ít hơn 5

In [23]:
# Đếm số từ trong mỗi dòng
short_texts = df[df['text'].apply(lambda x: len(str(x).split()) < 5)]

# In ra các dòng này
print(short_texts)

                                                  text  label
127                                 Florida for Trump!      0
288    A MUST watch video!https://youtu.be/-5Z-jJ2Z4bU      0
772                                               Cool      0
965                    That would be unconstitutional.      0
1115                   Around 120,000 displaced people      1
...                                                ...    ...
67547                           TRUMP VICTORY FOR SURE      0
67689                                        Brilliant      0
67766                                  Good guy.\n👍👍👍👍      0
67797      https://www.youtube.com/watch?v=gqxwF-TeYas      0
67834                                        Horseshit      0

[170 rows x 2 columns]


In [24]:
short_texts[short_texts['label']==1]

Unnamed: 0,text,label
1115,"Around 120,000 displaced people",1
20229,Republican Congressman Will Hurd,1
24713,Ted Cruz,1
26250,Four U.S. senators,1
28034,“On 1/20,1
31892,No.,1
40323,(Reuters),1
57596,Jan 29 (Reuters),1
65117,advertisement,1


=> Xóa các văn bản có số từ <5 vì với label 1 - true news thì thật sự không có ý nghĩa -> có thể làm model dự đoán sai.

In [25]:
# Loại bỏ các dòng có số từ < 5
df = df[df['text'].apply(lambda x: len(str(x).split()) >= 5)]

## Clean text

- Làm sạch văn bản (lower, bỏ dấu câu, stopwords, stemming...) + Tokenizer

In [26]:
# The first running
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [27]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [28]:
def clean_text(row):
    row = str(row).lower()

    # Remove email headers
    row = re.sub(r'(?i)\b(from|to|cc|bcc|subject|date|return-path|message-id|thread-topic|thread-index|content-type|mime-version|boundary|received|x-[\w-]+):.*', ' ', row)
    
    # Remove mailto links
    row = re.sub(r'mailto:[^\s]+', ' ', row)
    
    # Decode quoted-printable
    row = quopri.decodestring(row.encode('utf-8')).decode('utf-8', errors='ignore')
    
    # Unescape HTML entities
    row = html.unescape(row)
    
    # Strip HTML tags
    if '<' in row and '>' in row:
        row = BeautifulSoup(row, "lxml").get_text()
    
    # Normalize
    row = re.sub(r'[\t\r\n]', ' ', row)
    row = re.sub(r'[_~+\-]{2,}', ' ', row)
    row = re.sub(r"[<>()|&©ø%\[\]\\~*\$€£¥]", ' ', row)
    row = re.sub(r"\\x[0-9a-fA-F]{2}", ' ', row)
    row = re.sub(r'(https?://)([^/\s]+)([^\s]*)', r'\2', row)
    row = re.sub(r'[a-f0-9]{16,}', ' ', row)
    row = re.sub(r'([.?!])[\s]*\1+', r'\1', row)
    row = re.sub(r'\s+', ' ', row)

    # Remove code-like keywords
    row = re.sub(r'\b(function|var|return|typeof|window|document|eval|\.split)\b', ' ', row)
    
    # Remove programming symbols
    row = re.sub(r'[{}=<>\[\]^~|`#@*]', ' ', row)

    # Remove all emoji
    row = emoji.replace_emoji(row, replace='')

    # Cut code JS minify or base36 encode
    code_gibberish = re.search(r'[a-z0-9]{20,}', row)
    if code_gibberish and len(row) - code_gibberish.start() > 50:
        row = row[:code_gibberish.start()]

    # Cut off JS/CDATA tail
    cutoff = re.search(
        r'(//\s*!?\s*cdata|function\s*\(|var\s+[a-zA-Z]|window\s*\.\s*|document\s*\.\s*|this\s*\.)',
        row
    )
    if cutoff and len(row) - cutoff.start() > 10:
        row = row[:cutoff.start()]

    row = re.sub(r'!+\s*cdata\s*!+', ' ', row, flags=re.IGNORECASE)

    return row.strip()

df['clean_text'] = df['text'].apply(clean_text)
df['clean_text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clean_text'] = df['text'].apply(clean_text)


0        former russian economy minister alexei ulyukay...
1        republicans were just given a leg up over demo...
2        this has to be one of the best remix videos ev...
3        in line with the new language law, russian is ...
4        jerusalem — a day after approving the construc...
                               ...                        
68599    the super bowl had not yet begun and trump fan...
68600    u.s. house republicans on friday won passage o...
68601    share on facebook share on twitter known to th...
68602    a new jersey man who worked at the world trade...
68603    turkey and iran have agreed to discuss within ...
Name: clean_text, Length: 68434, dtype: object

In [29]:
print("Discovery text of data:\n")

print(f"Data 1454: {df['text'][1454]}\n")
print(f"Clean data 1454: {df['clean_text'][1454]}\n")

print(f"Data 19802: {df['text'][19802]}\n")
print(f"Clean data 19802: {df['clean_text'][19802]}\n")

Discovery text of data:

Data 1454: View source Re: Obama Says He Didn’t Know Hillary Clinton Was Using Private Email Address - NYTimes.com From:pir@hrcoffice.com To: jennifer.m.palmieri@gmail.com Date: 2015-03-08 10:21 Subject: Re: Obama Says He Didn’t Know Hillary Clinton Was Using Private Email Address - NYTimes.com Ok. Sounds like people are putting words into his mouth. On Mar 8, 2015, at 7:56 AM, Jennifer Palmieri <jennifer.m.palmieri@gmail.com<mailto:jennifer.m.palmieri@gmail.com>> wrote: Suggest Philippe talk to Josh or Eric. They know POTUS and HRC emailed. Josh has been asked about that. Standard practice is not to confirm anything about his email, so his answer to press was that he would not comment/confirm. I recollect that Josh was also asked if POTUS ever noticed her personal email account and he said something like POTUS likely had better things to do than focus on his Cabinet's email addresses. Sent from my iPad On Mar 8, 2015, at 12:40 AM, Philippe Reines <pir@hrcoffic

In [30]:
def tokenize_and_filter(text):  # dùng cho các mô hình truyền thống như TF-IDF
    # Mở rộng các từ viết tắt (contractions)
    text = contractions.fix(text)

    # Tokenize
    tokens = word_tokenize(text)

    # Lọc stopwords và chỉ giữ từ alphabet -> stemming
    return [stemmer.stem(w) for w in tokens if w.lower() not in stop_words and w.isalpha()]

In [31]:
def tokenize_texts(tokenizer, texts, max_length=512): # dùng cho các mô hình như BERT
    return tokenizer(
        texts,
        truncation=True,      # Cắt bớt nếu dài hơn max_length
        padding=True,         # Tự động thêm padding để đồng đều độ dài
        max_length=max_length # Độ dài tối đa mỗi đoạn văn
    )

# Model building

In [32]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from sklearn.model_selection import train_test_split

# Chuẩn bị dữ liệu
X = df['clean_text']
y = df['label']

# Chia tập huấn luyện (train) và tập kiểm tra (test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Lưu lại index gốc
X_test_indices = X_test.index

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

2025-06-12 07:01:50.800154: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1749711710.981868      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1749711711.035778      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## 1. Multilevel-CNN
- Bắt được đặc trưng từ từ, cụm từ, câu bằng các kernel kích thước khác nhau (3, 4, 5,...).
- Phù hợp với dữ liệu dài + đa dạng, không phụ thuộc vào thứ tự quá dài như RNN.
- Huấn luyện nhanh hơn LSTM, độ chính xác cao hơn CNN đơn thuần.

## 2. CNN + BiLSTM
- CNN trích đặc trưng cục bộ, sau đó BiLSTM hiểu ngữ cảnh hai chiều (trước và sau).
- Phù hợp cho dữ liệu có logic tuyến tính (như tin tức).
- Độ chính xác cao, tuy chậm hơn Multilevel-CNN chút nhưng vẫn tốt nếu tối ưu đúng.

## Multilevel-CNN

**Kiến trúc gợi ý:**

Input (chuỗi văn bản)
→ Embedding Layer
→ Conv1D (kernel size 3) → GlobalMaxPool
→ Conv1D (kernel size 4) → GlobalMaxPool
→ Conv1D (kernel size 5) → GlobalMaxPool
→ Concatenate
→ Dense layers → Dropout (chưa có trong code)
→ Output (Sigmoid / Softmax)

In [33]:
# Tokenization
# Fit tokenizer trên dữ liệu train
tokenizer = Tokenizer(num_words=5000) # giữ lại 5000 từ phổ biến nhất
tokenizer.fit_on_texts(X_train)
# Convert văn bản thành câu và padding
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)
X_train_pad = pad_sequences(X_train_seq, maxlen=512, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=512, padding='post', truncating='post')

# Convert to tensor
X_train_tensor = torch.tensor(X_train_pad, dtype=torch.long)
X_test_tensor = torch.tensor(X_test_pad, dtype=torch.long)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32)

# Dataset & DataLoader
class TextDataset(Dataset):
    '''
    Tạo custom Dataset từ dữ liệu đã padding và label.
    '''
    def __init__(self, X, y):
        self.X = X
        self.y = y
    def __len__(self):
        return len(self.X)
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

train_dataset = TextDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TextDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32)

# Định nghĩa model
class MultilevelCNN(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_classes=1):
        super(MultilevelCNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim) # Embedding layer
        self.conv3 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=3)
        self.conv4 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=4)
        self.conv5 = nn.Conv1d(in_channels=embed_dim, out_channels=32, kernel_size=5)
        self.fc = nn.Linear(32*3, 10)
        self.out = nn.Linear(10, num_classes)
    def forward(self, x):
        x = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        x = x.permute(0, 2, 1) # (batch_size, embed_dim, seq_len)

        x1 = F.relu(self.conv3(x)) # Conv1d với kernel_size = 3
        x2 = F.relu(self.conv4(x)) # Conv1d với kernel_size = 4
        x3 = F.relu(self.conv5(x)) # Conv1d với kernel_size = 5

        x1 = F.max_pool1d(x1, x1.size(2)).squeeze(2)
        x2 = F.max_pool1d(x2, x2.size(2)).squeeze(2)
        x3 = F.max_pool1d(x3, x3.size(2)).squeeze(2)

        x = torch.cat((x1, x2, x3), 1) # Nối lại các features
        x = F.relu(self.fc(x))
        x = torch.sigmoid(self.out(x)) # Binary classification

        return x

# Model, loss, optimizer
model = MultilevelCNN(vocab_size=5000, embed_dim=16).to(device)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

best_val_loss = float('inf')
patience = 2
wait = 0
num_epochs = 20
# Training loop
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    loop = tqdm(train_loader, desc=f"Epoch {epoch+1} [Train]", leave=True)
    for batch_X, batch_y in loop:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X).squeeze(1)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        loop.set_postfix(loss=loss.item())

    # === Validation sau mỗi epoch ===
    model.eval()
    correct = 0
    val_loss_total = 0
    total = 0
    val_loop = tqdm(test_loader, desc=f"Epoch {epoch+1} [Val]", leave=True)
    with torch.no_grad():
        for batch_X, batch_y in val_loop:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X).squeeze(1)
            val_loss = criterion(outputs, batch_y)
            val_loss_total += val_loss.item()
            
            preds = (outputs > 0.5).float()
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)
            val_loop.set_postfix(val_loss=val_loss.item())

    val_acc = correct / total
    val_loss_avg = val_loss_total / len(test_loader)
    total_loss_avg = total_loss / len(test_loader)
    print(f"Epoch {epoch+1} | Train Loss: {total_loss_avg:.4f} | Val Loss: {val_loss_avg:.4f} | Val Acc: {val_acc*100:.2f}%")

    # === Early stopping check ===
    if val_loss_avg < best_val_loss:
        best_val_loss = val_loss_avg
        wait = 0
    else:
        wait += 1
        if wait >= patience:
            print("Early stopping triggered.")
            break

Epoch 1 [Train]: 100%|██████████| 1497/1497 [00:07<00:00, 203.58it/s, loss=0.0772]
Epoch 1 [Val]: 100%|██████████| 642/642 [00:01<00:00, 361.95it/s, val_loss=0.201] 


Epoch 1 | Train Loss: 0.6831 | Val Loss: 0.1689 | Val Acc: 93.41%


Epoch 2 [Train]: 100%|██████████| 1497/1497 [00:06<00:00, 232.74it/s, loss=0.116]  
Epoch 2 [Val]: 100%|██████████| 642/642 [00:01<00:00, 367.45it/s, val_loss=0.184]  


Epoch 2 | Train Loss: 0.3135 | Val Loss: 0.1353 | Val Acc: 94.97%


Epoch 3 [Train]: 100%|██████████| 1497/1497 [00:06<00:00, 237.27it/s, loss=0.058] 
Epoch 3 [Val]: 100%|██████████| 642/642 [00:01<00:00, 364.11it/s, val_loss=0.135]  


Epoch 3 | Train Loss: 0.2246 | Val Loss: 0.1495 | Val Acc: 94.58%


Epoch 4 [Train]: 100%|██████████| 1497/1497 [00:06<00:00, 241.06it/s, loss=0.0139] 
Epoch 4 [Val]: 100%|██████████| 642/642 [00:01<00:00, 370.14it/s, val_loss=0.103]  

Epoch 4 | Train Loss: 0.1640 | Val Loss: 0.1373 | Val Acc: 95.19%
Early stopping triggered.





In [34]:
# Predict
def predict_text(texts):
    model.eval()
    seq = tokenizer.texts_to_sequences(texts)
    padded = pad_sequences(seq, maxlen=512, padding='post', truncating='post')
    tensor = torch.tensor(padded, dtype=torch.long).to(device)
    with torch.no_grad():
        outputs = model(tensor)
        probs = outputs.cpu().numpy()
        return probs, (probs > 0.5).astype(int)

probs, preds = predict_text(X_test)

# Lấy lại văn bản gốc từ df['text'] theo index
raw_texts = df.loc[X_test_indices, 'text'].tolist()
# Tạo DataFrame kết quả
results_df = pd.DataFrame({
    "index": X_test_indices,                     # giữ lại chỉ số gốc
    "text": raw_texts,                           # văn bản
    "prob": np.round(probs.flatten(), 4),        # xác suất dự đoán (real)
    "predict": preds.flatten(),                  # nhãn dự đoán (0/1)
    "label": y_test.tolist()                     # nhãn thật
})

# Hiển thị 20 dòng đầu
print(results_df.head(20))

    index                                               text    prob  predict  \
0   55768  U.S. President Donald Trump will strike a blow...  0.9997        1   
1   64071  The EU is a nazi brainchild: it is a bureaucra...  0.0044        0   
2   57308  White House press secretary Sean Spicer insist...  0.9977        1   
3   64897  Time to exhale James Barack Obama supporter Ja...  0.0212        0   
4    9694  Right-wing Christian extremism is a cancer tha...  0.0000        0   
5    2625  A joint U.S.-South Korean military exercise wi...  0.9998        1   
6   18564  Shelling killed three people in the last major...  0.9999        1   
7   30908  http://www.shtfplan.com/headline-news/you-will...  0.0000        0   
8   47973  21st Century Wire says..We ll believe it  when...  0.0675        0   
9   31723  CHICAGO —Donald Trump, the GOP presidential fr...  0.9969        1   
10  34110  U.S. House of Representatives Speaker Paul Rya...  0.9999        1   
11   5897  The DC Mayor deci

In [35]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pandas as pd

# Chuyển y_test về numpy array (nếu chưa)
true_labels = y_test.values

# Accuracy
acc = accuracy_score(true_labels, preds)
print(f"\nAccuracy: {acc:.4f}")

# Confusion matrix
conf_matrix = confusion_matrix(true_labels, preds)
conf_df = pd.DataFrame(conf_matrix)

print("\nConfusion Matrix:")
print(conf_df)

# Classification report
print("\nClassification Report:")
print(classification_report(true_labels, preds))


Accuracy: 0.9519

Confusion Matrix:
      0      1
0  9522    693
1   294  10022

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.93      0.95     10215
           1       0.94      0.97      0.95     10316

    accuracy                           0.95     20531
   macro avg       0.95      0.95      0.95     20531
weighted avg       0.95      0.95      0.95     20531



## CNN + BiLSTM

Kiến trúc gợi ý: