# EDA và Tiền xử lý dữ liệu:
- Phân tích phân bố dữ liệu, missing values, outliers.
- Thực hiện chuẩn hóa biến số, encoding cho biến phân loại.
- Trích chọn hoặc tạo thêm các đặc trưng mới (feature engineering).
- Chia tập dữ liệu thành train/validation/test theo tỷ lệ và phương pháp stratified sampling.


# Processing

In [1]:
import pandas as pd

The source dataset can be downloaded using the link: https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing

In [2]:
true_df = pd.read_csv("./DataSet_Misinfo_TRUE.csv", index_col=0)
fake_df = pd.read_csv("./DataSet_Misinfo_FAKE.csv", index_col=0)

In [3]:
true_df.shape, fake_df.shape

((34975, 1), (43642, 1))

There are 34975 true articles and 43642 fake articles.

In [4]:
def process_df(df: pd.DataFrame) -> pd.DataFrame:
    nan_count = int(df.isna().sum()["text"])
    duplicate_count = df.duplicated().sum()
    df = df.dropna()
    df = df.drop_duplicates()
    print(
        f"{nan_count} NaN, {duplicate_count} duplicate in the provided DataFrame. Have cleaned them all up."
    )
    return df

There are 29 NaN items in the true articles dataset, and 0 in the fake one. We would drop the NaN items in the true articles dataset.

In [5]:
true_df = process_df(true_df)

29 NaN, 448 duplicate in the provided DataFrame. Have cleaned them all up.


In [6]:
fake_df = process_df(fake_df)

0 NaN, 9564 duplicate in the provided DataFrame. Have cleaned them all up.


In [7]:
true_df.head()

Unnamed: 0,text
0,The head of a conservative Republican faction ...
1,Transgender people will be allowed for the fir...
2,The special counsel investigation of links bet...
3,Trump campaign adviser George Papadopoulos tol...
4,President Donald Trump called on the U.S. Post...


In [8]:
fake_df.head()

Unnamed: 0,text
0,Donald Trump just couldn t wish all Americans ...
1,House Intelligence Committee Chairman Devin Nu...
2,"On Friday, it was revealed that former Milwauk..."
3,"On Christmas day, Donald Trump announced that ..."
4,Pope Francis used his annual Christmas Day mes...


This dataset includes only the articles, without any metadata attached, such as the publisher, date, author.

In [9]:
true_df.shape[0] + fake_df.shape[0]

68604

# Splitting

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
true_df["label"] = 1
fake_df["label"] = 0

In [12]:
combined_df = pd.concat([true_df, fake_df], ignore_index=True)

In [13]:
random_seed = 42
test_size = 0.1
valid_size = 0.1

train_stratified, test_stratified = train_test_split(
    combined_df,
    test_size=test_size,
    random_state=random_seed,
    stratify=combined_df["label"],
)

train_stratified, valid_stratified = train_test_split(
    train_stratified,
    test_size=valid_size / (1 - test_size),
    random_state=random_seed,
    stratify=train_stratified["label"],
)

train_stratified.to_csv("./stratify_train.csv")
valid_stratified.to_csv("./stratify_valid.csv")
test_stratified.to_csv("./stratify_test.csv")

In [14]:
train_normal, test_normal = train_test_split(
    combined_df,
    test_size=test_size,
    random_state=random_seed,
)

train_normal, valid_normal = train_test_split(
    train_normal,
    test_size=valid_size / (1 - test_size),
    random_state=random_seed,
)

train_normal.to_csv("./normal_train.csv")
valid_normal.to_csv("./normal_valid.csv")
test_normal.to_csv("./normal_test.csv")