# **Final Project**

## **Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





# Read dataset

In [1]:
!pip install contractions



In [2]:
import pandas as pd
import re
import contractions

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from transformers import BertTokenizer

In [3]:
true_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_TRUE.csv")
true_df

Unnamed: 0.1,Unnamed: 0,text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...
...,...,...
34970,34970,Most conservatives who oppose marriage equalit...
34971,34971,The freshman senator from Georgia quoted scrip...
34972,34972,The State Department told the Republican Natio...
34973,34973,"ADDIS ABABA, Ethiopia —President Obama convene..."


In [4]:
fake_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_FAKE.csv")
fake_df

Unnamed: 0.1,Unnamed: 0,text
0,0,Donald Trump just couldn t wish all Americans ...
1,1,House Intelligence Committee Chairman Devin Nu...
2,2,"On Friday, it was revealed that former Milwauk..."
3,3,"On Christmas day, Donald Trump announced that ..."
4,4,Pope Francis used his annual Christmas Day mes...
...,...,...
43637,44422,The USA wants to divide Syria.\r\n\r\nGreat Br...
43638,44423,The Ukrainian coup d'etat cost the US nothing ...
43639,44424,The European Parliament falsifies history by d...
43640,44425,The European Parliament falsifies history by d...


In [5]:
# Delete order column
true_df = true_df.drop('Unnamed: 0', axis=1)
fake_df = fake_df.drop('Unnamed: 0', axis=1)

In [6]:
true_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34975 entries, 0 to 34974
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    34946 non-null  object
dtypes: object(1)
memory usage: 273.4+ KB


In [7]:
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43642 entries, 0 to 43641
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    43642 non-null  object
dtypes: object(1)
memory usage: 341.1+ KB


In [8]:
true_df.describe()

Unnamed: 0,text
count,34946
unique,34526
top,"Killing Obama administration rules, dismantlin..."
freq,58


In [9]:
fake_df.describe()

Unnamed: 0,text
count,43642
unique,34078
top,Leave a Reply Click here to get more info on f...
freq,38


# Data Preprocessing

- Xử lý giá trị null

In [10]:
true_df.isnull().sum()

text    29
dtype: int64

In [11]:
fake_df.isnull().sum()

text    0
dtype: int64

In [12]:
true_df = true_df.dropna()

- Xử lý giá trị duplicate

In [13]:
true_df.duplicated().sum()

420

In [14]:
fake_df.duplicated().sum()

9564

In [15]:
true_df = true_df.drop_duplicates()
fake_df = fake_df.drop_duplicates()

- Thêm label và gộp 2 tập dữ liệu

In [16]:
true_df['label'] = 1
fake_df['label'] = 0

df = pd.concat([true_df, fake_df], ignore_index=True)
df

Unnamed: 0,text,label
0,The head of a conservative Republican faction ...,1
1,Transgender people will be allowed for the fir...,1
2,The special counsel investigation of links bet...,1
3,Trump campaign adviser George Papadopoulos tol...,1
4,President Donald Trump called on the U.S. Post...,1
...,...,...
68599,"Apparently, the new Kyiv government is in a hu...",0
68600,The USA wants to divide Syria.\r\n\r\nGreat Br...,0
68601,The Ukrainian coup d'etat cost the US nothing ...,0
68602,The European Parliament falsifies history by d...,0


In [17]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle dataset
df

Unnamed: 0,text,label
0,Former Russian economy minister Alexei Ulyukay...,1
1,Republicans were just given a leg up over Demo...,0
2,This has to be one of the best remix videos ev...,0
3,"In line with the new Language Law, Russian is ...",0
4,JERUSALEM — A day after approving the const...,1
...,...,...
68599,The Super Bowl had not yet begun and Trump fan...,0
68600,U.S. House Republicans on Friday won passage o...,1
68601,Share on Facebook Share on Twitter Known to th...,0
68602,A New Jersey man who worked at the World Trade...,1


* Kiểm tra imbalance

In [None]:
df['label'].value_counts()

Dữ liệu không bị imbalance

In [None]:
import re
import pandas as pd

def print_matching_rows(df, column_name='text'):
    for i, text in df[column_name].items():
        found_br = re.search(r'<\s*br\s*/?\s*>', text)
        found_hex = re.search(r"\\x[0-9a-fA-F]{2}", text)

        if found_br or found_hex:
            print(f"\nDòng {i} chứa:")
            if found_br:
                print("  - Thẻ <br />")
            if found_hex:
                print("  - Mã hex kiểu \\x..")
            print("  => Nội dung:", text)

print_matching_rows(df)

## Clean text

- Làm sạch văn bản (lower, bỏ dấu câu, stopwords, stemming...) + Tokenizer

In [18]:
# The first running
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [19]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [20]:
def clean_text(row):
    """
    Generator to clean text from a pandas Series-like iterable.
    Applies regex patterns to normalize text, remove HTML tags, escape sequences, 
    special characters, URLs and whitespace.
    """
    row = str(row).lower()

    # Remove escape characters
    row = re.sub(r'[\t\r\n]', ' ', row)

    # Remove repeated special chars (ex: __, --,)
    row = re.sub(r'[_~+\-]{2,}', ' ', row)

    # Remove unwanted symbols preserving ., !, ?, -
    row = re.sub(r"[<>()|&©ø%\[\]\\~*\$€£¥]", ' ', row)

    # Remove HTML <br /> tags
    row = re.sub(r'<\s*br\s*/?\s*>', ' ', row)

    # Remove hex codes like \x92
    row = re.sub(r"\\x[0-9a-fA-F]{2}", ' ', row)

    # Extract domain from URLs
    row = re.sub(r'(https?://)([^/\s]+)([^\s]*)', r'\2', row)

    # Normalize multiple spaces
    row = re.sub(r'\s+', ' ', row)

    # Xử lý dấu câu dư kiểu ". .", "!!", "??"
    row = re.sub(r'([.?!])[\s]*\1+', r'\1', row)
    
    return row.strip()

df['clean_text'] = df['text'].apply(clean_text) # có thể xử lý ở model training
df['clean_text']

0        former russian economy minister alexei ulyukay...
1        republicans were just given a leg up over demo...
2        this has to be one of the best remix videos ev...
3        in line with the new language law, russian is ...
4        jerusalem — a day after approving the construc...
                               ...                        
68599    the super bowl had not yet begun and trump fan...
68600    u.s. house republicans on friday won passage o...
68601    share on facebook share on twitter known to th...
68602    a new jersey man who worked at the world trade...
68603    turkey and iran have agreed to discuss within ...
Name: clean_text, Length: 68604, dtype: object

In [None]:
print("Discovery text of data:\n")

print(f"Data 1454: {df['text'][1454]}\n")
print(f"Clean data 1454: {df['clean_text'][1454]}\n")

print(f"Data 19802: {df['text'][19802]}\n")
print(f"Clean data 19802: {df['clean_text'][19802]}\n")

In [21]:
def tokenize_and_filter(text):  # dùng cho các mô hình truyền thống như TF-IDF
    # Mở rộng các từ viết tắt (contractions)
    text = contractions.fix(text)

    # Tokenize
    tokens = word_tokenize(text)

    # Lọc stopwords và chỉ giữ từ alphabet -> stemming
    return [stemmer.stem(w) for w in tokens if w.lower() not in stop_words and w.isalpha()]

In [22]:
def tokenize_texts(tokenizer, texts, max_length=512): # dùng cho các mô hình như BERT
    return tokenizer(
        texts,
        truncation=True,      # Cắt bớt nếu dài hơn max_length
        padding=True,         # Tự động thêm padding để đồng đều độ dài
        max_length=max_length # Độ dài tối đa mỗi đoạn văn
    )