# **Final Project**

## **Problem stament :**     

The widespread dissemination of fake news and propaganda presents serious societal risks, including the erosion of public trust, political polarization, manipulation of elections, and the spread of harmful misinformation during crises such as pandemics or conflicts. From an NLP perspective, detecting fake news is fraught with challenges. Linguistically, fake news often mimics the tone and structure of legitimate journalism, making it difficult to distinguish using surface-level features. The absence of reliable and up-to-date labeled datasets, especially across multiple languages and regions, hampers the effectiveness of supervised learning models. Additionally, the dynamic and adversarial nature of misinformation means that malicious actors constantly evolve their language and strategies to bypass detection systems. Cultural context, sarcasm, satire, and implicit bias further complicate automated analysis. Moreover, NLP models risk amplifying biases present in training data, leading to unfair classifications and potential censorship of legitimate content. These challenges underscore the need for cautious, context-aware approaches, as the failure to address them can inadvertently contribute to misinformation, rather than mitigate it.



Use datasets in link : https://drive.google.com/drive/folders/1mrX3vPKhEzxG96OCPpCeh9F8m_QKCM4z?usp=sharing
to complete requirement.

## **About dataset:**

* **True Articles**:

  * **File**: `MisinfoSuperset_TRUE.csv`
  * **Sources**:

    * Reputable media outlets like **Reuters**, **The New York Times**, **The Washington Post**, etc.

* **Fake/Misinformation/Propaganda Articles**:

  * **File**: `MisinfoSuperset_FAKE.csv`
  * **Sources**:

    * **American right-wing extremist websites** (e.g., Redflag Newsdesk, Breitbart, Truth Broadcast Network)
    * **Public dataset** from:

      * Ahmed, H., Traore, I., & Saad, S. (2017): "Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques" *(Springer LNCS 10618)*



## **Requirement**

A team consisting of three members must complete a project that involves applying the methods learned from the beginning of the course up to the present. The team is expected to follow and document the entire machine learning workflow, which includes the following steps:

1. **Data Preprocessing**: Clean and prepare the dataset,etc.

2. **Exploratory Data Analysis (EDA)**: Explore and visualize the data.

3. **Model Building**: Select and build one or more machine learning models suitable for the problem at hand.

4. **Hyperparameter set up**: Set and adjust the model's hyperparameters using appropriate methods to improve performance.

5. **Model Training**: Train the model(s) on the training dataset.

6. **Performance Evaluation**: Evaluate the trained model(s) using appropriate metrics (e.g., accuracy, precision, recall, F1-score, confusion matrix, etc.) and validate their performance on unseen data.

7. **Conclusion**: Summarize the results, discuss the model's strengths and weaknesses, and suggest possible improvements or future work.





# Read dataset

In [1]:
!pip install contractions



In [2]:
!pip install langdetect



In [44]:
import pandas as pd
import re
import contractions

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from transformers import BertTokenizer

import html
import quopri
import emoji
from bs4 import BeautifulSoup
from langdetect import detect, LangDetectException

In [4]:
true_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_TRUE.csv")
true_df

Unnamed: 0.1,Unnamed: 0,text
0,0,The head of a conservative Republican faction ...
1,1,Transgender people will be allowed for the fir...
2,2,The special counsel investigation of links bet...
3,3,Trump campaign adviser George Papadopoulos tol...
4,4,President Donald Trump called on the U.S. Post...
...,...,...
34970,34970,Most conservatives who oppose marriage equalit...
34971,34971,The freshman senator from Georgia quoted scrip...
34972,34972,The State Department told the Republican Natio...
34973,34973,"ADDIS ABABA, Ethiopia —President Obama convene..."


In [5]:
fake_df = pd.read_csv("/kaggle/input/misinfo/DataSet_Misinfo_FAKE.csv")
fake_df

Unnamed: 0.1,Unnamed: 0,text
0,0,Donald Trump just couldn t wish all Americans ...
1,1,House Intelligence Committee Chairman Devin Nu...
2,2,"On Friday, it was revealed that former Milwauk..."
3,3,"On Christmas day, Donald Trump announced that ..."
4,4,Pope Francis used his annual Christmas Day mes...
...,...,...
43637,44422,The USA wants to divide Syria.\r\n\r\nGreat Br...
43638,44423,The Ukrainian coup d'etat cost the US nothing ...
43639,44424,The European Parliament falsifies history by d...
43640,44425,The European Parliament falsifies history by d...


In [6]:
# Delete order column
true_df = true_df.drop('Unnamed: 0', axis=1)
fake_df = fake_df.drop('Unnamed: 0', axis=1)

In [7]:
true_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34975 entries, 0 to 34974
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    34946 non-null  object
dtypes: object(1)
memory usage: 273.4+ KB


In [8]:
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43642 entries, 0 to 43641
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    43642 non-null  object
dtypes: object(1)
memory usage: 341.1+ KB


In [9]:
true_df.describe()

Unnamed: 0,text
count,34946
unique,34526
top,"Killing Obama administration rules, dismantlin..."
freq,58


In [10]:
fake_df.describe()

Unnamed: 0,text
count,43642
unique,34078
top,Leave a Reply Click here to get more info on f...
freq,38


# Data Preprocessing

- Xử lý giá trị null

In [11]:
true_df.isnull().sum()

text    29
dtype: int64

In [12]:
fake_df.isnull().sum()

text    0
dtype: int64

In [13]:
true_df = true_df.dropna()

- Xử lý giá trị duplicate

In [14]:
true_df.duplicated().sum()

420

In [15]:
fake_df.duplicated().sum()

9564

In [16]:
true_df = true_df.drop_duplicates()
fake_df = fake_df.drop_duplicates()

- Thêm label và gộp 2 tập dữ liệu

In [17]:
true_df['label'] = 1
fake_df['label'] = 0

df = pd.concat([true_df, fake_df], ignore_index=True)
df

Unnamed: 0,text,label
0,The head of a conservative Republican faction ...,1
1,Transgender people will be allowed for the fir...,1
2,The special counsel investigation of links bet...,1
3,Trump campaign adviser George Papadopoulos tol...,1
4,President Donald Trump called on the U.S. Post...,1
...,...,...
68599,"Apparently, the new Kyiv government is in a hu...",0
68600,The USA wants to divide Syria.\r\n\r\nGreat Br...,0
68601,The Ukrainian coup d'etat cost the US nothing ...,0
68602,The European Parliament falsifies history by d...,0


In [18]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle dataset
df

Unnamed: 0,text,label
0,Former Russian economy minister Alexei Ulyukay...,1
1,Republicans were just given a leg up over Demo...,0
2,This has to be one of the best remix videos ev...,0
3,"In line with the new Language Law, Russian is ...",0
4,JERUSALEM — A day after approving the const...,1
...,...,...
68599,The Super Bowl had not yet begun and Trump fan...,0
68600,U.S. House Republicans on Friday won passage o...,1
68601,Share on Facebook Share on Twitter Known to th...,0
68602,A New Jersey man who worked at the World Trade...,1


* Kiểm tra imbalance

In [19]:
df['label'].value_counts()

label
1    34526
0    34078
Name: count, dtype: int64

=> Dữ liệu không bị imbalance

- Xử lý các văn bản không phải là tiếng Anh

In [21]:
def safe_detect(x):
    if isinstance(x, str) and x.strip() and len(x.strip()) > 20:
        try:
            return detect(x)
        except LangDetectException:
            return 'unknown'
    return 'unknown'

df['lang'] = df['text'].apply(safe_detect)
non_english = df[df['lang'] != 'en']
print(non_english)

                                                    text  label     lang
127                                   Florida for Trump!      0  unknown
253    0 комментариев 0 поделились Фото: AP \nКоммент...      0       ru
294    0 комментариев 7 поделились \n"Это полный бред...      0       ru
297    +++ Beim Jupiter! Spuren römischer Zivilisatio...      0       de
433    0 комментариев 0 поделились \n23 октября в Ниж...      0       ru
...                                                  ...    ...      ...
68104  Mittwoch, 16. November 2016 Neue App ruft auto...      0       de
68347  +++ Muhten ihm einiges zu: Bauer soll Streit u...      0       de
68388  — The Sun (@TheSun) 23. November 2016 Laut Fer...      0       de
68406                       President Obama is a Muslim.      0       ca
68444  Страна: Китай Заявления КНДР о завершении свое...      0       ru

[649 rows x 3 columns]


In [35]:
len(non_english[non_english['label']==1])

10

In [40]:
df = df.drop(columns="lang", axis=1) # Xóa cột phụ sau khi xử lý

=> Không xóa các dòng văn bản không phải tiếng Anh vì label 0 - fake news chiếm đa số

- Xử lý các văn bản với số từ ít hơn 5

In [36]:
# Đếm số từ trong mỗi dòng
short_texts = df[df['text'].apply(lambda x: len(str(x).split()) < 5)]

# In ra các dòng này
print(short_texts)

                                                  text  label     lang
127                                 Florida for Trump!      0  unknown
288    A MUST watch video!https://youtu.be/-5Z-jJ2Z4bU      0       en
772                                               Cool      0  unknown
965                    That would be unconstitutional.      0       en
1115                   Around 120,000 displaced people      1       en
...                                                ...    ...      ...
67547                           TRUMP VICTORY FOR SURE      0       de
67689                                        Brilliant      0  unknown
67766                                  Good guy.\n👍👍👍👍      0  unknown
67797      https://www.youtube.com/watch?v=gqxwF-TeYas      0  unknown
67834                                        Horseshit      0  unknown

[170 rows x 3 columns]


In [37]:
short_texts[short_texts['label']==1]

Unnamed: 0,text,label,lang
1115,"Around 120,000 displaced people",1,en
20229,Republican Congressman Will Hurd,1,en
24713,Ted Cruz,1,unknown
26250,Four U.S. senators,1,unknown
28034,“On 1/20,1,unknown
31892,No.,1,unknown
40323,(Reuters),1,unknown
57596,Jan 29 (Reuters),1,unknown
65117,advertisement,1,unknown


=> Xóa các văn bản có số từ <5 vì với label 1 - true news thì thật sự không có ý nghĩa -> có thể làm model dự đoán sai.

In [38]:
# Loại bỏ các dòng có số từ < 5
df = df[df['text'].apply(lambda x: len(str(x).split()) >= 5)]

## Clean text

- Làm sạch văn bản (lower, bỏ dấu câu, stopwords, stemming...) + Tokenizer

In [42]:
# The first running
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [43]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [45]:
def clean_text(row):
    row = str(row).lower()

    # Remove email headers
    row = re.sub(r'(?i)\b(from|to|cc|bcc|subject|date|return-path|message-id|thread-topic|thread-index|content-type|mime-version|boundary|received|x-[\w-]+):.*', ' ', row)
    
    # Remove mailto links
    row = re.sub(r'mailto:[^\s]+', ' ', row)
    
    # Decode quoted-printable
    row = quopri.decodestring(row.encode('utf-8')).decode('utf-8', errors='ignore')
    
    # Unescape HTML entities
    row = html.unescape(row)
    
    # Strip HTML tags
    if '<' in row and '>' in row:
        row = BeautifulSoup(row, "lxml").get_text()
    
    # Normalize
    row = re.sub(r'[\t\r\n]', ' ', row)
    row = re.sub(r'[_~+\-]{2,}', ' ', row)
    row = re.sub(r"[<>()|&©ø%\[\]\\~*\$€£¥]", ' ', row)
    row = re.sub(r"\\x[0-9a-fA-F]{2}", ' ', row)
    row = re.sub(r'(https?://)([^/\s]+)([^\s]*)', r'\2', row)
    row = re.sub(r'[a-f0-9]{16,}', ' ', row)
    row = re.sub(r'([.?!])[\s]*\1+', r'\1', row)
    row = re.sub(r'\s+', ' ', row)

    # Remove code-like keywords
    row = re.sub(r'\b(function|var|return|typeof|window|document|eval|\.split)\b', ' ', row)
    
    # Remove programming symbols
    row = re.sub(r'[{}=<>\[\]^~|`#@*]', ' ', row)

    # Remove all emoji
    row = emoji.replace_emoji(row, replace='')

    # Cut code JS minify or base36 encode
    code_gibberish = re.search(r'[a-z0-9]{20,}', row)
    if code_gibberish and len(row) - code_gibberish.start() > 50:
        row = row[:code_gibberish.start()]

    # Cut off JS/CDATA tail
    cutoff = re.search(
        r'(//\s*!?\s*cdata|function\s*\(|var\s+[a-zA-Z]|window\s*\.\s*|document\s*\.\s*|this\s*\.)',
        row
    )
    if cutoff and len(row) - cutoff.start() > 10:
        row = row[:cutoff.start()]

    row = re.sub(r'!+\s*cdata\s*!+', ' ', row, flags=re.IGNORECASE)

    return row.strip()

df['clean_text'] = df['text'].apply(clean_text)
df['clean_text']

0        former russian economy minister alexei ulyukay...
1        republicans were just given a leg up over demo...
2        this has to be one of the best remix videos ev...
3        in line with the new language law, russian is ...
4        jerusalem — a day after approving the construc...
                               ...                        
68599    the super bowl had not yet begun and trump fan...
68600    u.s. house republicans on friday won passage o...
68601    share on facebook share on twitter known to th...
68602    a new jersey man who worked at the world trade...
68603    turkey and iran have agreed to discuss within ...
Name: clean_text, Length: 68434, dtype: object

In [46]:
print("Discovery text of data:\n")

print(f"Data 1454: {df['text'][1454]}\n")
print(f"Clean data 1454: {df['clean_text'][1454]}\n")

print(f"Data 19802: {df['text'][19802]}\n")
print(f"Clean data 19802: {df['clean_text'][19802]}\n")

Discovery text of data:

Data 1454: View source Re: Obama Says He Didn’t Know Hillary Clinton Was Using Private Email Address - NYTimes.com From:pir@hrcoffice.com To: jennifer.m.palmieri@gmail.com Date: 2015-03-08 10:21 Subject: Re: Obama Says He Didn’t Know Hillary Clinton Was Using Private Email Address - NYTimes.com Ok. Sounds like people are putting words into his mouth. On Mar 8, 2015, at 7:56 AM, Jennifer Palmieri <jennifer.m.palmieri@gmail.com<mailto:jennifer.m.palmieri@gmail.com>> wrote: Suggest Philippe talk to Josh or Eric. They know POTUS and HRC emailed. Josh has been asked about that. Standard practice is not to confirm anything about his email, so his answer to press was that he would not comment/confirm. I recollect that Josh was also asked if POTUS ever noticed her personal email account and he said something like POTUS likely had better things to do than focus on his Cabinet's email addresses. Sent from my iPad On Mar 8, 2015, at 12:40 AM, Philippe Reines <pir@hrcoffic

In [47]:
def tokenize_and_filter(text):  # dùng cho các mô hình truyền thống như TF-IDF
    # Mở rộng các từ viết tắt (contractions)
    text = contractions.fix(text)

    # Tokenize
    tokens = word_tokenize(text)

    # Lọc stopwords và chỉ giữ từ alphabet -> stemming
    return [stemmer.stem(w) for w in tokens if w.lower() not in stop_words and w.isalpha()]

In [48]:
def tokenize_texts(tokenizer, texts, max_length=512): # dùng cho các mô hình như BERT
    return tokenizer(
        texts,
        truncation=True,      # Cắt bớt nếu dài hơn max_length
        padding=True,         # Tự động thêm padding để đồng đều độ dài
        max_length=max_length # Độ dài tối đa mỗi đoạn văn
    )