**FakeNewsNet Cleaning techniques**

**Dataset 1: PolitiFact**

## Data Cleaning Strategy for PolitiFact Dataset

This section outlines the comprehensive cleaning techniques applied to the PolitiFact dataset:

1. **Deduplication**: Removes duplicate articles/posts to prevent data leakage
2. **Missing values/nulls**: Eliminates rows with missing critical data
3. **Lowercasing**: Standardizes text for consistent tokenization
4. **URL/user mention removal**: Cleans Twitter metadata (e.g., "@user", "http://")
5. **Punctuation removal**: Eliminates noise from text
6. **Emoji/HTML tag stripping**: Removes irrelevant or encoded characters
7. **Non-English removal**: Filters to keep only English-language content using language detection
8. **Data Imbalancing Check**: Verifies class distribution (fake vs. real) is acceptable for training

## Step 1: Upload PolitiFact Datasets

This cell uploads the PolitiFact fake and real news CSV files from your local machine to Google Colab for processing.

In [None]:
from google.colab import files

# Upload files politifact_real and politifact_fake datasets
uploaded = files.upload()


## Step 2: Load and Combine Datasets

This cell performs the initial data loading and labeling:
- Loads separate `politifact_fake.csv` and `politifact_real.csv` files
- Adds a `label` column to each dataset ('fake' or 'real')
- Combines both datasets into a single DataFrame for unified processing
- Displays the first few rows to verify the structure

In [None]:
import pandas as pd

df_fake = pd.read_csv("politifact_fake.csv")
df_real = pd.read_csv("politifact_real.csv")

df_fake['label'] = 'fake'
df_real['label'] = 'real'

df = pd.concat([df_fake, df_real], ignore_index=True)
df.head()

## Step 3: Remove Duplicates and Handle Missing Values

This cell performs initial data quality checks:
- **Deduplication**: Identifies and removes duplicate entries based on the `title` column
- **Missing value analysis**: Displays count of null values per column
- **Essential field filtering**: Drops rows missing `title` or `label` (critical for modeling)

These steps ensure data quality and prevent training issues.

In [None]:
# Check for duplicates in title (most common text field for detection)
print("Duplicates in title:", df.duplicated(subset='title').sum())

# Remove duplicates by title
df = df.drop_duplicates(subset='title')

# Check for nulls
print("\nMissing values per column:")
print(df.isnull().sum())

# Drop rows with missing title or label (essential for modeling)
df = df.dropna(subset=['title', 'label'])

# Preview cleaned structure
df.head()

## Step 4: Text Cleaning Function

This cell defines and applies a comprehensive text cleaning function that:
- **Lowercases** all text for consistency
- **Removes URLs** (http/https links)
- **Removes user mentions** (@username patterns)
- **Removes punctuation** (special characters)
- **Removes emojis and symbols** (Unicode characters)
- **Strips whitespace** from beginning and end

The cleaned text is stored in a new `clean_title` column, preserving the original for reference.

In [None]:
import re

def clean_text(text):
    text = str(text).lower()                          # Lowercase
    text = re.sub(r"http\S+", "", text)               # Remove URLs
    text = re.sub(r"@\w+", "", text)                  # Remove mentions
    text = re.sub(r"[^\w\s]", "", text)               # Remove punctuation
    text = re.sub(r"[\u263a-\U0001f645]", "", text)   # Remove emojis/symbols
    return text.strip()

# Apply to title column
df['clean_title'] = df['title'].apply(clean_text)

# Preview result
df[['title', 'clean_title', 'label']].head()

## Step 5: Install Language Detection Library

Installs the `langdetect` library, which will be used to identify and filter non-English text in the next step.

In [None]:
!pip install langdetect

## Step 6: Language Detection and Filtering

This cell filters the dataset to include only English-language content:
- Defines a safe language detection function that handles exceptions
- Applies language detection to all cleaned titles
- Filters to keep only rows where language is detected as English ('en')
- Removes the temporary `language` helper column

This ensures the model trains on consistent English text, improving performance.

In [None]:
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

# Function to safely detect language
def detect_lang(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Apply to cleaned titles
df['language'] = df['clean_title'].apply(detect_lang)

# Filter only English rows
df = df[df['language'] == 'en']

# Drop helper column
df = df.drop(columns=['language'])

# Preview
df[['clean_title', 'label']].sample(5)

## Step 7: Class Distribution Analysis

This cell visualizes the balance between fake and real news:
- Counts the number of samples for each label
- Creates a bar chart showing the distribution
- Helps identify if class imbalance exists

A balanced dataset (or acceptable imbalance) is important for training unbiased classification models.

In [None]:
import matplotlib.pyplot as plt

# Count fake vs real
class_counts = df['label'].value_counts()
print("Class Distribution:\n", class_counts)

# Plot bar chart
class_counts.plot(kind='bar', color=['red', 'green'])
plt.title("Class Distribution After Cleaning")
plt.xlabel("Label")
plt.ylabel("Number of Samples")
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

## Step 8: Save Cleaned PolitiFact Dataset

This cell exports the cleaned dataset:
- Saves only the essential columns (`clean_title` and `label`) to CSV
- Downloads the file to your local machine for use in model training

The cleaned dataset is now ready for machine learning pipelines.

In [None]:
# Save to CSV
df[['clean_title', 'label']].to_csv('clean_fakenewsnet.csv', index=False)

# Download locally
from google.colab import files
files.download('clean_fakenewsnet.csv')

**Dataset 2: GossipCop**

## Data Cleaning Strategy for GossipCop Dataset

This section applies the same comprehensive cleaning techniques to the GossipCop dataset:

1. **Deduplication**: Removes duplicate articles/posts
2. **Missing values/nulls**: Eliminates rows with missing critical data
3. **Lowercasing**: Standardizes text for consistent tokenization
4. **URL/user mention removal**: Cleans Twitter metadata
5. **Punctuation removal**: Eliminates noise from text
6. **Emoji/HTML tag stripping**: Removes irrelevant characters
7. **Non-English removal**: Filters to keep only English content
8. **Data Imbalancing**: Addresses class imbalance using undersampling if needed

## Step 9: Upload GossipCop Datasets

This cell uploads the GossipCop fake and real news CSV files from your local machine to Google Colab for processing.

In [None]:
# Upload files gossipcop_real and gossipcop_fake datasets
from google.colab import files
uploaded = files.upload()

## Step 10: Load and Combine GossipCop Datasets

This cell performs initial data loading for GossipCop:
- Loads separate `gossipcop_fake.csv` and `gossipcop_real.csv` files
- Adds a `label` column to each dataset ('fake' or 'real')
- Combines both datasets into a single DataFrame
- Displays the first few rows to verify the structure

In [None]:
import pandas as pd

# Load files
df_fake = pd.read_csv("gossipcop_fake.csv")
df_real = pd.read_csv("gossipcop_real.csv")

# Label the data
df_fake['label'] = 'fake'
df_real['label'] = 'real'

# Combine
df = pd.concat([df_fake, df_real], ignore_index=True)

# Preview
df.head()

## Step 11: Remove Duplicates and Handle Missing Values (GossipCop)

This cell performs data quality checks for GossipCop:
- **Deduplication**: Identifies and removes duplicate entries based on `title`
- **Missing value analysis**: Displays count of null values per column
- **Essential field filtering**: Drops rows missing `title` or `label`

Same cleaning approach as PolitiFact to ensure consistency.

In [None]:
# Check for duplicate titles
print("Duplicates in title:", df.duplicated(subset='title').sum())

# Remove duplicate titles
df = df.drop_duplicates(subset='title')

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Drop rows missing title or label (core for modeling)
df = df.dropna(subset=['title', 'label'])

# Preview cleaned structure
df.head()

## Step 12: Text Cleaning (GossipCop)

This cell applies the same comprehensive text cleaning function to GossipCop:
- Lowercases text
- Removes URLs, user mentions, punctuation
- Removes emojis and symbols
- Strips whitespace

Creates a `clean_title` column with the processed text.

In [None]:
import re

# Define clean text function
def clean_text(text):
    text = str(text).lower()  # Lowercase
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"@\w+", "", text)     # Remove mentions
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = re.sub(r"[\u263a-\U0001f645]", "", text)  # Remove emojis/symbols
    return text.strip()

# Apply to title column
df['clean_title'] = df['title'].apply(clean_text)

# Preview result
df[['title', 'clean_title', 'label']].head()

## Step 13: Language Detection and Filtering (GossipCop)

This cell filters GossipCop to include only English-language content:
- Applies safe language detection to all cleaned titles
- Filters to keep only English ('en') content
- Removes the temporary `language` helper column

Ensures consistency with PolitiFact dataset processing.

In [None]:
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

# Define function for language detection
def detect_lang(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Apply to cleaned titles
df['language'] = df['clean_title'].apply(detect_lang)

# Keep only English
df = df[df['language'] == 'en']

# Drop helper column
df = df.drop(columns=['language'])

# Preview result
df[['clean_title', 'label']].sample(5)

## Step 14: Class Distribution Analysis (GossipCop)

This cell visualizes the balance between fake and real news in GossipCop:
- Counts samples for each label
- Creates a bar chart showing the distribution
- Identifies if class imbalance needs to be addressed

GossipCop often has significant class imbalance that requires correction.

In [None]:
import matplotlib.pyplot as plt

# Count labels
class_counts = df['label'].value_counts()
print("\nClass Distribution:\n", class_counts)

# Plot distribution
class_counts.plot(kind='bar', color=['red', 'green'])
plt.title("Class Distribution After Cleaning")
plt.xlabel("Label")
plt.ylabel("Number of Samples")
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()

## Step 15: Address Class Imbalance with Random Undersampling

This cell balances the GossipCop dataset using random undersampling:
- **Reference**: Based on "A comprehensive survey of fake news in social networks: Attributes, features, and detection approaches"
- **Method**: Random undersampling of the majority class
- **Process**:
  - Separates data by class (fake vs. real)
  - Downsamples the majority class (real news) to match minority class size
  - Combines and shuffles the balanced dataset
  - Verifies equal class distribution

This prevents model bias toward the majority class during training.

In [None]:
# Reference for Undersampling technique: A comprehensive survey of fake news in social networks: Attributes,features, and detection approaches (random undersampling technique)

# Separate by class
df_real = df[df['label'] == 'real']
df_fake = df[df['label'] == 'fake']

# Downsample real news
df_real_downsampled = df_real.sample(n=len(df_fake), random_state=42)

# Combine and shuffle
df_balanced = pd.concat([df_fake, df_real_downsampled], ignore_index=True)
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

# Confirm balance
print(df_balanced['label'].value_counts())

## Step 16: Save Cleaned GossipCop Dataset

This cell exports the cleaned and balanced GossipCop dataset:
- Saves only essential columns (`clean_title` and `label`) to CSV
- Downloads the file to your local machine

The cleaned GossipCop dataset is now ready for model training.

In [None]:
# Save to CSV
df_balanced[['clean_title', 'label']].to_csv('clean_gossipcop.csv', index=False)

# Download locally (Colab)
from google.colab import files
files.download('clean_gossipcop.csv')