##Notebook 1 : Dataset Analysis and Pre-Processing

This series of notebooks demonstrates a comparative approach to fake news detection using natural language processing (NLP) techniques. We will use a dataset from Kaggle containing labeled news articles as real or fake.


## Import Libraries and Download Resources

This cell imports necessary libraries for data processing, analysis, and NLP tasks. It also downloads resources like 'punkt', 'stopwords', and 'wordnet' from NLTK for tokenization, stop word removal, and lemmatization.


In [1]:
import shutil
import os
import numpy as np
import pandas as pd
import kagglehub

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Download and Copy the Fake News Dataset from Kaggle

This cell downloads the "english-fake-news-dataset" from Kaggle using the `kagglehub` library. It then copies the dataset files to a specific directory within the Colab environment (`/content/english-fake-news-dataset`).

In [2]:
dataset_path = kagglehub.dataset_download("evilspirit05/english-fake-news-dataset")
print("Original path to dataset files:", dataset_path)

target_directory = "/content/english-fake-news-dataset"

os.makedirs(target_directory, exist_ok=True)

for item in os.listdir(dataset_path):
    source = os.path.join(dataset_path, item)
    destination = os.path.join(target_directory, item)
    if os.path.isdir(source):
        shutil.copytree(source, destination)
    else:
        shutil.copy2(source, destination)

print("Files copied to:", target_directory)

Downloading from https://www.kaggle.com/api/v1/datasets/download/evilspirit05/english-fake-news-dataset?dataset_version_number=1...


100%|██████████| 9.55M/9.55M [00:00<00:00, 30.6MB/s]

Extracting files...





Original path to dataset files: /root/.cache/kagglehub/datasets/evilspirit05/english-fake-news-dataset/versions/1
Files copied to: /content/english-fake-news-dataset


## Data Cleaning  

This code performs initial data cleaning by removing duplicate and NaN rows.

In [3]:
%cd english-fake-news-dataset

/content/english-fake-news-dataset


In [4]:
data=pd.read_csv("final_en.csv")

data.dropna(how='all', inplace=True)
data.drop_duplicates(inplace=True)
data.dropna(subset=['title', 'text'], inplace=True)
data.reset_index(drop=True, inplace=True)

data.head()

Unnamed: 0,title,text,lebel
0,Trump backs off praise of Russia's Putin after...,"HENDERSON, Nev. (Reuters) - U.S. Republican pr...",1
1,Trump's funding request for U.S. border wall h...,WASHINGTON (Reuters) - President Donald Trump’...,1
2,"As Votes For Trump Went Up, Canada’s Immigrat...","Well, sad to say, it s a sure chance the next ...",0
3,"U.S. Navy, shipbuilders ready for Trump's expa...","SIMI VALLEY, Calif. (Reuters) - The U.S. arms ...",1
4,"Trump defends DACA move, urges Congress to ena...",WASHINGTON (Reuters) - President Donald Trump ...,1


## Displaying Class Distribution

This cell prints the distribution of the target variable ('lebel') to show the number of real and fake news samples in the dataset.

In [5]:
print(data["lebel"].value_counts())

lebel
1    5000
0    4730
Name: count, dtype: int64


## Downsampling the Majority Class

This cell downsamples the majority class (real news) to balance the dataset. It randomly selects samples from the majority class to match the number of samples in the minority class.

In [6]:
# Downsample
majority_class = data[data['lebel'] == 1]
minority_class = data[data['lebel'] == 0]

majority_downsampled = majority_class.sample(n=4730, random_state=42)
balanced_data = pd.concat([majority_downsampled, minority_class])
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)

print(balanced_data['lebel'].value_counts())

lebel
0    4730
1    4730
Name: count, dtype: int64


##  Feature Engineering

This cell combines the 'title' and 'text' features into a single feature for better context.

In [None]:
features = data['title'] + " <BREAK> " + data['text']

labels = data['lebel'].values

# Convert to lists if needed
features = features.tolist()
labels = labels.tolist()

print("Features:", features[5])
print("Labels:", labels[5])


Features: MUST WATCH VIDEO: WATCH WHAT TRACK & FIELD OLYMPIAN Does When Our National Anthem Is Played [Video] <BREAK> Usain Bolt was mid-interview when our National Anthem began to play he stops to honor it! Way to go! 
Labels: 0


## Filtering Features and Labels

This cell filters the features and labels based on the character length to reduce computational load. It retains samples with a character length less than or equal to 2500.

In [None]:

# Filter features and labels based on character length
filtered_features = []
filtered_labels = []

for feature, label in zip(features, labels):
    if len(feature) <= 2500:
        filtered_features.append(feature)
        filtered_labels.append(label)

print("Filtered Features")
print(len((filtered_features)))


Filtered Features
5562


## Save the Processed Dataset

This cell saves the processed dataset, containing the filtered features and labels, into a compressed .npz file named "data.npz." This format ensures efficient storage and loading.

In [None]:
filtered_features = np.array(filtered_features)
labels = np.array(labels)

np.savez("data.npz", features=filtered_features, labels=filtered_labels)