**Exploratory Data Analysis (EDA) Introduction**

Before diving into feature engineering or model training, it's essential to develop a thorough understanding of the dataset. This notebook focuses on exploratory data analysis (EDA) to identify patterns, spot anomalies, and uncover meaningful insights. The findings here will guide our decisions in the preprocessing, feature selection, and modeling phases of the project.

In [None]:
import pandas as pd

df = pd.read_csv('../Data/WELFake_Dataset.csv')

if not df.empty:
    print('✅ Data read successfully')
    print(df.head())
else:
    print('❌ Data reading unsucessful')

**Dataset Overview**

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB


In [15]:
df.describe(include = 'all')

Unnamed: 0.1,Unnamed: 0,title,text,label
count,72134.0,71576,72095.0,72134.0
unique,,62347,62718.0,
top,,Factbox: Trump fills top jobs for his administ...,,
freq,,14,738.0,
mean,36066.5,,,0.514404
std,20823.436496,,,0.499796
min,0.0,,,0.0
25%,18033.25,,,0.0
50%,36066.5,,,1.0
75%,54099.75,,,1.0


In [16]:
df.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

These initial exploratory commands provide valuable insight into the structure and quality of the dataset. The dataset comprises **72,134 news articles**, each characterized by four variables: an unnamed index column (serving as an article ID), the article's title, the full text, and a binary label indicating whether the article is real or fake. Notably, **558** entries are missing titles and **39** are missing text. Addressing these missing values will be an essential part of the data preprocessing phase, and it is likely that records lacking core textual information will be removed to maintain the integrity of the analysis.