**Objectives:**

Apply text cleaning & basic feature extraction

Build a simple classifier to distinguish FAKE vs REAL articles

Explore word usage differences in fake versus real headlines

**Data Source:**

https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?

**A. Data Preparation:**

Use Fake.csv and True.csv from Kaggle

In [2]:
from google.colab import files
uploaded = files.upload()

Saving True.csv to True.csv


In [3]:
from google.colab import files
uploaded = files.upload()

Saving Fake.csv to Fake.csv


In [4]:
import pandas as pd

fake = pd.read_csv('Fake.csv')
true = pd.read_csv('True.csv')
fake['label'] = 1  # fake=1, real=0
true['label'] = 0
df = pd.concat([fake, true]).sample(frac=1).reset_index(drop=True)


**B. Text Cleaning & Preprocessing:**

In [5]:
import re
df['clean'] = df['text'].str.lower().str.replace(r'[^a-z\s]', ' ')
df['clean'] = df['clean'].str.replace(r'\s+', ' ')


**C. Simple Feature Extraction:**


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X = vectorizer.fit_transform(df['clean'])
y = df['label']


**D. Training a Classifier:**

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=500)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, preds))
print(classification_report(y_test, preds))


Accuracy: 0.987750556792873
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      4212
           1       0.99      0.98      0.99      4768

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



**E. Exploratory Analysis:**

In [8]:
import numpy as np

coef = clf.coef_[0]
top_fake = np.argsort(coef)[-20:]
top_real = np.argsort(coef)[:20]

print("Words associated with FAKE news:")
print([vectorizer.get_feature_names_out()[i] for i in top_fake])

print("\nWords associated with REAL news:")
print([vectorizer.get_feature_names_out()[i] for i in top_real])


Words associated with FAKE news:
['sen', 'watch', 'featured', 'featured image', 'obama', 'com', 'is', 'image', 'hillary', 'that', 'just', 'gop', 'the us', 'read', 'mr', 'president trump', 'read more', 'us', 'this', 'via']

Words associated with REAL news:
['reuters', 'said', 'washington reuters', 'on', 'said on', 'reuters the', 'president donald', 'on wednesday', 'washington', 'on tuesday', 'on thursday', 'in', 'republican', 'on friday', 'reuters president', 'on monday', 'bit', 'said in', 'wednesday', 'minister']


**What You’ll Learn & Showcase**

Loading and cleaning real-world fake news data

TF–IDF extraction and training a basic text classifier

Evaluating performance with accuracy and precision/recall

Interpreting model weights to find words that indicate fake news

This mini-project covers content-based detection, one of the foundational approaches discussed in Shu et al.’s article