# Fake News Detection with TF-IDF and Logistic Regression

This notebook shows how to:
1. Load a CSV dataset of news articles.
2. Preprocess and combine title/text.
3. Compute TF-IDF features.
4. Train a Logistic Regression classifier.
5. Evaluate accuracy and confusion matrix.

---

## 1. Install and Import Libraries
If you don’t already have the required packages, uncomment and run the pip installs below.


In [1]:
# Uncomment if needed:
# !pip install pandas scikit-learn matplotlib


In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
%matplotlib inline


## 2. Load and Inspect the Dataset

- Place your CSV (e.g. `train.csv`) in the same folder as this notebook.
- The CSV must have columns: `id`, `title`, `author`, `text`, `label`  
  where `label` is 0 for real news, 1 for fake news.


In [3]:
# Adjust the filename/path if necessary
DATA_PATH = "train.csv"

# Load the dataset
df = pd.read_csv(DATA_PATH)

# Quick overview
print(f"Total rows: {df.shape[0]}, columns: {df.shape[1]}")
print(df.head(3))


FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

### 2.1. Handle Missing Values and Combine Text

- Fill any missing `title` or `text` with an empty string.
- Create a new column `content = title + " " + text`.


In [None]:
df['title'] = df['title'].fillna('')
df['text']  = df['text'].fillna('')
df['content'] = df['title'] + " " + df['text']

# Confirm
print(df[['content','label']].sample(3))


## 3. TF-IDF Vectorization

- We use `TfidfVectorizer` to convert each `content` into a sparse TF-IDF feature vector.
- We remove English stopwords and limit to the top 5,000 features.


In [None]:
vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=5000  # keep the 5,000 most frequent terms
)

# Fit on entire corpus and transform
X = vectorizer.fit_transform(df['content'])
y = df['label'].values

print(f"TF-IDF matrix shape: {X.shape}")
# e.g., (number_of_articles, 5000)


## 4. Train/Test Split

- Split data into 80% training and 20% testing.
- Set a `random_state` for reproducibility.


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print("Training set:", X_train.shape, y_train.shape)
print("Test set:",     X_test.shape,  y_test.shape)


NameError: name 'X' is not defined

## 5. Train Logistic Regression Model

- Initialize `LogisticRegression` with a higher `max_iter` to ensure convergence on sparse high-dimensional data.


In [5]:
model = LogisticRegression(
    solver='lbfgs',    # default solver; works well for many problems
    max_iter=1000,     # increase if the solver doesn’t converge
    n_jobs=-1          # use all CPU cores for faster training
)

# Train on the TF-IDF features
model.fit(X_train, y_train)
print("Model training complete.")


NameError: name 'X_train' is not defined

## 6. Evaluate on Test Set

- Compute accuracy, confusion matrix, and classification report (precision/recall/F1).


In [6]:
# 6.1. Predictions
y_pred = model.predict(X_test)

# 6.2. Accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {acc:.2%}")

# 6.3. Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# 6.4. Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Real", "Fake"]))


NameError: name 'X_test' is not defined

### 6.5. Optional: Plot Confusion Matrix

A simple heatmap to visualize the confusion matrix.


In [7]:
import seaborn as sns  # seaborn is optional but makes plotting nicer

sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", 
            xticklabels=["Pred_Real","Pred_Fake"],
            yticklabels=["True_Real","True_Fake"])
plt.ylabel("True label")
plt.xlabel("Predicted label")
plt.title("Confusion Matrix")
plt.show()


NameError: name 'cm' is not defined

## 7. (Optional) Inspect Top TF-IDF Features

See which terms have the highest weights (coefficients) for predicting “fake” vs. “real.”

- `model.coef_` is an array of shape `(1, n_features)` because this is binary classification.
- We sort terms by their coefficient value: positive coefficients → indicative of class “Fake”; negative → indicative of “Real.”


In [8]:
# Get feature names (terms) from the vectorizer
terms = vectorizer.get_feature_names_out()

# Get the coefficient values for the single logistic regression class (shape: [5000])
coefs = model.coef_[0]

# Pair terms with their coefficient, and sort by coefficient descending
coef_df = pd.DataFrame({
    'term': terms,
    'coef': coefs
}).sort_values(by='coef', ascending=False)

# Top 10 terms most indicative of Fake
print("Top 10 terms indicative of Fake news:")
print(coef_df.head(10))

# Top 10 terms most indicative of Real news (most negative coefficients)
print("\nTop 10 terms indicative of Real news:")
print(coef_df.tail(10))


NameError: name 'vectorizer' is not defined

## 8. Save the Trained Model (Optional)

- If you want to reuse the trained model later, save it with `joblib` or `pickle`.


In [9]:
import joblib

# Save both the vectorizer and the model
joblib.dump(vectorizer, "tfidf_vectorizer.joblib")
joblib.dump(model,       "logreg_fake_news_model.joblib")

print("Saved vectorizer and model to disk.")


NameError: name 'vectorizer' is not defined

## 9. (Optional) Load and Test Saved Model on New Samples

- Demonstrates how to load the pipeline components and predict on new text strings.


In [10]:
# Load saved objects
loaded_vectorizer = joblib.load("tfidf_vectorizer.joblib")
loaded_model      = joblib.load("logreg_fake_news_model.joblib")

# Example new samples
new_articles = [
    "Breaking news: Scientists discover cure for common cold...",
    "Exclusive: Celebrity endorses miracle diet pill, doctors shocked..."
]

# Transform with TF-IDF
X_new = loaded_vectorizer.transform(new_articles)

# Predict
predictions = loaded_model.predict(X_new)
probabilities = loaded_model.predict_proba(X_new)[:,1]  # Probability of class ‘Fake’

for text, pred, prob in zip(new_articles, predictions, probabilities):
    label = "Fake" if pred == 1 else "Real"
    print(f"\nArticle: {text[:60]}...")
    print(f"Predicted label: {label} (prob_fake = {prob:.2f})")


FileNotFoundError: [Errno 2] No such file or directory: 'tfidf_vectorizer.joblib'