# Fake News Detection with TF-IDF and Logistic Regression

This notebook shows how to:
1. Load a CSV dataset of news articles.
2. Preprocess and combine title/text.
3. Compute TF-IDF features.
4. Train a Logistic Regression classifier.
5. Evaluate accuracy and confusion matrix.

---

## 1. Install and Import Libraries
If you don’t already have the required packages, uncomment and run the pip installs below.


In [None]:
# Uncomment if needed:
# !pip install pandas scikit-learn matplotlib


In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
%matplotlib inline


## 2. Load and Inspect the Dataset

- Place your CSV (e.g. `train.csv`) in the same folder as this notebook.
- The CSV must have columns: `id`, `title`, `author`, `text`, `label`  
  where `label` is 0 for real news, 1 for fake news.


In [None]:
# Adjust the filename/path if necessary
DATA_PATH = "train.csv"

# Load the dataset
df = pd.read_csv(DATA_PATH)

# Quick overview
print(f"Total rows: {df.shape[0]}, columns: {df.shape[1]}")
print(df.head(3))


### 2.1. Handle Missing Values and Combine Text

- Fill any missing `title` or `text` with an empty string.
- Create a new column `content = title + " " + text`.


In [None]:
df['title'] = df['title'].fillna('')
df['text']  = df['text'].fillna('')
df['content'] = df['title'] + " " + df['text']

# Confirm
print(df[['content','label']].sample(3))


## 3. TF-IDF Vectorization

- We use `TfidfVectorizer` to convert each `content` into a sparse TF-IDF feature vector.
- We remove English stopwords and limit to the top 5,000 features.


In [None]:
vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=5000  # keep the 5,000 most frequent terms
)

# Fit on entire corpus and transform
X = vectorizer.fit_transform(df['content'])
y = df['label'].values

print(f"TF-IDF matrix shape: {X.shape}")
# e.g., (number_of_articles, 5000)


## 4. Train/Test Split

- Split data into 80% training and 20% testing.
- Set a `random_state` for reproducibility.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print("Training set:", X_train.shape, y_train.shape)
print("Test set:",     X_test.shape,  y_test.shape)


## 5. Train Logistic Regression Model

- Initialize `LogisticRegression` with a higher `max_iter` to ensure convergence on sparse high-dimensional data.
