# Text Classification using TF-IDF and Logistic Regression

## Objective
Build a baseline text classification model using traditional machine learning.

## Dataset
AG News dataset with 4 classes:
- World
- Sports
- Business
- Sci/Tech

## Steps
1. Load dataset  
2. Text preprocessing  
3. TF-IDF vectorization  
4. Train Logistic Regression model  
5. Evaluate performance  


In [2]:
# Import required libraries
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


In [3]:
# Load the dataset
categories = ['rec.autos', 'sci.med', 'comp.graphics', 'sci.space']

data = fetch_20newsgroups(
    subset='all',
    categories=categories,
    remove=('headers', 'footers', 'quotes')
)

X = data.data
y = data.target

print("Number of samples:", len(X))
print("Number of classes:", len(set(y)))


Number of samples: 3940
Number of classes: 4


In [4]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))


Training samples: 3152
Testing samples: 788


In [5]:
# Convert text to TF-IDF features
tfidf = TfidfVectorizer(
    stop_words='english',
    max_features=5000
)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("TF-IDF train shape:", X_train_tfidf.shape)
print("TF-IDF test shape:", X_test_tfidf.shape)


TF-IDF train shape: (3152, 5000)
TF-IDF test shape: (788, 5000)


In [6]:
# Train Logistic Regression model
model = LogisticRegression(
    max_iter=1000
)

model.fit(X_train_tfidf, y_train)

print("Model training completed")


Model training completed


In [7]:
# Make predictions on the test set
y_pred = model.predict(X_test_tfidf)

# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)

print("Test Accuracy:", accuracy)
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


Test Accuracy: 0.8769035532994924

Classification Report:

              precision    recall  f1-score   support

           0       0.86      0.89      0.88       183
           1       0.80      0.93      0.86       193
           2       0.94      0.86      0.90       200
           3       0.93      0.83      0.87       212

    accuracy                           0.88       788
   macro avg       0.88      0.88      0.88       788
weighted avg       0.88      0.88      0.88       788

