Multi-class text classification dataset with labeled documents for news categorization, topic classification, and document analysis.
| Property | Value |
|---|---|
| Project | Text Classification Dataset |
| Category | Text Data / NLP |
| Author | Molla Samser |
| Designer & Tester | Rima Khatun |
| Website | https://rskworld.in |
| help@rskworld.in | |
| Phone | +91 93305 39277 |
This dataset includes labeled documents across multiple categories for text classification tasks. Perfect for:
- π° News Categorization - Classify news articles into categories
- π·οΈ Topic Classification - Identify main topics from text
- π Document Analysis - Analyze and categorize documents
- π€ NLP Model Training - Train and fine-tune models
- β Multiple document categories (6 classes)
- β Large labeled dataset (240+ training samples)
- β Train/Validation/Test splits
- β Multiple formats (CSV, JSON, TXT)
- β Transformer ready format (BERT, RoBERTa)
- π₯ Interactive Data Explorer - Visual data exploration tool
- π₯ REST API Server - Flask-based prediction API
- π₯ Data Augmentation - 6 augmentation techniques
- π₯ Model Explainability - LIME-based explanations
- π₯ Batch Processing - High-throughput classification
- π₯ Advanced Visualizations - Word clouds, confusion matrices
- π₯ Performance Benchmarking - Model comparison tools
- π₯ Cross-Validation - Robust model evaluation
| Metric | Value |
|---|---|
| Training Samples | 240 |
| Validation Set | 30 |
| Test Set | 30 |
| Categories | 6 |
| Avg. Text Length | ~20 words |
| Label | Category | Description | Color |
|---|---|---|---|
| 0 | Technology | Tech news, gadgets, software, AI | π΅ Blue |
| 1 | Sports | Athletics, competitions, leagues | π’ Green |
| 2 | Politics | Government, policy, elections | π£ Purple |
| 3 | Entertainment | Movies, music, TV shows, celebrities | π©· Pink |
| 4 | Business | Finance, markets, economy | π‘ Amber |
| 5 | Science | Research, discoveries, space, health | π΅ Cyan |
text-classification/
βββ index.html # Main showcase page
βββ explorer.html # π Interactive data explorer
βββ README.md # Documentation
βββ requirements.txt # Python dependencies
βββ text-classification.svg # Project logo
β
βββ assets/
β βββ css/
β β βββ style.css # Styles
β βββ js/
β β βββ main.js # Scripts
β βββ favicon.svg # Favicon
β
βββ data/
β βββ csv/
β β βββ train.csv # Training data (240 samples)
β β βββ validation.csv # Validation data
β β βββ test.csv # Test data
β β βββ full_dataset.csv # Complete dataset
β βββ json/
β β βββ dataset.json # JSON format
β β βββ full_dataset.json # Complete JSON
β βββ txt/
β βββ categories.txt # Category labels
β βββ sample_documents.txt
β
βββ scripts/
β βββ preprocessing.py # Text preprocessing
β βββ train_classifier.py # Traditional ML training
β βββ train_transformers.py # BERT/Transformer training
β βββ data_augmentation.py # π 6 augmentation techniques
β βββ visualizations.py # π Word clouds, charts
β βββ api_server.py # π REST API server
β βββ model_explainability.py # π LIME explanations
β βββ batch_processor.py # π Batch classification
β
βββ notebooks/
βββ text_classification_tutorial.ipynb # Complete tutorial
# Download the dataset
wget https://rskworld.in/datasets/text-classification.zip
unzip text-classification.zip
cd text-classificationpip install -r requirements.txtimport pandas as pd
# Load training data
train_df = pd.read_csv('data/csv/train.csv', comment='#')
print(f"Training samples: {len(train_df)}")
print(train_df.head())# Traditional ML model
python scripts/train_classifier.py
# View visualizations
python scripts/visualizations.py ../dataOpen explorer.html in your browser to:
- Filter documents by category
- Search through the dataset
- View category distribution charts
- Analyze word count distributions
# Start the API server
cd scripts
python api_server.py --demo --port 5000API Endpoints:
GET / - API info
GET /health - Health check
GET /categories - List all categories
POST /predict - Classify single text
POST /predict/batch - Classify multiple texts
POST /analyze - Detailed text analysis
Example API Call:
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Apple announces new iPhone with AI features"}'from scripts.data_augmentation import TextAugmenter
augmenter = TextAugmenter(num_aug=5, random_state=42)
text = "Apple announces revolutionary new iPhone"
augmented = augmenter.augment(text)
for i, aug_text in enumerate(augmented, 1):
print(f"{i}. {aug_text}")Supported Techniques:
- Synonym Replacement (SR)
- Random Insertion (RI)
- Random Swap (RS)
- Random Deletion (RD)
- Character-level augmentation
- Keyboard error simulation
from scripts.model_explainability import TextExplainer
explainer = TextExplainer(classifier_fn)
explanation = explainer.explain("New AI-powered smartphone released")
print(f"Predicted: {explanation['predicted_category']}")
print("Important words:")
for item in explanation['word_importance'][:5]:
print(f" {item['word']}: {item['importance']:.4f}")# Process a file of texts
python scripts/batch_processor.py process \
--input input.csv \
--output predictions.csv \
--model model.joblib \
--batch-size 100
# Evaluate predictions
python scripts/batch_processor.py evaluate \
--predictions predictions.csv \
--ground-truth ground_truth.csv \
--output report.json# Generate all visualizations
python scripts/visualizations.py ../data
# Outputs:
# - visualizations/category_distribution.png
# - visualizations/text_length_distribution.png
# - visualizations/wordcloud_all.png
# - visualizations/wordclouds_by_category/from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Load data
train_df = pd.read_csv('data/csv/train.csv', comment='#')
# Vectorize
tfidf = TfidfVectorizer(max_features=10000)
X = tfidf.fit_transform(train_df['text'])
y = train_df['label']
# Train
model = LogisticRegression()
model.fit(X, y)
# Predict
text = "Apple unveils new iPhone with AI features"
prediction = model.predict(tfidf.transform([text]))
print(f"Predicted: {prediction[0]}") # 0 = Technologyfrom transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)
# Tokenize
text = "Scientists discover new planet in nearby galaxy"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Predict
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print(f"Predicted label: {prediction}") # 5 = Science| Model | Accuracy | F1 Score | Inference (ms) |
|---|---|---|---|
| Naive Bayes | 85.2% | 0.847 | ~1ms |
| Logistic Regression | 89.7% | 0.892 | ~2ms |
| Linear SVM | 88.9% | 0.885 | ~2ms |
| BERT (fine-tuned) | 94.3% | 0.941 | ~50ms |
This dataset is provided for educational purposes only.
Copyright (c) 2026 RSK World - All Rights Reserved
Molla Samser
- π Website: https://rskworld.in
- π§ Email: help@rskworld.in
- π± Phone: +91 93305 39277
Rima Khatun
If you have any questions or need support:
- π§ Email: support@rskworld.in
- π Contact: https://rskworld.in/contact.php
Made with β€οΈ by RSK World
rskworld.in