Skip to content

rskworld/text-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ Text Classification Dataset

Multi-class text classification dataset with labeled documents for news categorization, topic classification, and document analysis.

Author Website License Difficulty Python


πŸ“‹ Project Information

Property Value
Project Text Classification Dataset
Category Text Data / NLP
Author Molla Samser
Designer & Tester Rima Khatun
Website https://rskworld.in
Email help@rskworld.in
Phone +91 93305 39277

πŸ“– Description

This dataset includes labeled documents across multiple categories for text classification tasks. Perfect for:

  • πŸ“° News Categorization - Classify news articles into categories
  • 🏷️ Topic Classification - Identify main topics from text
  • πŸ“‘ Document Analysis - Analyze and categorize documents
  • πŸ€– NLP Model Training - Train and fine-tune models

✨ Features

Core Features

  • βœ… Multiple document categories (6 classes)
  • βœ… Large labeled dataset (240+ training samples)
  • βœ… Train/Validation/Test splits
  • βœ… Multiple formats (CSV, JSON, TXT)
  • βœ… Transformer ready format (BERT, RoBERTa)

πŸ†• Advanced Features

  • πŸ”₯ Interactive Data Explorer - Visual data exploration tool
  • πŸ”₯ REST API Server - Flask-based prediction API
  • πŸ”₯ Data Augmentation - 6 augmentation techniques
  • πŸ”₯ Model Explainability - LIME-based explanations
  • πŸ”₯ Batch Processing - High-throughput classification
  • πŸ”₯ Advanced Visualizations - Word clouds, confusion matrices
  • πŸ”₯ Performance Benchmarking - Model comparison tools
  • πŸ”₯ Cross-Validation - Robust model evaluation

πŸ“Š Dataset Statistics

Metric Value
Training Samples 240
Validation Set 30
Test Set 30
Categories 6
Avg. Text Length ~20 words

Categories

Label Category Description Color
0 Technology Tech news, gadgets, software, AI πŸ”΅ Blue
1 Sports Athletics, competitions, leagues 🟒 Green
2 Politics Government, policy, elections 🟣 Purple
3 Entertainment Movies, music, TV shows, celebrities 🩷 Pink
4 Business Finance, markets, economy 🟑 Amber
5 Science Research, discoveries, space, health πŸ”΅ Cyan

πŸ› οΈ Technologies

CSV TXT JSON Transformers BERT Flask Scikit-learn


πŸ“ Project Structure

text-classification/
β”œβ”€β”€ index.html                  # Main showcase page
β”œβ”€β”€ explorer.html               # πŸ†• Interactive data explorer
β”œβ”€β”€ README.md                   # Documentation
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ text-classification.svg     # Project logo
β”‚
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ css/
β”‚   β”‚   └── style.css           # Styles
β”‚   β”œβ”€β”€ js/
β”‚   β”‚   └── main.js             # Scripts
β”‚   └── favicon.svg             # Favicon
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ csv/
β”‚   β”‚   β”œβ”€β”€ train.csv           # Training data (240 samples)
β”‚   β”‚   β”œβ”€β”€ validation.csv      # Validation data
β”‚   β”‚   β”œβ”€β”€ test.csv            # Test data
β”‚   β”‚   └── full_dataset.csv    # Complete dataset
β”‚   β”œβ”€β”€ json/
β”‚   β”‚   β”œβ”€β”€ dataset.json        # JSON format
β”‚   β”‚   └── full_dataset.json   # Complete JSON
β”‚   └── txt/
β”‚       β”œβ”€β”€ categories.txt      # Category labels
β”‚       └── sample_documents.txt
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ preprocessing.py        # Text preprocessing
β”‚   β”œβ”€β”€ train_classifier.py     # Traditional ML training
β”‚   β”œβ”€β”€ train_transformers.py   # BERT/Transformer training
β”‚   β”œβ”€β”€ data_augmentation.py    # πŸ†• 6 augmentation techniques
β”‚   β”œβ”€β”€ visualizations.py       # πŸ†• Word clouds, charts
β”‚   β”œβ”€β”€ api_server.py           # πŸ†• REST API server
β”‚   β”œβ”€β”€ model_explainability.py # πŸ†• LIME explanations
β”‚   └── batch_processor.py      # πŸ†• Batch classification
β”‚
└── notebooks/
    └── text_classification_tutorial.ipynb  # Complete tutorial

πŸš€ Quick Start

1. Clone or Download

# Download the dataset
wget https://rskworld.in/datasets/text-classification.zip
unzip text-classification.zip
cd text-classification

2. Install Dependencies

pip install -r requirements.txt

3. Load Dataset

import pandas as pd

# Load training data
train_df = pd.read_csv('data/csv/train.csv', comment='#')
print(f"Training samples: {len(train_df)}")
print(train_df.head())

4. Train a Model

# Traditional ML model
python scripts/train_classifier.py

# View visualizations
python scripts/visualizations.py ../data

πŸ†• Advanced Features Usage

πŸ“Š Interactive Data Explorer

Open explorer.html in your browser to:

  • Filter documents by category
  • Search through the dataset
  • View category distribution charts
  • Analyze word count distributions

🌐 REST API Server

# Start the API server
cd scripts
python api_server.py --demo --port 5000

API Endpoints:

GET  /                  - API info
GET  /health            - Health check
GET  /categories        - List all categories
POST /predict           - Classify single text
POST /predict/batch     - Classify multiple texts
POST /analyze           - Detailed text analysis

Example API Call:

curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple announces new iPhone with AI features"}'

πŸ”„ Data Augmentation

from scripts.data_augmentation import TextAugmenter

augmenter = TextAugmenter(num_aug=5, random_state=42)
text = "Apple announces revolutionary new iPhone"

augmented = augmenter.augment(text)
for i, aug_text in enumerate(augmented, 1):
    print(f"{i}. {aug_text}")

Supported Techniques:

  • Synonym Replacement (SR)
  • Random Insertion (RI)
  • Random Swap (RS)
  • Random Deletion (RD)
  • Character-level augmentation
  • Keyboard error simulation

πŸ” Model Explainability

from scripts.model_explainability import TextExplainer

explainer = TextExplainer(classifier_fn)
explanation = explainer.explain("New AI-powered smartphone released")

print(f"Predicted: {explanation['predicted_category']}")
print("Important words:")
for item in explanation['word_importance'][:5]:
    print(f"  {item['word']}: {item['importance']:.4f}")

πŸ“¦ Batch Processing

# Process a file of texts
python scripts/batch_processor.py process \
  --input input.csv \
  --output predictions.csv \
  --model model.joblib \
  --batch-size 100

# Evaluate predictions
python scripts/batch_processor.py evaluate \
  --predictions predictions.csv \
  --ground-truth ground_truth.csv \
  --output report.json

πŸ“ˆ Visualizations

# Generate all visualizations
python scripts/visualizations.py ../data

# Outputs:
# - visualizations/category_distribution.png
# - visualizations/text_length_distribution.png
# - visualizations/wordcloud_all.png
# - visualizations/wordclouds_by_category/

πŸ“ Usage Examples

Basic Classification

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load data
train_df = pd.read_csv('data/csv/train.csv', comment='#')

# Vectorize
tfidf = TfidfVectorizer(max_features=10000)
X = tfidf.fit_transform(train_df['text'])
y = train_df['label']

# Train
model = LogisticRegression()
model.fit(X, y)

# Predict
text = "Apple unveils new iPhone with AI features"
prediction = model.predict(tfidf.transform([text]))
print(f"Predicted: {prediction[0]}")  # 0 = Technology

Using Transformers (BERT)

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

# Tokenize
text = "Scientists discover new planet in nearby galaxy"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Predict
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print(f"Predicted label: {prediction}")  # 5 = Science

πŸ“Š Model Performance

Model Accuracy F1 Score Inference (ms)
Naive Bayes 85.2% 0.847 ~1ms
Logistic Regression 89.7% 0.892 ~2ms
Linear SVM 88.9% 0.885 ~2ms
BERT (fine-tuned) 94.3% 0.941 ~50ms

πŸ“œ License

This dataset is provided for educational purposes only.

Copyright (c) 2026 RSK World - All Rights Reserved


πŸ‘¨β€πŸ’» Author

Molla Samser

Designer & Tester

Rima Khatun


🀝 Support

If you have any questions or need support:


πŸ”— Links


Made with ❀️ by RSK World
rskworld.in

About

This dataset includes labeled documents across multiple categories for text classification tasks. Perfect for news categorization, topic classification, document analysis, and NLP model training.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors