A UNIX-style Python CLI that ingests massive text logs, streams them line-by-line, and produces human-friendly summaries. The tool highlights error hot spots, top tokens, timestamp ranges, and lets you benchmark the naive “read everything” approach against the optimized streaming pipeline. Caching makes repeat runs nearly instant, which is perfect for on-call triage and observability work.
- Features
- Architecture
- Getting Started
- Usage
- Benchmarking & Profiling
- Testing & Quality Gates
- Project Structure
- Contributing
- Changelog
- Streaming summarization – custom iterators in
texttool.streamingkeep memory flat, even for multi-GB files. - Rich stats – totals, severity counts, top tokens, timestamp bounds, and sample lines to orient quickly.
- Disk-backed memoization – summaries are cached by path/mtime/size, turning repeat runs into ~120ms lookups.
- Baseline vs. optimized benchmarking – run both strategies and capture per-line microsecond metrics.
- Built-in profiling –
--profilepipes the summarization step throughcProfilefor micro-optimizations. - Professional tooling – Ruff + Black lint/format, pytest coverage gates, GitHub Actions CI, and reproducible synthetic data generators.
texttool/
cli.py # argparse entry point, caching + benchmark orchestration
summarizer.py # shared summarization core + baseline/streaming strategies
streaming.py # buffered iterators/generators for large file handling
cache.py # JSON-based memoization keyed by file metadata
benchmarks/
generate_sample_log.py # deterministic synthetic log generator
results.md # captured before/after metrics
benchmark.py # convenience wrapper to synthesize + compare runs
- Prerequisites
- Python 3.11+
pip(ships with Python)
- Install (editable mode for development)
python3.11 -m venv .venv source .venv/bin/activate pip install -e .[dev] - Regenerate sample data (optional)
python3.11 benchmarks/generate_sample_log.py sample_logs/synthetic.log --lines 200000
Summarize any log file (streaming strategy, cache enabled by default):
python3.11 -m texttool.cli summarize /path/to/log --top-words 20Helpful flags:
--strategy baseline|streaming– compare naive vs. streaming summarizers.--no-cache– bypass the memoization layer (cache lives at~/.cache/text_processing_cli/summaries.json).--json– emit machine-readable output for scripts/dashboards.--profile– capture cProfile stats (top 25 rows printed to stdout).--compare --repeats N– benchmark both strategies and print a timing table.
Generate a deterministic log and compare strategies in one command:
python3.11 benchmark.py sample_logs/synthetic.log --generate --lines 300000 --repeats 3A manual benchmark produced the following “before vs. after” highlights (see benchmarks/results.md for raw data):
- Baseline read-all strategy: ~2.00 s for a 200k-line log.
- Streaming iterator: ~1.98 s for the same log with flat memory usage.
- Cache hit: ~0.12 s after the summary is memoized, ideal for repeat investigations.
Quality tooling follows PEP 8 via Black + Ruff.
ruff check .
black .
python3.11 -m pytest # runs with coverage + 85% gateCI mirrors these steps in .github/workflows/ci.yml. Coverage snapshots are stored in docs/coverage.md.
.
├── texttool/ # Core package
├── benchmarks/ # Synthetic data + captured metrics
├── sample_logs/ # Empty placeholder folder (logs are .gitignored)
├── tests/ # Pytest suite covering cache, streaming, summarizer
├── docs/coverage.md # Latest coverage summary
├── CHANGELOG.md # Release history
└── pyproject.toml # Tooling/lint/test configuration
- Fork & clone the repository.
- Create a virtualenv + install dev dependencies (
pip install -e .[dev]). - Run
ruff,black, andpytestlocally before opening a PR. - Follow conventional commit messages or squash into logical commits (e.g.,
feat: add streaming iterator,docs: expand README). - Update
CHANGELOG.mdanddocs/coverage.mdwhen behavior changes.
See CHANGELOG.md for release notes. Versioning follows SemVer; the current release is 0.1.0.