Text Processing Command-Line Tool

A UNIX-style Python CLI that ingests massive text logs, streams them line-by-line, and produces human-friendly summaries. The tool highlights error hot spots, top tokens, timestamp ranges, and lets you benchmark the naive “read everything” approach against the optimized streaming pipeline. Caching makes repeat runs nearly instant, which is perfect for on-call triage and observability work.

Features

Streaming summarization – custom iterators in texttool.streaming keep memory flat, even for multi-GB files.
Rich stats – totals, severity counts, top tokens, timestamp bounds, and sample lines to orient quickly.
Disk-backed memoization – summaries are cached by path/mtime/size, turning repeat runs into ~120ms lookups.
Baseline vs. optimized benchmarking – run both strategies and capture per-line microsecond metrics.
Built-in profiling – --profile pipes the summarization step through cProfile for micro-optimizations.
Professional tooling – Ruff + Black lint/format, pytest coverage gates, GitHub Actions CI, and reproducible synthetic data generators.

Architecture

texttool/
  cli.py          # argparse entry point, caching + benchmark orchestration
  summarizer.py   # shared summarization core + baseline/streaming strategies
  streaming.py    # buffered iterators/generators for large file handling
  cache.py        # JSON-based memoization keyed by file metadata
benchmarks/
  generate_sample_log.py  # deterministic synthetic log generator
  results.md              # captured before/after metrics
benchmark.py              # convenience wrapper to synthesize + compare runs

Getting Started

Prerequisites
- Python 3.11+
- pip (ships with Python)

Install (editable mode for development)

python3.11 -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Regenerate sample data (optional)

python3.11 benchmarks/generate_sample_log.py sample_logs/synthetic.log --lines 200000

Usage

Summarize any log file (streaming strategy, cache enabled by default):

python3.11 -m texttool.cli summarize /path/to/log --top-words 20

Helpful flags:

--strategy baseline|streaming – compare naive vs. streaming summarizers.
--no-cache – bypass the memoization layer (cache lives at ~/.cache/text_processing_cli/summaries.json).
--json – emit machine-readable output for scripts/dashboards.
--profile – capture cProfile stats (top 25 rows printed to stdout).
--compare --repeats N – benchmark both strategies and print a timing table.

Benchmarking & Profiling

Generate a deterministic log and compare strategies in one command:

python3.11 benchmark.py sample_logs/synthetic.log --generate --lines 300000 --repeats 3

A manual benchmark produced the following “before vs. after” highlights (see benchmarks/results.md for raw data):

Baseline read-all strategy: ~2.00 s for a 200k-line log.
Streaming iterator: ~1.98 s for the same log with flat memory usage.
Cache hit: ~0.12 s after the summary is memoized, ideal for repeat investigations.

Testing & Quality Gates

Quality tooling follows PEP 8 via Black + Ruff.

ruff check .
black .
python3.11 -m pytest       # runs with coverage + 85% gate

CI mirrors these steps in .github/workflows/ci.yml. Coverage snapshots are stored in docs/coverage.md.

Project Structure

.
├── texttool/           # Core package
├── benchmarks/         # Synthetic data + captured metrics
├── sample_logs/        # Empty placeholder folder (logs are .gitignored)
├── tests/              # Pytest suite covering cache, streaming, summarizer
├── docs/coverage.md    # Latest coverage summary
├── CHANGELOG.md        # Release history
└── pyproject.toml      # Tooling/lint/test configuration

Contributing

Fork & clone the repository.
Create a virtualenv + install dev dependencies (pip install -e .[dev]).
Run ruff, black, and pytest locally before opening a PR.
Follow conventional commit messages or squash into logical commits (e.g., feat: add streaming iterator, docs: expand README).
Update CHANGELOG.md and docs/coverage.md when behavior changes.

Changelog

See CHANGELOG.md for release notes. Versioning follows SemVer; the current release is 0.1.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Processing Command-Line Tool

Table of Contents

Features

Architecture

Getting Started

Usage

Benchmarking & Profiling

Testing & Quality Gates

Project Structure

Contributing

Changelog

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
sample_logs		sample_logs
tests		tests
texttool		texttool
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml

xantoine-dev/log-sampler

Folders and files

Latest commit

History

Repository files navigation

Text Processing Command-Line Tool

Table of Contents

Features

Architecture

Getting Started

Usage

Benchmarking & Profiling

Testing & Quality Gates

Project Structure

Contributing

Changelog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages