Skip to content

veyorokon/code-harvest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

harvest-code

Extract codebases into portable JSON for RAG, tooling, and analysis.

PyPI Python 3.9+ License: MIT

Quick Start

pip install harvest-code
harvest serve  # Watch + serve current directory at http://localhost:8787

harvest Web UI

Features

  • 🔍 Zero dependencies - Pure Python, no external packages required
  • 📊 Smart chunking - Extracts functions, classes, exports with stable IDs
  • 🌐 Interactive UI - Web interface with search, filtering, and syntax highlighting
  • 🎯 Intelligent filtering - Automatically skips build artifacts, binaries, and test files
  • Live updates - Auto-refreshes when files change (watch mode)
  • 🚀 Scales - Handles 50k+ files with progressive loading

Commands

harvest reap - Extract codebase to JSON

# Basic usage - output named after directory
harvest reap .                    # → ./current-dir-name.harvest.json
harvest reap /path/to/project     # → ./project.harvest.json

# Custom output
harvest reap . -o analysis.json

# Control what's included
harvest reap . --include metadata       # File inventory only
harvest reap . --include data          # File contents only  
harvest reap . --exclude chunks        # Skip code parsing
harvest reap . --format jsonl          # Line-delimited JSON

# File type filtering (NEW)
harvest reap . --only-ext .py          # Only Python files
harvest reap . --only-ext py,ts,js     # Multiple extensions (dot optional)
harvest reap . --skip-ext .log,.tmp    # Skip specific types

# Common combinations
harvest reap . --include data --only-ext .py    # Python source only (for LLMs)
harvest reap . --include chunks --only-ext .ts  # TypeScript symbols only

# Override default filters
harvest reap . --no-default-excludes   # Include hidden files, tests, etc.

harvest serve - Interactive web UI

# Watch + serve (default)
harvest serve                    # Current directory on port 8787
harvest serve /path/to/project   # Specific directory
harvest serve --port 8080        # Custom port

# Control options
harvest serve --no-watch         # Disable auto-refresh
harvest serve --only-ext py,ts   # Watch specific file types

Web UI Features:

  • Real-time search and filtering
  • Syntax highlighting with toggle
  • Skeleton view - Show only function/class signatures
  • Auto-refresh on file changes
  • Infinite scroll for large codebases
  • Deep linking to specific files/lines

harvest query - Search harvest data

# Find specific code elements
harvest query data.json --entity chunks --language python --public true
harvest query data.json --entity files --export-named "MyComponent"
harvest query data.json --path-glob "src/**" --fields path,symbol,kind

harvest watch - Monitor directory changes

# Continuous harvesting
harvest watch .                        # → ./<dirname>.harvest.json
harvest watch /path/to/src -o out.json # Custom output
harvest watch . --only-ext py,js,ts    # Specific extensions

harvest sow - Generate artifacts

# Create React barrel exports
harvest sow data.json --react src/index.ts

harvest winnow - Extract code skeletons

# Extract function/class signatures without bodies
harvest winnow data.harvest.json --skeleton                    # → data.skeleton.harvest.jsonl
harvest winnow data.harvest.json --skeleton --language python  # Python only
harvest winnow data.harvest.json --skeleton --out custom.jsonl # Custom output

# Output formats
harvest winnow data.harvest.json --skeleton --format jsonl     # → data.skeleton.harvest.jsonl (default)
harvest winnow data.harvest.json --skeleton --format json      # → data.skeleton.harvest.json
harvest winnow data.harvest.json --skeleton --format md        # → data.skeleton.harvest.md
harvest winnow data.harvest.json --skeleton --out -            # Output to stdout

Output Structure

harvest generates JSON with three sections:

{
  "metadata": {
    "schema": "harvest/v1.2",
    "source": {"type": "local", "root": "/path/to/code"},
    "counts": {"total_files": 150, "total_bytes": 524288}
  },
  "data": [
    {
      "path": "src/utils.py",
      "language": "python",
      "content": "def helper():\n    pass",
      "exports": null,
      "py_symbols": {"functions": ["helper"]}
    }
  ],
  "chunks": [
    {
      "id": "abc123...",
      "file_path": "src/utils.py",
      "kind": "function",
      "symbol": "helper",
      "start_line": 1,
      "end_line": 2,
      "public": true
    }
  ]
}

Output Sections

  • metadata - File inventory, counts, timestamps
  • data - File contents and language metadata
  • chunks - Parsed symbols (functions, classes, exports)

Control output with --include and --exclude:

harvest reap . --include metadata       # Inventory only (fast)
harvest reap . --exclude chunks         # Skip parsing (smaller)
harvest reap . --include chunks         # Symbols only (for analysis)

File Handling

Smart Filtering

harvest uses a three-tier filtering system:

Completely Skipped:

  • Hidden files and directories (.git/, .env)
  • Test directories (tests/, __tests__/)
  • Build artifacts (dist/, build/, node_modules/)
  • Binaries and media (.exe, .mp3, .db)
  • Logs and temp files (.log, .tmp)

Path-Only (no content):

  • Images (.jpg, .png, .svg)
  • Fonts (.ttf, .woff)
  • Documents (.pdf, .doc)

Fully Processed:

  • Source code (.py, .js, .ts, etc.)
  • Config files (.json, .yaml, .toml)
  • Documentation (.md, .txt)

Override with --no-default-excludes to include everything.

Language Support

Full parsing (chunks + symbols):

  • Python - Functions, classes via AST
  • JavaScript/TypeScript - Functions, classes, React components, ES6/CommonJS exports
  • JSON/YAML/TOML - Single file chunks

Syntax highlighting in web UI: Python, JavaScript, TypeScript, JSON, YAML, TOML, Markdown, Shell, Go, Rust

API Endpoints

The web server exposes REST APIs:

# Search chunks
curl http://localhost:8787/api/search?entity=chunks&language=python

# Get metadata
curl http://localhost:8787/api/meta

# Download full harvest
curl http://localhost:8787/api/harvest

Use Cases

RAG/LLM Context

# Get clean Python source for LLM context
harvest reap . --include data --only-ext .py

# Extract API signatures only
harvest winnow project.harvest.json --skeleton --language python
import json

with open('project.harvest.json') as f:
    harvest = json.load(f)

# Extract public functions for context
api_functions = [
    chunk for chunk in harvest['chunks'] 
    if chunk['public'] and chunk['kind'] == 'function'
]

Documentation Generation

# Generate architectural overview in Markdown
harvest winnow app.harvest.json --skeleton --format md  # → app.skeleton.harvest.md

# Extract all exports
harvest query app.json --entity files --has-default-export --fields path,exports

Code Analysis

# Find all React components
harvest query app.json --entity chunks --kind export_default --language typescript

# Find large classes
harvest query app.json --entity chunks --kind class --min-lines 100

File Naming

  • Output files use pattern: <directory-name>.harvest.json
  • Created in current working directory (where command is run)
  • Use -o flag for custom names/paths
  • All *.harvest.json files are automatically ignored to prevent recursion

Development

git clone https://github.com/veyorokon/code-harvest.git
cd code-harvest
pip install -e .
harvest reap . -o self.harvest.json

Links

License

MIT - see LICENSE file

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages