Full Marks — Every source, Perfect Markdown
Convert ANY source format into clean, structured Markdown — with images, videos, diagrams, and full provenance tracking. One command. No cloud required.
FullMark takes files, folders, URLs, videos, images, and archives and converts them into well-structured Markdown. It auto-detects the source type, routes it to the right conversion engine, and writes the result with a stable identity so you never convert the same thing twice.
| Category | Formats | How it works |
|---|---|---|
| Documents | PDF, DOCX, RTF, TXT, EPUB | Text and tables extracted; scanned PDFs fall back to OCR |
| Spreadsheets | XLSX, XLS, CSV, ODS | Each sheet becomes a GFM pipe table |
| Presentations | PPTX, ODP | Slide titles, text blocks, and speaker notes |
| Notebooks | IPYNB | Markdown cells as-is; code cells in fenced blocks; outputs as text |
| MSG (Outlook), EML | Headers + body; HTML email → Markdown | |
| Images | JPG, PNG, BMP, TIFF, WebP | OCR text extraction (Tesseract → EasyOCR fallback); table images → GFM tables; decorative images embedded as base64 or described by vision LLM |
| SVG | SVG | Shapes and paths parsed → Mermaid diagram code block; text elements as bullet list |
| Video | MP4, AVI, MOV, MKV, WEBM | Audio → Whisper transcription with timecodes; frames sampled every 10 s, head overlay cropped, Tesseract OCR-diff detects slide changes (ignores head movement), upscaled and OCR'd (Tesseract → EasyOCR → vision LLM fallback); transcript and slide content interleaved chronologically |
| Audio | MP3, WAV, M4A | Whisper transcription with [MM:SS] timestamps |
| Web | HTTP/HTTPS URLs, HTML, RSS, YouTube | Page content + images → Markdown; SVG logos detected by content-type; YouTube transcripts; RSS entries as numbered list |
| Archives | ZIP | Auto-unpacked; each file routed to the right agent individually |
| URL Lists | TXT, DOCX, XLSX, CSV | One URL per line/cell — each fetched and converted; non-URL lines skipped |
| Source Code | .py .js .ts .go .rs .java .c .cpp .cs .rb .php .sh .sql .json .yaml .toml .tf and 50+ more |
Each file becomes a syntax-highlighted fenced code block |
| GitHub Repos | https://github.com/owner/repo |
Full repo tree via GitHub API — no git clone needed |
Every output file carries:
- YAML front matter — source URL/path, conversion timestamp, agent name, and a stable
source_id - Provenance footnote — at the bottom of every file so the origin travels with the document
Files over 120,000 characters are split into name_001.md, name_002.md, etc.
Many computers have Python, ffmpeg, or Tesseract installed but not on their system PATH.
This means commands like python, ffmpeg, or tesseract might fail even though the software
is physically installed.
Create a test.py file anywhere on your machine and run it first:
# test.py — paste this and run: python test.py
import shutil, sys
tools = {
"python": sys.executable,
"ffmpeg": shutil.which("ffmpeg"),
"tesseract": shutil.which("tesseract"),
}
print(f"Python version : {sys.version}")
for name, path in tools.items():
status = "✓ found" if path else "✗ NOT FOUND (may need PATH fix)"
print(f"{name:12} {status} {path or ''}")⚠ A tool showing as "NOT FOUND" here might still be installed — it just isn't on your PATH. See the system binaries section for install links.
FullMark requires Python 3.11 or higher. Check your version:
python --versionIf you have multiple Python versions installed, use py -3.11 on Windows to select explicitly.
git clone https://github.com/tmprabubiz/fullmark.git
cd fullmark# Windows (PowerShell)
python -m venv .venv
.venv\Scripts\Activate.ps1
# macOS / Linux
python3 -m venv .venv
source .venv/bin/activatepip install -r requirements.txtOptional extras (install only what you need):
pip install pymupdf # enhanced PDF extraction
pip install xlrd # legacy .xls support
pip install cairosvg # SVG → raster for vision fallbackThese are not Python packages — install them separately:
| Tool | Purpose | Download |
|---|---|---|
| ffmpeg | Video/audio extraction | https://ffmpeg.org/download.html |
| Tesseract OCR | Image text extraction | https://github.com/UB-Mannheim/tesseract/wiki (Windows) |
After installing, confirm they are on your PATH:
ffmpeg -version
tesseract --versionIf either command fails, add the install directory to your system PATH, or set the full path in .env:
FFMPEG_PATH=C:\FFmpeg\bin\ffmpeg.exe
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe# Windows
copy .env.template .env
# macOS / Linux
cp .env.template .envThen open .env in any text editor. Everything is optional — FullMark always has a free fallback path.
python fullmark_preflight.pypython fullmark_cli.py report.pdf
python fullmark_cli.py presentation.pptx
python fullmark_cli.py lecture.mp4
python fullmark_cli.py diagram.svg # → Mermaid code block
python fullmark_cli.py screenshot.png # → OCR extracted textpython fullmark_cli.py https://example.com/articleWeb pages are converted with images intact:
- Raster images (JPG/PNG/WebP) → OCR run, text extracted into the document
- SVG images (logos, icons) → content-type detected, saved as
.svg, converted to Mermaid or text bullets - Decorative images with no extractable text → base64-embedded inline
When you run a URL for the first time, FullMark asks whether to follow links:
Source: https://example.com/docs
Single-page conversion by default (no link following).
Follow links on this page and convert multiple pages? [y/N]:
Type n (or just press Enter) for a single page. Type y to see an estimated page count and time before confirming a crawl.
python fullmark_cli.py https://www.youtube.com/watch?v=VIDEO_IDpython fullmark_cli.py lecture.mp4
python fullmark_cli.py interview.mp3Output structure:
## Scene 1 — [00:00]
[Transcript for this time segment...]
[Frame OCR text if present]
## Scene 2 — [01:42]
...Whisper runs locally — no audio is sent to any cloud service.
Set model size with --whisper-model tiny|base|small|medium|large (default: base).
GPU acceleration (NVIDIA): The default
pip install -r requirements.txtinstalls CPU-only PyTorch. If you have an NVIDIA GPU, install the CUDA build for significantly faster Whisper transcription and EasyOCR:pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121Replace
cu121with your CUDA version (cu118,cu124, etc. — check withnvidia-smi). Without this, Whisper and EasyOCR log informational messages about running in CPU/FP32 mode — these are not errors.
Video frame OCR pipeline (fully local, no API required):
| Step | Tool | Notes |
|---|---|---|
| 1. Sample frames | ffmpeg | Every VIDEO_FRAME_INTERVAL seconds (default: 10 s) |
| 2. Crop head overlay | Pillow | Top 20 % of each frame is removed before comparison — ignores webcam/head movement |
| 3. Slide-change detection | Tesseract + Jaccard similarity | OCR the content region; keep frame only when slide text changed ≥ 30 % from previous kept frame |
| 4. Upscale for final OCR | Pillow LANCZOS | Kept frames narrower than 1280 px are upscaled before final OCR pass |
| 5. OCR primary | Tesseract | Best for text-heavy slides and screen recordings |
| 6. OCR fallback | EasyOCR | Catches stylised fonts and graphics that Tesseract misses |
| 7. Vision LLM fallback | VISION_CHAIN | Only if both OCR tools return empty — e.g. hand-drawn diagrams or decorative slides. Disabled by default. Enable by setting VISION_CHAIN in .env. |
The Tesseract-first change detection (steps 2–3) means:
- Human head movement is invisible to the frame selector
- A new slide triggers a capture even if it appears after only 10 s
- Duplicate frames are discarded immediately to save disk space
- If Tesseract is not installed, FullMark falls back to PySceneDetect, then fixed-interval sampling
The final output interleaves audio transcript and frame OCR chronologically by timecode. Each ## [MM:SS] section shows the slide OCR content followed by the speech that occurred while that slide was visible. Whisper segments are never split mid-sentence.
Tune extraction with .env settings:
VIDEO_FRAME_INTERVAL=10 # seconds between sampled frames (default: 10)
WHISPER_MODEL=base # tiny | base | small | medium | largeFor cases where the standard OCR pipeline is insufficient — such as very low-resolution videos that cannot be usefully upscaled, or videos consisting entirely of diagrams with no machine-readable text — a standalone script is included:
python video_vision_extractor.py "input/lecture.mp4"This sends each unique frame to an OpenAI vision model (GPT-4o-mini by
default) and writes a separate _Video.md file alongside the main transcript.
It requires OPENAI_API_KEY in your .env.
OPENAI_API_KEY=sk-...
OPENAI_VISION_MODEL=gpt-4o-mini # or gpt-4o for higher accuracyKey options:
# Skip PySceneDetect (much faster for long videos — uses fixed-interval only)
python video_vision_extractor.py lecture.mp4 --skip-scene-detect
# Adjust sampling density (lower = more frames)
python video_vision_extractor.py lecture.mp4 --interval 10 --hash-threshold 10
# Preview which frames would be sent without spending any API tokens
python video_vision_extractor.py lecture.mp4 --dry-run --skip-scene-detectCost note: GPT-4o-mini vision at
detail: highcosts roughly $0.001–$0.003 per frame. A 1-hour lecture at 10s intervals ≈ 300 unique frames ≈ $0.30–$0.90. The standard OCR pipeline (Tesseract + EasyOCR) is completely free and should be tried first — it handles most screen-recording and slide-deck videos well.
python fullmark_cli.py ./my_documents/Place files into input/ and run with no arguments:
python fullmark_cli.pyFullMark scans input/ and converts everything it finds.
Create a .txt file with one URL per line:
https://example.com/page-one
https://example.com/page-two
https://docs.python.org/3/library/os.html
Save as input/urls.txt and run:
python fullmark_cli.py input/urls.txtpython fullmark_cli.py https://example.com/docs --follow-links --crawl-depth 2 --max-pages 30 --crawl-delay 2⚠ Large crawls can consume significant LLM tokens. Start with
--max-pages 10to sample first. Interrupted runs resume automatically — already-converted sources are skipped.
# Entire repo
python fullmark_cli.py https://github.com/owner/repo
# Specific branch
python fullmark_cli.py https://github.com/owner/repo/tree/main
# Just a subdirectory (recommended for large repos)
python fullmark_cli.py https://github.com/owner/repo/tree/main/srcOutput contains the repo tree overview, all text files grouped by directory in syntax-highlighted fenced code blocks, and binary files noted but not embedded.
Rate limits:
| Mode | Limit |
|---|---|
| Unauthenticated | 60 API requests/hour |
With GITHUB_TOKEN |
5,000 API requests/hour |
GITHUB_TOKEN=ghp_yourTokenHereGet a free token at github.com/settings/tokens — no scopes needed for public repos.
python fullmark_cli.py ./my-project/Supported code/config extensions (50+):
| Category | Extensions |
|---|---|
| Python | .py .pyw .pyi |
| JavaScript / TypeScript | .js .mjs .cjs .jsx .ts .tsx |
| JVM | .java .kt .scala .groovy |
| C family | .c .h .cpp .cs |
| Systems | .go .rs .swift .zig .dart |
| Scripting | .rb .php .pl .lua .r |
| Shell | .sh .bash .ps1 .bat .cmd |
| Data / config | .json .yaml .toml .ini .cfg .xml |
| Database | .sql .graphql .proto |
| Infrastructure | .tf .tfvars .bicep .nix |
| Web front-end | .css .scss .vue .svelte |
| Docs-as-code | .rst .mdx |
| Misc | .dockerfile .gitignore .editorconfig .lock |
python fullmark_cli.py --help
Arguments:
SOURCE File, directory, or URL (optional — omit to use input/)
Options:
-o, --output DIR Output directory (default: ./output)
-w, --whisper-model Whisper model: tiny|base|small|medium|large
-v, --verbose Show debug logs
--follow-links Follow hyperlinks found on URL sources
--crawl-depth N Link-hop depth (default: 1)
--crawl-delay SECS Sleep between requests (default: 2.0)
--max-pages N Hard cap on crawled pages (default: 50)
--force Reconvert even if already in conversion_log.json
--version Show version
FullMark tracks every conversion in two files inside output/:
| File | Purpose |
|---|---|
conversion_log.json |
Machine-readable dedup index — the tool reads this |
conversion_log.md |
Human-readable summary table with source ID column |
conversion_skipped.log |
Plain-text log of every skipped source with date, file location, and how to reconvert |
Re-running FullMark on the same source is automatically skipped — no flag needed.
Every source gets a stable source_id (fm-<16hex>) computed from its content:
| Source type | Identity basis |
|---|---|
| URL | SHA256 of normalised URL — tracking params stripped (utm_*, fbclid, gclid, ref), .git suffix removed, trailing / removed |
| File < 10 MB | SHA256 of full file bytes — same file under a different name is detected |
| File ≥ 10 MB (video/audio) | SHA256 of first 4 MB — fast fingerprint without reading the whole file |
This means https://example.com/page?utm_source=newsletter and https://example.com/page
produce the same source_id and are treated as the same source.
When a source is skipped, a notice is appended to output/conversion_skipped.log:
[2026-06-07 21:47:00 UTC] SKIPPED — already converted
Source : https://docs.cloud.google.com/managed-spark/docs
ID : fm-5003ea12a85b9f75
Converted : 2026-06-06 21:47:01 UTC
Output :
output\doc_managed-spark_docs\doc_managed-spark_docs.md
To reconvert : fullmark_cli.py https://... --force
To delete & redo: delete the output file(s) above, then run again
The skip log is plain text — easy to grep by date, filename, or source ID.
If you delete an output .md file, FullMark detects that the file is gone and
reconverts automatically — no --force flag needed. The dedup check verifies
files actually exist on disk, not just that they appear in the log.
Workflow for replacing an output:
- Delete the
.mdfile inoutput/ - Run
python fullmark_cli.py <source>— FullMark reconverts without objection
python fullmark_cli.py https://example.com/article --forceOr set for a whole batch run:
FORCE_RECONVERT=trueEvery converted .md file has a provenance trail built in so the source and identity
travel with the document wherever it goes.
YAML front matter (top of every file):
---
source: https://docs.cloud.google.com/managed-spark/docs
converted: 2026-06-07T10:22:33Z
agent: WebAgent
source_id: fm-5003ea12a85b9f75
---Footnote (bottom of every file / last segment):
---
*Converted by [FullMark](https://github.com/tmprabubiz/fullmark) · source: `https://...` · id: `fm-5003ea12a85b9f75`*
Copy the file into a knowledge base, share it, or embed it in a document store — the
origin is always traceable. The source_id ties the output back to the log entry.
| Image type | What happens |
|---|---|
| Photo / screenshot with text | Tesseract OCR → text blocks; aligned columns → GFM table |
| SVG logo / icon | Content-type detected from HTTP header (not blindly saved as .jpg); shapes → Mermaid code block; <text> elements → bullet list |
| Decorative image (no text) | Base64-embedded inline or described by vision LLM |
| Web page images | Downloaded alongside the page; extension set from Content-Type header with byte-sniff fallback (PNG magic, JPEG FF D8, SVG <svg tag) |
OCR pipeline: Tesseract → EasyOCR fallback → embed if both fail.
MP4 / AVI / MOV / MKV / WEBM
└─ ffmpeg extracts audio
└─ Whisper (local, free) → timestamped transcript
└─ PySceneDetect finds scene changes
└─ opencv extracts one frame per scene → ImageAgent OCR
└─ CompilerAgent merges transcript + frame OCR → structured Markdown
MP3 / WAV / M4A
└─ Whisper → [MM:SS] timestamped Markdown
Whisper runs fully locally — no audio leaves your machine.
FullMark uses an LLM to structure video/audio transcripts and describe decorative images. Entirely optional — if you have no API keys, it falls back to mechanical formatting.
# Try left-to-right; first to respond wins.
# Put your most reliable/preferred provider first.
# Recommended default (Gemini free tier is fast and generous):
COMPILER_CHAIN=gemini,groq,openrouter_free,ollama
# VISION_CHAIN: used only when Tesseract + EasyOCR both return empty on a frame.
# Put your preferred vision provider first — order is fully user-configurable.
# OpenAI (gpt-4o-mini) is a reliable, low-cost vision option:
VISION_CHAIN=openai,gemini,gemini_free,anthropic,openrouter_freeImportant: If
.envcontains more than one activeCOMPILER_CHAIN=orVISION_CHAIN=line, python-dotenv silently uses only the first one. FullMark will log a warning at startup if duplicates are detected so this is never invisible.
| Tier | Providers | Cost |
|---|---|---|
| Free APIs | Gemini free, Groq, Cerebras, NVIDIA, OpenRouter (free models), Mistral free | Free |
| Low-cost | OpenAI (gpt-4o-mini), DeepSeek, Together AI, Fireworks, Cohere | Pay-per-use, cheap |
| Premium | OpenAI (gpt-4o), Anthropic, Gemini Pro | Pay-per-use |
| Local / offline | Ollama (any model you've pulled) | Free |
PROVIDER_MAX_RETRIES=2 # retries per provider on 429 before moving on
PROVIDER_RETRY_DELAY=5 # seconds (exponential: 5s, 10s)output/
report.md ← single file (< 120k chars)
big_document_001.md ← auto-segmented (> 120k chars)
big_document_002.md
doc_managed-spark_docs/
doc_managed-spark_docs.md ← web conversion in subfolder
image-001.svg ← companion images, correct extension
image-002.png
git_owner_repo/
git_owner_repo_001.md ← large repo, 3 segments
git_owner_repo_002.md
git_owner_repo_003.md
conversion_log.json ← machine-readable dedup index
conversion_log.md ← human-readable summary table with source_id
conversion_skipped.log ← skip notices: date, file, how to reconvert
FullMark is designed to run in a Command Prompt or PowerShell window — not by double-clicking.
cd G:\fullmark
python fullmark_cli.py input/# Windows PowerShell — capture everything to a file
python fullmark_cli.py ./docs/ 2>&1 | Tee-Object -FilePath run.log
# Bash / macOS / Linux
python fullmark_cli.py ./docs/ 2>&1 | tee run.logVerbose mode — see every routing decision, provider attempt, and retry:
python fullmark_cli.py report.pdf -vORCHESTRATOR (extension + MIME routing, dedup, source_id, footnote)
│
├── DocumentAgent → PDF DOCX XLSX CSV PPTX EPUB IPYNB MSG EML RTF TXT
├── CodeAgent → .py .js .ts .go .rs .java .json .yaml .toml + 50 more
├── WebAgent → URLs HTML RSS YouTube + UrlListAgent
│ └── ImageAgent (per downloaded image — OCR / SVG / embed)
├── ImageAgent → JPG PNG SVG BMP TIFF WebP (Tesseract → EasyOCR → base64)
├── VideoAgent → MP4 AVI MOV MP3 WAV M4A (ffmpeg + Whisper + PySceneDetect)
│ └── CompilerAgent (LLM merge of transcript + frame OCR)
└── RepoAgent → https://github.com/owner/repo (GitHub Trees API, no clone)
python -m pytest tests/ -v147 tests, all mocked — no internet connection or external tools required.
The unit suite mocks all I/O. The following test cases are not yet written and should be added before a production release:
Security (from Opus 4.8 review)
- ZIP traversal: member
../../evil.txtmust not escapetemp_dir - YAML front matter: source values containing
#,:, or"produce valid YAML
Output naming
- Two sources with the same filename stem → two distinct output paths (hash suffix)
- Summary links for subfolder outputs are relative, not bare filenames
CLI / routing
https://github.com/a/b→ RepoAgent (not crawlable)https://github.com/a/b/issues/1→ WebAgent (crawlable)https://github.com/a/b/blob/main/f.py→ WebAgent (crawlable)
Video OCR pipeline
- Frame is upscaled to ≥1280px before Tesseract runs
- EasyOCR is not instantiated when Tesseract returns text
- Vision LLM fallback is gated on
VISION_CHAINenv var
These require real files or network and cannot be mocked:
| # | Input | What to verify |
|---|---|---|
| 1 | PDF with text | Output contains paragraph text, not just front matter |
| 2 | Scanned (image) PDF | pytesseract OCR fallback produces text |
| 3 | XLSX | GFM table output |
| 4 | HTTP URL | Images download, HTML → clean Markdown |
| 5 | YouTube URL | Transcript with [MM:SS] timestamps |
| 6 | Short MP4 (< 5 min) | Whisper transcript + at least some OCR frame text |
| 7 | ZIP with mixed files | Each file routes to correct agent, all outputs collected |
| 8 | Crafted ZIP (traversal) | ../../test.txt member extracts safely, not outside temp dir |
| 9 | Two files with same name from different dirs | Both saved, no silent overwrite |
| 10 | GitHub repo URL | RepoAgent runs, not WebAgent |
fullmark/
__init__.py ← AgentError, FullMarkError
orchestrator.py ← routing, dedup, source_id injection, output writing
agents/
document_agent.py ← PDF DOCX XLSX CSV PPTX EPUB IPYNB MSG EML RTF TXT
code_agent.py ← source code + config files (50+ extensions)
web_agent.py ← URLs HTML RSS YouTube; image content-type detection
image_agent.py ← raster OCR + SVG→Mermaid
video_agent.py ← Whisper + scene detection + frame OCR
compiler_agent.py ← LLM merge of transcript + frame data
repo_agent.py ← GitHub repo → Markdown (no clone, GitHub Trees API)
utils/
model_client.py ← provider fallback chain (Gemini → OpenAI-compat → Ollama)
markdown_utils.py ← front matter, inject_source_id, append_footnote, GFM tables
file_utils.py ← extension detection, ZIP unpacking, URL naming
metadata_logger.py ← JSON + Markdown log, dedup, skip notices
crawler.py ← recursive URL crawler (BFS, depth/delay/domain control)
tests/ ← pytest suite (147 tests)
fullmark_cli.py ← CLI entry point
fullmark_preflight.py ← system dependency checker
.env.template ← configuration template
MIT © tmprabubiz