Master Build: Production-Grade Python "ExamKit" Project Generator (macOS, Offline, OSS)

Perfect—here’s a **ready-to-use, master prompt** you can paste into ChatGPT (or your favorite code-gen assistant) to generate the full Python project from the requirements we just defined.

---

# ✅ Master Build Prompt — “ExamKit” (Python, Local-Only, macOS)

**Role:** You are a senior Python engineer. Generate a complete, production-grade Python project called **`examkit`** that runs fully **offline** on **macOS** using only **free/open-source** tools. The project ingests **lecture video, transcripts, slides, session topics, exam topics, and exam papers** and outputs an **exam-ready PDF** with citations, formulas, diagrams, and a coverage report.

Follow every instruction precisely. Produce **all files** exactly as specified, with **type hints**, **docstrings**, and **clear comments**.

---

## 1) Objectives & Constraints

* **Local-only (offline):** No network calls during processing. Everything must run on macOS with Apple Silicon/Intel.
* **Free/Open-source:** Use `faster-whisper`, `PyMuPDF`, `python-pptx`, `tesseract`, `ffmpeg`, `faiss-cpu`, `sentence-transformers`, `spaCy`, `matplotlib`, `jinja2`, **Typst** (preferred) OR `pandoc+wkhtmltopdf` fallback. Use **Ollama** for local LLM (`llama3.2:8b` default).
* **Reproducible CLI pipeline** with config (`config/config.yml`) and **deterministic** outputs.
* **Traceability:** Every paragraph in PDF must cite sources (video timecodes, slide numbers, exam question ids).
* **Portability:** No Docker required. Poetry environment or uv is fine.

---

## 2) Deliverables (Create all these files)

**Project root**

```
examkit/
  pyproject.toml
  README.md
  LICENSE
  Makefile
  .gitignore
  .env.example
  examkit/                      # Python package
    __init__.py
    cli.py                      # Typer-based CLI (or Click), entrypoint
    config.py                   # Pydantic models for config
    logging_utils.py
    utils/
      __init__.py
      io_utils.py
      text_utils.py
      timecode.py
      math_utils.py
    ingestion/
      __init__.py
      ingest.py                 # Manifest, validation, ffmpeg extract
      transcript_normalizer.py  # VTT/SRT/TXT → jsonl segments
      slides_parser.py          # PPTX→JSONL, images; PDF→JSONL via PyMuPDF+OCR
      exam_parser.py            # Exam paper structure/marks extraction
      ocr.py                    # Tesseract helper
    asr/
      __init__.py
      whisper_runner.py         # faster-whisper wrapper (offline)
    nlp/
      __init__.py
      splitter.py               # sentence/paragraph segmentation
      embeddings.py             # sentence-transformers; FAISS index
      topic_mapping.py          # syllabus mapping, coverage matrix
      retrieval.py              # RAG over FAISS
      spaCy_nlp.py              # NER, cleanup (en_core_web_sm)
    synthesis/
      __init__.py
      prompts.py                # Jinja templates for prompts
      ollama_client.py          # local LLM calls via subprocess/http
      composer.py               # section builders: def/intuit/derivation/examples/common mistakes
      citations.py              # manage refs: [vid hh:mm:ss][slide N][exam Q2b]
      diagrams.py               # Graphviz/Mermaid helpers
    render/
      __init__.py
      templater.py              # Jinja2 → Markdown/Typst
      typst_renderer.py         # Typst compile
      pandoc_renderer.py        # Fallback path
    qa/
      __init__.py
      checks.py                 # formulas compile, link checker, keyword recall
    reports/
      __init__.py
      coverage.py               # topic coverage csv/json
      export.py                 # write citations.json, coverage.csv
  config/
    config.yml
    templates/
      typst/
        main.typ                # Typst main template
        theme.typ               # typography/theme
      markdown/
        section.md.j2           # per-topic section template
        pdf_main.md.j2          # stitched MD template
      prompts/
        definition.j2
        derivation.j2
        mistakes.j2
        compare.j2
        fast_revision.j2
  input/
    sample/
      video/sample.mp4          # (stub, small or placeholder note)
      transcript/sample.vtt
      slides/sample.pptx
      exam/sample_exam.pdf
      topics/session_topics.yml
      topics/exam_topics.yml
  out/                          # build artifacts
  cache/
  logs/
  tests/
    test_ingestion.py
    test_parsers.py
    test_topic_mapping.py
    test_render.py
```

---

## 3) `pyproject.toml` (Poetry) — Required Dependencies

Include at least:

* `typer[all]`, `rich`, `pydantic`, `pyyaml`, `tqdm`
* `faster-whisper`, `ffmpeg-python`
* `pymupdf`, `pdfminer.six`, `python-pptx`
* `pytesseract`, `Pillow`
* `sentence-transformers`, `faiss-cpu`, `spacy` (`en_core_web_sm` in README)
* `matplotlib`, `pandas`, `numpy`, `scikit-learn`
* `jinja2`
* Optional `mermaid-cli` via npm noted in README (not mandatory)
* Typst is installed via system (brew), not Python—document in README

Set Python `>=3.11`.

---

## 4) CLI Commands (Typer)

Implement a Typer CLI in `examkit/cli.py` with subcommands:

1. `examkit ingest --manifest manifest.json`

   * Validates inputs, extracts audio with ffmpeg, normalizes transcripts → `cache/transcript.jsonl`, parses slides → `cache/slides.jsonl`, parses exam → `cache/exam.jsonl`, exports a combined `cache/manifest.normalized.json`.

2. `examkit build --config config/config.yml --out out/lec05.pdf --offline`

   * Pipeline: preprocess → embeddings → topic mapping → RAG synthesis with Ollama → diagrams → templating → Typst render.
   * Save artifacts: `out/lec05.pdf`, `out/citations.json`, `out/coverage.csv`, `out/notes.md`.

3. `examkit report --session <id> --open`

   * Summarize coverage, QA checks, broken links, missing formulas; print Rich table; optionally open coverage csv.

4. `examkit cache clear`

   * Clear cached embeddings, segments, temp files safely.

Each command prints **clear progress** using `rich` and `tqdm`.

---

## 5) Config (`config/config.yml`) — Pydantic Model in `config.py`

Sample defaults:

```yaml
asr:
  engine: faster-whisper
  model: small
  language: en
  vad: true

llm:
  engine: ollama
  model: llama3.2:8b
  temperature: 0.2
  max_tokens: 900
  system_prompt: "You create exam-ready, cited study notes. Be precise, concise, and grounded in sources."

embedding:
  model: all-MiniLM-L6-v2
  dim: 384
  batch_size: 32

retrieval:
  top_k: 8
  max_context_tokens: 2000

pdf:
  engine: typst
  theme: classic
  font_size: 11
  include_appendix: true

diagrams:
  graphviz: true

offline: true
logging:
  level: INFO
```

---

## 6) Core Behaviors (Implement exactly)

### Ingestion

* `ffmpeg` → 16kHz mono WAV.
* Transcript normalize: support VTT/SRT/TXT → `jsonl` (`{source,type,start,end,text}`).
* Slides:

  * `.pptx` → titles/bullets/notes/images; export text to `slides.jsonl`.
  * `.pdf` → PyMuPDF; low text density → OCR with Tesseract; preserve reading order.
* Exam paper: headings (Section A/B), marks via regex, question ids.

### NLP & Mapping

* Segment transcript to sentences/paragraphs.
* Embed chunks with `sentence-transformers` → FAISS index.
* Topic mapping: use keywords + embeddings to compute coverage per topic; write `out/coverage.csv`.

### Synthesis (Local LLM via Ollama)

* RAG retrieve top-k chunks per subsection.
* Use Jinja **prompt templates** for:

  * `definition.j2`, `derivation.j2`, `mistakes.j2`, `compare.j2`, `fast_revision.j2`.
* Composer stitches sections per topic: Definition, Intuition, Key Formulae (LaTeX/Typst), Derivation (step-by-step), Worked Example, Common Mistakes, Quick Revision.
* **Citations**: Each paragraph includes `[vid hh:mm:ss] [slide N] [exam Q2b]`.
* No hallucinations—if unsupported, tag as `[Assumed]` with rationale.

### Diagrams

* `diagrams.py` auto-generates Graphviz DOT for flows or processes detected by pattern mining; export to SVG/PNG and include in PDF.

### Render

* `templater.py`: Jinja → Markdown & Typst glue.
* `typst_renderer.py`: compile `config/templates/typst/main.typ` → PDF; fallback: Pandoc+wkhtmltopdf if `pdf.engine` = `pandoc`.
* Add ToC, headers/footers, figure/table numbering. Large font option.

### QA

* `checks.py`: math block compilation check (for Typst LaTeX), internal link checker, keyword recall vs exam topics, equation symbol consistency (simple regex rules).

### Reports

* Save `citations.json`, `coverage.csv`, and a `report.txt` summary.
* CLI `report` prints pretty summary tables.

---

## 7) Typst Templates (Minimal but elegant)

* `config/templates/typst/main.typ`:

  * A4, ToC, page numbers, header with course/session, footer with date.
  * Figure/table numbering, code block styling, math environment.
* `theme.typ` for fonts (use system fonts like Inter/Noto if installed).

---

## 8) Testing (pytest)

Write meaningful unit tests:

* `test_ingestion.py`: verify audio extraction called; transcript normalize order; OCR fallback triggers.
* `test_parsers.py`: pptx/pdf parsing counts; exam marks extraction.
* `test_topic_mapping.py`: check known chunk maps to expected topic.
* `test_render.py`: template renders without undefined vars; dummy Typst compile mocked.

Add `Makefile` targets: `make setup`, `make test`, `make lint`, `make build-demo`.

---

## 9) README.md

Include:

* Feature list & pipeline diagram (ASCII or Mermaid code fenced).
* macOS install steps (brew: `ffmpeg`, `tesseract`, `graphviz`, Typst).
* Python env via Poetry/uv, spaCy model install.
* Ollama install + `ollama pull llama3.2:8b`.
* Quickstart:

  ```
  examkit ingest --manifest manifest.json
  examkit build --config config/config.yml --out out/lec05.pdf --offline
  examkit report --session lec05 --open
  ```
* Troubleshooting (OCR confidence, ASR model size, typst not found).

---

## 10) Code Quality

* Use **type hints**, **docstrings (Google style)**, and **logging** (info/debug).
* No network calls if `offline=true` (enforce/guard).
* Fail gracefully with actionable error messages.
* Keep modules small, cohesive, and unit-testable.

---

## 11) Sample Fixtures

Under `input/sample/` provide tiny placeholder files or instruction stubs. Include minimal `session_topics.yml` and `exam_topics.yml` with 3–5 topics and weights so tests can run.

---

## 12) Acceptance Criteria

* Running the sample pipeline produces:

  * `out/lec05.pdf` (non-empty, ToC present),
  * `out/citations.json` (≥10 entries),
  * `out/coverage.csv` (topics × sources),
  * `out/notes.md` (compiled source).
* CLI helps, progress bars/logs shown, no external calls.
* All tests pass with `pytest -q`.

---

**Now generate the entire repository exactly as specified above. Provide each file’s full content with proper code blocks and file headers.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master Build: Production-Grade Python "ExamKit" Project Generator (macOS, Offline, OSS) #1

✅ Master Build Prompt — “ExamKit” (Python, Local-Only, macOS)

1) Objectives & Constraints

2) Deliverables (Create all these files)

3) `pyproject.toml` (Poetry) — Required Dependencies

4) CLI Commands (Typer)

5) Config (`config/config.yml`) — Pydantic Model in `config.py`

6) Core Behaviors (Implement exactly)

Ingestion

NLP & Mapping

Synthesis (Local LLM via Ollama)

Diagrams

Render

QA

Reports

7) Typst Templates (Minimal but elegant)

8) Testing (pytest)

9) README.md

10) Code Quality

11) Sample Fixtures

12) Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Master Build: Production-Grade Python "ExamKit" Project Generator (macOS, Offline, OSS) #1

Description

✅ Master Build Prompt — “ExamKit” (Python, Local-Only, macOS)

1) Objectives & Constraints

2) Deliverables (Create all these files)

3) pyproject.toml (Poetry) — Required Dependencies

4) CLI Commands (Typer)

5) Config (config/config.yml) — Pydantic Model in config.py

6) Core Behaviors (Implement exactly)

Ingestion

NLP & Mapping

Synthesis (Local LLM via Ollama)

Diagrams

Render

QA

Reports

7) Typst Templates (Minimal but elegant)

8) Testing (pytest)

9) README.md

10) Code Quality

11) Sample Fixtures

12) Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

3) `pyproject.toml` (Poetry) — Required Dependencies

5) Config (`config/config.yml`) — Pydantic Model in `config.py`