Skip to content

Master Build: Production-Grade Python "ExamKit" Project Generator (macOS, Offline, OSS) #1

@thecoder8890

Description

@thecoder8890

Perfect—here’s a ready-to-use, master prompt you can paste into ChatGPT (or your favorite code-gen assistant) to generate the full Python project from the requirements we just defined.


✅ Master Build Prompt — “ExamKit” (Python, Local-Only, macOS)

Role: You are a senior Python engineer. Generate a complete, production-grade Python project called examkit that runs fully offline on macOS using only free/open-source tools. The project ingests lecture video, transcripts, slides, session topics, exam topics, and exam papers and outputs an exam-ready PDF with citations, formulas, diagrams, and a coverage report.

Follow every instruction precisely. Produce all files exactly as specified, with type hints, docstrings, and clear comments.


1) Objectives & Constraints

  • Local-only (offline): No network calls during processing. Everything must run on macOS with Apple Silicon/Intel.
  • Free/Open-source: Use faster-whisper, PyMuPDF, python-pptx, tesseract, ffmpeg, faiss-cpu, sentence-transformers, spaCy, matplotlib, jinja2, Typst (preferred) OR pandoc+wkhtmltopdf fallback. Use Ollama for local LLM (llama3.2:8b default).
  • Reproducible CLI pipeline with config (config/config.yml) and deterministic outputs.
  • Traceability: Every paragraph in PDF must cite sources (video timecodes, slide numbers, exam question ids).
  • Portability: No Docker required. Poetry environment or uv is fine.

2) Deliverables (Create all these files)

Project root

examkit/
  pyproject.toml
  README.md
  LICENSE
  Makefile
  .gitignore
  .env.example
  examkit/                      # Python package
    __init__.py
    cli.py                      # Typer-based CLI (or Click), entrypoint
    config.py                   # Pydantic models for config
    logging_utils.py
    utils/
      __init__.py
      io_utils.py
      text_utils.py
      timecode.py
      math_utils.py
    ingestion/
      __init__.py
      ingest.py                 # Manifest, validation, ffmpeg extract
      transcript_normalizer.py  # VTT/SRT/TXT → jsonl segments
      slides_parser.py          # PPTX→JSONL, images; PDF→JSONL via PyMuPDF+OCR
      exam_parser.py            # Exam paper structure/marks extraction
      ocr.py                    # Tesseract helper
    asr/
      __init__.py
      whisper_runner.py         # faster-whisper wrapper (offline)
    nlp/
      __init__.py
      splitter.py               # sentence/paragraph segmentation
      embeddings.py             # sentence-transformers; FAISS index
      topic_mapping.py          # syllabus mapping, coverage matrix
      retrieval.py              # RAG over FAISS
      spaCy_nlp.py              # NER, cleanup (en_core_web_sm)
    synthesis/
      __init__.py
      prompts.py                # Jinja templates for prompts
      ollama_client.py          # local LLM calls via subprocess/http
      composer.py               # section builders: def/intuit/derivation/examples/common mistakes
      citations.py              # manage refs: [vid hh:mm:ss][slide N][exam Q2b]
      diagrams.py               # Graphviz/Mermaid helpers
    render/
      __init__.py
      templater.py              # Jinja2 → Markdown/Typst
      typst_renderer.py         # Typst compile
      pandoc_renderer.py        # Fallback path
    qa/
      __init__.py
      checks.py                 # formulas compile, link checker, keyword recall
    reports/
      __init__.py
      coverage.py               # topic coverage csv/json
      export.py                 # write citations.json, coverage.csv
  config/
    config.yml
    templates/
      typst/
        main.typ                # Typst main template
        theme.typ               # typography/theme
      markdown/
        section.md.j2           # per-topic section template
        pdf_main.md.j2          # stitched MD template
      prompts/
        definition.j2
        derivation.j2
        mistakes.j2
        compare.j2
        fast_revision.j2
  input/
    sample/
      video/sample.mp4          # (stub, small or placeholder note)
      transcript/sample.vtt
      slides/sample.pptx
      exam/sample_exam.pdf
      topics/session_topics.yml
      topics/exam_topics.yml
  out/                          # build artifacts
  cache/
  logs/
  tests/
    test_ingestion.py
    test_parsers.py
    test_topic_mapping.py
    test_render.py

3) pyproject.toml (Poetry) — Required Dependencies

Include at least:

  • typer[all], rich, pydantic, pyyaml, tqdm
  • faster-whisper, ffmpeg-python
  • pymupdf, pdfminer.six, python-pptx
  • pytesseract, Pillow
  • sentence-transformers, faiss-cpu, spacy (en_core_web_sm in README)
  • matplotlib, pandas, numpy, scikit-learn
  • jinja2
  • Optional mermaid-cli via npm noted in README (not mandatory)
  • Typst is installed via system (brew), not Python—document in README

Set Python >=3.11.


4) CLI Commands (Typer)

Implement a Typer CLI in examkit/cli.py with subcommands:

  1. examkit ingest --manifest manifest.json

    • Validates inputs, extracts audio with ffmpeg, normalizes transcripts → cache/transcript.jsonl, parses slides → cache/slides.jsonl, parses exam → cache/exam.jsonl, exports a combined cache/manifest.normalized.json.
  2. examkit build --config config/config.yml --out out/lec05.pdf --offline

    • Pipeline: preprocess → embeddings → topic mapping → RAG synthesis with Ollama → diagrams → templating → Typst render.
    • Save artifacts: out/lec05.pdf, out/citations.json, out/coverage.csv, out/notes.md.
  3. examkit report --session <id> --open

    • Summarize coverage, QA checks, broken links, missing formulas; print Rich table; optionally open coverage csv.
  4. examkit cache clear

    • Clear cached embeddings, segments, temp files safely.

Each command prints clear progress using rich and tqdm.


5) Config (config/config.yml) — Pydantic Model in config.py

Sample defaults:

asr:
  engine: faster-whisper
  model: small
  language: en
  vad: true

llm:
  engine: ollama
  model: llama3.2:8b
  temperature: 0.2
  max_tokens: 900
  system_prompt: "You create exam-ready, cited study notes. Be precise, concise, and grounded in sources."

embedding:
  model: all-MiniLM-L6-v2
  dim: 384
  batch_size: 32

retrieval:
  top_k: 8
  max_context_tokens: 2000

pdf:
  engine: typst
  theme: classic
  font_size: 11
  include_appendix: true

diagrams:
  graphviz: true

offline: true
logging:
  level: INFO

6) Core Behaviors (Implement exactly)

Ingestion

  • ffmpeg → 16kHz mono WAV.

  • Transcript normalize: support VTT/SRT/TXT → jsonl ({source,type,start,end,text}).

  • Slides:

    • .pptx → titles/bullets/notes/images; export text to slides.jsonl.
    • .pdf → PyMuPDF; low text density → OCR with Tesseract; preserve reading order.
  • Exam paper: headings (Section A/B), marks via regex, question ids.

NLP & Mapping

  • Segment transcript to sentences/paragraphs.
  • Embed chunks with sentence-transformers → FAISS index.
  • Topic mapping: use keywords + embeddings to compute coverage per topic; write out/coverage.csv.

Synthesis (Local LLM via Ollama)

  • RAG retrieve top-k chunks per subsection.

  • Use Jinja prompt templates for:

    • definition.j2, derivation.j2, mistakes.j2, compare.j2, fast_revision.j2.
  • Composer stitches sections per topic: Definition, Intuition, Key Formulae (LaTeX/Typst), Derivation (step-by-step), Worked Example, Common Mistakes, Quick Revision.

  • Citations: Each paragraph includes [vid hh:mm:ss] [slide N] [exam Q2b].

  • No hallucinations—if unsupported, tag as [Assumed] with rationale.

Diagrams

  • diagrams.py auto-generates Graphviz DOT for flows or processes detected by pattern mining; export to SVG/PNG and include in PDF.

Render

  • templater.py: Jinja → Markdown & Typst glue.
  • typst_renderer.py: compile config/templates/typst/main.typ → PDF; fallback: Pandoc+wkhtmltopdf if pdf.engine = pandoc.
  • Add ToC, headers/footers, figure/table numbering. Large font option.

QA

  • checks.py: math block compilation check (for Typst LaTeX), internal link checker, keyword recall vs exam topics, equation symbol consistency (simple regex rules).

Reports

  • Save citations.json, coverage.csv, and a report.txt summary.
  • CLI report prints pretty summary tables.

7) Typst Templates (Minimal but elegant)

  • config/templates/typst/main.typ:

    • A4, ToC, page numbers, header with course/session, footer with date.
    • Figure/table numbering, code block styling, math environment.
  • theme.typ for fonts (use system fonts like Inter/Noto if installed).


8) Testing (pytest)

Write meaningful unit tests:

  • test_ingestion.py: verify audio extraction called; transcript normalize order; OCR fallback triggers.
  • test_parsers.py: pptx/pdf parsing counts; exam marks extraction.
  • test_topic_mapping.py: check known chunk maps to expected topic.
  • test_render.py: template renders without undefined vars; dummy Typst compile mocked.

Add Makefile targets: make setup, make test, make lint, make build-demo.


9) README.md

Include:

  • Feature list & pipeline diagram (ASCII or Mermaid code fenced).

  • macOS install steps (brew: ffmpeg, tesseract, graphviz, Typst).

  • Python env via Poetry/uv, spaCy model install.

  • Ollama install + ollama pull llama3.2:8b.

  • Quickstart:

    examkit ingest --manifest manifest.json
    examkit build --config config/config.yml --out out/lec05.pdf --offline
    examkit report --session lec05 --open
    
  • Troubleshooting (OCR confidence, ASR model size, typst not found).


10) Code Quality

  • Use type hints, docstrings (Google style), and logging (info/debug).
  • No network calls if offline=true (enforce/guard).
  • Fail gracefully with actionable error messages.
  • Keep modules small, cohesive, and unit-testable.

11) Sample Fixtures

Under input/sample/ provide tiny placeholder files or instruction stubs. Include minimal session_topics.yml and exam_topics.yml with 3–5 topics and weights so tests can run.


12) Acceptance Criteria

  • Running the sample pipeline produces:

    • out/lec05.pdf (non-empty, ToC present),
    • out/citations.json (≥10 entries),
    • out/coverage.csv (topics × sources),
    • out/notes.md (compiled source).
  • CLI helps, progress bars/logs shown, no external calls.

  • All tests pass with pytest -q.


Now generate the entire repository exactly as specified above. Provide each file’s full content with proper code blocks and file headers.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions