-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Perfect—here’s a ready-to-use, master prompt you can paste into ChatGPT (or your favorite code-gen assistant) to generate the full Python project from the requirements we just defined.
✅ Master Build Prompt — “ExamKit” (Python, Local-Only, macOS)
Role: You are a senior Python engineer. Generate a complete, production-grade Python project called examkit that runs fully offline on macOS using only free/open-source tools. The project ingests lecture video, transcripts, slides, session topics, exam topics, and exam papers and outputs an exam-ready PDF with citations, formulas, diagrams, and a coverage report.
Follow every instruction precisely. Produce all files exactly as specified, with type hints, docstrings, and clear comments.
1) Objectives & Constraints
- Local-only (offline): No network calls during processing. Everything must run on macOS with Apple Silicon/Intel.
- Free/Open-source: Use
faster-whisper,PyMuPDF,python-pptx,tesseract,ffmpeg,faiss-cpu,sentence-transformers,spaCy,matplotlib,jinja2, Typst (preferred) ORpandoc+wkhtmltopdffallback. Use Ollama for local LLM (llama3.2:8bdefault). - Reproducible CLI pipeline with config (
config/config.yml) and deterministic outputs. - Traceability: Every paragraph in PDF must cite sources (video timecodes, slide numbers, exam question ids).
- Portability: No Docker required. Poetry environment or uv is fine.
2) Deliverables (Create all these files)
Project root
examkit/
pyproject.toml
README.md
LICENSE
Makefile
.gitignore
.env.example
examkit/ # Python package
__init__.py
cli.py # Typer-based CLI (or Click), entrypoint
config.py # Pydantic models for config
logging_utils.py
utils/
__init__.py
io_utils.py
text_utils.py
timecode.py
math_utils.py
ingestion/
__init__.py
ingest.py # Manifest, validation, ffmpeg extract
transcript_normalizer.py # VTT/SRT/TXT → jsonl segments
slides_parser.py # PPTX→JSONL, images; PDF→JSONL via PyMuPDF+OCR
exam_parser.py # Exam paper structure/marks extraction
ocr.py # Tesseract helper
asr/
__init__.py
whisper_runner.py # faster-whisper wrapper (offline)
nlp/
__init__.py
splitter.py # sentence/paragraph segmentation
embeddings.py # sentence-transformers; FAISS index
topic_mapping.py # syllabus mapping, coverage matrix
retrieval.py # RAG over FAISS
spaCy_nlp.py # NER, cleanup (en_core_web_sm)
synthesis/
__init__.py
prompts.py # Jinja templates for prompts
ollama_client.py # local LLM calls via subprocess/http
composer.py # section builders: def/intuit/derivation/examples/common mistakes
citations.py # manage refs: [vid hh:mm:ss][slide N][exam Q2b]
diagrams.py # Graphviz/Mermaid helpers
render/
__init__.py
templater.py # Jinja2 → Markdown/Typst
typst_renderer.py # Typst compile
pandoc_renderer.py # Fallback path
qa/
__init__.py
checks.py # formulas compile, link checker, keyword recall
reports/
__init__.py
coverage.py # topic coverage csv/json
export.py # write citations.json, coverage.csv
config/
config.yml
templates/
typst/
main.typ # Typst main template
theme.typ # typography/theme
markdown/
section.md.j2 # per-topic section template
pdf_main.md.j2 # stitched MD template
prompts/
definition.j2
derivation.j2
mistakes.j2
compare.j2
fast_revision.j2
input/
sample/
video/sample.mp4 # (stub, small or placeholder note)
transcript/sample.vtt
slides/sample.pptx
exam/sample_exam.pdf
topics/session_topics.yml
topics/exam_topics.yml
out/ # build artifacts
cache/
logs/
tests/
test_ingestion.py
test_parsers.py
test_topic_mapping.py
test_render.py
3) pyproject.toml (Poetry) — Required Dependencies
Include at least:
typer[all],rich,pydantic,pyyaml,tqdmfaster-whisper,ffmpeg-pythonpymupdf,pdfminer.six,python-pptxpytesseract,Pillowsentence-transformers,faiss-cpu,spacy(en_core_web_smin README)matplotlib,pandas,numpy,scikit-learnjinja2- Optional
mermaid-clivia npm noted in README (not mandatory) - Typst is installed via system (brew), not Python—document in README
Set Python >=3.11.
4) CLI Commands (Typer)
Implement a Typer CLI in examkit/cli.py with subcommands:
-
examkit ingest --manifest manifest.json- Validates inputs, extracts audio with ffmpeg, normalizes transcripts →
cache/transcript.jsonl, parses slides →cache/slides.jsonl, parses exam →cache/exam.jsonl, exports a combinedcache/manifest.normalized.json.
- Validates inputs, extracts audio with ffmpeg, normalizes transcripts →
-
examkit build --config config/config.yml --out out/lec05.pdf --offline- Pipeline: preprocess → embeddings → topic mapping → RAG synthesis with Ollama → diagrams → templating → Typst render.
- Save artifacts:
out/lec05.pdf,out/citations.json,out/coverage.csv,out/notes.md.
-
examkit report --session <id> --open- Summarize coverage, QA checks, broken links, missing formulas; print Rich table; optionally open coverage csv.
-
examkit cache clear- Clear cached embeddings, segments, temp files safely.
Each command prints clear progress using rich and tqdm.
5) Config (config/config.yml) — Pydantic Model in config.py
Sample defaults:
asr:
engine: faster-whisper
model: small
language: en
vad: true
llm:
engine: ollama
model: llama3.2:8b
temperature: 0.2
max_tokens: 900
system_prompt: "You create exam-ready, cited study notes. Be precise, concise, and grounded in sources."
embedding:
model: all-MiniLM-L6-v2
dim: 384
batch_size: 32
retrieval:
top_k: 8
max_context_tokens: 2000
pdf:
engine: typst
theme: classic
font_size: 11
include_appendix: true
diagrams:
graphviz: true
offline: true
logging:
level: INFO6) Core Behaviors (Implement exactly)
Ingestion
-
ffmpeg→ 16kHz mono WAV. -
Transcript normalize: support VTT/SRT/TXT →
jsonl({source,type,start,end,text}). -
Slides:
.pptx→ titles/bullets/notes/images; export text toslides.jsonl..pdf→ PyMuPDF; low text density → OCR with Tesseract; preserve reading order.
-
Exam paper: headings (Section A/B), marks via regex, question ids.
NLP & Mapping
- Segment transcript to sentences/paragraphs.
- Embed chunks with
sentence-transformers→ FAISS index. - Topic mapping: use keywords + embeddings to compute coverage per topic; write
out/coverage.csv.
Synthesis (Local LLM via Ollama)
-
RAG retrieve top-k chunks per subsection.
-
Use Jinja prompt templates for:
definition.j2,derivation.j2,mistakes.j2,compare.j2,fast_revision.j2.
-
Composer stitches sections per topic: Definition, Intuition, Key Formulae (LaTeX/Typst), Derivation (step-by-step), Worked Example, Common Mistakes, Quick Revision.
-
Citations: Each paragraph includes
[vid hh:mm:ss] [slide N] [exam Q2b]. -
No hallucinations—if unsupported, tag as
[Assumed]with rationale.
Diagrams
diagrams.pyauto-generates Graphviz DOT for flows or processes detected by pattern mining; export to SVG/PNG and include in PDF.
Render
templater.py: Jinja → Markdown & Typst glue.typst_renderer.py: compileconfig/templates/typst/main.typ→ PDF; fallback: Pandoc+wkhtmltopdf ifpdf.engine=pandoc.- Add ToC, headers/footers, figure/table numbering. Large font option.
QA
checks.py: math block compilation check (for Typst LaTeX), internal link checker, keyword recall vs exam topics, equation symbol consistency (simple regex rules).
Reports
- Save
citations.json,coverage.csv, and areport.txtsummary. - CLI
reportprints pretty summary tables.
7) Typst Templates (Minimal but elegant)
-
config/templates/typst/main.typ:- A4, ToC, page numbers, header with course/session, footer with date.
- Figure/table numbering, code block styling, math environment.
-
theme.typfor fonts (use system fonts like Inter/Noto if installed).
8) Testing (pytest)
Write meaningful unit tests:
test_ingestion.py: verify audio extraction called; transcript normalize order; OCR fallback triggers.test_parsers.py: pptx/pdf parsing counts; exam marks extraction.test_topic_mapping.py: check known chunk maps to expected topic.test_render.py: template renders without undefined vars; dummy Typst compile mocked.
Add Makefile targets: make setup, make test, make lint, make build-demo.
9) README.md
Include:
-
Feature list & pipeline diagram (ASCII or Mermaid code fenced).
-
macOS install steps (brew:
ffmpeg,tesseract,graphviz, Typst). -
Python env via Poetry/uv, spaCy model install.
-
Ollama install +
ollama pull llama3.2:8b. -
Quickstart:
examkit ingest --manifest manifest.json examkit build --config config/config.yml --out out/lec05.pdf --offline examkit report --session lec05 --open -
Troubleshooting (OCR confidence, ASR model size, typst not found).
10) Code Quality
- Use type hints, docstrings (Google style), and logging (info/debug).
- No network calls if
offline=true(enforce/guard). - Fail gracefully with actionable error messages.
- Keep modules small, cohesive, and unit-testable.
11) Sample Fixtures
Under input/sample/ provide tiny placeholder files or instruction stubs. Include minimal session_topics.yml and exam_topics.yml with 3–5 topics and weights so tests can run.
12) Acceptance Criteria
-
Running the sample pipeline produces:
out/lec05.pdf(non-empty, ToC present),out/citations.json(≥10 entries),out/coverage.csv(topics × sources),out/notes.md(compiled source).
-
CLI helps, progress bars/logs shown, no external calls.
-
All tests pass with
pytest -q.
Now generate the entire repository exactly as specified above. Provide each file’s full content with proper code blocks and file headers.