This repository contains a collection of Python scripts and tools designed for various tasks related to text extraction, categorisation, and prompt engineering. The main functionalities include a JSON database of research papers with Prompt Patterns (PPs) and Prompt Examples (PEs) extracted, extracting text from PDFs, categorising text using Cosine Similarity, and generating and testing prompts for AI models.
- Installation
- Usage
- Directory Structure
- Contributing
- License
- Security Policy
- Responsible Use Guidelines
- Prompt Pattern Dictionary (Web App)
- Clone the repository:
git clone https://github.com/yourusername/your-repo.git
cd your-repo- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`- Install the required dependencies:
pip install -r requirements.txt-
Set up environment variables by creating a
.envfile in the root directory and adding the necessary keys:AZURE_OPENAI_MODEL=<your-model> API_VERSION=<your-api-version> AZURE_OPENAI_KEY=<your-api-key> AZURE_OPENAI_ENDPOINT=<your-endpoint>
To extract text from a PDF file, use the extractTextFromPDF.py script. Below are some examples:
python extractTextFromPDF.py -filename "Test.pdf"
python extractTextFromPDF.py -filename "Test.pdf" -pages 1-10
python extractTextFromPDF.py -filename "Test.pdf" -pages 1-10 -extractexamples True
python extractTextFromPDF.py -filename "Test.pdf" -pages 1-10 -summary True
python extractTextFromPDF.py -filename "Test.pdf" -pages 1-10 -keypoints TrueTo categorise text using Cosine Similarity, use the categorisation_cosine_similarity.py script:
python categorisation_cosine_similarity.py --top_n 5
python categorisation_cosine_similarity.py --threshold 0.5To generate and test prompts, use the testPrompts.py script:
python testPrompts.py
python vision_testPrompts.pyTo export and count the PPs and PEs from the promptpatterns.json JSON file, use the exportPromptPatternsJSONfile.py script.
Below are some example usages:
- Print the PPs and PEs to the console: This will print the PPs and PEs to the console in a formatted way.
python exportPromptPatternsJSONfile.py --format console- Write the PPs and PEs to an HTML file with the default filename
promptpatterns.html:
This will write the PPs and PEs to an HTML file called promptpatterns.html in the same directory as the script.
python exportPromptPatternsJSONfile.py --format html- Write the PPs and PEs to an HTML file with a custom filename:
This will write the PPs and PEs to an HTML file called
mypromptpatterns.htmlin the same directory as the script.
python exportPromptPatternsJSONfile.py --format html --filename mypromptpatterns.html- Include the current date in the filename of the HTML file:
This will write the PPs and PEs to an HTML file with a filename that includes the current date in the format
promptpatterns_YYYYmmdd.html.
python exportPromptPatternsJSONfile.py --format html --filename promptpatterns_{date}.html- Count the number of Titles, PatternCategory, and pattern name:
This will count the number of Titles, PatternCategory, and pattern name and output it to the console.
python exportPromptPatternsJSONfile.py --countContributions are welcome! Please open an issue or submit a pull request for any improvements, research paper additions or bug fixes.
For ethical / dual‑use concerns use the "Responsible Use Report" issue template. For AI-enriched metadata corrections use the "AI-Assisted Field Correction" template.
Security vulnerabilities should follow the coordinated disclosure process in SECURITY.md (private advisory or email) rather than a public issue.
This project is licensed under the MIT License. See the LICENSE file for details.
See prompt-pattern-dictionary/SECURITY.md for supported versions and coordinated disclosure steps. Avoid including sensitive exploit payloads or personal data in reports.
A standalone page at /responsible-use in the web app and the Orientation "Accessibility & Responsible Use" section document:
- Core principles: transparency, defensive focus, privacy, inclusivity.
- Acceptable uses: research, defensive tooling, education, evaluation with non-sensitive data.
- Prohibited uses: real exploit/malware generation, phishing deployment, guardrail bypass attempts.
- Safeguards: provenance badges, planned caution indicators, issue templates, minimal telemetry (opt-in, excludes prompt content).
Report ethical concerns with the issue label responsible-use-review; corrections to AI-assisted fields with ai-assist-correction.
The prompt-pattern-dictionary/ subfolder contains a Next.js application and a data pipeline to build a searchable dictionary of prompt patterns.
Key build notes:
- Data pipeline script:
prompt-pattern-dictionary/scripts/build-data.js - Python steps (embeddings, categorization, enrichment) auto-detect and prefer
uv runwhen available. To force uv on Windows PowerShell:
$env:USE_UV = "1"
node .\prompt-pattern-dictionary\scripts\build-data.js --enrich --enrich-limit 10 --enrich-fields template-
Enrichment flags:
--enrichto enable optional enrichment via Azure OpenAI (GPT-5)--enrich-limit <n>to cap items processed--enrich-fields <csv>to scope fields:template,application,dependentLLM,turn
-
GPT-5 temperature behavior: The enrichment pipeline does not set
temperaturefor GPT-5 (Azure requires default temperature). The client also retries withouttemperatureif the service rejects the parameter.
Run these from the repository root unless noted.
- Install dependencies for the web app:
cd .\prompt-pattern-dictionary
npm install- Build data (required before first run and whenever source JSON changes):
# Optional: prefer uv for any Python steps in the pipeline
$env:USE_UV = "1"
node .\scripts\build-data.js- Start in development mode:
npm run dev
# Open http://localhost:3000- Build for production and start the server:
npm run build
npm start
# Open http://localhost:3000Notes:
- The
npm run buildscript runs the full pipeline: data transform, normalized schema, semantic categories, andnext build. - Use
npm run exportif you want a static export (files inprompt-pattern-dictionary/out).
If the repo is inside a OneDrive-synced directory (including OneNote notebooks), the .next build folder may be locked or partially synced, causing build or dev server errors (e.g., EBUSY/EPERM on Windows).
Workarounds:
- Exclude the project (or at least the
.nextfolder) from OneDrive sync. - Move the project outside OneDrive-synced paths (recommended for Next.js development).
- If a lock occurs, close OneNote/OneDrive temporarily, delete
.next, and re-runnpm run devornpm run build.
The Orientation content (how to use the dictionary) was refactored from a single long page into a hybrid, multi-page structure:
- Hub:
/orientation– overview cards linking to each section plus links to “All Sections” and the Cheat Sheet. - Per-section routes:
/orientation/{slug}– focused pages (quick-start, what-is-a-pattern, pattern-anatomy, lifecycle, choosing-patterns, combining-patterns, adaptation, anti-patterns, quality-evaluation, accessibility-responsible-use, glossary, faq, feedback, next-steps). - Consolidated legacy view:
/orientation/all– full scrollable content (retains original anchors for deep link continuity). - Printable / rapid reference:
/orientation/cheatsheet– condensed key constructs and workflows.
Sections are metadata-driven via ORIENTATION_SECTIONS (number, slug, title, component). Navigation components (sidebar + inline chip set) render from this single source of truth; the pager component wires previous/next traversal.
User-adjustable preferences enhance accessibility and reading comfort:
- Font scale: data attribute
data-font-scaleapplied to<html>with supported values -1, 0, 1, 2 (base, + steps). CSS scales body text and headings accordingly. - Width mode:
data-width-mode=default|relaxed; relaxed widens prose up to ~85ch for users needing fewer line wraps. - Theme / contrast:
data-theme=light|dark|high-contrastwith asystemoption that removes the attribute and defers toprefers-color-schememedia queries. (The legacy valuehcis auto-migrated if found in saved preferences.) Seeprompt-pattern-dictionary/docs/THEMING.mdfor full token architecture. - Persistence: Stored under localStorage key
orientation:readability:v1; hydration script replays settings and applies attributes with minimal layout shift. - UI:
ReadabilityControlscomponent (toolbar) with buttons for font scaling, width toggle, and a select for theme mode. Appears in orientation layout (sidebar desktop + inline mobile). Can be reused site‑wide later.
A lightweight client component (LegacyHashRedirect) preserves backward compatibility for old single-page anchors:
- On mount, it inspects
window.location.hash. - If the hash matches a known section slug and you are not already on that route, it
router.replace()to/orientation/{slug}. - If the hash does not match a known slug and you are not on
/orientation/all, it redirects to/orientation/all#hash, ensuring deep links to sub‑headings still land meaningfully. - If already on
/orientation/all, no action is taken.
This maintains existing external links and bookmarks without server‑side redirects. A future enhancement may introduce explicit server (or Next.js middleware) 301 mappings for improved SEO signals—tracked as a backlog item.
To add a new user preference:
- Extend the
usePreferenceshook (underprompt-pattern-dictionary/src/app/orientation/hooks/) with state, localStorage serialization, and dataset syncing. - Choose a descriptive
data-*attribute name; keep attribute count minimal (prefer reusing existing tokens vs. additive bespoke classes). - Update the
ReadabilityControlsUI (or create a new modular control) with accessible semantics (button labels,aria-pressed, or form elements as appropriate). - Document the accepted values and rationale in this README (and optionally in a dedicated
docs/ACCESSIBILITY.mdordocs/PREFERENCES.md). - Add non-destructive CSS tied to the attribute in
globals.cssguarded by clear comments.
Guardrails:
- Avoid preferences that require layout reflow more than once per interaction.
- Provide reversible changes (toggle or reset option) if adding multi-step controls.
- Maintain WCAG contrast compliance across all themes and states.