Promptwright - Synthetic Dataset Generation Library

Promptwright is a Python library from Stacklok designed for generating large synthetic datasets using a local LLM. The library offers a flexible and easy-to-use set of interfaces, enabling users the ability to generate prompt led synthetic datasets.

Promptwright was inspired by the redotvideo/pluto, in fact it started as fork, but ended up largley being a re-write, to allow dataset generatiion against a local LLM model, as opposed to OpenAI where costs can be prohibitively expensive.

The library interfaces with Ollama, making it easy to just pull a model and run Promptwright.

Features

Local LLM Client Integration: Interact with Ollama based models
Configurable Instructions and Prompts: Define custom instructions and system prompts
Push to Hugging Face: Push the generated dataset to Hugging Face Hub.

Getting Started

Prerequisites

Python 3.11+
promptwright library installed
Ollama CLI installed and running (see Ollama Installation
A Model pulled via Ollama (see Model Compatibility)

Installation

To install the prerequisites, you can use the following commands:

pip install promptwright
ollama serve
ollama pull {model_name} # whichever model you want to use

Example Usage

There are a few examples in the examples directory that demonstrate how to use the library to generate different topic based datasets.

Running an Example

To run an example:

Ensure you have started Ollama by running ollama serve.
Verify that the required model is downloaded (e.g. llama3.2:latest).
Set the model_name in the chosen example file to the model you have downloaded.

    engine = LocalDataEngine(
      args=LocalEngineArguments(
          instructions="Generate creative writing prompts and example responses.",
          system_prompt="You are a creative writing instructor providing writing prompts and example responses.",
          model_name="llama3.2:latest",
          temperature=0.9,  # Higher temperature for more creative variations
          max_retries=2,

Run your chosen example file:
```
python example/creative_writing.py
```
The generated dataset will be saved to a JSONL file to whatever is set within dataset.save().

Prompt Output Examples

{
  "messages": [
    {
      "role": "system",
      "content": "You are tasked with designing an immersive virtual reality experience that transports users to a fantastical world of wonder."
    },
    {
      "role": "user",
      "content": "Create a descriptive passage about a character discovering their hidden talents."
    },
    {
      "role": "assistant",
      "content": "As she stared at the canvas, Emma's fingers hovered above the paintbrushes, as if hesitant to unleash the colors that had been locked within her. The strokes began with bold abandon, swirling blues and greens merging into a mesmerizing dance of light and shadow. With each passing moment, she felt herself becoming the art – her very essence seeping onto the canvas like watercolors in a spring storm. The world around her melted away, leaving only the vibrant symphony of color and creation."
    }
  ]
}

Library Overview

Classes

Dataset: A class for managing generated datasets.
LocalDataEngine: The main engine responsible for interacting with the LLM client and generating datasets.
LocalEngineArguments: A configuration class that defines the instructions, system prompt, model name temperature, retries, and prompt templates used for generating data.
OllamaClient: A client class for interacting with the Ollama API
HFUploader: A utility class for uploading datasets to Hugging Face (pass in the path to the dataset and token).

Troubleshooting

If you encounter any errors while running the script, here are a few common troubleshooting steps:

Restart Ollama:
```
killall ollama && ollama serve
```
Verify Model Installation:
```
ollama pull {model_name}
```
Check Ollama Logs:
Inspect the logs for any error messages that might provide more context on what went wrong, these can be found in the ~/.ollama/logs directory.

Model Compatibility

The library should work with most LLM models. It has been tested with the following models so far:

LLaMA3: The library is designed to work with the LLaMA model, specifically the llama3:latest model.
Mistral: The library is compatible with the Mistral model, which is a fork of the GPT-3 model.

If you test anymore, please make a pull request to update this list!

Contributing

If something here could be improved, please open an issue or submit a pull request.

License

This project is licensed under the Apache 2 License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
promptwright		promptwright
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
coverage.xml		coverage.xml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Promptwright - Synthetic Dataset Generation Library

Features

Getting Started

Prerequisites

Installation

Example Usage

Running an Example

Prompt Output Examples

Library Overview

Classes

Troubleshooting

Model Compatibility

Contributing

License

About

Releases

Packages

Languages

License

eltociear/promptwright

Folders and files

Latest commit

History

Repository files navigation

Promptwright - Synthetic Dataset Generation Library

Features

Getting Started

Prerequisites

Installation

Example Usage

Running an Example

Prompt Output Examples

Library Overview

Classes

Troubleshooting

Model Compatibility

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages