Skip to content

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

License

Notifications You must be signed in to change notification settings

stephenleo/llm-structured-output-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧩 LLM Structured Output Benchmarks

Python 3.11.9 DOI GitHub - License

Github dev.to badge dev.to badge

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, LMFormatEnforcer, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

🏆 Benchmark Results [2024-07-20]

  1. Multi-label classification
    Framework Model Reliability Latency p95 (s)
    Instructor gpt-4o-mini-2024-07-18 1.000 1.096
    Mirascope gpt-4o-mini-2024-07-18 1.000 1.523
    Fructose gpt-4o-mini-2024-07-18 1.000 2.256
    Outlines unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 1.891*
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 2.994*
    Llamaindex gpt-4o-mini-2024-07-18 0.999 0.936
    Modelsmith gpt-4o-mini-2024-07-18 0.999 1.333
    Marvin gpt-4o-mini-2024-07-18 0.998 1.722
  2. Named Entity Recognition
    Framework Model Reliability Latency p95 (s) Precision Recall F1 Score
    Instructor gpt-4o-mini-2024-07-18 1.000 3.319 0.807 0.733 0.768
    LMFormatEnforcer unsloth/llama-3-8b-Instruct-bnb-4bit 1.000 9.655* 0.761 0.488 0.595
    Mirascope gpt-4o-mini-2024-07-18 0.998 6.531 0.805 0.644 0.715
    Llamaindex gpt-4o-mini-2024-07-18 0.997 2.212 0.770 0.106 0.186
    Marvin gpt-4o-mini-2024-07-18 0.936 4.179 0.815 0.797 0.806

* NVIDIA GeForce RTX 4080 Super GPU

🏃 Run the benchmark

  1. Install the requirements using pip install -r requirements.txt
  2. Set the OpenAI api key: export OPENAI_API_KEY=sk-...
  3. Run the benchmark using python -m main run-benchmark
  4. Raw results are stored in the results directory.
  5. Generate the results using:
    • Multilabel classification: python -m main generate-results
    • NER: python -m main generate-results --results-data-path ./results/ner --task ner
  6. To get help on the command line arguments, add --help after the command. Eg., python -m main run-benchmark --help

🧪 Benchmark methodology

  1. Multi-label classification:
    • Task: Given a text, predict the labels associated with it.
    • Data:
      • Base data: Alexa intent detection dataset
      • Benchmarking test is run using synthetic data generated by running: python -m data_sources.generate_dataset generate-multilabel-data.
      • The synthetic data is generated by sampling and combining rows from the base data to achieve multiple classes per row according to some distribution for num classes per row. See python -m data_sources.generate_dataset generate-multilabel-data --help for more details.
    • Prompt: "Classify the following text: {text}"
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.
  2. Named Entity Recognition
    • Task: Given a text, extract the entities present in it.
    • Data:
      • Base data: Synthetic PII Finance dataset
      • Benchmarking test is run using a sampled data generated by running: python -m data_sources.generate_dataset generate-ner-data.
      • The data is sampled from the base data to achieve number of entities per row according to some distribution. See python -m data_sources.generate_dataset generate-ner-data --help for more details.
    • Prompt: Extract and resolve a list of entities from the following text: {text}
    • Evaluation Metrics:
      1. Reliability: The percentage of times the framework returns valid labels without errors. The average of all the rows percent_successful values.
      2. Latency: The 95th percentile of the time taken to run the framework on the data.
      3. Precision: The micro average of the precision of the framework on the data.
      4. Recall: The micro average of the recall of the framework on the data.
      5. F1 Score: The micro average of the F1 score of the framework on the data.
    • Experiment Details: Run each row through the framework n_runs number of times and log the percent of successful runs for each row.

📊 Adding new data

  1. Create a new pandas dataframe pickle file with the following columns:
    • text: The text to be sent to the framework
    • labels: List of labels associated with the text
    • See data/multilabel_classification.pkl for an example.
  2. Add the path to the new pickle file in the ./config.yaml file under the source_data_pickle_path key for all the frameworks you want to test.
  3. Run the benchmark using python -m main run-benchmark to test the new data on all the frameworks!
  4. Generate the results using python -m main generate-results

🏗️ Adding a new framework

The easiest way to create a new framework is to reference the ./frameworks/instructor_framework.py file. Detailed steps are as follows:

  1. Create a .py file in frameworks directory with the name of the framework. Eg., instructor_framework.py for the instructor framework.
  2. In this .py file create a class that inherits BaseFramework from frameworks.base.
  3. The class should define an init method that initializes the base class. Here are the arguments the base class expects:
    • task (str): the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • prompt (str): Prompt template used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model (str): LLM model to be used. Obtained from the init_kwargs in the ./config.yaml file.
    • llm_model_family (str): LLM model family to be used. Current supported values as "openai" and "transformers". Obtained from the init_kwargs in the ./config.yaml file.
    • retries (int): Number of retries for the framework. Default is $0$. Obtained from the init_kwargs in the ./config.yaml file.
    • source_data_picke_path (str): Path to the source data pickle file. Obtained from the init_kwargs in the ./config.yaml file.
    • sample_rows (int): Number of rows to sample from the source data. Useful for testing on a smaller subset of data. Default is $0$ which uses all rows in source_data_pickle_path for the benchmarking. Obtained from the init_kwargs in the ./config.yaml file.
    • response_model (Any): The response model to be used. Internally passed by the benchmarking script.
  4. The class should define a run method that takes three arguments:
    • inputs: a dictionary of {"text": str} where str is the text to be sent to the framework
    • n_runs: number of times to repeat each text
    • expected_response: Output expected from the framework
    • task: The task that the framework is being tested on. Obtained from the task in the ./config.yaml file. Eg., "multilabel_classification"
  5. This run method should create another run_experiment function that takes inputs as argument, runs that input through the framework and returns the output.
  6. The run_experiment function should be annotated with the @experiment decorator from frameworks.base with n_runs, expected_resposne and task as arguments.
  7. The run method should call the run_experiment function and return the four outputs predictions, percent_successful, metrics and latencies.
  8. Import this new class in frameworks/__init__.py.
  9. Add a new entry in the ./config.yaml file with the name of the class as the key. The yaml entry can have the following fields
    • task: the task that the framework is being tested on. Obtained from ./config.yaml file. Allowed values are "multilabel_classification" and "ner"
    • n_runs: number of times to repeat each text
    • init_kwargs: all the arguments that need to be passed to the init method of the class, including those mentioned in step 3 above.

🧭 Roadmap

  1. Framework related tasks:
    Framework Multi-label classification Named Entity Recognition Synthetic Data Generation
    Instructor ✅ OpenAI ✅ OpenAI 💭 Planning
    Mirascope ✅ OpenAI ✅ OpenAI 💭 Planning
    Fructose ✅ OpenAI 🚧 In Progress 💭 Planning
    Marvin ✅ OpenAI ✅ OpenAI 💭 Planning
    Llamaindex ✅ OpenAI ✅ OpenAI 💭 Planning
    Modelsmith ✅ OpenAI 🚧 In Progress 💭 Planning
    Outlines ✅ HF Transformers 🚧 In Progress 💭 Planning
    LM format enforcer ✅ HF Transformers ✅ HF Transformers 💭 Planning
    Jsonformer ❌ No Enum Support 💭 Planning 💭 Planning
    Strictjson ❌ Non-standard schema ❌ Non-standard schema ❌ Non-standard schema
    Guidance 💭 Planning 💭 Planning 💭 Planning
    DsPy 💭 Planning 💭 Planning 💭 Planning
    Langchain 💭 Planning 💭 Planning 💭 Planning
  2. Others
    • Latency metrics
    • CICD pipeline for benchmark run automation
    • Async run

💡 Contribution guidelines

Contributions are welcome! Here are the steps to contribute:

  1. Please open an issue with any new framework you would like to add. This will help avoid duplication of effort.
  2. Once the issue is assigned to you, pls submit a PR with the new framework!

🎓 Citation

To cite LLM Structured Output Benchmarks in your work, please use the following bibtex reference:

@software{marie_stephen_leo_2024_12327267,
  author       = {Marie Stephen Leo},
  title        = {{stephenleo/llm-structured-output-benchmarks: 
                   Release for Zenodo}},
  month        = jun,
  year         = 2024,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.12327267},
  url          = {https://doi.org/10.5281/zenodo.12327267}
}

🙏 Feedback

If this work helped you in any way, please consider ⭐ this repository to give me feedback so I can spend more time on this project.

About

Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on tasks like multi-label classification, named entity recognition, synthetic data generation, etc.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages