Hierarchical Prompting Taxonomy

A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles

Paper

Table of Contents

News
Introduction
Demo
Installation
Usage
Datasets and Models
References
Contributing
Cite Us

News

[02-26-25] HPT is accepted to AAAI 2025 CogSci-AI Bridge ! Check out the presentation here.
[06-18-24] HPT is published ! Check out the paper here.

↑ Back to Top ↑

Introduction

Hierarchical Prompting Taxonomy is a set of rules that maps prompting strategies onto human cognitive principles, enabling a universal measure of task complexity for LLMs.
Hierarchical Prompting Framework is a framework designed to choose the most effective prompt from five distinct prompting strategies, minimizing cognitive load on LLMs for task resolution. This framework allows for a more accurate evaluation of LLMs and delivers more transparent insights.
Hierarchical Prompting Index quantitatively evaluates task complexity of LLMs across various datasets, offering insights into the cognitive demands each task imposes on different LLMs.

↑ Back to Top ↑

Usage

After Cloning the Repository

Linux

To get started on a Linux setup, follow these setup commands:

Activate your conda environment:
```
conda activate hpt
```
Navigate to the main codebase
```
cd HPT/hierarchical_prompt
```
Install the dependencies
```
pip install -r requirements.txt
```

Add required APIs

Create a .env file in the conda environment

HF_TOKEN = "your HF Token"
OPENAI_API_KEY = "your API key"
ANTHROPIC_API_KEY = "your API key"

To run both frameworks, use the following command structure
```
bash run.sh method model dataset [--thres num]
```
- method
  - man
  - auto
- model
  - gpt4o
  - claude
  - gemma2
  - nemo
  - llama3
  - phi3
  - gemma
  - mistral
- dataset
  - mmlu
  - gsm8k
  - humaneval
  - boolq
  - csqa
  - iwslt
  - samsum
- If the datasets are IWSLT or SamSum, add '--thres num'
- num
  - 0.15
  - 0.20
  - 0.25
  - 0.30
  - or higher thresholds apart from our experiments.
- Example commands:
```
bash run.sh man llama3 iwslt --thres 0.15
```
```
bash run.sh auto phi3 boolq 
```

To Run LLM-as-a-judge Experiment

Naviage to prompt_complexity directory
```
  cd HPT/prompt_complexity
```
Run the prompt_complexity script
```
  python prompt_complexity.py
```

↑ Back to Top ↑

Datasets and models

HPT currently supports different datasets, models and prompt engineering methods employed by HPF. You are welcome to add more.

Datasets

Reasoning datasets:
- MMLU
- CommonsenseQA
Coding datasets:
- HumanEval
Mathematics datasets:
- GSM8K
Question-answering datasets:
- BoolQ
Translation datasets:
- IWSLT-2017 en-fr
Summarization datasets:
- SamSum

Models

Language models:
- GPT-4o
- Claude 3.5 Sonnet
- Mistral Nemo 12B
- Gemma 2 9B
- Llama 3 8B
- Mistral 7B
- Phi 3 3.8B
- Gemma 7B

Prompt Engineering

Role Prompting [1]
Zero-shot Chain-of-Thought Prompting [2]
Three-shot Chain-of-Thought Prompting [3]
Least-to-Most Prompting [4]
Generated Knowledge Prompting [5]

↑ Back to Top ↑

References

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2023). Better Zero-Shot Reasoning with Role-Play Prompting. ArXiv, abs/2308.07702.
Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. ArXiv, abs/2205.11916.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903.
Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., & Chi, E.H. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv, abs/2205.10625.
Liu, J., Liu, A., Lu, X., Welleck, S., West, P., Le Bras, R., Choi, Y., & Hajishirzi, H. (2021). Generated Knowledge Prompting for Commonsense Reasoning. Annual Meeting of the Association for Computational Linguistics.

↑ Back to Top ↑

Contributing

This project aims to build open-source evaluation frameworks for assessing LLMs and other agents. This project welcomes contributions and suggestions. Please see the details on how to contribute.

If you are new to GitHub, here is a detailed guide on getting involved with development on GitHub.

↑ Back to Top ↑

Cite Us

If you find our work useful, please cite us !

@misc{budagam2024hierarchicalpromptingtaxonomyuniversal,
      title={Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models Aligned with Human Cognitive Principles}, 
      author={Devichand Budagam and Ashutosh Kumar and Mahsa Khoshnoodi and Sankalp KJ and Vinija Jain and Aman Chadha},
      year={2024},
      eprint={2406.12644},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.12644}, 
}

↑ Back to Top ↑

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
HP_scores		HP_scores
examples		examples
hierarchical_prompt		hierarchical_prompt
imgs		imgs
prompt_complexity		prompt_complexity
templates		templates
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchical Prompting Taxonomy

News

Introduction

Usage

Linux

To Run LLM-as-a-judge Experiment

Datasets and models

Datasets

Models

Prompt Engineering

References

Contributing

Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

devichand579/HPT

Folders and files

Latest commit

History

Repository files navigation

Hierarchical Prompting Taxonomy

News

Introduction

Usage

Linux

To Run LLM-as-a-judge Experiment

Datasets and models

Datasets

Models

Prompt Engineering

References

Contributing

Cite Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages