Skip to content

yanshanjing/awesome-instruction-dataset

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 

Repository files navigation

awesome-instruction-tuning(ChatGPT|LLaMA)-dataset

A collection of open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

Other relevant awesome-list: nichtdax/awesome-totally-open-chatgpt

Size: The number of instruction tuning pairs

Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Table of Contents

  1. The template
  2. The Instruction tuning Dataset
  3. Reinforcement Learning from Human Feedback (RLHF) Datasets
  4. At Your Own Risk Dataset
  5. Awesome Codebase

The template

Append the new project at the end of file

## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}

- summary:
- Data generation model:
- paper:
- Cost:
- Related: (if applicable)

The Instruction-following Datasets

  • Summary:52K data generated from modified self-instruct pipeline with human written 175 seed task.
  • Data generation model: text-davinci-003
  • paper: alpaca-blog
  • Cost: $600
  • Summary: A project that manually cleaned the Alpaca 52K Dataset
  • Data generation model: text-davinci-003
  • paper: N/A
  • Cost: N/A
  • Summary:52K data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • paper: N/A
  • Cost: $880
  • Summary:52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.
  • Data generation model: text-davinci-003
  • Cost: $6000
  • Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
  • paper: N/A
  • Cost: N/A
  • Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
  • Data generation model: GPT-4
  • paper: N/A
  • Cost: N/A
  • Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
  • Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A

Reinforcement Learning from Human Feedback (RLHF) Datasets

  • Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A
  • Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
  • Data generation model: GPT-4
  • paper: Instruction Tuning with GPT-4
  • Cost: N/A
  • Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI

Datasets without license information

  • Summary: A compilation of tatsu-lab/alpaca ,Dahoas/instruct-human-assistant-prompt ,allenai/prosocial-dialog
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A

Open-source Codebase For Instruction-following LLMs

  • Summary: Alternatives are projects featuring different instruct finetuned language models for chat.

About

A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published