Generating Instruction Datasets Locally Using Meta-Llama-3-8B-Instruct

Instruction tuning involves training a pretrained language model on a dataset of prompts containing instructions followed by outputs.

This kind of dataset, known as an instruction dataset, is essential but hard to collect.

The Alpaca model addressed this issue by using OpenAI’s text-davinci-003 model to generate synthetic data based upon a small human written seed dataset. They used the OpenAI API, which can be costly.

To solve this, I replicated the Alpaca dataset generation process with the local Meta-Llama-3-8B-Instruct model to reproduce a self-instruct dataset.

This method is cost-effective and efficient for creating high-quality datasets for NLP tasks.

Still, I don't recommend using it directly as I mainly created this notebook for educational purposes.

Sources:

Alpaca: A Strong, Replicable Instruction-Following Model

Github (stanford_alpaca)

Self-Instruct: Aligning Language Models with Self-Generated Instructions

meta-llama/Meta-Llama-3-8B-Instruct

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
105_example_generations.json		105_example_generations.json
300_example_generations.json		300_example_generations.json
README.md		README.md
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generating Instruction Datasets Locally Using Meta-Llama-3-8B-Instruct

About

Releases

Packages

Languages

shikhr/Self-Instruct-Dataset

Folders and files

Latest commit

History

Repository files navigation

Generating Instruction Datasets Locally Using Meta-Llama-3-8B-Instruct

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages