# Generating An Instruction Dataset via Llama 3 and Ollama

* This notebook uses [llama3.1:8b](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) model to generate synthetic data. I am following this [notebook](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb) created by Sebastian Raschka.
* The generated dataset will be an instruction dataset with "instruction" and "output" field similar to what can be found in Alpaca:
```
{
    "instruction": "What is the atomic number of helium?",
    "output": "The atomic number of helium is 2.",
}
```

In [1]:
from importlib.metadata import version

pkgs = [
    "tqdm",    # Progress bar
]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.5


## Install Ollama and Download Llama3.1

* Prior to running the code below, install ollama by visiting https://ollama.com and following the instructions.
* We have to either start the ollama application or run ollama serve in a separate terminal.
* Run `ollama run llama3` to download and use `llama3:8b` model.
* We can also use `ollama serve` or run the ollama application in the background.
* Next we can check if the model is running by query via Rest API. An alternate way is to use [Ollama Python Library](https://pypi.org/project/ollama/).

In [63]:
import ollama

def query_model(prompt, model="llama3", role="user"):

    response = ollama.chat(
        messages=[
            {
                'role': role,
                'content': prompt,
            }
        ],
        model=model, 
        options = {
            "seed": None,        # for deterministic responses
            "temperature": 1.,   # for deterministic responses
            "top_p": 1,
        }
    )
    return response['message']['content']

In [31]:
result = query_model("What do Llamas eat?")
print(result)

Llamas are herbivores, which means they primarily eat plants and plant-based foods. Their diet consists of:

1. **Grasses**: Tall grasses, such as bunchgrasses and short grasses like Bermuda grass.
2. **Leaves**: Leaves from trees and shrubs, including the leaves of plants like cattails and cottonwood.
3. **Fruits**: Various fruits like apples, carrots (yes, they love them!), berries, and leafy greens like collard greens.
4. **Hay**: High-quality hay, such as timothy or alfalfa hay, is a staple in many llamas' diets. Hay is rich in fiber and serves as a nutritious, low-calorie food source.
5. **Grains** (in moderation): Some llamas may receive small amounts of grains like oats or corn as treats or for added nutrition.

In their natural habitat, llamas are browsers, which means they roam freely and eat what's available. To replicate this in captivity, llamas require access to a variety of fresh plants and a diet rich in fiber to maintain good digestive health.

Do you have any other que

## Extract Instructions

Now, let's use the "hack" proposed in the paper: we provide the empty prompt template `<|begin_of_text|><|start_header_id|>user<|end_header_id|>` prompt, which will cause the instruction-finetuned Llama 3 model to generate an instruction.

In [52]:
def extract_instruction(text):
    for content in text.split("\n"):
        if content:
            return content.strip()

In [64]:
from pprint import pprint

# Ref: https://github.com/lmstudio-ai/.github/issues/43
query = "{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>"

result = query_model(query, role="assistant")
instruction = extract_instruction(result)
pprint(instruction)

("I'd like to know more about the history of a famous scientific instrument, "
 'such as a telescope or a microscope.')


* As we can see above, surprisingly, the model indeed generated an instruction.

## Generate Responses

* Let's generate 5 synthetic examples.
* We can scale up this approach to an arbitrary number of data samples.
* To generate a dataset suitable for instruction finetuning, we want to increase this to at least 1k to 50k. 

In [65]:
from tqdm import tqdm

dataset_size = 5
dataset = []

for i in tqdm(range(dataset_size)):

    result = query_model(query, role="assistant")
    instruction = extract_instruction(result)
    response = query_model(instruction, role="user")
    entry = {
        "instruction": instruction,
        "output": response
    }
    dataset.append(entry)

100%|██████████| 5/5 [02:23<00:00, 28.78s/it]


In [66]:
import json

with open("instruction-data-llama3-8b.json", "w") as file:
    json.dump(dataset, file, indent=4)

In [67]:
!cat instruction-data-llama3-8b.json

[
    {
        "instruction": "I was wondering if there are any strategies or tactics that can be used to improve communication and feedback in agile teams?",
        "output": "Excellent question! Effective communication and feedback are crucial for the success of agile teams. Here are some strategies and tactics you can use to improve communication and feedback:\n\n**Communication Strategies:**\n\n1. **Regular Meetings**: Hold daily stand-up meetings (Scrum) or daily syncs (Kanban) to ensure everyone is aligned and aware of progress.\n2. **Open Communication Channels**: Encourage team members to share their thoughts, concerns, and ideas openly and respectfully.\n3. **Collaborative Tools**: Use tools like Trello, Asana, or Jira to facilitate collaboration, tracking progress, and sharing information.\n4. **Active Listening**: Make sure team members listen attentively to each other and respond thoughtfully.\n\n**Feedback Strategies:**\n\n1. **Regular Retrospectives**: Hold retrospectiv

## Bonus

We can also use LIMA dataset instead.

In [72]:
from datasets import load_dataset
from dotenv import load_dotenv

load_dotenv()

dataset = load_dataset("GAIR/lima", split="train")

In [79]:
pprint(dataset[502])

{'conversations': ['What are some current jobs that will become completely '
                   'automated or obsolete within the next decade?',
                   'Here are some examples of jobs that may become fully '
                   "automated by the 2030's:\n"
                   '\n'
                   '*  Truck Drivers. With the current pace of advances in AI, '
                   'we may see more and more semi-autonomous and even '
                   'fully-autonomous vehicles on the road over the next few '
                   'years. As of 2021, there are about 3.5 million truck '
                   'drivers in the US alone, many of which may be replaced by '
                   'fully-autonomous trucks by 2030.\n'
                   '\n'
                   '*  Customer Service Representatives. As of the early '
                   "2020's, text and voice-based chatbots are already "
                   'supplementing agents in call centers. This trend is '
                   'e