![NVIDIA Logo](images/nvidia.png)

# List Generation - LoRA Data

In this notebook you resume your work on synthetic data generation (SDG).

Previously you used a GPT43B-based list generation function to generate many seed topics. In this notebook you will use the current 43B-based function along with the seed topics to generate synthetic data, prompts and labels, that you will use in a later notebook to fine-tune a GPT8B model for list generation.

---

## Learning Objectives

By the time you complete this notebook you:
- Will generate synthetic data for fine-tuning an 8B model for list generation.

---

## Imports

In [None]:
import json
import random

from tqdm.notebook import tqdm

from llm_utils.llm_functions import generate_list_43B as generate_list
from llm_utils.prompt_templates import gen_list_template_zero_shot as gen_list_template

---

## Review List Generation

In a previous section, you worked to create a `generate_list` function that relied on a prompt-engineered GPT43B model to generate lists of seed topics. We have provided our implementation of this function for you here.

In [None]:
ways = generate_list(5, 'ways to feel great every day.')

In [None]:
ways

---

## List Generation Prompt Template

Also in the previous notebook you synthetically generated 100 unique seed topics. If you recall, these things were intended to be used in in the very prompts we send to `generate_list`, namely:

```python
f'Make a python list of {len} {topic}.'
```

Since you will be utilizing this prompt template extensively in conjunction with loops, we've provided the helper `gen_list_template` function to accept arguments `len` and `topic` and return the prompt using the specified values.

In [None]:
gen_list_template(5, 'good ideas')

---

## Review Seed Topics

Using the same process you did earlier, but with extra time on our hands, we generated a list of 254 seed topics that we have provided for you to use in this notebook.

In [None]:
with open('data/254_seed_topics.json', 'r') as f:
    seed_topics = json.load(f)

In [None]:
len(seed_topics)

In [None]:
seed_topics[:10]

---

## Generate List Generation Prompts

These seed topics can be passed into `gen_list_template` to generate appropriate prompts for list generation.

In [None]:
for i, seed_topic in enumerate(seed_topics[:5]):
    print(gen_list_template(i+2, seed_topic))

---

## Exercise: Generate Data for Fine-Tuning List Generation Model

![List Gen Data](images/list_gen_data.png)

In this exercise you will aim to populate `prompts_with_labels` with 100 prompt/label pairs intended for use in fine-tuning a GPT8B model to perform list generation. This is inline with the virtuous cycle we have discussed where we can prompt engineer a larger model to perform a task and then use this larger model's responses to create data for fine-tuning a smaller model to perform the same task.

Your `prompts_with_labels` should be a list of 2-tuples in the format `({prompt}, {label})` where `prompt` is an appropriately formatted prompt for list generation, and `label` is a well-formed list without duplicates of the correct size.

Here is an example:

In [None]:
example_prompts_with_labels = []
topics = ['colors', 'magicians']
for i, topic in enumerate(topics):
    n = i+2 # We don't want 0, or 1 things
    prompt = gen_list_template(n, topic)
    label = generate_list(n, prompt)
    prompt_with_label = (prompt, label)
    example_prompts_with_labels.append(prompt_with_label)

In [None]:
example_prompts_with_labels

### Considerations

Before you begin, it might be worth keeping the following in mind:
- Since this data is intended for training, you want a diversity of data. Be sure to create lists of varying length, and use as many different `seed_topics` in your prompts as you can.
- `generate_list` performs best with `n` < 7.
- `random.randint(begin, end)` will give you an integer between the range of `begin` and `end`.
- `generate_list` will return an empty list if it fails. Be sure to handle this when populating `prompt_with_labels`.
- In spite of your efforts, there's always a chance `prompt_with_labels` ends up with duplicate items. Consider taking care to address this.

### Your Work Here

Here is the actual `prompts_with_labels` you should populate.

In [None]:
prompts_with_labels = []

If you get stuck, feel free to check out our solution below.

### Solution

In [None]:
random.seed(1) # We set a seed just for reproducibility
solution_prompts_with_labels = []
# Set to 10 for the sake of time.
# In reality we would set to ~120 to overshoot our target of 100 in case of duplicates.
for seed_topic in tqdm(seed_topics[:10]):
    n = random.randint(2, 8)
    prompt = gen_list_template(n, seed_topic)
    label = generate_list(n, prompt)
    if len(label) == n: # Make sure label is a well-formed list. Empty lists will not enter this `if` statement.
        random.shuffle(label) # Not required, but will help avoid some duplication and possibly diversify data.
        solution_prompts_with_labels.append((prompt, label))

In [None]:
for p, l in solution_prompts_with_labels[:10]:
    print(f'Prompt: {p}')
    print(f'List: {l}\n')