![NVIDIA Logo](images/nvidia.png)

# Project: Seed Topic Generation

In this notebook you will be making synthetic data preparations to fine-tune a GPT8B model for list generation, using our current GPT43B list generator to generate the synthetic data.

---

## Learning Objectives

By the time you complete this notebook you will be able to:
- Use `generate_list` to generate 100 seed topics for future efforts in synthetic data generation.
- Efficiently edit simple synthetically generated text data.

---

## Imports

In [None]:
import json
import ast
import random

from tqdm.notebook import tqdm

from llm_utils.models import Models
from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.helpers import edit_list
from llm_utils.llm_functions import generate_list_43B as generate_list

---

## Sufficient Amounts of Diverse Data

When we generate synthetic data we need a sufficient amount of data and often, as is the case with PEFT, data that is diverse.

In the last notebook we created a tool in `generate_list` that starts us in the right direction. `generate_list` can successfully (though not always perfectly) create roughly 7 items on just about any given topic you should wish.

With that in mind, if we wanted to begin generating some kind of synthetic data, a logical next step would be to combine `generate_list` with a loop, passing it a variety of seed topics and values for `n`:

```python
seed_topics = [...] # Imagine `seed_topics` contains some number of list generation topics
new_topics = []
for seed_topic in seed_topics:
  n = random.randint(1, 7)
  new_topics.extend(generate_list(n, seed_topic))
```

Given this setup, the amount of `new_topics` we could generate would be limited primarily by the number of unique topics in `seed_topics`.



---

## Exercise Objective

![Seed Topics](images/seed_topics.png)

Your goal then in this challenge is to generate 100 unique `seed_topics` that could each be appropriately passed as a `topic` into the list generation prompt template we created in the previous notebook. Here's a simplified version of the prompt template (without the logic for generating few-shot prompts).

In [None]:
def gen_list_template(n, topic):
    return f'Make a python list of {n} {topic}.'

Assume that `len` will be > 1, which means that each `topic` needs to be something in plural noun form that has diverse subtypes.

### Examples of Good Seed Topics

Let's take a look at a few good `seed_topics` that would fit well in the `gen_list_template` we've provided you.

In [None]:
n = 3

In [None]:
good = 'dogs'
gen_list_template(3, good)

In [None]:
good = 'flavors of ice cream'
gen_list_template(3, good)

### Examples of Bad Seed Topics

Now let's look at some seed topics that would not work well as arguments to `gen_list_template`. First are singular nouns.

In [None]:
singular = 'flower' # Singular
gen_list_template(3, singular)

The following is a verb that is in a tense that would not work.

In [None]:
verb = 'jumped' # Verb in this tense doesn't work
gen_list_template(3, verb)

And finally are things that have no diverse subtypes even if they are provided in plural noun form.

In [None]:
actually_unique = 'Jensen Huangs' # There's actually only one
gen_list_template(3, actually_unique)

### Example Seed Topics

We've provided a file with an example of 100 seed topics just to further your understanding of what is being asked of you.

In [None]:
with open('data/100_seed_topics.json', 'r') as f:
    seed_topics = json.load(f)

In [None]:
for seed_topic in seed_topics[-10:]:
    print(seed_topic)

---

## List Editor

As an aside, it's important when working with data, synthetically generated or not, to make sure your samples are what you are actually looking for.

To support your efforts here and elsewhere in the workshop, we've provided you with the `edit_list` function which allows you to easily iterate through lists, viewing its items, and then choosing to either keep, remove, or edit the item.

In [None]:
list_to_edit = ['keep this', 'remove this', 'edit this to be "keep this"', 'keep this']

Execute the cell below to edit `list_to_edit`. Your efforts should result in a list of 3 `'keep this'` items.

In [None]:
edit_list(list_to_edit)

View your list to confirm.

In [None]:
list_to_edit

If you messed something up, try again until you feel comfortable using the list editor.

---

## Make a List of 100 Unique Seed Topics That Each Have Diverse Subtypes

`generate_list` has been imported into this notebook to use here.

If you're up for an extra challenge, feel free to jump right in. If you'd like some guidance, feel free to expand step-by-step guidance for the challenge by expanding the _Guidelines_ section below.

### Your Work Here

In [None]:
seed_topics = [] # TODO: `seed_topics` should contain at least 100 unique seed topics well-suited for being passed to `generate_list`

---

# Guidelines

## Make Seed Topic Generation Prompt

Let's begin by engineering a prompt we can send to `generate_list` that will generate well-formatted seed topics. Just for iteration, we will experiment with a value of 4 for `n`.

If you get stuck, feel free to check out the *Solution* for this step just below.

### Your Work Here

In [None]:
seed_topic_prompt = '' # TODO: engineer a prompt that will result in `generate_list` generating well-formed seed topics.

In [None]:
generate_list(4, seed_topic_prompt)

### Solution

We found the following prompt resulted in well-formed seed topics.

In [None]:
seed_topic_prompt = 'plural nouns with diverse subtypes'

In [None]:
generate_list(4, seed_topic_prompt)

---

## Make Variations on Seed Topic Generation Prompts

Since we are aiming for significant volume here, we should consider where we can add variety to our prompts in hopes of generating diverse outputs. With that in mind, before we attempt to start using `generate_list` to generate lots of seed topics, let's create a few variations of prompts we can use.

Create 5 variations on your working `seed_topic_prompt` above that all result `generate_list` generating well-formed seed topics.

If you get stuck, check out the *Solution* below.

### Your Work Here

### Solution

We got good results by using the following seed topic variations...

In [None]:
seed_topic_variations = ['things not in nature', 
                         'technical objects', 
                         'everyday things', 
                         'non-tangible objects', 
                         'things in nature']

...and then adding them to our original seed topic prompt.

In [None]:
seed_topic_prompts = []
for seed_topic_variation in seed_topic_variations:
    seed_topic_prompt = f'plural nouns with diverse subtypes of {seed_topic_variation}'
    print(seed_topic_prompt)
    seed_topic_prompts.append(seed_topic_prompt)

In [None]:
for seed_topic_prompt in seed_topic_prompts:
    print(generate_list(4, seed_topic_prompt))

---

## Reuse the Same Prompts With Diverse Outputs

Remember that you can set `top_k` and `temperature` to get potentially distinct outputs from the same prompt. Let's attempt to get more unique responses for each of the `seed_topic_prompts` you created in the previous step by passing them to `generate_list` multiple times along with higher values for `top_k` and `temperature`.

If you haven't been doing so yet, start to store your seed topics in a `seed_topics` list.

Additionally, be sure to deduplicate your `seed_topics` list which can be done by casting it to a set and then back to a list: `seed_topics = list(set(seed_topics)`.

If you get stuck, check out the *Solution* below.

### Your Work Here

In [None]:
seed_topics = [] # TODO: Populate by looping over `seed_topic_prompts` multiple times while generating more random responses.

### Solution

In [None]:
seed_topics = []
for seed_topic_prompt in tqdm(seed_topic_prompts*5):  # For the sake of generating more seed topics we loop through our prompts 3 times.
                                                      # Because of our top_k and temp settings we hope for new values each iteration.
    
    new_seed_topics = generate_list(5, seed_topic_prompt, top_k=16, temperature=0.9, top_p=0.8)
    print(new_seed_topics)
    seed_topics.extend(new_seed_topics)
    seed_topics = list(set(seed_topics))
    print(len(seed_topics))

You may have noticed some empty lists in the solution output above. Our implementation of `generate_list` will return an empty list in the chance that the model generated a response that is not a well-formed list.

---

## Use a While Loop

If at this point you still haven't reached your goal of 100 things, try creating a `while` loop that continues to loop over your generation steps, cleaning out duplicates at the end of each iteration, until you've reached your goal.

Feel free to check out the *Solution* below if you get stuck.

### Your Work Here

In [None]:
# Your work here

### Solution

Here we slightly modify what we did in the last step by creating a `while` loop that goes until `seed_topics` contains at least 100 unique items. We ensure the items are unique by deduplicating `seed_topics` on each iteration of the while loop.

We've added a `tqdm` progress bar for convenience.

In [None]:
seed_topics = []
progress_bar = tqdm(total=150)
while len(seed_topics) < 150: # We are going to overshoot our target assuming we will edit out many items that are not well-formed
    for seed_topic_prompt in seed_topic_prompts:

        new_seed_topics = generate_list(5, seed_topic_prompt, top_k=16, temperature=0.9, top_p=0.8)
        seed_topics.extend(new_seed_topics)
        seed_topics = list(set(seed_topics))
        progress_bar.update(len(seed_topics) - progress_bar.n)

progress_bar.close()

## (Optional) Clean Your Seed Topics

This may be more effort than you'd like to put into this challenge right now, but in legitimate settings, you need to take care to make sure your data is good quality.

If you'd like to go through the seed topics you have so far and clean them up, feel free to pass your `seed_topics` list into the provided `edit_list` function, which will let you look through the items in your list and delete or edit the ones you don't want. Remember, each `seed_topic` in the list should fit sensibly into the template `Make a python list of 3 {seed_topic}`.

In [None]:
# Optional
bk_seed_topics = seed_topics.copy() # Make a backup in case we mess up editing
edit_list(seed_topics)

In [None]:
len(list(set(seed_topics)))