![NVIDIA Logo](images/nvidia.png)

# Project: Generate Detailed Synthetic Emails

In this notebook you will use your fine-tuned GPT8B-based `generate_list` LLM function, in conjunction with GPT43B to generate a collection of synthetic customer emails, each containing a variety of distinct and appropriate details.

---

## Learning Objectives

By the time you complete this notebook you will:
- Generate several collections of synthetic data for use in customer email generation.
- Generate at least 50 unique customer emails, each with distinct and appropriate details, written to a fictitious company of our creation.

---

## Imports

In [None]:
import random
import json

from tqdm.notebook import tqdm

from llm_utils.helpers import edit_list
from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.models import Models
from llm_utils.llm_functions import (
    make_llm_function,
    generate_list_8B_lora as generate_list, 
    generate_customer_email as example_generate_customer_email
)

---

## Models

In [None]:
Models.list_models()

---

## Exercise Main Objective

![Customer Emails](images/customer_emails.png)

The main objective of this project is to synthetically generate 50 unique customer emails roughly 150 words each to a fictitious company of our creation called StarBikes using an LLM function we will call `generate_customer_emails`. Aside from being a coherent, well-formed message, each email should contain text that specifies:
- The customer's distinct mood.
- The sender's name.
- Your company's name.
- The product that the email is writing about, which should be appropriate to your company's industry.
- The location of the store where the product was purchased.

---

## Exercise Example

Below is an example of the kind of email each of your 50 should be like, using an LLM function of our creation `example_generate_email`.

In [None]:
sender_name = 'Josh'
company_name = 'NVIDI-OMG'
industry = 'tech'
product = 'H100000'
mood = 'happy'
store_location = 'Santa Clara'

In [None]:
example_customer_email = example_generate_customer_email(sender_name, 
                                                         company_name, 
                                                         industry, 
                                                         product, 
                                                         mood, 
                                                         store_location, 
                                                         top_k=4, 
                                                         temperature=0.7)

In [None]:
print(example_customer_email)

---

## Exercise Constraints

In addition to the main objective just discussed, please adhere to the following exercise constraints.

### Create Company

You customer emails should be addressed to StarBikes, our fictitious company and which is part of the bike industry.

In [None]:
company_name = 'StarBikes'
company_industry = 'bike'

### Synthetic Email Details

Aside from the company's name and industry, each customer email should contain unique instances of the customer's name, mood, product, and store location. To this end, you will need to create synthetically generated data representing instances of each of these categories for use in generating 50 unique customer emails.

For this part of your work, constrain yourself to using the LoRA fine-tuned GPT8B-based `generate_list` LLM function you created earlier. We have imported our solution implementation for you.

In [None]:
generate_list(5, 'affirmations')

### Synthetic Email Generation

When it comes time to generate the emails themselves, please use GPT43B.

In [None]:
email_generator_llm = NemoServiceBaseModel(Models.gpt43b.value)

---

## Begin Your Work

If you're up for a big challenge, you can jump right in. If you'd like additional support, expand the _Exercise Walkthrough_ section below which will work through the challenge step by step.

### Your Work Here

---

# Exercise Walkthrough

## Generate Synthetic Customer Emails

This is the time to put your prompt engineering skills to the test. Begin by creating a prompt template you can use for synthetic email generation. The prompt template should include all of the details needed for an email discussed above.

Start by iteratively developing a prompt with GPT43B, and once you're satisfied with your prompt, capture it in a prompt template function that takes relevant email details and returns a well-formed prompt.

If you'd like to see a solution for this section, expand the _Solution_ section below.

### Your Work Here

In [None]:
def customer_email_prompt_template(sender_name, company_name, industry, product, mood, store_location):
    return 'TODO' # TODO: Create a prompt template to generate synthetic customer emails as specified

In [None]:
customer_email_prompt = customer_email_prompt_template('Josh', 'NVIDI-OMG', 'tech', 'H100000', 'happy', 'Santa Clara')

In [None]:
customer_email_prompt

In [None]:
email_generator_llm.generate(customer_email_prompt)

### Solution

In [None]:
def customer_email_prompt_template(sender_name, company_name, industry, product, mood, store_location):
    return f"""\
Write a 150 word email from a customer named {sender_name} to the fictitious company {company_name} \
that ends in the customer's name.

Context: The mood of the customer {sender_name} is {mood}.

Instructions: Take the following steps in drafting this email from the customer {sender_name} to the {industry} company {company_name}:
1) The customer makes a question or complaint about the following product: {product}.
2) The customer tells a brief story about a relevant experience they had with their {company_name} {product}.
3) The customer describes that they purchased the product at a {company_name} store location in {store_location}.
4) The customer signs off in a way that matches their mood using their name {sender_name}.
5) If the customer has\'t signed off with their name ({sender_name}) they sign off with their name ({sender_name}).
"""

In [None]:
customer_email_prompt = customer_email_prompt_template('Josh', 'NVIDI-OMG', 'tech', 'H100000', 'happy', 'Santa Clara')

In [None]:
email_generator_llm.generate(customer_email_prompt)

---

## Create LLM Function

![Email LLM Function](images/email_llm_function.png)

With an appropriate model and prompt template, we can now create an LLM function, which we'll call `generate_customer_email` to encapsulate the synthetic customer email generation task.

You can use `make_llm_function` along with your `email_generator_llm` model, you `customer_email_prompt_template` and the `strip` function below as `postprocessor`.

If you get stuck, feel free to check the *Solution* below.

### Your Work Here

In [None]:
def strip(response):
    return response.strip()

In [None]:
generate_customer_email = "TODO" # TODO: Make an LLM function to encapsulate the customer email generation task.

### Solution

In [None]:
def strip(response):
    return response.strip()

In [None]:
generate_customer_email = make_llm_function(NemoServiceBaseModel(Models.gpt43b.value), 
                                            customer_email_prompt_template, 
                                            postprocessor=strip)

In [None]:
generate_customer_email('Josh', 'NVIDI-OMG', 'tech', 'H100000', 'happy', 'Santa Clara')

---

## Generate Synthetic Data for Prompt Template Parameters

![Email Details](images/email_details.png)

Ultimately, we want to scale the use of `generate_customer_email` but before we can we need to generate synthetic data for the following template parameters:

- sender_names
- products
- moods
- store_locations

To do this you will be using your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate these lists.

---

## Generate Sender Names

Use `generate_list` to populate a `sender_names` list with 30 unique names. Don't forget you can pass in named arguments for `top_k` and `temperature` to get more variety out of your generated lists. Also recall from the previous section that `generate_list` with GPT8B performs best with values for list length of 7 or less, and that if the underlying model response during list generation is malformed that `generate_list` will return an empty list.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

### Your Work Here

In [None]:
sender_names = [] # TODO: populate with 30 unique names

In [None]:
edit_list(sender_names)

### Solution

In [None]:
sender_name_queues = [
    "male names",
    "female names",
    "unusual names",
    "names that start with the letter V"
]

In [None]:
sender_names = []
while len(sender_names) < 30:
    for sender_name_queue in sender_name_queues:
        sender_name = generate_list(5, sender_name_queue, top_k=8, temperature=1)
        sender_names.extend(sender_name)
        sender_names = list(set(sender_names))
        print(len(sender_names))

In [None]:
len(sender_names)

In [None]:
edit_list(sender_names)

---

## Generate Products

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `products` list with 30 products appropriate to your company and its industry.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

### Your Work Here

In [None]:
products = [] # TODO: populate `products` with 30 products appropriate to your fictitious company

In [None]:
edit_list(products)

### Solution

In [None]:
bike_queues = [
    "parts sold at a bicycle shop",
    "kinds of bike",
    "bike accesories",
    "unusual things I would find at a bike store"
]

In [None]:
products = []
while len(products) < 40: # Overshooting in case some need to be edited out
    for bike_queue in bike_queues:
        bike_product = generate_list(5, bike_queue, top_k=8, temperature=1)
        products.extend(bike_product)
        products = list(set(products))
        print(len(products))

In [None]:
len(products)

In [None]:
edit_list(products)

---

## Generate Moods

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `moods` with 30 moods the customer might be in.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

### Your Work Here

In [None]:
moods = [] # TODO: populate `moods` with 30 distinct moods a person (customer) might be in.

In [None]:
edit_list(moods)

### Solution

In [None]:
mood_queues = [
    "moods a happy customer might be in",
    "moods a disgruntled customer might be in",
    "moods an inquisitive customer might be in"
]

In [None]:
moods = []
while len(moods) < 30:
    for mood_queue in mood_queues:
        mood = generate_list(5, mood_queue, top_k=8, temperature=1)
        moods.extend(mood)
        moods = list(set(moods))
        print(len(moods))

In [None]:
len(moods)

In [None]:
edit_list(moods)

---

## Generate Store Locations

Use your LoRA fine-tuned GPT8B powered `generate_list` function, along with any other code required, to populate `store_locations` with 30 physical locations (city names for example) where a store that the customer purchased their product might be.

Don't forget you can pass in named arguments for `top_k` and `temperature`.

Use the `edit_list` helper on your generated list to clean any responses you don't want.

If you'd like to see a solution for this section, expand the _Solution_ section below.

### Your Work Here

In [None]:
store_locations  = [] # TODO: populate `store_locations` with 30 physical locations where one of your fictitious stores might be

In [None]:
edit_list(store_locations)

### Solution

In [None]:
store_location_queues = [
    "cities in California",
    "cities in Maryland",
    "cities in Alaska"
]

In [None]:
store_locations = []
while len(store_locations) < 30:
    for store_location_queue in store_location_queues:
        store_location = generate_list(5, store_location_queue, top_k=8, temperature=1)
        store_locations.extend(store_location)
        store_locations = list(set(store_locations))
        print(len(store_locations))

In [None]:
len(store_locations)

In [None]:
edit_list(store_locations)

---

## Check List Lengths

At this point `sender_names`, `products`, `moods` and `store_locations` should each have at least 30 unique items. Please run the following cell to confirm.

In [None]:
lists = {'sender_names': sender_names, 'products': products, 'moods': moods, 'store_locations': store_locations}
good = True
for k, l in lists.items():
    if len(set(l)) < 30:
        print(f'{k} only has {len(set(l))} items, please correct.')
        good = False

if good:
    print('All your lists have at least 30 items.')

---

## Generate Synthetic Customer Emails

![Customer Emails](images/customer_emails.png)

Now that you have a `generate_customer_emails` LLM function, and synthetic data for all the details we would like to include in the synthetic customer emails, you're ready to create the synthetic customer emails.

Using your `generate_customer_emails`, populate an `emails` list with 50 synthetic emails.

Remember that you can set parameters like `top_k` and `temperature` to influence the creativity of your emails.

Here are the lists and values you've created to use in your calls to `generate_customer_email`.

In [None]:
company_name

In [None]:
company_industry

In [None]:
random.choice(sender_names)

In [None]:
random.choice(products)

In [None]:
random.choice(moods)

In [None]:
random.choice(store_locations)

And for your reference, here are the arguments `generate_customer_emails` expects.

```python
generate_customer_email(sender_name, company_name, industry, product, mood, store_location)
```

Before doing a large generation loop, be sure to try out one or several generations first.

Feel free to check out the *Solution* below if you get stuck.

### Your Work Here

In [None]:
emails = [] # TODO: populate `emails` with 50 synthetically generated emails.

### Solution

In [None]:
emails = []
progress_bar = tqdm(total=50)
while len(emails) < 50:
    sender_name = random.choice(sender_names)
    product = random.choice(products)
    mood = random.choice(moods)
    store_location = random.choice(store_locations)

    customer_email = generate_customer_email(sender_name, 
                                             company_name, 
                                             company_industry, 
                                             product, 
                                             mood, 
                                             store_location, 
                                             top_k=4, 
                                             temperature=1.0)
    emails.append(customer_email)
    progress_bar.update(1)

progress_bar.close()

In [None]:
for email in emails[:5]:
    print(email+'\n')

---

## Check Synthetic Email List Length

At this point `emails` should have at least 50 unique items. Please run the following cell to confirm.

In [None]:
num_emails = len(set(emails))
if num_emails < 50:
    print(f'You only have {num_emails}.')
else:
    print(f'Good job, you have {num_emails} emails.')