### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

In [1]:
import os

# Set the OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = ""

### OpenAI API Library

We'll be leveraging [this](https://github.com/openai/openai-python) library to access OpenAI's model endpoints.

There are a number of models to choose from and you can find resources about them [here](https://platform.openai.com/docs/models) and their pricing [here](https://openai.com/pricing).

The first step is to install `openai`!

In [2]:
!pip install openai -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/77.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Once we've installed it, we need to import it and set our API key!

In [3]:
import openai

openai.api_key = os.environ.get("OPENAI_API_KEY")

If you wanted to use `gpt-4`, you'd need an account that has closed beta access to the model endpoint.

You can check if your API Key has access using the following cell.

In [4]:
# check if acct. has gpt-4 access
"gpt-4" in [model["root"] for model in openai.Model.list()["data"]]

True

For the rest of the tutorial, we're going to assume you're using `gpt-3.5-turbo` as your model.

Let's make some helper functions for prompting our model and generating our prompts.

In [5]:
def prompt_model(prompt_list, model="gpt-3.5-turbo"):
  return openai.ChatCompletion.create(model=model, messages=prompt_list)

def create_prompt(role, prompt):
  return {"role" : role, "content" : prompt}

As you can see, our prompts have to be in a specific format - as set by OpenAI.

Here's an example:

```
{"role" : "system", "content" : "You are an expert in Python programming."}
{"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
```

Let's see that in action! Remember that you can feed OpenAI's chat completion endpoint with a list of prompts!

In [6]:
list_of_prompts = [
    {"role" : "system", "content" : "You are an expert in Python programming."},
    {"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
]

model_output = prompt_model(list_of_prompts)
print(model_output)

{
  "id": "chatcmpl-89ZOxSORfqAOThlEF95BzbMsaWI4Y",
  "object": "chat.completion",
  "created": 1697291727,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure! Here's a Python function that calculates the Nth number of the Fibonacci sequence:\n\n```python\ndef fibonacci(n):\n    if n <= 0:\n        return \"N must be a positive integer.\"\n    elif n == 1:\n        return 0\n    elif n == 2:\n        return 1\n    else:\n        fib = [0, 1]\n        for i in range(2, n):\n            fib.append(fib[i-1] + fib[i-2])\n        return fib[n-1]\n```\n\nYou can use this function to find the Nth number in the Fibonacci sequence by passing the value of N as an argument. For example, if you want to find the 8th number in the sequence:\n\n```python\nn = 8\nresult = fibonacci(n)\nprint(result)\n```\n\nOutput:\n```python\n13\n```\n\nIn this example, the 8th number in the Fibonacci sequence is 13."
      

As you can see, we get a lot of information back from the endpoint.

We can see the number of tokens we used, why the output stopped, what the output is, and more!

Let's view the prompt a bit clearer using some display libraries.

In [7]:
from IPython.display import display, Markdown

markdown_output = model_output["choices"][0]["message"]["content"]

display(Markdown(markdown_output))

Sure! Here's a Python function that calculates the Nth number of the Fibonacci sequence:

```python
def fibonacci(n):
    if n <= 0:
        return "N must be a positive integer."
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        fib = [0, 1]
        for i in range(2, n):
            fib.append(fib[i-1] + fib[i-2])
        return fib[n-1]
```

You can use this function to find the Nth number in the Fibonacci sequence by passing the value of N as an argument. For example, if you want to find the 8th number in the sequence:

```python
n = 8
result = fibonacci(n)
print(result)
```

Output:
```python
13
```

In this example, the 8th number in the Fibonacci sequence is 13.

### Generating Synthetic Data

Alright, now we can pull everything together and start creating our synthetic data!

**NOTE:** Using OpenAI's endpoints to create our dataset does mean that we cannot use our model for commercial use. This is meant to demonstrate the methods, and can be extended to any open-source LLM.

We're going to use this process to create 100 product/marketing email pairs.

We'll be doing this in 2-steps:

1. Create the 100 products and short descriptions.
2. Create marketing emails for each of those 100 product/descriptions.

Let's begin by creating the prompt for our products/descriptions!

In [8]:
datagen_prompts = [
    {"role" : "system", "content" : "You are a product innovator. You create new products that people crave."},
    {"role" : "user", "content" : "Please generate a list of 10 new products and extremely short descriptions."},
]

In [9]:
first_data_gen = prompt_model(datagen_prompts)
print(first_data_gen["choices"][0]["message"]["content"])

1. LunaSleep: A smart pillow that analyzes sleep patterns, adjusts firmness and temperature, and emits soothing aromas for a restful and rejuvenating slumber.

2. SunCharge: A solar-powered phone case that charges your device using clean energy on-the-go, ensuring your phone never runs out of battery.

3. Pet Health Tracker: A wearable device for pets that monitors their vital signs, activity levels, and nutrition, providing real-time health insights to pet owners.

4. MultitaskMaster: An intuitive smartwatch with integrated task management features, allowing you to prioritize and stay organized efficiently throughout the day.

5. EnergiStation: A sleek and compact charging station that wirelessly powers multiple devices simultaneously, eliminating messy cables and clutter.

6. FreshAir Purifier: A portable air purifier that uses advanced filtration technology to remove allergens, pollutants, and odors, ensuring clean and refreshing air wherever you go.

7. FitWave: An interactive fitn

Okay, now that we have a list of 100 items - let's parse them out into a Python list - also, we can keep track of our total token usage to estimate costs!

In [10]:
def retrieve_token_usage(open_ai_response):
  return sum([tokens for tokens in open_ai_response["usage"].values()])

In [11]:
f"We used {retrieve_token_usage(first_data_gen)} tokens"

'We used 710 tokens'

The following code might need to be modified based on how your data was returned by OpenAI's endpoint!

In [12]:
text_response = first_data_gen["choices"][0]["message"]["content"]

products_and_descriptions = []
for line in text_response.splitlines():
  if "." in line:
    product_descriptions = line.split(".")[1]
    product_descriptions_split = product_descriptions.split(":")
    products_and_descriptions.append(
        {
            "product" : product_descriptions_split[0][3:-2],
            "description" : ":".join(product_descriptions_split[1:])[1:]+"."
        }
    )

In [13]:
products_and_descriptions[0]

{'product': 'naSle',
 'description': 'A smart pillow that analyzes sleep patterns, adjusts firmness and temperature, and emits soothing aromas for a restful and rejuvenating slumber.'}

Now that we have our items parsed out into a Python list - we can go ahead and iterate through each of the items and have whichever OpenAI model you selected create a short marketing email for it!

First though, we'll need a system prompt to use!

In [14]:
system_prompt = create_prompt(
    "system",
    "You are a marketing executive. You are proficient at writing short, and snappy marketing emails. The emails should be easy to read, and contain excited and vibrant language."
)

We'll also need a user prompt - we'll have to wrap this in a function so we can call it for each item of the 100 items we created above.

In [15]:
def generate_user_prompt(product, description):
  user_prompt = create_prompt(
      "user",
      f"Please create a short marketing email using this product: {product} and this description: {description}"
  )

  return user_prompt

Now we're good to start generating our synthetic data! We simply need to iterate through each item - and collate the results into a list of dictionaries!

(depending on which model you use, this step might take a long time, and could become expensive!)

In [16]:
from openai.error import RateLimitError
total_token_usage = 0

for idx, item in enumerate(products_and_descriptions):
  if "marketing_email" in item:
    continue
  print(f"Working on {idx}")
  user_prompt = generate_user_prompt(item["product"], item["description"])
  full_prompt = [system_prompt, user_prompt]
  try:
    prompt_response = prompt_model(full_prompt)
    item["marketing_email"] = prompt_response["choices"][0]["message"]["content"]
    total_token_usage += retrieve_token_usage(prompt_response)
  except RateLimitError as e:
    continue

Working on 0
Working on 1
Working on 2
Working on 3
Working on 4
Working on 5
Working on 6
Working on 7
Working on 8
Working on 9


In [17]:
products_desc_and_marktng_emails_dataset = [p_d_and_m for p_d_and_m in products_and_descriptions if "marketing_email" in p_d_and_m]

### Uploading Dataset to HuggingFace Hub

Now that we've created our synthetic dataset - let's push it to the HuggingFace hub!

As always, the first task is to get the required dependencies.

In [18]:
!pip install huggingface_hub -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/302.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [19]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the '

Now we can load our data into the desired format - and upload it to the hub!

In [20]:
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
from datasets import load_dataset, Dataset
import pandas as pd

In [22]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=products_desc_and_marktng_emails_dataset))

In [23]:
hf_dataset

Dataset({
    features: ['product', 'description', 'marketing_email'],
    num_rows: 10
})

In [24]:
hf_username = "renatomoulin"
dataset_name = "fourthbrain_synthetic_marketmail_gpt35"

hf_dataset.push_to_hub(f"{hf_username}/{dataset_name}")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/539 [00:00<?, ?B/s]

### Conclusion

And that's it! You just created a synthetic dataset and pushed it to the hub!

Next stop? [Modeling!](https://colab.research.google.com/drive/1RfUuzG11Q8AaZuJIHLzXCVC087xoDeSd?usp=sharing)