# Generating Synthetic Text

This notebook will walk you through generating synthetic natural language text, similar to the data that you provide to it. This is accomplished by fine-tuning a large scale language model that has been pre-trained on billions of documents and is therefore capable of introducing realistic new variations into the data.
 
To run this notebook, you will need an API key from the Gretel console,  at https://console.gretel.cloud. 

** **Limitations and Biases** **
Large-scale language models such as GPT-X may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information and examples please see [OpenAI](https://huggingface.co/gpt2#limitations-and-bias) and [EleutherAI](https://huggingface.co/EleutherAI/gpt-neo-125M#limitations-and-biases)'s docs for more details.

## Configure the project
* Install dependencies
* Import libraries
* Log into Gretel and set up a project

In [None]:
%%capture
!pip install -U gretel-client

In [None]:
import json

import pandas as pd
from gretel_client import configure_session
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project, get_project

pd.set_option('max_colwidth', None)
pd.set_option('display.max_rows', 100)

In [None]:
# Log into Gretel

configure_session(api_key="prompt", cache="yes", endpoint="https://api-dev.gretel.cloud", validate=True, clear=True) #clear=True

project = create_or_get_unique_project(name="synthetic-text")
project

## Create the model configuration

In this notebook we will use GPT-Neo, a transformer model designed using EleutherAI's replication of OpenAI's GPT-3 Architecture. This model has been pre-trained on the Pile, a large-scale dataset using 300 billion tokens over 572,300 steps. In this introductory example, we will fine-tune GPT-Neo to generate synthetic (and hopefully entertaining) cocktail recipes by fine-tuning across a dataset of well known cocktail recipes. 

In [None]:
config = {
  "schema_version": 1,
  "models": [
    {
      "gpt_x": {
        "data_source": "__",
        "pretrained_model": "EleutherAI/gpt-neo-125M",
        "batch_size": 4,
        "epochs": 3,
        "weight_decay": 0.1,
        "warmup_steps": 100,
        "lr_scheduler": "cosine",
        "learning_rate": 0.0005
      }
    }
  ]
}

print(json.dumps(config, indent=2))

## Load and preview the text dataset
Specify a data source to train the model on. This can be a local file, web location, or HDFS file. Currently, the text dataset must be saved in single-column CSV format.


In [None]:
dataset_path = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/drink-recipes.csv'
df = pd.concat([pd.read_csv(dataset_path)] * 1)

df.to_csv('training_data.csv', index=False)
df

## Train the synthetic model
In this step, we will task the worker running in the Gretel cloud, or locally, to fine-tune the GPT language model on the source dataset.

In [None]:
%%time 

model = project.create_model_obj(model_config=config)
model.data_source = "training_data.csv"
model.name = "cocktail-generator"
model.submit_cloud()

poll(model)

## Generate synthetic text data
You can now use the fine-tuned synthetic model to generate as much synthetic data as you like. The next cells walk through three ways to generate data.

1.  Generate text records from the model
2.  Generate text records using a single text seed (or prompt)
3.  Generate text records using a unique seed per record

In [None]:
%%time 

# Example 1: Generate text records from the model.
pd.set_option('display.max_rows', 20)

record_handler = model.create_record_handler_obj(
    params={"num_records": 20, "maximum_text_length": 128}
)
record_handler.submit_cloud()
poll(record_handler)

pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

In [None]:
# Example 2: Generate text with an optional "prompt" which is used to condition
# model generation.

record_handler = model.create_record_handler_obj(
    params={"num_records": 5, 
            "maximum_text_length": 128,
            "prompt": "Two software engineers walk into a bar. What do they order?"}
)
record_handler.submit_cloud()
poll(record_handler)

pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

In [None]:
# Example 3: Generate text with optional "prompts" to condition model generation 
# with an individual prompt for each record in CSV format.

prompts = pd.DataFrame(["Can you make me a drink with orange juice in it?"]*5, columns=["text"])
prompts.to_csv('prompts.csv', index=False)

record_handler = model.create_record_handler_obj(
    params={"maximum_text_length": 128},
    data_source='prompts.csv'
)
record_handler.submit_cloud()
poll(record_handler)

pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')