## Synthetic Data for Data Augmentation Using A Decoder-Style LLM

In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p transformers

Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.10.6
IPython version      : 8.12.0

transformers: 4.27.2



In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


def generate_synthetic_text(prompt, num_samples=1):
    model_name = "gpt2"
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    
    synthetic_texts = []
    for _ in range(num_samples):
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        sample_output = model.generate(
            input_ids,
            max_length=100,  # You can set this to control the length of generated text
            min_length=30,   # Minimum length of the generated text
            num_return_sequences=1,
            attention_mask=attention_mask,
            no_repeat_ngram_size=2, # This will prevent repeating n-grams (here 2-grams) in the generated text
            early_stopping=True
        )

        text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
        synthetic_texts.append(text)
    
    return synthetic_texts

In [5]:
# Example prompt
prompt = "The weather was nice and I enjoyed"

# Generate synthetic data
synthetic_data = generate_synthetic_text(prompt)
for text in synthetic_data:
    print(text)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The weather was nice and I enjoyed the view. I was able to get a good view of the city and the surrounding area. The weather is good and it was a nice day.

I was very impressed with the views. It was the first time I've been to the area and was really impressed. We had a great time and we were able get to see the entire city. There was also a lot of parking and there was plenty of traffic. Parking is very easy and you can
