<a href="https://colab.research.google.com/github/rickiepark/MLQandAI/blob/main/supplementary/q15-text-augment/synthetic-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 디코더 기반 LLM을 사용해 데이터 증식을 위한 합성 데이터 생성하기

In [1]:
!pip install watermark

%load_ext watermark
%watermark -a 'Sebastian Raschka' -v -p transformers

Author: Sebastian Raschka

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 7.34.0

transformers: 4.44.2



In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer


def generate_synthetic_text(prompt, num_samples=1):
    model_name = "gpt2"
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name,
                                              clean_up_tokenization_spaces=False)

    synthetic_texts = []
    for _ in range(num_samples):
        inputs = tokenizer(prompt, return_tensors="pt")
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        sample_output = model.generate(
            input_ids,
            max_length=100,  # 생성될 텍스트의 최대 길이
            min_length=30,   # 생성될 텍스트의 최소 길이
            num_return_sequences=1,
            attention_mask=attention_mask,
            no_repeat_ngram_size=2 # n-그램(여기서는 2-그램)의 반복을 막기 위해
        )

        text = tokenizer.decode(sample_output[0], skip_special_tokens=True)
        synthetic_texts.append(text)

    return synthetic_texts

In [3]:
# 프롬프트
prompt = "The weather was nice and I enjoyed"

# 합성 데이터 생성하기
synthetic_data = generate_synthetic_text(prompt)
for text in synthetic_data:
    print(text)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The weather was nice and I enjoyed the view. I was able to get a good view of the city and the surrounding area. The weather is good and it was a nice day.

I was very impressed with the views. It was the first time I've been to the area and was really impressed. We had a great time and we were able get to see the entire city. There was also a lot of parking and there was plenty of traffic. Parking is very easy and you can
