<a href="https://colab.research.google.com/github/rtajeong/ChatGPT_for_Management/blob/main/6_generate_a_novel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating a novel
- using Pre-trained model 'gpt2'

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 모델과 토크나이저 로드
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# 시작 프롬프트 정의
prompt = "Once upon a time"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Auto-regressive 방식으로 텍스트 생성
output_ids = model.generate(
    input_ids,
    max_length=100,  # 생성될 최대 토큰 수
    num_return_sequences=1,  # 한 번의 실행에서 몇 개의 문장을 생성할지를 지정
    temperature=0.7,  # 샘플링 다양성 조절
    top_k=50,  # 높은 확률의 상위 50개 토큰만 고려
    repetition_penalty=1.2,
)

# 결과 출력
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(output_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, the world was filled with people who were not only rich but also powerful.
The first thing that came to mind when I thought of this is how much money they had in their pockets and what kind it would be if someone took them out on an adventure or something like those things? The amount you could spend at any given moment without having anyone else's attention! It seemed so simple for me… But then again maybe there are some more interesting ways around these kinds "money


In [None]:
output_text

'Once upon a time, the world was filled with people who were not only rich but also powerful.\nThe first thing that came to mind when I thought of this is how much money they had in their pockets and what kind it would be if someone took them out on an adventure or something like those things? The amount you could spend at any given moment without having anyone else\'s attention! It seemed so simple for me… But then again maybe there are some more interesting ways around these kinds "money'

# Training and Generating (모델 훈련과 생성 과정)
- using LSTM model

In [None]:
# 1. 텍스트 데이터 준비

text = """
Once upon a time, there was a little prince who lived on a small planet.
He loved watching sunsets and exploring the stars. One day, he decided to visit other planets...
"""


In [None]:
text = """
Once upon a time, there was a little prince who lived on a tiny planet no larger than a house.
The planet had three volcanoes, two active and one extinct. The prince carefully tended to them,
removing weeds and cleaning the craters to prevent any eruptions.

The little prince loved watching sunsets. On his small planet, he could see the sunset simply by moving his chair a few steps.
One day, he watched the sunset forty-four times because he was feeling particularly sad.

The little prince also took care of a rose, a beautiful but proud flower that had appeared on his planet.
The rose often demanded attention and made the prince question her love for him. But deep down, he loved her dearly.

Feeling confused and lonely, the little prince decided to leave his planet and explore the universe.
He visited many planets, each inhabited by strange and interesting characters.

On one planet, he met a king who claimed to rule the stars. The king believed his orders were absolute, but the prince found him amusing.
On another planet, he encountered a businessman who spent his days counting stars, believing they were his property.
The prince also met a lamplighter who lit and extinguished a streetlamp every minute due to the fast rotation of his tiny planet.

Through these encounters, the little prince learned about grown-ups and their odd priorities.
Eventually, he arrived on Earth, where he met a fox. The fox taught the prince about relationships,
explaining that love and friendship require time, patience, and responsibility.

The fox said, "What is essential is invisible to the eye." The little prince never forgot these words.
After meeting the fox, the prince began to see the world differently. He realized the importance of the rose he had left behind
and decided to return to his planet to care for her once more.
"""



In [None]:
# 2. 텍스트 전처리: 텍스트 데이터를 LSTM 모델이 이해할 수 있는 숫자(정수 인덱스)로 변환

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

# 토큰화 및 정수 인덱싱
tokenizer = Tokenizer(char_level=True, lower=False)  # char 단위로 처리
tokenizer.fit_on_texts([text])
total_chars = len(tokenizer.word_index)

# 시퀀스 생성
sequence_length = 40
sequences = []
for i in range(len(text) - sequence_length):
    seq = text[i:i + sequence_length]
    next_char = text[i + sequence_length]
    sequences.append((seq, next_char))

# 입력(x)과 출력(y)으로 나누기
x = []
y = []
for seq, next_char in sequences:
    x.append(tokenizer.texts_to_sequences([seq])[0])
    y.append(tokenizer.word_index[next_char])

x = np.array(x)
y = np.array(y)

# One-hot encoding 출력
y = to_categorical(y, num_classes=total_chars + 1)


In [None]:
sequences[:7]

[('\nOnce upon a time, there was a little pr', 'i'),
 ('Once upon a time, there was a little pri', 'n'),
 ('nce upon a time, there was a little prin', 'c'),
 ('ce upon a time, there was a little princ', 'e'),
 ('e upon a time, there was a little prince', ' '),
 (' upon a time, there was a little prince ', 'w'),
 ('upon a time, there was a little prince w', 'h')]

In [None]:
[(x[i], y[i]) for i in range(7)]

[(array([17, 28,  4, 14,  2,  1, 15, 13,  9,  4,  1,  5,  1,  3,  6, 18,  2,
         21,  1,  3,  8,  2,  7,  2,  1, 24,  5, 10,  1,  5,  1, 11,  6,  3,
          3, 11,  2,  1, 13,  7]),
  array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0.])),
 (array([28,  4, 14,  2,  1, 15, 13,  9,  4,  1,  5,  1,  3,  6, 18,  2, 21,
          1,  3,  8,  2,  7,  2,  1, 24,  5, 10,  1,  5,  1, 11,  6,  3,  3,
         11,  2,  1, 13,  7,  6]),
  array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0.])),
 (array([ 4, 14,  2,  1, 15, 13,  9,  4,  1,  5,  1,  3,  6, 18,  2, 21,  1,
          3,  8,  2,  7,  2,  1, 24,  5, 10,  1,  5,  1, 11,  6,  3,  3, 11,
          2,  1, 13,  7,  6,  4]),
  array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

In [None]:
# 3. LSTM 모델 생성: 텍스트 시퀀스를 입력으로 받아 다음 문자(토큰)를 예측하는 LSTM 모델을 정의

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Input

# 모델 정의
model = Sequential([
    Input(shape=(sequence_length,)),
    Embedding(input_dim=total_chars + 1, output_dim=20),
    LSTM(150, return_sequences=False, dropout=0.2),
    Dense(total_chars + 1, activation='softmax')
])

# 컴파일
model.compile(optimizer='adam', loss='categorical_crossentropy')
# model.build((None, sequence_length))
model.summary()


In [None]:
# 4. 모델 훈련: 모델을 훈련하여 텍스트 패턴을 학습

model.fit(x, y, epochs=100, batch_size=64, verbose=0)

<keras.src.callbacks.history.History at 0x784d4dbb8610>

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def generate_text(model, tokenizer, seed_text, num_generate=100):
    sequence_length = 40  # 모델 입력 크기
    result = seed_text

    for i in range(num_generate):
        # 입력 시퀀스를 모델 크기에 맞게 자르기
        input_text = result[-sequence_length:]

        # 시퀀스를 인덱스로 변환
        input_seq = [tokenizer.word_index[char] for char in input_text if char in tokenizer.word_index]
        input_seq = pad_sequences([input_seq], maxlen=sequence_length, padding='pre')

        # 모델 예측

        predicted_probs = model.predict(input_seq, verbose=0)
        predicted_char_idx = np.argmax(predicted_probs)

        # 인덱스를 문자로 변환
        if predicted_char_idx in tokenizer.index_word:
            predicted_char = tokenizer.index_word[predicted_char_idx]
        else:
            # print(f"Warning: Index {predicted_char_idx} not in tokenizer.index_word")
            break

        # 결과에 추가
        result += predicted_char
        # print(result)

    return result


In [None]:
# 텍스트 생성
seed = "Once upon a time, there was a "
generated_text = generate_text(model, tokenizer, seed_text=seed, num_generate=200)
print(generated_text)

Once upon a time, there was a lemering tha cat a his pppop. the prince buster torethers, 
eFlivinitid tany pante etere the since qute he mer famd onthinclaps and theirind there ver arded on erern,, he lealined to rele to the fort.


----------------------------------------------