<a href="https://colab.research.google.com/github/urielmun/capstone-lab/blob/main/Data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data preprocessing
---
## Task:

#### 0. Split Scene
**The video generation API splits the entire book text into scenes so that it can generate scene-by-scene videos.**<br>
Scene: Larger than paragraph, smaller than chapter,
- Rule Based Algorithm

#### 1. Dialogue tracking LLM
**Identify who is speaking each line, even if names aren't written.**
- Prompt Enginnering (Few-shot Learning)
- Fine Tuning

#### 2. Create scene structure
**Over time, separate the narration and character dialogue and organize them into a scene structure.**<br>
Input
```Text
She smiled. “It’s a beautiful morning.”
He nodded. “Let’s go for a walk.”
```

Output
```Scene structure
{"Narrative": ["She smiled"],
"Female character": ["It’s a beautiful morning."],
"Narrative": ["He nodded"],
"Male character": ["Let’s go for a walk."]
}
```


#### 3. Automation
**Automation allows data to be entered into the dialogue tracking LLM scene-by-scene.**


## Task0

#### Rule Based Algorithm


In [44]:
import re
import os
from tqdm import tqdm
import pickle
from datasets import load_dataset
from itertools import islice
import json

In [37]:
def split_into_scenes(text: str):
    """
    도서 전체 텍스트를 문단 단위로 분리하는 함수.
    1) 빈 줄(줄바꿈 2회 이상)을 기준으로 문단 분리
    2) 문단 길이가 너무 짧으면 이전 문단과 병합
    3) 문단이 너무 길면 '\n' 기준으로 한 번 더 세분화
    """
    SHORT_PARAGRAPH=100
    LONG_PARAGRAPH=800

    raw_paragraphs = re.split(r'\n\s*\n+', text.strip())
    paragraphs = [p.strip() for p in raw_paragraphs if p.strip()]

    merged_paragraph=[]
    buffer=""
    final_paragraphs = []

    for p in paragraphs:
        if len(p) > LONG_PARAGRAPH:
            sub_paras = [sub.strip() for sub in p.split('\n') if sub.strip()]
            final_paragraphs.extend(sub_paras)
        else:
            final_paragraphs.append(p)

    for p in final_paragraphs:
        if len(p) < SHORT_PARAGRAPH:
            buffer+=" "+p
        else:
            if buffer:
                buffer+=" "+p
                merged_paragraph.append(buffer.strip())
                buffer=""
            merged_paragraph.append(p)

    if buffer:
        merged_paragraph.append(buffer.strip())

    return merged_paragraph

split_into_scenes test code

In [35]:
test="""
CHAPTER I.


Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for his own amusement, never took up any book but the Baronetage; there
he found occupation for an idle hour, and consolation in a distressed
one; there his faculties were roused into admiration and respect, by
contemplating the limited remnant of the earliest patents; there any
unwelcome sensations, arising from domestic affairs changed naturally
into pity and contempt as he turned over the almost endless creations
of the last century; and there, if every other leaf were powerless, he
could read his own history with an interest which never failed. This
was the page at which the favourite volume always opened:

“ELLIOT OF KELLYNCH HALL.


“Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,
daughter of James Stevenson, Esq. of South Park, in the county of
Gloucester, by which lady (who died 1800) he has issue Elizabeth, born
June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,
1789; Mary, born November 20, 1791.”

Precisely such had the paragraph originally stood from the printer’s
hands; but Sir Walter had improved it by adding, for the information of
himself and his family, these words, after the date of Mary’s
birth—“Married, December 16, 1810, Charles, son and heir of Charles
Musgrove, Esq. of Uppercross, in the county of Somerset,” and by
inserting most accurately the day of the month on which he had lost his
wife.

Then followed the history and rise of the ancient and respectable
family, in the usual terms; how it had been first settled in Cheshire;
how mentioned in Dugdale, serving the office of high sheriff,
representing a borough in three successive parliaments, exertions of
loyalty, and dignity of baronet, in the first year of Charles II, with
all the Marys and Elizabeths they had married; forming altogether two
handsome duodecimo pages, and concluding with the arms and
motto:—“Principal seat, Kellynch Hall, in the county of Somerset,” and
Sir Walter’s handwriting again in this finale:—

“Heir presumptive, William Walter Elliot, Esq., great grandson of the
second Sir Walter.”

Vanity was the beginning and the end of Sir Walter Elliot’s character;
vanity of person and of situation. He had been remarkably handsome in
his youth; and, at fifty-four, was still a very fine man. Few women
could think more of their personal appearance than he did, nor could
the valet of any new made lord be more delighted with the place he held
in society. He considered the blessing of beauty as inferior only to
the blessing of a baronetcy; and the Sir Walter Elliot, who united
these gifts, was the constant object of his warmest respect and
devotion.
"""


for i in split_into_scenes(test):
    print(i)
    print("---\n")

CHAPTER I. Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for his own amusement, never took up any book but the Baronetage; there
he found occupation for an idle hour, and consolation in a distressed
one; there his faculties were roused into admiration and respect, by
contemplating the limited remnant of the earliest patents; there any
unwelcome sensations, arising from domestic affairs changed naturally
into pity and contempt as he turned over the almost endless creations
of the last century; and there, if every other leaf were powerless, he
could read his own history with an interest which never failed. This
was the page at which the favourite volume always opened:
---

Sir Walter Elliot, of Kellynch Hall, in Somersetshire, was a man who,
for his own amusement, never took up any book but the Baronetage; there
he found occupation for an idle hour, and consolation in a distressed
one; there his faculties were roused into admiration and respect, by
contemplating t

## Data Load

In [46]:
dataset = load_dataset(
    "incredible45/Gutenberg-BookCorpus-Cleaned-Data-English",
    split="train",
    streaming=True
)
count = 0
MAX_BOOKS = 5  # 저장할 최대 책 권수

# 저장 폴더 생성
save_dir = "books_pickle"
os.makedirs(save_dir, exist_ok=True)

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/43 [00:00<?, ?it/s]

In [47]:
save_dir = "books_jsonl"
os.makedirs(save_dir, exist_ok=True)

filename = os.path.join(save_dir, "processed_books.jsonl")

with open(filename, "w", encoding="utf-8") as f:
    for example in tqdm(dataset, desc="Processing books"):
        book_title = example.get("book_title", "untitled")
        author= example.get("author", "")
        context = example.get("context", "")
        if len(context.strip()) < 2000:  # 최소 길이 필터
                continue
        scenes = split_into_scenes(context)

        record = {
            "book_title": book_title,
            "author": author,
            "scenes": scenes
        }

        f.write(json.dumps(record, ensure_ascii=False) + "\n")

        count += 1
        if count >= MAX_BOOKS:  # 최대 저장 권수 도달 시 종료
            break

print(f"저장 완료! 총 {count}권 저장됨. 파일: {filename}")


Processing books: 4it [00:05,  1.44s/it]

저장 완료! 총 5권 저장됨. 파일: books_jsonl/processed_books.jsonl





## Task1


In [48]:
'''
!pip install openai

from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
'''
!pip install transformers accelerate

from transformers import pipeline

# Load a chat-style text generation model
generator = pipeline(
    "text-generation",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    device_map="auto"
)




config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0


In [51]:
def infer_dialogues(paragraph):
    prompt_template = """
    ### role ###
    You are a literary dialogue analyzer.

    ### instruction ###
    1. Extract all lines of dialogue enclosed in quotation marks.
    2. Identify the speaker of each line of dialogue.
    3. Segment and summarize the narration that appears between dialogues, sentence by sentence.
    4. If the speaker’s name appears nearby (e.g., “said Alice”), use that.
    5. If the speaker’s name is not written, infer it from context (gender, previous line, actions, etc.).
    6. Keep narration and dialogue separate.

    ### Handling Ambiguities ###
    - If the speaker cannot be identified, label as `"Unknown character"`.
    - If multiple narrative sentences appear in a row, combine them into one `"Narrative"` entry.
    - If there are no dialogues, output only `"Narrative"`.
    - If the story includes a child or parent, label them explicitly as `"Child"`, `"Father"`, `"Mother"`, etc., based on the text.
    - Keep capitalization consistent with input text.

    ### examples ###
    Example 1:
    Text:
    She smiled. “It’s a beautiful morning.”
    He nodded. “Let’s go for a walk.”
    Output:
    {{"Narrative": ["She smiled"], "Female character": ["It’s a beautiful morning."], "Narrative": ["He nodded"], "Male character": ["Let’s go for a walk."]}}

    Example 2:
    Text:
    The child giggled. "Can we do it again?"
    His father laughed softly. "Not this time, son."
    Output:
    {{"Narrative":["The child giggled"], "Child": ["Can we do it again?"], "Narrative": ["His Father laughed softly."], "Father": ["Not this time, son."]}}

    Now analyze this paragraph: {paragraph}

    Output:
    """

    final_prompt = prompt_template.format(paragraph=paragraph)
    result = generator(final_prompt, max_new_tokens=250)
    full_text = result[0]['generated_text']
    output_only = full_text[len(final_prompt):]

    return output_only.strip()

infer_dialogues test code




In [52]:
# Test Example
paragraph = """The wind howled outside.
“I can’t believe it’s come to this,” she whispered.
He sighed. “We knew it would, eventually.”"""

print(infer_dialogues(paragraph))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


{"Narrative": ["The wind howled outside."], "Female character": ["I can’t believe it’s come to this.", "Whispered"], "Narrative": ["He sighed"], "Male character": ["We knew it would, eventually."]}


##Task2

In [None]:
input_filename = "books_jsonl/processed_books.jsonl"
output_filename = "books_jsonl/processed_books_with_dialogues.jsonl"

os.makedirs(os.path.dirname(output_filename), exist_ok=True)

if not os.path.exists(input_filename):
    print(f"오류: 입력 파일 '{input_filename}'을 찾을 수 없습니다.")
else:
    with open(input_filename, "r", encoding="utf-8") as infile, \
         open(output_filename, "w", encoding="utf-8") as outfile:

         for line in tqdm(infile, desc="Processing scenes"):
            record = json.loads(line.strip())
            original_scenes = record.get("scenes", [])
            processed_scenes = [infer_dialogues(scene) for scene in original_scenes]
            record["scenes"] = processed_scenes
            outfile.write(json.dumps(record, ensure_ascii=False) + "\n")

    print(f"전처리 완료! 결과가 '{output_filename}' 파일에 저장되었습니다.")

Processing scenes: 0it [00:00, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_