# Phase 1: 토픽 모델링 & 스타일 분석

Phase 0에서 정제된 게시판 데이터를 기반으로 BERTopic 토픽 모델을 학습하고, 인기글 게시판에 대해 제목/본문 스타일 피처까지 추출합니다.

## ✅ 목표
- Phase 0에서 생성된 Parquet 데이터를 불러오기
- 한국어 전처리(Mecab 등) 및 불용어 제거 적용
- 각 게시판(익게2, 자유, 연애, 익게1)에 대한 BERTopic 실행 및 산출물 저장
- 인기글(`hot_df`)에 대해 스타일 피처(길이/문장부호/품사) 통계와 스타일 힌트 생성
- 프롬프팅용 JSON(`*_topics_for_prompt.json`, `인기글_topics_for_prompt.json`) 및 템플릿 저장

## ⚙️ 사전 준비
- `phase0_setup.ipynb`를 먼저 실행하여 `outputs/*.parquet` 파일을 생성해 두어야 합니다.
- Colab 런타임이 새로 시작되었다면 아래 패키지 설치 셀을 다시 실행하세요.


---
## 0. 필수 패키지 설치 (런타임당 1회)
- 이미 설치되어 있다면 건너뛰셔도 됩니다.
- MECAB 설치는 Phase 0과 동일하게 한 번만 수행하면 됩니다.
- 이 노트북은 `konlpy`/`mecab-python3`가 설치되어 있다고 가정합니다. 설치가 안 되어 있다면 Phase 0의 설치 셀을 실행하거나 아래와 같이 직접 설치할 수 있습니다.
  - 예: `%pip install -q konlpy mecab-python3`


In [1]:
# Phase 1에서 사용하는 패키지들을 모두 설치합니다.
# - 형태소 분석: konlpy, mecab-python3
# - 토픽 모델링: sentence-transformers, bertopic, umap-learn, hdbscan
# - 시각화: plotly, seaborn

%pip install -q konlpy mecab-python3
%pip install -q sentence-transformers bertopic umap-learn hdbscan plotly seaborn

print("✅ Phase 1 필수 패키지 설치 완료")


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m83.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m591.2/591.2 kB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m495.9/495.9 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.0/153.0 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h✅ Phase 1 필수 패키지 설치 완료


---
## 1. 공통 경로 및 Phase 0 산출물 로드


In [4]:
from google.colab import drive

# Google Drive 마운트 (이미 마운트되어 있다면 건너뛰어도 됩니다)
drive.mount('/content/drive', force_remount=True)

print("✅ Google Drive 마운트 완료")


Mounted at /content/drive
✅ Google Drive 마운트 완료


In [5]:
from pathlib import Path
from typing import Tuple
import pandas as pd
import numpy as np
import json

# Phase 0에서 사용한 동일 경로
PROJECT_ROOT = Path("/content/drive/MyDrive/board_crawling")
OUTPUT_DIR = PROJECT_ROOT / "outputs"
MODEL_DIR = PROJECT_ROOT / "models"

if not OUTPUT_DIR.exists():
    raise FileNotFoundError("OUTPUT_DIR가 없습니다. Phase 0을 먼저 실행해 주세요.")

PARQUET_MAP = {
    "익게2": OUTPUT_DIR / "익게2_for_topic.parquet",
    "자유게시판": OUTPUT_DIR / "자유게시판_for_topic.parquet",
    "연애상담소": OUTPUT_DIR / "연애상담소_for_topic.parquet",
    "익게1": OUTPUT_DIR / "익게1_for_topic.parquet",
    "인기글": OUTPUT_DIR / "인기글_clean.parquet",
}

missing = [p for p in PARQUET_MAP.values() if not p.exists()]
if missing:
    raise FileNotFoundError(f"다음 파일을 찾을 수 없습니다:\n- " + "\n- ".join(str(p) for p in missing))

board_frames = {
    name: pd.read_parquet(path)
    for name, path in PARQUET_MAP.items()
}

print("✅ Phase 0 산출물 로드 완료")
for name, df in board_frames.items():
    print(f"  • {name}: {len(df):,}행")


✅ Phase 0 산출물 로드 완료
  • 익게2: 49,251행
  • 자유게시판: 2,285행
  • 연애상담소: 1,852행
  • 익게1: 1,736행
  • 인기글: 6,592행


In [6]:
summary_df = pd.DataFrame(
    [
        {"게시판": name, "행 수": f"{len(df):,}", "기간": f"{df['date'].min():%Y-%m-%d} ~ {df['date'].max():%Y-%m-%d}"}
        for name, df in board_frames.items()
    ]
)
summary_df


Unnamed: 0,게시판,행 수,기간
0,익게2,49251,2024-09-01 ~ 2025-08-31
1,자유게시판,2285,2024-09-01 ~ 2025-08-31
2,연애상담소,1852,2024-09-01 ~ 2025-08-31
3,익게1,1736,2024-09-01 ~ 2025-08-31
4,인기글,6592,2024-09-01 ~ 2025-08-31


---
## 2. Mecab 및 전처리 도구 초기화


In [7]:
try:
    from konlpy.tag import Mecab
    mecab = Mecab()
    MECAB_AVAILABLE = True
    print("✅ Mecab 사용 가능")
except Exception as e:
    MECAB_AVAILABLE = False
    mecab = None
    print(f"⚠️ Mecab 초기화 실패: {e}")
    print("   → Mecab 토큰화를 건너뛰고 문자열 기반 처리를 수행합니다.")


⚠️ Mecab 초기화 실패: The MeCab dictionary does not exist at "/usr/local/lib/mecab/dic/mecab-ko-dic". Is the dictionary correctly installed?
You can also try entering the dictionary path when initializing the Mecab class: "Mecab('/some/dic/path')"
   → Mecab 토큰화를 건너뛰고 문자열 기반 처리를 수행합니다.


In [8]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import umap
import hdbscan
from sklearn.feature_extraction.text import CountVectorizer

EMB_NAME = "intfloat/multilingual-e5-base"
embedder = SentenceTransformer(EMB_NAME)
print(f"✅ SentenceTransformer 로드: {EMB_NAME}")


  $max \{ core_k(a), core_k(b), 1/\alpha d(a,b) \}$.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

✅ SentenceTransformer 로드: intfloat/multilingual-e5-base


---
## 3. 텍스트 전처리 함수 (clean → normalize → tokenize → stopwords)


In [9]:
DEFAULT_STOPWORDS = [
    "이","가","을","를","에","의","와","과","도","로","으로",
    "은","는","에서","에게","께","한테",
    "더","그","저","이것","그것","저것","그런","이런","저런","그렇","이렇","저렇",
    "때","때문","것","수","등",
    "있다","없다","하다","되다","이다","아니다","같다",
    "게시판","게시글","글","댓글","작성","작성자","조회","추천","비추천"
]

CUSTOM_STOPWORDS = set(DEFAULT_STOPWORDS)

import re

def clean_text(text):
    """URL/이메일/HTML/특수문자 제거 및 공백 정규화"""
    if pd.isna(text) or text == "":
        return ""
    text = str(text)
    text = re.sub(r'http[s]?://\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'[^\w\s가-힣ㄱ-ㅎㅏ-ㅣ.,!?~()\-\'\"…·:;<>/@#%&\[\]\{\}]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def normalize_text(text):
    """(선택) 띄어쓰기 교정. 현재는 그대로 반환"""
    if not text or text.strip() == "":
        return ""
    return text


def tokenize_text(text, pos_filter=('NNG','NNP','VV','VA','VX','MM','IC')):
    if not text or text.strip() == "":
        return []
    if not MECAB_AVAILABLE or mecab is None:
        return []
    try:
        tokens = mecab.pos(text)
        return [w for w, pos in tokens if pos in pos_filter]
    except Exception:
        return []


def remove_stopwords(tokens, custom_stopwords=None):
    if not tokens:
        return []
    stopwords = set(DEFAULT_STOPWORDS)
    if custom_stopwords:
        stopwords.update(custom_stopwords)
    return [t for t in tokens if t not in stopwords]


def preprocess_text(text, use_spacing=True, use_tokenize=True, custom_stopwords=None):
    cleaned = clean_text(text)
    if not cleaned:
        return ""
    normalized = normalize_text(cleaned) if use_spacing else cleaned
    if use_tokenize:
        tokens = tokenize_text(normalized)
        tokens = remove_stopwords(tokens, custom_stopwords)
        return " ".join(tokens)
    else:
        return normalized


def build_text_preprocessed(df, max_len=2000, use_spacing=True, use_tokenize=True, custom_stopwords=None):
    combined = (
        df["title"].fillna("") + " " + df["contents"].fillna("")
    ).str.replace(r"\s+", " ", regex=True)

    preprocessed = combined.apply(
        lambda x: preprocess_text(
            x,
            use_spacing=use_spacing,
            use_tokenize=use_tokenize,
            custom_stopwords=custom_stopwords
        )
    )
    return preprocessed.str.slice(0, max_len)

print("✅ 전처리 함수 로드 완료")


✅ 전처리 함수 로드 완료


---
## 4. 제목/본문 스타일 피처 및 요약 함수


In [10]:
PUNCT_PATTERN = re.compile(r'[^\w\s가-힣ㄱ-ㅎㅏ-ㅣ]')

def collect_punct_counts(text: str):
    counts = {}
    if not isinstance(text, str):
        return counts
    for ch in text:
        if PUNCT_PATTERN.match(ch):
            counts[ch] = counts.get(ch, 0) + 1
    return counts


def extract_title_features(title: str):
    if not isinstance(title, str):
        title = "" if title is None else str(title)
    t = title.strip()
    title_len = len(t)
    punct_counts = collect_punct_counts(t)
    punct_total = sum(punct_counts.values())

    title_noun = title_verb = title_adj = 0
    if MECAB_AVAILABLE and mecab is not None and t:
        try:
            for word, pos in mecab.pos(t):
                if pos.startswith("NN"):
                    title_noun += 1
                elif pos.startswith("VV"):
                    title_verb += 1
                elif pos.startswith("VA"):
                    title_adj += 1
        except Exception:
            pass

    return {
        "title_len": title_len,
        "title_punct_total": punct_total,
        "title_punct_counts": punct_counts,
        "title_noun": title_noun,
        "title_verb": title_verb,
        "title_adj": title_adj,
    }


def extract_body_features(contents: str):
    if not isinstance(contents, str):
        contents = "" if contents is None else str(contents)
    text = contents.strip()
    body_len = len(text)
    body_paragraphs = text.count("\n") + 1 if text else 0
    punct_counts = collect_punct_counts(text)
    punct_total = sum(punct_counts.values())
    return {
        "body_len": body_len,
        "body_paragraphs": body_paragraphs,
        "body_punct_total": punct_total,
        "body_punct_counts": punct_counts,
    }


def add_text_style_features(df_topics: pd.DataFrame) -> pd.DataFrame:
    df = df_topics.copy()
    title_feats = df["title"].apply(extract_title_features)
    body_feats = df["contents"].apply(extract_body_features)
    title_feat_df = pd.DataFrame(list(title_feats))
    body_feat_df = pd.DataFrame(list(body_feats))
    df = pd.concat([
        df.reset_index(drop=True),
        title_feat_df.reset_index(drop=True),
        body_feat_df.reset_index(drop=True),
    ], axis=1)
    return df


def topic_style_stats(df_topics_with_feat: pd.DataFrame) -> pd.DataFrame:
    df = df_topics_with_feat[df_topics_with_feat["topic"] != -1].copy()
    agg = df.groupby("topic").agg({
        "title_len": ["mean", "median"],
        "title_noun": ["mean"],
        "title_verb": ["mean"],
        "title_adj": ["mean"],
        "body_len": ["mean", "median"],
        "body_paragraphs": ["mean"],
    })
    agg.columns = ["_".join(col).strip() for col in agg.columns.values]
    return agg.reset_index()


def topic_punct_profile(df_topics_with_feat: pd.DataFrame,
                        title_col: str = "title_punct_counts",
                        body_col: str = "body_punct_counts",
                        topn: int = 3) -> pd.DataFrame:
    rows = []
    df = df_topics_with_feat[df_topics_with_feat["topic"] != -1].copy()
    for t, sub in df.groupby("topic"):
        title_agg, body_agg = {}, {}
        for d in sub[title_col].dropna():
            if isinstance(d, dict):
                for ch, c in d.items():
                    title_agg[ch] = title_agg.get(ch, 0) + int(c)
        for d in sub[body_col].dropna():
            if isinstance(d, dict):
                for ch, c in d.items():
                    body_agg[ch] = body_agg.get(ch, 0) + int(c)

        def summarize(agg_dict):
            if not agg_dict:
                return [], {}
            total = sum(agg_dict.values())
            sorted_items = sorted(agg_dict.items(), key=lambda x: x[1], reverse=True)
            top_chars = [ch for ch, _ in sorted_items[:topn]]
            dist = {ch: cnt / total for ch, cnt in agg_dict.items()}
            return top_chars, dist

        title_top, title_dist = summarize(title_agg)
        body_top, body_dist = summarize(body_agg)

        rows.append({
            "topic": int(t),
            "title_punct_top": title_top,
            "title_punct_dist": title_dist,
            "body_punct_top": body_top,
            "body_punct_dist": body_dist,
        })
    return pd.DataFrame(rows)


def make_style_hints(row: pd.Series):
    hints = []

    def safe_float(col):
        v = row.get(col)
        if isinstance(v, (int, float)) and not pd.isna(v):
            return float(v)
        return None

    def make_range(mean_val, med_val=None, widen=0.3, floor=0, as_int=True):
        if mean_val is None and med_val is None:
            return None, None
        center = med_val if med_val is not None else mean_val
        if center is None:
            return None, None
        low = center * (1 - widen/2)
        high = center * (1 + widen/2)
        if as_int:
            low = int(max(floor, round(low)))
            high = int(max(low + 1, round(high)))
        else:
            low = max(floor, round(low, 1))
            high = max(low + 0.1, round(high, 1))
        return low, high

    title_len_mean = safe_float("title_len_mean")
    title_len_med = safe_float("title_len_median")
    body_len_mean = safe_float("body_len_mean")
    body_len_med = safe_float("body_len_median")
    body_pars_mean = safe_float("body_paragraphs_mean")
    title_noun_mean = safe_float("title_noun_mean")
    title_verb_mean = safe_float("title_verb_mean")
    title_adj_mean = safe_float("title_adj_mean")

    title_punct_top = row.get("title_punct_top") or []
    body_punct_top = row.get("body_punct_top") or []

    if title_len_mean is not None or title_len_med is not None:
        lo, hi = make_range(title_len_mean, title_len_med, widen=0.3, floor=1, as_int=True)
        if lo is not None:
            hints.append(f"제목 길이는 보통 {lo}~{hi}자 (평균 {title_len_mean:.1f}자) 범위입니다.")

    if body_len_mean is not None or body_len_med is not None:
        lo, hi = make_range(body_len_mean, body_len_med, widen=0.3, floor=50, as_int=True)
        if lo is not None:
            hints.append(f"본문 길이는 대략 {lo}~{hi}자 (평균 {body_len_mean:.1f}자)입니다.")

    if body_pars_mean is not None:
        lo, hi = make_range(body_pars_mean, None, widen=0.4, floor=1, as_int=True)
        if lo is not None:
            hints.append(f"본문은 평균 {body_pars_mean:.1f}개 문단, 보통 {lo}~{hi}개 문단 구조입니다.")

    if title_noun_mean is not None:
        lo, hi = make_range(title_noun_mean, None, widen=0.5, floor=0, as_int=True)
        hints.append(f"제목에는 명사가 {lo}~{hi}개 사용되는 경우가 많습니다.")
    if title_verb_mean is not None:
        lo, hi = make_range(title_verb_mean, None, widen=0.6, floor=0, as_int=True)
        hints.append(f"제목에는 동사가 {lo}~{hi}개 포함되는 경향이 있습니다.")
    if title_adj_mean is not None:
        lo, hi = make_range(title_adj_mean, None, widen=0.6, floor=0, as_int=True)
        hints.append(f"제목에는 형용사가 {lo}~{hi}개 정도 사용됩니다.")

    if title_punct_top:
        hints.append(f"제목에서 자주 쓰이는 문장부호: {title_punct_top}")
    if body_punct_top:
        hints.append(f"본문에서 자주 쓰이는 문장부호: {body_punct_top}")

    return hints


---
## 5. 프롬프트 생성 보조 함수 (키워드/시간/대표글)


In [11]:
from collections import Counter


def top_keywords_df(model: BERTopic, topn: int = 8) -> pd.DataFrame:
    rows = []
    for topic_id, words in model.get_topics().items():
        if topic_id == -1:
            continue
        rows.append({
            "topic": int(topic_id),
            "keywords": [w for w, _ in words[:topn]],
        })
    return pd.DataFrame(rows)


def time_dow_helper(df_topics: pd.DataFrame):
    dt = pd.to_datetime(df_topics["date"], errors="coerce")
    tmp = df_topics.copy()
    tmp["dow"] = dt.dt.dayofweek
    tmp["hour"] = dt.dt.hour

    by_hour = tmp.pivot_table(index="topic", columns="hour", values="id", aggfunc="count", fill_value=0)
    by_dow = tmp.pivot_table(index="topic", columns="dow", values="id", aggfunc="count", fill_value=0)

    best_hours = (
        by_hour.rank(ascending=False, axis=1)
        .apply(lambda row: list(row.nsmallest(3).index), axis=1)
        .to_dict()
    )
    best_dows = (
        by_dow.rank(ascending=False, axis=1)
        .apply(lambda row: list(row.nsmallest(2).index), axis=1)
        .to_dict()
    )
    return best_hours, best_dows


def representatives(df_topics: pd.DataFrame, k: int = 3) -> pd.DataFrame:
    reps = []
    for topic_id in sorted(df_topics["topic"].dropna().unique()):
        if topic_id == -1:
            continue
        sub = (
            df_topics[df_topics["topic"] == topic_id]
            .sort_values("topic_prob", ascending=False)
            .head(k)
        )
        reps.append({
            "topic": int(topic_id),
            "representatives": sub[["id", "title", "url"]].to_dict(orient="records"),
        })
    return pd.DataFrame(reps)


def make_range_for_template(mean_val, widen=0.4, floor=50):
    if mean_val is None or pd.isna(mean_val):
        return floor, floor
    lo = int(max(mean_val * (1 - widen/2), floor))
    hi = int(max(mean_val * (1 + widen/2), lo + 10))
    return lo, hi


---
## 6. BERTopic 실행 함수 (일반 게시판 / 인기글)


In [12]:
def run_bertopic_for_df(df: pd.DataFrame, board_name: str,
                        outdir: Path = OUTPUT_DIR,
                        use_spacing: bool = False,
                        use_tokenize: bool = True,
                        custom_stopwords=None,
                        min_cluster_size: int = 25,
                        min_samples: int = 10) -> Tuple[pd.DataFrame, BERTopic]:
    df = df.copy()
    df["text"] = build_text_preprocessed(
        df,
        use_spacing=use_spacing,
        use_tokenize=use_tokenize and MECAB_AVAILABLE,
        custom_stopwords=custom_stopwords,
    )
    docs = df["text"].tolist()

    embeddings = embedder.encode(
        docs,
        show_progress_bar=True,
        batch_size=64,
        normalize_embeddings=True,
    )

    umap_model = umap.UMAP(
        n_neighbors=15,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
        random_state=42,
    )
    hdb_model = hdbscan.HDBSCAN(
        min_cluster_size=min_cluster_size,
        min_samples=min_samples,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True,
    )
    vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1, 2), max_df=0.95, min_df=10)

    topic_model = BERTopic(
        embedding_model=None,
        umap_model=umap_model,
        hdbscan_model=hdb_model,
        vectorizer_model=vectorizer,
        language="multilingual",
        calculate_probabilities=True,
        verbose=True,
    )

    topics, probs = topic_model.fit_transform(docs, embeddings)
    df["topic"] = topics
    df["topic_prob"] = [
        float(np.max(p)) if p is not None and len(p) > 0 else np.nan
        for p in probs
    ]

    out_prefix = f"{board_name}_topics"
    df_out = df[["id", "title", "url", "date", "contents", "topic", "topic_prob"]].copy()
    df_out.to_parquet(outdir / f"{out_prefix}.parquet", index=False)
    topic_model.get_topic_info().to_csv(outdir / f"{board_name}_topic_info.csv", index=False)
    topic_model.save(str(outdir / f"bertopic_{board_name}"))

    print(f"[{board_name}] docs={len(df_out):,} | topics={topic_model.get_topic_info().query('Topic!=-1').shape[0]}")
    return df_out, topic_model


def run_bertopic_for_hot(df: pd.DataFrame,
                         board_name: str = "인기글",
                         outdir: Path = OUTPUT_DIR,
                         use_spacing: bool = False,
                         use_tokenize: bool = True,
                         custom_stopwords=None) -> Tuple[pd.DataFrame, BERTopic]:
    df = df.copy()
    df["text"] = build_text_preprocessed(
        df,
        use_spacing=use_spacing,
        use_tokenize=use_tokenize and MECAB_AVAILABLE,
        custom_stopwords=custom_stopwords,
    )
    docs = df["text"].tolist()

    embeddings = embedder.encode(
        docs,
        show_progress_bar=True,
        batch_size=64,
        normalize_embeddings=True,
    )

    umap_model = umap.UMAP(
        n_neighbors=10,
        n_components=5,
        min_dist=0.0,
        metric="cosine",
        random_state=42,
    )
    hdb_model = hdbscan.HDBSCAN(
        min_cluster_size=15,
        min_samples=5,
        metric="euclidean",
        cluster_selection_method="eom",
        prediction_data=True,
    )
    vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1, 2), max_df=0.95, min_df=5)

    topic_model = BERTopic(
        embedding_model=None,
        umap_model=umap_model,
        hdbscan_model=hdb_model,
        vectorizer_model=vectorizer,
        language="multilingual",
        calculate_probabilities=True,
        verbose=True,
    )

    topics, probs = topic_model.fit_transform(docs, embeddings)
    df["topic"] = topics
    df["topic_prob"] = [
        float(np.max(p)) if p is not None and len(p) > 0 else np.nan
        for p in probs
    ]

    out_prefix = f"{board_name}_topics"
    df_out = df[["id", "title", "url", "date", "contents", "topic", "topic_prob"]].copy()
    df_out.to_parquet(outdir / f"{out_prefix}.parquet", index=False)
    topic_model.get_topic_info().to_csv(outdir / f"{board_name}_topic_info.csv", index=False)
    topic_model.save(str(outdir / f"bertopic_{board_name}"))

    print(f"[{board_name}] docs={len(df_out):,} | topics={topic_model.get_topic_info().query('Topic!=-1').shape[0]}")
    return df_out, topic_model


---
## 7. 일반 게시판 토픽 모델 실행
- 아래 `BOARD_RUN_LIST`에 실행할 게시판을 지정하세요.
- 기본값은 `"익게2"`만 포함했으며, 필요에 따라 다른 게시판을 추가하면 됩니다.
- 각 실행은 시간이 오래 걸릴 수 있습니다 (수 분 단위).


In [13]:
BOARD_RUN_LIST = ["익게2"]  # 필요 시 ["익게2", "자유게시판", "연애상담소", "익게1"] 로 확장

general_results = {}
for board_name in BOARD_RUN_LIST:
    df_board = board_frames.get(board_name)
    if df_board is None:
        print(f"⚠️ {board_name}: 데이터가 없습니다.")
        continue

    print(f"\n=== {board_name} 토픽 모델링 시작 ===")
    board_topics, board_model = run_bertopic_for_df(
        df_board,
        board_name=board_name,
        use_tokenize=True,
        custom_stopwords=CUSTOM_STOPWORDS,
    )
    general_results[board_name] = {"topics": board_topics, "model": board_model}

    kw_df = top_keywords_df(board_model, topn=8)
    best_hours, best_dows = time_dow_helper(board_topics)
    rep_df = representatives(board_topics, k=3)

    payload = kw_df.merge(rep_df, on="topic", how="left")
    payload["best_hours"] = payload["topic"].map(best_hours)
    payload["best_dows"] = payload["topic"].map(best_dows)

    json_path = OUTPUT_DIR / f"{board_name}_topics_for_prompt.json"
    with open(json_path, "w", encoding="utf-8") as f:
        json.dump(payload.to_dict(orient="records"), f, ensure_ascii=False, indent=2)
    print(f"✅ {board_name} 프롬프트 JSON 저장: {json_path}")



=== 익게2 토픽 모델링 시작 ===


Batches:   0%|          | 0/770 [00:00<?, ?it/s]

2025-11-25 02:24:06,757 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-25 02:25:27,632 - BERTopic - Dimensionality - Completed ✓
2025-11-25 02:25:27,634 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-25 02:26:56,766 - BERTopic - Cluster - Completed ✓
2025-11-25 02:26:56,783 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-25 02:26:59,563 - BERTopic - Representation - Completed ✓


[익게2] docs=49,251 | topics=204
✅ 익게2 프롬프트 JSON 저장: /content/drive/MyDrive/board_crawling/outputs/익게2_topics_for_prompt.json


---
## 8. 인기글(Hot) 토픽 모델 + 스타일 분석


In [14]:
hot_df = board_frames["인기글"].copy()
print(f"\n=== 인기글 토픽 + 스타일 분석 시작 (행 수: {len(hot_df):,}) ===")

hot_topics, hot_model = run_bertopic_for_hot(
    hot_df,
    board_name="인기글",
    use_tokenize=True,
    custom_stopwords=CUSTOM_STOPWORDS,
)

hot_topics = add_text_style_features(hot_topics)
hot_style_stats = topic_style_stats(hot_topics)
hot_punct_profile = topic_punct_profile(hot_topics)

hot_kw_df = top_keywords_df(hot_model, topn=8)
hot_best_hours, hot_best_dows = time_dow_helper(hot_topics)
hot_rep_df = representatives(hot_topics, k=3)

hot_prompt_payload = hot_kw_df.merge(hot_rep_df, on="topic", how="left")
hot_prompt_payload["best_hours"] = hot_prompt_payload["topic"].map(hot_best_hours)
hot_prompt_payload["best_dows"] = hot_prompt_payload["topic"].map(hot_best_dows)

hot_prompt_payload = (
    hot_prompt_payload
    .merge(hot_style_stats, on="topic", how="left")
    .merge(hot_punct_profile, on="topic", how="left")
)
hot_prompt_payload["style_hints"] = hot_prompt_payload.apply(make_style_hints, axis=1)

with open(OUTPUT_DIR / "인기글_topics_for_prompt.json", "w", encoding="utf-8") as f:
    json.dump(hot_prompt_payload.to_dict(orient="records"), f, ensure_ascii=False, indent=2)
print("✅ 인기글 프롬프트 JSON 저장 완료")

N = 8
hot_top_topics = (
    hot_topics["topic"].value_counts()
    .drop(-1, errors="ignore")
    .head(N)
    .index
    .tolist()
)

prompt_templates = []
for topic_id in hot_top_topics:
    row = hot_prompt_payload[hot_prompt_payload["topic"] == topic_id]
    if row.empty:
        continue
    row = row.iloc[0]
    kws = row["keywords"]
    hours = row["best_hours"] or []
    dows = row["best_dows"] or []
    style_hints = row.get("style_hints") or []
    body_len_mean = row.get("body_len_mean")
    if isinstance(body_len_mean, (int, float)) and not pd.isna(body_len_mean):
        lo, hi = make_range_for_template(body_len_mean, widen=0.4, floor=50)
        length_info = {"min_chars": lo, "max_chars": hi, "mean_chars": float(body_len_mean)}
    else:
        length_info = {"min_chars": 400, "max_chars": 700, "mean_chars": None}
    template = {
        "topic": int(topic_id),
        "prompt": {
            "keywords": kws[:5],
            "time_windows": {"best_hours": hours, "best_dows": dows},
            "length": length_info,
            "style_hints": style_hints,
        }
    }
    prompt_templates.append(template)

with open(OUTPUT_DIR / "인기글_prompt_templates.json", "w", encoding="utf-8") as f:
    json.dump(prompt_templates, f, ensure_ascii=False, indent=2)
print("✅ 인기글 프롬프트 템플릿 저장 완료")



=== 인기글 토픽 + 스타일 분석 시작 (행 수: 6,592) ===


Batches:   0%|          | 0/103 [00:00<?, ?it/s]

2025-11-25 02:28:05,666 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-11-25 02:28:19,344 - BERTopic - Dimensionality - Completed ✓
2025-11-25 02:28:19,346 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-11-25 02:28:22,173 - BERTopic - Cluster - Completed ✓
2025-11-25 02:28:22,179 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-11-25 02:28:23,018 - BERTopic - Representation - Completed ✓


[인기글] docs=6,592 | topics=81
✅ 인기글 프롬프트 JSON 저장 완료
✅ 인기글 프롬프트 템플릿 저장 완료


---
## 9. 산출물 요약
- `outputs/<게시판>_topics.parquet`
- `outputs/<게시판>_topic_info.csv`
- `outputs/<게시판>_topics_for_prompt.json`
- `outputs/bertopic_<게시판>` 디렉터리 (모델)
- `outputs/인기글_prompt_templates.json`

필요에 따라 Phase 1.5/2/3 노트북에서 위 산출물을 이어서 사용하면 됩니다.
