# FileCharCountDistributionStatistic 开发笔记

本 notebook 演示如何开发 FileCharCount 字符区间统计卡片，并在快速上手框架中调试单次运行与趋势分析的呈现效果。

**English:** This notebook demonstrates how to build the FileCharCount character-range statistic card and how to preview both single-run and trend outputs within the quickstart framework.  
**日本語:** このノートブックでは、FileCharCount の文字数区間統計カードの作成方法と、クイックスタートフレームワーク内で単一実行とトレンドの表示を確認する方法を紹介します。

## 数据准备

运行下方单元以加载示例运行数据。根据实际情况调整 `repo_name` 和 `run_name`，以便查看不同执行结果。

**English:** Run the cell below to load sample run data. Adjust `repo_name` and `run_name` as needed to inspect different executions.  
**日本語:** 下のセルを実行してサンプルの実行データを読み込みます。必要に応じて `repo_name` と `run_name` を変更し、別の実行結果を確認してください。

In [None]:
import sys
from pathlib import Path

NOTEBOOKS_DIR = Path.cwd().resolve().parent
if str(NOTEBOOKS_DIR) not in sys.path:
    sys.path.insert(0, str(NOTEBOOKS_DIR))

from quickstart_dashboard import RunDataLoader

loader = RunDataLoader(base_dir="../../artifacts")
repos = loader.list_repos()

if not repos:
    print("⚠️ 未找到任何项目，请确认 ../../artifacts 目录存在分析结果。")
    repo_name = None
    run_name = None
    sample_run = None
    sample_history = None
else:
    repo_name = repos[0]
    print(f"使用示例项目: {repo_name}")
    runs = loader.list_runs(repo_name)
    if not runs:
        print("⚠️ 项目下暂未找到运行记录。")
        run_name = None
        sample_run = None
        sample_history = None
    else:
        run_name = runs[0]
        print(f"使用示例运行: {run_name}")
        sample_run = loader.load_run(repo_name, run_name)
        sample_history = loader.load_history(repo_name, limit=20)


### 技术栈筛选

选择需要查看的技术栈后，重新运行预览单元即可按栈过滤统计结果。

**English:** Choose a tech stack and rerun the preview cells to filter the results for that stack.

**日本語:** 技術スタックを選択し、プレビューセルを再実行すると、そのスタックに絞った結果を確認できます。

In [None]:

import ipywidgets as widgets
from IPython.display import display

from quickstart_dashboard import RunData, TechStackClassifier


stack_classifier = TechStackClassifier.from_config()
stack_options = stack_classifier.stack_labels or [stack_classifier.all_label]
default_stack = stack_options[0] if stack_options else None


def filter_run_by_stack(run: RunData, stack_label: str | None) -> RunData:
    """Return a RunData copy filtered to the requested tech stack."""

    if run is None:
        return run

    label = stack_label or stack_classifier.all_label
    if stack_classifier.is_all(label):
        dataframes = dict(run.dataframes)
    else:
        dataframes = stack_classifier.filter_run_dataframes(run.dataframes, label)

    metadata = dict(run.metadata or {})
    metadata["selected_stack"] = label
    return RunData(
        repo=run.repo,
        run=run.run,
        path=run.path,
        metadata=metadata,
        dataframes=dataframes,
        timestamp=run.timestamp,
        ended_at=run.ended_at,
        selected_stack=label,
    )


stack_dropdown = widgets.Dropdown(
    options=stack_options,
    value=default_stack,
    description="技术栈",
    layout=widgets.Layout(width="320px"),
)

display(stack_dropdown)


## 定义统计卡片

在下方代码单元中实现字符区间统计逻辑。运行后即可在当前会话中使用该类。

**English:** Implement the character-range statistic in the code cell below. Once executed, the class becomes available for use in this session.  
**日本語:** 以下のコードセルで文字数区間統計のロジックを実装します。セルを実行すると、このセッションでクラスを利用できるようになります。

### 数据处理函数

在定义统计类之前，我们先拆分出路径归一化和 diff 元数据匹配的辅助函数，方便单独运行并快速排查问题。
**English:** Before defining the statistic class, split out helper functions for path normalization and diff metadata joins so they can be executed independently during debugging.
**日本語:** 統計クラスを定義する前に、パスの正規化や差分メタデータ結合の補助関数を分離し、デバッグ時に個別に実行しやすくします。


In [None]:
from typing import List, Optional, Sequence

import pandas as pd

from quickstart_dashboard import RunData

def normalize_paths(
    df: pd.DataFrame, candidate_columns: Sequence[str] | None = None
) -> pd.DataFrame:
    """Return a copy with normalized POSIX-style paths."""

    columns = list(candidate_columns or [])
    if not columns:
        columns = [col for col in ("path", "file_path") if col in df.columns]

    if not columns:
        result = df.copy()
        result["path"] = ""
        return result

    normalized = pd.Series([""] * len(df), index=df.index, dtype=object)
    for column in columns:
        values = df[column].fillna("").astype(str)
        values = values.str.replace("\\", "/", regex=False).str.strip("/")
        normalized = normalized.where(normalized != "", values)

    result = df.copy()
    result["path"] = normalized
    return result

def load_char_counts(run: RunData) -> Optional[pd.DataFrame]:
    """Extract FileCharCount rows from analysis results."""

    df = run.dataframes.get("analysis_results_df")
    if df is None or df.empty:
        return None

    if "analyzer_type" not in df.columns or "count" not in df.columns:
        return None

    filtered = df[df["analyzer_type"] == "FileCharCount"].copy()
    if filtered.empty:
        return None

    filtered["count"] = pd.to_numeric(filtered["count"], errors="coerce")
    filtered.dropna(subset=["count"], inplace=True)
    if filtered.empty:
        return None

    filtered = normalize_paths(filtered)
    filtered["commit_hash"] = filtered.get("commit_hash", "").fillna("").astype(str)
    filtered = filtered[filtered["path"] != ""].copy()
    if filtered.empty:
        return None

    filtered.sort_values(["path", "count"], ascending=[True, False], inplace=True)
    filtered = filtered.drop_duplicates(subset=["path", "commit_hash"], keep="first")
    return filtered

def attach_diff_metadata(
    run: RunData, df: pd.DataFrame
) -> tuple[pd.DataFrame, str, bool]:
    """Merge diff metadata to highlight added and changed files."""

    run_type = (run.metadata or {}).get("run_type", "")
    enriched = df.copy()
    diff_info_available = False

    if run_type == "diff":
        diff_df = run.dataframes.get("diff_results_df")
        if diff_df is not None and not diff_df.empty:
            diff_df = diff_df.copy()
            for column in (
                "target_path",
                "source_path",
                "diff_change_type",
                "target_commit_hash",
                "base_commit_hash",
            ):
                if column not in diff_df.columns:
                    diff_df[column] = ""

            diff_df["target_path"] = diff_df["target_path"].fillna("").astype(str)
            diff_df["source_path"] = diff_df["source_path"].fillna("").astype(str)
            diff_df["diff_change_type"] = diff_df["diff_change_type"].fillna("").astype(str)
            diff_df["target_commit_hash"] = diff_df["target_commit_hash"].fillna("").astype(str)
            diff_df["base_commit_hash"] = diff_df["base_commit_hash"].fillna("").astype(str)

            diff_df["_merge_path"] = diff_df["target_path"].where(
                diff_df["target_path"] != "", diff_df["source_path"]
            )
            diff_df["_merge_path"] = (
                diff_df["_merge_path"].str.replace("\\", "/", regex=False).str.strip("/")
            )

            diff_map = diff_df[
                ["_merge_path", "diff_change_type", "target_commit_hash", "base_commit_hash"]
            ].rename(columns={"_merge_path": "path"})

            enriched = enriched.merge(diff_map, on="path", how="left")
            enriched["diff_change_type"] = (
                enriched.get("diff_change_type", "").fillna("").astype(str).str.upper()
            )
            enriched["target_commit_hash"] = (
                enriched.get("target_commit_hash", "").fillna("").astype(str)
            )
            enriched["base_commit_hash"] = (
                enriched.get("base_commit_hash", "").fillna("").astype(str)
            )

            commit_hash = enriched.get("commit_hash", "").fillna("").astype(str)
            target_mask = enriched["target_commit_hash"] != ""
            commit_match = commit_hash == enriched["target_commit_hash"]
            keep_mask = (~target_mask) | (target_mask & commit_match)
            enriched = enriched.loc[keep_mask].copy()
            enriched.sort_values(["path", "count"], ascending=[True, False], inplace=True)

            diff_info_available = (
                enriched["diff_change_type"].replace("", pd.NA).notna().any()
            )
        else:
            enriched["diff_change_type"] = pd.NA
    else:
        enriched["diff_change_type"] = pd.NA

    enriched = enriched.drop_duplicates(subset=["path"], keep="first")
    return enriched, str(run_type), diff_info_available

def bucket_label(value: float, bins: Sequence[tuple[str, int, Optional[int]]]) -> str:
    """Return the label matching the provided count."""

    for label, lower, upper in bins:
        if value < lower:
            continue
        if upper is None or value < upper:
            return label
    return bins[-1][0]

def summarize_file_char_run(
    run: RunData, bins: Sequence[tuple[str, int, Optional[int]]]
) -> Optional[dict[str, object]]:
    """Prepare summary rows for the statistic card."""

    base_df = load_char_counts(run)
    if base_df is None:
        return None

    enriched, run_type, diff_info_available = attach_diff_metadata(run, base_df)
    if enriched.empty:
        return None

    enriched = enriched.copy()
    enriched["range_label"] = enriched["count"].apply(lambda value: bucket_label(value, bins))

    stack_label = (run.metadata or {}).get("selected_stack") or run.selected_stack or ""

    rows: List[dict[str, object]] = []
    for label, _, _ in bins:
        subset = enriched[enriched["range_label"] == label]
        total = int(subset.shape[0])
        added: Optional[int] = None
        changed: Optional[int] = None

        if run_type == "diff" and diff_info_available:
            added = int(subset["diff_change_type"].isin({"A"}).sum())
            changed = int(subset["diff_change_type"].isin({"M", "R"}).sum())

        rows.append({"range": label, "total": total, "added": added, "changed": changed})

    return {
        "rows": rows,
        "run_type": str(run_type),
        "diff_info": diff_info_available,
        "stack_label": stack_label,
    }

### 统计卡片类

接下来利用这些辅助函数实现最终的统计卡片类，便于在 Notebook 中调试并导出到仪表盘。
**English:** Next, use these helpers to implement the final statistic class so it can be debugged in the notebook and exported to the dashboard.
**日本語:** これらの補助関数を利用して最終的な統計クラスを実装し、ノートブックでデバッグしダッシュボードへエクスポートできるようにします。


In [None]:
from typing import Optional, Sequence

import ipywidgets as widgets

from quickstart_dashboard import BaseStatistic, RunData, RunHistory, _card_container


class FileCharCountDistributionStatistic(BaseStatistic):
    """Bucket files by character count and highlight diff additions/modifications."""

    BINS: Sequence[tuple[str, int, Optional[int]]] = (
        ("<1k", 0, 1_000),
        ("1k-2k", 1_000, 2_000),
        ("2k-3k", 2_000, 3_000),
        ("3k-10k", 3_000, 10_000),
        ("10k-50k", 10_000, 50_000),
        (">=50k", 50_000, None),
    )

    def __init__(self) -> None:
        self.name = "文件字符区间"
        self.description = "基于 FileCharCount 分析结果统计字符区间，并区分新增/变更文件数量"

    def render_single(self, run: RunData) -> widgets.Widget:
        summary = summarize_file_char_run(run, self.BINS)
        if summary is None:
            return _card_container(
                self.name,
                description=self.description,
                body_html="<div style='color:#666;'>暂无 FileCharCount 分析结果可用于统计。</div>",
                min_width="360px",
            )

        diff_run = summary["run_type"] == "diff"
        diff_info = diff_run and summary["diff_info"]
        if diff_info:
            note = "统计基于 diff 运行匹配的 FileCharCount 结果。"
        elif diff_run:
            note = "diff 运行未匹配到文件级变更，新增/变更列以 “-” 显示。"
        else:
            run_label = summary["run_type"] or "未知"
            note = f"当前运行类型为 {run_label}，新增/变更列以 “-” 显示。"

        stack_label = summary.get("stack_label") or ""
        stack_note = ""
        if stack_label:
            stack_note = f"技术栈筛选：{stack_label}"

        table = self._render_distribution_table(summary["rows"], diff_info)

        body_widgets = [
            widgets.HTML(value=f"<div style='color:#666;font-size:12px;'>{note}</div>"),
        ]
        if stack_note:
            body_widgets.append(
                widgets.HTML(value=f"<div style='color:#666;font-size:12px;'>{stack_note}</div>")
            )
        body_widgets.append(table)

        return _card_container(
            self.name,
            description=self.description,
            body_widgets=body_widgets,
            min_width="360px",
        )

    def _render_distribution_table(
        self, rows: Sequence[dict[str, object]], diff_info: bool
    ) -> widgets.HTML:
        header_cells = ["区间", "文件数", "新增", "变更"]
        html = [
            "<table style='border-collapse:collapse;font-size:12px;width:100%;max-width:520px;'>",
            "<thead><tr>",
        ]
        for cell in header_cells:
            html.append(
                f"<th style='border-bottom:1px solid #ddd;padding:6px 8px;text-align:left;color:#555;font-weight:600;'>{cell}</th>"
            )
        html.append("</tr></thead><tbody>")

        for row in rows:
            total = row["total"]
            added = row.get("added")
            changed = row.get("changed")
            html.append("<tr>")
            html.append(
                f"<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#333;'>{row['range']}</td>"
            )
            html.append(
                f"<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#333;'>{total}</td>"
            )
            if diff_info:
                html.append(
                    f"<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#333;'>{added if added is not None else 0}</td>"
                )
                html.append(
                    f"<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#333;'>{changed if changed is not None else 0}</td>"
                )
            else:
                html.append("<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#999;'>-</td>")
                html.append("<td style='padding:6px 8px;border-bottom:1px solid #f0f0f0;color:#999;'>-</td>")
            html.append("</tr>")

        html.append("</tbody></table>")
        return widgets.HTML(value="".join(html))

    def render_trend(self, history: RunHistory) -> widgets.Widget:
        stack_label = ""
        if history.runs:
            stack_label = history.runs[-1].selected_stack or ""

        hint = "该统计不支持趋势分析。"
        if stack_label:
            hint += f" 当前技术栈筛选：{stack_label}。"

        return _card_container(
            self.name,
            description=self.description,
            body_html=f"<div style='color:#666;'>{hint}</div>",
            min_width="360px",
            flex="1 1 100%",
        )

## 调试与预览

运行以下单元，在 notebook 中直接查看单次运行卡片和趋势统计，便于上线前快速验证。

**English:** Execute the following cells to preview the single-run card and trend statistics directly inside the notebook, making it easy to validate before publishing.  
**日本語:** 次のセルを実行すると、ノートブック上で単一実行カードとトレンド統計を直接確認でき、公開前の検証が容易になります。

In [None]:

if sample_run is None:
    print("⚠️ 没有可用的运行数据，无法预览单次统计卡片。")
else:
    stat = FileCharCountDistributionStatistic()
    stack_label = stack_dropdown.value if "stack_dropdown" in globals() else None
    preview_run = filter_run_by_stack(sample_run, stack_label)
    widget = stat.render_single(preview_run)
    display(widget)


In [None]:

if sample_history is None or not sample_history.runs:
    print("⚠ 没有足够的历史数据，无法绘制趋势图。")
else:
    stat = FileCharCountDistributionStatistic()
    stack_label = stack_dropdown.value if "stack_dropdown" in globals() else None
    filtered_runs = [
        filter_run_by_stack(run, stack_label)
        for run in sample_history.runs
    ]
    filtered_history = RunHistory(repo=sample_history.repo, runs=filtered_runs)
    widget = stat.render_trend(filtered_history)
    display(widget)


## 导出到仪表盘

确认逻辑后，可以在 `quickstart_dashboard.ipynb` 中通过 `load_statistic_from_notebook("custom_statistics/file_char_count_distribution_statistic.ipynb")` 将此统计类加载并注册到仪表盘。

**English:** After validating the logic, load and register this statistic in `quickstart_dashboard.ipynb` via `load_statistic_from_notebook("custom_statistics/file_char_count_distribution_statistic.ipynb")`.  
**日本語:** ロジックを確認したら、`quickstart_dashboard.ipynb` で `load_statistic_from_notebook("custom_statistics/file_char_count_distribution_statistic.ipynb")` を使ってこの統計クラスを読み込み、ダッシュボードに登録してください。