# Q1 数据预处理模块

## 目标
本 Notebook 完成 Q1 问题的数据预处理工作，包括：

1. **数据加载**: 读取官方提供的 DWTS 数据集
2. **缺失值处理**: 处理 N/A、0分等特殊值
3. **构建有效参赛者集合 $A_{s,t}$**: 每个赛季每周仍在比赛的选手
4. **计算评委评分**: 总分和平均分的统一口径
5. **提取淘汰标签 $E_{s,t}$**: 每周被淘汰的选手集合

## 符号定义
- $s$: 赛季编号 (s = 1, 2, ..., 34)
- $t$: 周次 (t = 1, 2, ..., T_s，其中 T_s 为赛季 s 的总周数)
- $i$: 参赛者编号 (名人+舞伴组合)
- $A_{s,t}$: 赛季 s 第 t 周仍在比赛的参赛者集合
- $n_{s,t} = |A_{s,t}|$: 该周参赛人数
- $E_{s,t}$: 赛季 s 第 t 周被淘汰的参赛者集合
- $X_{i,s,t,j}$: 参赛者 i 在赛季 s 第 t 周来自评委 j 的评分
- $S_{i,s,t}$: 参赛者 i 在赛季 s 第 t 周的评委总分

---
## 1. 环境配置与库导入

In [None]:
# ============================================================
# 1.1 导入必要的库
# ============================================================

import pandas as pd
import numpy as np
import re
import warnings
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Set

# 可视化库
import matplotlib.pyplot as plt
import seaborn as sns

# 忽略警告
warnings.filterwarnings('ignore')

# 设置显示选项
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 50)
pd.set_option('display.width', 200)

# 设置绘图样式
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['font.size'] = 10

print("库导入完成!")

In [None]:
# ============================================================
# 1.2 定义数据路径
# ============================================================

# 数据目录 (相对于当前 notebook 的路径)
DATA_DIR = Path("../data/raw")

# 原始数据文件路径
RAW_DATA_PATH = DATA_DIR / "2026_MCM_Problem_C_Data.csv"

# 检查文件是否存在
if RAW_DATA_PATH.exists():
    print(f"数据文件存在: {RAW_DATA_PATH}")
    print(f"文件大小: {RAW_DATA_PATH.stat().st_size / 1024:.2f} KB")
else:
    print(f"错误: 数据文件不存在 - {RAW_DATA_PATH}")

---
## 2. 数据加载与初步探索

In [None]:
# ============================================================
# 2.1 加载原始数据
# ============================================================

# 读取 CSV 文件
# encoding='utf-8-sig' 用于处理可能存在的 BOM 标记
raw_df = pd.read_csv(RAW_DATA_PATH, encoding='utf-8-sig')

# 打印基本信息
print("=" * 60)
print("数据基本信息")
print("=" * 60)
print(f"数据形状: {raw_df.shape[0]} 行 × {raw_df.shape[1]} 列")
print(f"\n赛季范围: Season {raw_df['season'].min()} - Season {raw_df['season'].max()}")
print(f"总参赛者数: {raw_df['celebrity_name'].nunique()} 人")

In [None]:
# ============================================================
# 2.2 查看数据列名
# ============================================================

print("数据列名:")
print("-" * 60)

# 分类显示列名
# 基本信息列
info_cols = ['celebrity_name', 'ballroom_partner', 'celebrity_industry', 
             'celebrity_homestate', 'celebrity_homecountry/region',
             'celebrity_age_during_season', 'season', 'results', 'placement']

# 评分列 (week{X}_judge{Y}_score 格式)
score_cols = [col for col in raw_df.columns if 'judge' in col and 'score' in col]

print(f"\n基本信息列 ({len(info_cols)} 列):")
for col in info_cols:
    print(f"  - {col}")

print(f"\n评分列 ({len(score_cols)} 列):")
print(f"  格式: week{{1-11}}_judge{{1-4}}_score")
print(f"  示例: {score_cols[:4]}")

In [None]:
# ============================================================
# 2.3 查看数据前几行
# ============================================================

print("数据预览 (前5行, 基本信息列):")
print("-" * 60)
display(raw_df[info_cols].head())

In [None]:
# ============================================================
# 2.4 数据类型检查
# ============================================================

print("数据类型:")
print("-" * 60)
print(raw_df.dtypes)

In [None]:
# ============================================================
# 2.5 缺失值分析
# ============================================================

print("缺失值分析:")
print("=" * 60)

# 计算每列的缺失值数量和比例
missing_count = raw_df.isnull().sum()
missing_pct = (missing_count / len(raw_df) * 100).round(2)

# 只显示有缺失值的列
missing_df = pd.DataFrame({
    '缺失数量': missing_count,
    '缺失比例(%)': missing_pct
})
missing_df = missing_df[missing_df['缺失数量'] > 0]

if len(missing_df) > 0:
    print(missing_df)
else:
    print("基本信息列无缺失值")

# 检查评分列中的 'N/A' 字符串
print("\n评分列中 'N/A' 值统计:")
print("-" * 60)
na_counts = {}
for col in score_cols:
    na_count = (raw_df[col] == 'N/A').sum()
    if na_count > 0:
        na_counts[col] = na_count

# 按周汇总
week_na_summary = {}
for col, count in na_counts.items():
    week = int(re.search(r'week(\d+)', col).group(1))
    week_na_summary[week] = week_na_summary.get(week, 0) + count

print(f"各周 N/A 值总数: {week_na_summary}")

---
## 3. 数据清洗与转换

### 3.1 关键清洗规则

根据题目说明，需要处理以下情况：

1. **N/A 值**: 表示该周没有第四位评委，视为缺失
2. **0 分**: 表示选手已被淘汰，后续周次记录为 0
3. **有效评委数**: 不同周可能有 3 或 4 位评委
4. **淘汰标签**: 从 `results` 字段解析淘汰周次

In [None]:
# ============================================================
# 3.1 定义淘汰周次解析函数
# ============================================================

def parse_elimination_week(results: str) -> Optional[int]:
    """
    从 results 字段解析淘汰周次
    
    参数:
        results: 比赛结果字符串，如 "Eliminated Week 3", "1st Place" 等
    
    返回:
        int: 淘汰周次 (正整数)
        -1: 表示退赛 (Withdrew)
        None: 未被淘汰 (进入决赛的选手)
    
    示例:
        >>> parse_elimination_week("Eliminated Week 3")
        3
        >>> parse_elimination_week("1st Place")
        None
        >>> parse_elimination_week("Withdrew")
        -1
    """
    # 处理空值
    if pd.isna(results):
        return None
    
    results = str(results).strip()
    
    # 匹配 "Eliminated Week X" 格式
    # 使用正则表达式提取周次数字
    match = re.search(r'Eliminated Week (\d+)', results, re.IGNORECASE)
    if match:
        return int(match.group(1))
    
    # 特殊情况: 退赛 (Withdrew/Quit)
    if 'Withdrew' in results or 'Quit' in results:
        return -1  # 用 -1 标记退赛
    
    # 其他情况: 1st/2nd/3rd Place 等 - 未被淘汰
    return None


# 测试函数
print("淘汰周次解析测试:")
print("-" * 40)
test_cases = [
    "Eliminated Week 3",
    "1st Place",
    "2nd Place", 
    "Withdrew",
    "Eliminated Week 10"
]
for case in test_cases:
    result = parse_elimination_week(case)
    print(f"  '{case}' -> {result}")

In [None]:
# ============================================================
# 3.2 定义评分解析函数
# ============================================================

def parse_score(value) -> Optional[float]:
    """
    解析单个评分值
    
    参数:
        value: 原始评分值 (可能是数字、字符串 'N/A'、0 等)
    
    返回:
        float: 有效评分 (> 0)
        None: 无效评分 (N/A、0、缺失等)
    
    说明:
        - 'N/A' 表示该周没有第四位评委
        - 0 表示选手已被淘汰
        - 有效评分范围通常是 1-10 分
    """
    # 处理缺失值
    if pd.isna(value):
        return None
    
    # 处理 'N/A' 字符串
    if value == 'N/A':
        return None
    
    # 尝试转换为浮点数
    try:
        score = float(value)
        # 0 分表示已淘汰，视为无效
        if score > 0:
            return score
        else:
            return None
    except (ValueError, TypeError):
        return None


# 测试函数
print("评分解析测试:")
print("-" * 40)
test_scores = [8, 7.5, 'N/A', 0, None, '9']
for score in test_scores:
    result = parse_score(score)
    print(f"  {repr(score)} -> {result}")

In [None]:
# ============================================================
# 3.3 计算每周评委评分 (总分和平均分)
# ============================================================

def compute_weekly_scores(df: pd.DataFrame, max_weeks: int = 11) -> pd.DataFrame:
    """
    计算每位选手每周的评委总分和平均分
    
    参数:
        df: 原始数据 DataFrame
        max_weeks: 最大周数 (默认 11)
    
    返回:
        DataFrame 包含以下列:
        - celebrity_name: 选手姓名
        - ballroom_partner: 舞伴姓名
        - season: 赛季
        - placement: 最终排名
        - results: 比赛结果
        - elimination_week: 淘汰周次
        - week{t}_total_score: 第 t 周评委总分
        - week{t}_avg_score: 第 t 周评委平均分
        - week{t}_judge_count: 第 t 周有效评委数
    
    计算公式:
        总分: S_{i,s,t} = Σ_j X_{i,s,t,j} (对所有有效评委 j 求和)
        平均分: S̄_{i,s,t} = S_{i,s,t} / m_{i,s,t} (m 为有效评委数)
    """
    result_rows = []
    
    # 遍历每位选手
    for idx, row in df.iterrows():
        # 基本信息
        record = {
            'celebrity_name': row['celebrity_name'],
            'ballroom_partner': row['ballroom_partner'],
            'season': row['season'],
            'placement': row['placement'],
            'results': row['results'],
            'elimination_week': parse_elimination_week(row['results']),
            'celebrity_industry': row.get('celebrity_industry', ''),
            'celebrity_age': row.get('celebrity_age_during_season', np.nan),
        }
        
        # 计算每周评分
        for week in range(1, max_weeks + 1):
            scores = []  # 该周有效评分列表
            
            # 遍历 4 位评委
            for judge in range(1, 5):
                col = f'week{week}_judge{judge}_score'
                if col in row.index:
                    score = parse_score(row[col])
                    if score is not None:
                        scores.append(score)
            
            # 计算总分和平均分
            if scores:
                record[f'week{week}_total_score'] = sum(scores)
                record[f'week{week}_avg_score'] = np.mean(scores)
                record[f'week{week}_judge_count'] = len(scores)
            else:
                # 无有效评分 (未参赛或已淘汰)
                record[f'week{week}_total_score'] = np.nan
                record[f'week{week}_avg_score'] = np.nan
                record[f'week{week}_judge_count'] = 0
        
        result_rows.append(record)
    
    return pd.DataFrame(result_rows)


# 执行计算
print("计算每周评委评分...")
processed_df = compute_weekly_scores(raw_df)
print(f"处理后数据形状: {processed_df.shape}")
print(f"\n新增列示例:")
new_cols = [col for col in processed_df.columns if 'week1_' in col]
print(f"  {new_cols}")

In [None]:
# ============================================================
# 3.4 查看处理后的数据
# ============================================================

print("处理后数据预览 (第1赛季):")
print("=" * 60)

# 选择第1赛季数据
s1_df = processed_df[processed_df['season'] == 1].copy()

# 显示关键列
display_cols = ['celebrity_name', 'elimination_week', 'placement',
                'week1_total_score', 'week2_total_score', 'week3_total_score']
display(s1_df[display_cols])

In [None]:
# ============================================================
# 3.5 验证淘汰周次与评分的一致性
# ============================================================

print("淘汰周次与评分一致性验证:")
print("=" * 60)
print("\n规则: 被淘汰后，后续周次应无有效评分 (NaN)")
print("-" * 60)

# 检查几个被淘汰的选手
eliminated_samples = processed_df[
    (processed_df['elimination_week'].notna()) & 
    (processed_df['elimination_week'] > 0)
].head(5)

for _, row in eliminated_samples.iterrows():
    elim_week = int(row['elimination_week'])
    name = row['celebrity_name']
    season = row['season']
    
    # 检查淘汰后的周次是否有评分
    post_elim_scores = []
    for w in range(elim_week + 1, 12):
        score = row.get(f'week{w}_total_score', np.nan)
        if pd.notna(score):
            post_elim_scores.append((w, score))
    
    status = "✓ 正确" if len(post_elim_scores) == 0 else f"✗ 异常: {post_elim_scores}"
    print(f"  S{season} {name}: 第{elim_week}周淘汰 -> {status}")

---
## 4. 构建有效参赛者集合 $A_{s,t}$ 和淘汰者集合 $E_{s,t}$

### 定义
- **有效参赛者集合 $A_{s,t}$**: 赛季 $s$ 第 $t$ 周仍在比赛的选手集合
  - 判断标准: 该周有有效评分 (非 NaN)
  
- **淘汰者集合 $E_{s,t}$**: 赛季 $s$ 第 $t$ 周被淘汰的选手集合
  - 判断标准: `elimination_week == t`

In [None]:
# ============================================================
# 4.1 构建有效参赛者集合 A_{s,t}
# ============================================================

def build_active_sets(df: pd.DataFrame) -> Dict[int, Dict[int, List[str]]]:
    """
    构建每个赛季每周的有效参赛者集合
    
    参数:
        df: 处理后的数据 DataFrame
    
    返回:
        嵌套字典: {season: {week: [celebrity_names]}}
    
    说明:
        A_{s,t} = {i : S_{i,s,t} 存在且 > 0}
        即该周有有效评分的选手
    """
    active_sets = {}
    
    for season in sorted(df['season'].unique()):
        season_df = df[df['season'] == season]
        active_sets[season] = {}
        
        # 找出该赛季的最大有效周数
        max_week = 0
        for week in range(1, 12):
            col = f'week{week}_total_score'
            if col in season_df.columns and season_df[col].notna().any():
                max_week = week
        
        # 构建每周的有效参赛者集合
        for week in range(1, max_week + 1):
            col = f'week{week}_total_score'
            # 有效参赛者: 该周有评分 (非 NaN)
            active = season_df[season_df[col].notna()]['celebrity_name'].tolist()
            active_sets[season][week] = active
    
    return active_sets


# 构建有效参赛者集合
print("构建有效参赛者集合 A_{s,t}...")
active_sets = build_active_sets(processed_df)
print(f"共 {len(active_sets)} 个赛季")

In [None]:
# ============================================================
# 4.2 构建淘汰者集合 E_{s,t}
# ============================================================

def build_elimination_sets(df: pd.DataFrame) -> Dict[int, Dict[int, List[str]]]:
    """
    构建每个赛季每周的淘汰者集合
    
    参数:
        df: 处理后的数据 DataFrame
    
    返回:
        嵌套字典: {season: {week: [eliminated_celebrity_names]}}
    
    说明:
        E_{s,t} = {i : elimination_week_i == t}
        注意: 有些周可能没有淘汰 (E_{s,t} = ∅)
              有些周可能淘汰多人 (|E_{s,t}| > 1)
    """
    elimination_sets = {}
    
    for season in sorted(df['season'].unique()):
        season_df = df[df['season'] == season]
        elimination_sets[season] = {}
        
        for week in range(1, 12):
            # 该周被淘汰的选手
            eliminated = season_df[
                season_df['elimination_week'] == week
            ]['celebrity_name'].tolist()
            
            if eliminated:
                elimination_sets[season][week] = eliminated
    
    return elimination_sets


# 构建淘汰者集合
print("构建淘汰者集合 E_{s,t}...")
elimination_sets = build_elimination_sets(processed_df)

# 统计淘汰情况
total_eliminations = sum(
    len(elim) 
    for season_elim in elimination_sets.values() 
    for elim in season_elim.values()
)
print(f"总淘汰事件数: {total_eliminations}")

In [None]:
# ============================================================
# 4.3 查看示例赛季的集合
# ============================================================

def display_season_sets(season: int, active: Dict, elim: Dict):
    """
    显示某赛季的有效参赛者和淘汰者集合
    """
    print(f"\n{'='*60}")
    print(f"Season {season} 集合信息")
    print(f"{'='*60}")
    
    if season not in active:
        print(f"赛季 {season} 不存在")
        return
    
    season_active = active[season]
    season_elim = elim.get(season, {})
    
    print(f"\n{'周次':<6} {'参赛人数':<10} {'淘汰人数':<10} {'淘汰选手'}")
    print("-" * 60)
    
    for week in sorted(season_active.keys()):
        n_active = len(season_active[week])
        eliminated = season_elim.get(week, [])
        n_elim = len(eliminated)
        elim_names = ', '.join(eliminated) if eliminated else '-'
        print(f"Week {week:<3} {n_active:<10} {n_elim:<10} {elim_names}")


# 显示第3赛季 (示例)
display_season_sets(3, active_sets, elimination_sets)

In [None]:
# ============================================================
# 4.4 显示更多赛季 (第1, 10, 20, 30赛季)
# ============================================================

for s in [1, 10, 20, 30]:
    if s in active_sets:
        display_season_sets(s, active_sets, elimination_sets)

---
## 5. 赛季汇总与投票规则分类

### 投票规则演变

根据题目说明，DWTS 的投票规则在不同赛季有所变化：

| 赛季范围 | 规则类型 | 说明 |
|---------|---------|------|
| S1-S2 | 排名制 (Rank-based) | 评委排名 + 粉丝排名，总排名最差者淘汰 |
| S3-S27 | 百分比制 (Percentage-based) | 评委分占50% + 粉丝票占50%，总分最低者淘汰 |
| S28-S34 | 排名制+评委拯救 (Rank + Judge Save) | 排名确定倒数两名，评委投票决定谁被淘汰 |

In [None]:
# ============================================================
# 5.1 定义投票规则分类
# ============================================================

def get_voting_rule(season: int) -> str:
    """
    根据赛季获取投票规则类型
    
    参数:
        season: 赛季编号
    
    返回:
        str: 投票规则类型
            - 'rank_v1': 排名制 (S1-S2)
            - 'percentage': 百分比制 (S3-S27)
            - 'rank_v2': 排名制+评委拯救 (S28-S34)
    """
    if season <= 2:
        return 'rank_v1'
    elif season <= 27:
        return 'percentage'
    else:
        return 'rank_v2'


# 投票规则描述
VOTING_RULE_DESC = {
    'rank_v1': '排名制 (S1-S2): 评委排名 + 粉丝排名',
    'percentage': '百分比制 (S3-S27): 评委分50% + 粉丝票50%',
    'rank_v2': '排名制+评委拯救 (S28-S34): 排名确定倒数两名，评委投票决定'
}

print("投票规则分类:")
print("=" * 60)
for rule, desc in VOTING_RULE_DESC.items():
    print(f"  {rule}: {desc}")

In [None]:
# ============================================================
# 5.2 生成赛季汇总表
# ============================================================

def generate_season_summary(df: pd.DataFrame, active: Dict, elim: Dict) -> pd.DataFrame:
    """
    生成赛季汇总信息表
    
    参数:
        df: 处理后的数据 DataFrame
        active: 有效参赛者集合
        elim: 淘汰者集合
    
    返回:
        DataFrame 包含每个赛季的汇总信息
    """
    summary_rows = []
    
    for season in sorted(df['season'].unique()):
        season_df = df[df['season'] == season]
        
        # 基本信息
        n_contestants = len(season_df)
        max_weeks = len(active.get(season, {}))
        
        # 冠军信息
        winner_row = season_df[season_df['placement'] == 1]
        winner = winner_row['celebrity_name'].values[0] if len(winner_row) > 0 else 'N/A'
        
        # 淘汰统计
        season_elim = elim.get(season, {})
        total_elim = sum(len(e) for e in season_elim.values())
        
        # 投票规则
        voting_rule = get_voting_rule(season)
        
        summary_rows.append({
            'season': season,
            'n_contestants': n_contestants,
            'max_weeks': max_weeks,
            'total_eliminations': total_elim,
            'voting_rule': voting_rule,
            'winner': winner
        })
    
    return pd.DataFrame(summary_rows)


# 生成赛季汇总
print("生成赛季汇总表...")
season_summary = generate_season_summary(processed_df, active_sets, elimination_sets)
print(f"\n赛季汇总 ({len(season_summary)} 个赛季):")
display(season_summary)

In [None]:
# ============================================================
# 5.3 按投票规则分组统计
# ============================================================

print("按投票规则分组统计:")
print("=" * 60)

rule_stats = season_summary.groupby('voting_rule').agg({
    'season': 'count',
    'n_contestants': 'sum',
    'total_eliminations': 'sum'
}).rename(columns={
    'season': '赛季数',
    'n_contestants': '总参赛人数',
    'total_eliminations': '总淘汰次数'
})

display(rule_stats)

In [None]:
# ============================================================
# 5.4 可视化: 每赛季参赛人数和周数
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 图1: 每赛季参赛人数
ax1 = axes[0]
colors = season_summary['voting_rule'].map({
    'rank_v1': '#1f77b4',
    'percentage': '#2ca02c', 
    'rank_v2': '#d62728'
})
ax1.bar(season_summary['season'], season_summary['n_contestants'], color=colors)
ax1.set_xlabel('Season')
ax1.set_ylabel('Number of Contestants')
ax1.set_title('Number of Contestants per Season')
ax1.axvline(x=2.5, color='gray', linestyle='--', alpha=0.5)
ax1.axvline(x=27.5, color='gray', linestyle='--', alpha=0.5)

# 图2: 每赛季周数
ax2 = axes[1]
ax2.bar(season_summary['season'], season_summary['max_weeks'], color=colors)
ax2.set_xlabel('Season')
ax2.set_ylabel('Number of Weeks')
ax2.set_title('Number of Weeks per Season')
ax2.axvline(x=2.5, color='gray', linestyle='--', alpha=0.5)
ax2.axvline(x=27.5, color='gray', linestyle='--', alpha=0.5)

# 添加图例
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#1f77b4', label='Rank v1 (S1-S2)'),
    Patch(facecolor='#2ca02c', label='Percentage (S3-S27)'),
    Patch(facecolor='#d62728', label='Rank v2 (S28-S34)')
]
fig.legend(handles=legend_elements, loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.02))

plt.tight_layout()
plt.savefig('season_overview.png', dpi=150, bbox_inches='tight')
plt.show()
print("图表已保存: season_overview.png")

---
## 6. 数据保存与导出

将处理后的数据保存为多种格式，供后续模型使用。

In [None]:
# ============================================================
# 6.1 创建输出目录
# ============================================================

import json
import pickle

# 输出目录
OUTPUT_DIR = Path("./processed_data")
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"输出目录: {OUTPUT_DIR.absolute()}")

In [None]:
# ============================================================
# 6.2 保存处理后的选手数据 (CSV)
# ============================================================

# 保存完整处理后数据
processed_df.to_csv(OUTPUT_DIR / 'processed_contestants.csv', index=False)
print(f"已保存: processed_contestants.csv ({len(processed_df)} 行)")

# 保存赛季汇总
season_summary.to_csv(OUTPUT_DIR / 'season_summary.csv', index=False)
print(f"已保存: season_summary.csv ({len(season_summary)} 行)")

In [None]:
# ============================================================
# 6.3 保存集合数据 (JSON)
# ============================================================

# 将字典的 int 键转换为 str (JSON 要求)
def convert_keys_to_str(d: Dict) -> Dict:
    """递归将字典的 int 键转换为 str"""
    if isinstance(d, dict):
        return {str(k): convert_keys_to_str(v) for k, v in d.items()}
    return d

# 保存有效参赛者集合
with open(OUTPUT_DIR / 'active_sets.json', 'w', encoding='utf-8') as f:
    json.dump(convert_keys_to_str(active_sets), f, indent=2)
print(f"已保存: active_sets.json")

# 保存淘汰者集合
with open(OUTPUT_DIR / 'elimination_sets.json', 'w', encoding='utf-8') as f:
    json.dump(convert_keys_to_str(elimination_sets), f, indent=2)
print(f"已保存: elimination_sets.json")

In [None]:
# ============================================================
# 6.4 保存为 Pickle (保留原始数据类型)
# ============================================================

# 打包所有数据
all_data = {
    'processed_df': processed_df,
    'active_sets': active_sets,
    'elimination_sets': elimination_sets,
    'season_summary': season_summary,
    'voting_rules': VOTING_RULE_DESC
}

# 保存为 pickle
with open(OUTPUT_DIR / 'all_data.pkl', 'wb') as f:
    pickle.dump(all_data, f)
print(f"已保存: all_data.pkl")

In [None]:
# ============================================================
# 6.5 验证保存的数据
# ============================================================

print("\n验证保存的数据:")
print("=" * 60)

# 列出输出目录中的文件
for file in OUTPUT_DIR.iterdir():
    size_kb = file.stat().st_size / 1024
    print(f"  {file.name}: {size_kb:.2f} KB")

# 测试加载 pickle
print("\n测试加载 pickle 文件...")
with open(OUTPUT_DIR / 'all_data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(f"  loaded_data.keys(): {list(loaded_data.keys())}")
print(f"  processed_df shape: {loaded_data['processed_df'].shape}")
print(f"  active_sets seasons: {list(loaded_data['active_sets'].keys())[:5]}...")

---
## 7. 总结

### 完成的工作

1. **数据加载**: 读取了 DWTS 官方数据集 (421 位选手, 34 个赛季)

2. **数据清洗**:
   - 处理了 N/A 值 (无第四位评委)
   - 处理了 0 分 (已淘汰选手)
   - 解析了淘汰周次

3. **特征计算**:
   - 计算了每周评委总分 $S_{i,s,t}$
   - 计算了每周评委平均分 $\bar{S}_{i,s,t}$
   - 记录了每周有效评委数 $m_{i,s,t}$

4. **集合构建**:
   - 有效参赛者集合 $A_{s,t}$: 每周仍在比赛的选手
   - 淘汰者集合 $E_{s,t}$: 每周被淘汰的选手

5. **投票规则分类**:
   - rank_v1 (S1-S2): 排名制
   - percentage (S3-S27): 百分比制
   - rank_v2 (S28-S34): 排名制+评委拯救

### 输出文件

| 文件名 | 格式 | 内容 |
|--------|------|------|
| processed_contestants.csv | CSV | 处理后的选手数据 |
| season_summary.csv | CSV | 赛季汇总信息 |
| active_sets.json | JSON | 有效参赛者集合 |
| elimination_sets.json | JSON | 淘汰者集合 |
| all_data.pkl | Pickle | 所有数据打包 |

### 下一步

数据预处理完成后，可以进入 **Q1-1 粉丝投票估计模型** 的建立。

In [None]:
# ============================================================
# 最终数据统计
# ============================================================

print("\n" + "=" * 60)
print("数据预处理完成 - 最终统计")
print("=" * 60)
print(f"\n总赛季数: {len(season_summary)}")
print(f"总参赛者数: {len(processed_df)}")
print(f"总淘汰事件数: {season_summary['total_eliminations'].sum()}")
print(f"\n投票规则分布:")
for rule in ['rank_v1', 'percentage', 'rank_v2']:
    count = (season_summary['voting_rule'] == rule).sum()
    print(f"  {rule}: {count} 个赛季")
print(f"\n输出文件保存在: {OUTPUT_DIR.absolute()}")