# 问题三：未来结果分布的预测模型

## 任务要求
1. 针对未来某日期的目标单词，建立模型预测结果分布（1-6次猜对及X的百分比）
2. 分析模型不确定性来源
3. 以2023年3月1日目标单词「EERIE」为例给出具体预测
4. 评估模型置信度

## 建模思路
1. **特征工程**：提取单词属性特征（元音、重复字母、字母频率等）
2. **多输出回归**：同时预测7个百分比（try_1 ~ try_6, try_x）
3. **模型选择**：对比多种模型（随机森林、梯度提升、神经网络）
4. **不确定性量化**：Bootstrap方法估计预测区间
5. **案例预测**：对EERIE进行预测并分析置信度

---
## 1. 数据加载与准备

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# 配置
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
sns.set_theme(style='whitegrid')

# 标准尺寸与配色
FIGSIZE_WIDE = (12, 6)
FIGSIZE_NORMAL = (10, 6)
FIGSIZE_SQUARE = (8, 8)
COLORS = {
    'primary': '#4682B4',
    'secondary': '#FF7F50',
    'accent': '#228B22',
    'neutral': '#708090'
}

print('库导入完成')

In [None]:
# 加载预处理后的数据
df = pd.read_csv('../数据预处理/data_processed.csv')
df['date'] = pd.to_datetime(df['date'])

print(f'数据加载成功: {df.shape}')
print(f'日期范围: {df["date"].min()} ~ {df["date"].max()}')

In [None]:
# 定义变量
# 目标变量：7个结果分布百分比
target_cols = ['try_1', 'try_2', 'try_3', 'try_4', 'try_5', 'try_6', 'try_x']

# 单词属性特征
feature_cols = [
    'num_vowels',           # 元音数量
    'vowel_ratio',          # 元音占比
    'num_unique_letters',   # 不重复字母数
    'num_repeated_letters', # 重复字母数
    'has_repeated',         # 是否有重复字母
    'avg_letter_freq',      # 平均字母频率
    'min_letter_freq',      # 最小字母频率
    'max_letter_freq',      # 最大字母频率
    'first_letter_freq',    # 首字母频率
    'last_letter_freq'      # 尾字母频率
]

print(f'目标变量 ({len(target_cols)}个): {target_cols}')
print(f'特征变量 ({len(feature_cols)}个)')

---
## 2. 探索性分析

### 2.1 结果分布概览

**图1说明**：各猜测次数的百分比分布，了解目标变量的分布特征。

In [None]:
# 结果分布统计
print('各猜测次数的百分比统计:')
print(df[target_cols].describe().round(2))

In [None]:
# 图1: 结果分布箱线图
fig, ax = plt.subplots(figsize=FIGSIZE_WIDE)

bp = ax.boxplot([df[col] for col in target_cols], labels=['1 try', '2 tries', '3 tries', 
                '4 tries', '5 tries', '6 tries', 'X (fail)'],
                patch_artist=True)

colors_box = ['#2ecc71', '#3498db', '#9b59b6', '#f39c12', '#e74c3c', '#c0392b', '#7f8c8d']
for patch, color in zip(bp['boxes'], colors_box):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_xlabel('Number of Tries', fontsize=12)
ax.set_ylabel('Percentage (%)', fontsize=12)

# 添加均值点
means = [df[col].mean() for col in target_cols]
ax.scatter(range(1, 8), means, color='red', s=80, zorder=5, label='Mean', marker='D')
ax.legend()

plt.tight_layout()
plt.savefig('figures/fig1_distribution_boxplot.pdf', bbox_inches='tight')
plt.show()
print('图1已保存: figures/fig1_distribution_boxplot.pdf')

In [None]:
# 图2: 平均结果分布柱状图
fig, ax = plt.subplots(figsize=FIGSIZE_NORMAL)

mean_dist = df[target_cols].mean()
bars = ax.bar(['1', '2', '3', '4', '5', '6', 'X'], mean_dist, 
              color=colors_box, edgecolor='black', alpha=0.8)

ax.set_xlabel('Number of Tries', fontsize=12)
ax.set_ylabel('Average Percentage (%)', fontsize=12)

# 添加数值标签
for bar, val in zip(bars, mean_dist):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
            f'{val:.1f}%', ha='center', fontsize=10)

plt.tight_layout()
plt.savefig('figures/fig2_average_distribution.pdf', bbox_inches='tight')
plt.show()
print('图2已保存: figures/fig2_average_distribution.pdf')

### 2.2 单词属性与结果分布的关系

In [None]:
# 计算特征与各目标的相关性
corr_matrix = df[feature_cols + target_cols].corr()

# 提取特征与目标的相关部分
feature_target_corr = corr_matrix.loc[feature_cols, target_cols]

# 图3: 特征-目标相关性热力图
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(feature_target_corr, annot=True, cmap='RdBu_r', center=0, 
            fmt='.2f', linewidths=0.5, ax=ax, cbar_kws={'shrink': 0.8})
ax.set_xticklabels(['1 try', '2 tries', '3 tries', '4 tries', '5 tries', '6 tries', 'X'])
plt.tight_layout()
plt.savefig('figures/fig3_feature_target_correlation.pdf', bbox_inches='tight')
plt.show()
print('图3已保存: figures/fig3_feature_target_correlation.pdf')

---
## 3. 模型构建

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# 准备数据
X = df[feature_cols].values
y = df[target_cols].values

# 数据划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'训练集: {X_train.shape[0]} 样本')
print(f'测试集: {X_test.shape[0]} 样本')

### 3.1 模型对比

In [None]:
# 定义模型
models = {
    'Ridge Regression': MultiOutputRegressor(Ridge(alpha=1.0)),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': MultiOutputRegressor(GradientBoostingRegressor(n_estimators=100, 
                                              max_depth=5, random_state=42))
}

# 评估模型
results = []
for name, model in models.items():
    print(f'\n训练 {name}...')
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    # 计算各目标的MAE
    mae_per_target = [mean_absolute_error(y_test[:, i], y_pred[:, i]) for i in range(7)]
    avg_mae = np.mean(mae_per_target)
    
    # 计算总体R2
    r2_per_target = [r2_score(y_test[:, i], y_pred[:, i]) for i in range(7)]
    avg_r2 = np.mean(r2_per_target)
    
    results.append({
        'Model': name,
        'Avg MAE': avg_mae,
        'Avg R2': avg_r2,
        'MAE_try_3': mae_per_target[2],
        'MAE_try_4': mae_per_target[3],
        'MAE_try_x': mae_per_target[6]
    })
    print(f'  Avg MAE: {avg_mae:.2f}%, Avg R2: {avg_r2:.4f}')

results_df = pd.DataFrame(results)
print('\n模型对比结果:')
print(results_df.to_string(index=False))

In [None]:
# 选择最佳模型（随机森林）
best_model = RandomForestRegressor(n_estimators=200, max_depth=10, 
                                   min_samples_split=5, random_state=42)
best_model.fit(X_train_scaled, y_train)

# 测试集预测
y_pred_best = best_model.predict(X_test_scaled)

print('最佳模型 (Random Forest) 详细评估:')
for i, col in enumerate(target_cols):
    mae = mean_absolute_error(y_test[:, i], y_pred_best[:, i])
    r2 = r2_score(y_test[:, i], y_pred_best[:, i])
    print(f'  {col}: MAE={mae:.2f}%, R2={r2:.4f}')

### 3.2 模型评估可视化

**图4说明**：各模型在不同目标上的MAE对比。

In [None]:
# 计算各模型各目标的MAE
model_mae_detail = {}
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    mae_per_target = [mean_absolute_error(y_test[:, i], y_pred[:, i]) for i in range(7)]
    model_mae_detail[name] = mae_per_target

# 图4: 模型MAE对比
fig, ax = plt.subplots(figsize=FIGSIZE_WIDE)

x = np.arange(7)
width = 0.25

for i, (name, maes) in enumerate(model_mae_detail.items()):
    ax.bar(x + i*width, maes, width, label=name, alpha=0.8)

ax.set_xlabel('Target Variable', fontsize=12)
ax.set_ylabel('Mean Absolute Error (%)', fontsize=12)
ax.set_xticks(x + width)
ax.set_xticklabels(['1 try', '2 tries', '3 tries', '4 tries', '5 tries', '6 tries', 'X'])
ax.legend()

plt.tight_layout()
plt.savefig('figures/fig4_model_comparison.pdf', bbox_inches='tight')
plt.show()
print('图4已保存: figures/fig4_model_comparison.pdf')

### 3.3 特征重要性

**图5说明**：随机森林模型的特征重要性分析。

In [None]:
# 特征重要性
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print('特征重要性排序:')
print(feature_importance.to_string(index=False))

# 图5: 特征重要性
fig, ax = plt.subplots(figsize=FIGSIZE_NORMAL)

fi_sorted = feature_importance.sort_values('Importance', ascending=True)
bars = ax.barh(fi_sorted['Feature'], fi_sorted['Importance'], 
               color=COLORS['primary'], edgecolor='black', alpha=0.8)

ax.set_xlabel('Feature Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)

for bar, val in zip(bars, fi_sorted['Importance']):
    ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,
            f'{val:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/fig5_feature_importance.pdf', bbox_inches='tight')
plt.show()
print('图5已保存: figures/fig5_feature_importance.pdf')

---
## 4. EERIE预测

In [None]:
# EERIE单词特征计算
# 字母频率表（基于5字母英文单词）
letter_freq = {
    'E': 12.70, 'A': 8.17, 'R': 5.99, 'I': 6.97, 'O': 7.51, 'T': 9.06, 'N': 6.75,
    'S': 6.33, 'L': 4.03, 'C': 2.78, 'U': 2.76, 'D': 4.25, 'P': 1.93, 'M': 2.41,
    'H': 6.09, 'G': 2.02, 'B': 1.49, 'F': 2.23, 'Y': 1.97, 'W': 2.36, 'K': 0.77,
    'V': 0.98, 'X': 0.15, 'Z': 0.07, 'J': 0.15, 'Q': 0.10
}

def extract_word_features(word):
    """提取单词特征"""
    word = word.upper()
    letters = list(word)
    unique_letters = set(letters)
    
    # 元音统计
    vowels = set('AEIOU')
    num_vowels = sum(1 for l in letters if l in vowels)
    vowel_ratio = num_vowels / len(letters)
    
    # 重复字母
    num_unique = len(unique_letters)
    num_repeated = len(letters) - num_unique
    has_repeated = 1 if num_repeated > 0 else 0
    
    # 字母频率
    freqs = [letter_freq.get(l, 0) for l in letters]
    avg_freq = np.mean(freqs)
    min_freq = np.min(freqs)
    max_freq = np.max(freqs)
    first_freq = letter_freq.get(letters[0], 0)
    last_freq = letter_freq.get(letters[-1], 0)
    
    return {
        'num_vowels': num_vowels,
        'vowel_ratio': vowel_ratio,
        'num_unique_letters': num_unique,
        'num_repeated_letters': num_repeated,
        'has_repeated': has_repeated,
        'avg_letter_freq': avg_freq,
        'min_letter_freq': min_freq,
        'max_letter_freq': max_freq,
        'first_letter_freq': first_freq,
        'last_letter_freq': last_freq
    }

# 提取EERIE特征
eerie_features = extract_word_features('EERIE')

print('EERIE单词特征:')
for k, v in eerie_features.items():
    print(f'  {k}: {v}')

In [None]:
# 准备EERIE特征向量
eerie_X = np.array([[eerie_features[col] for col in feature_cols]])
eerie_X_scaled = scaler.transform(eerie_X)

# 点预测
eerie_pred = best_model.predict(eerie_X_scaled)[0]

print('\nEERIE结果分布预测（点预测）:')
for i, col in enumerate(target_cols):
    print(f'  {col}: {eerie_pred[i]:.1f}%')
print(f'  Sum: {sum(eerie_pred):.1f}%')

### 4.1 Bootstrap不确定性估计

In [None]:
# Bootstrap预测区间
n_bootstrap = 500
bootstrap_preds = []

np.random.seed(42)
X_full_scaled = scaler.fit_transform(X)

print(f'Bootstrap采样 ({n_bootstrap}次)...')
for i in range(n_bootstrap):
    # 有放回抽样
    idx = np.random.choice(len(X), size=len(X), replace=True)
    X_boot = X_full_scaled[idx]
    y_boot = y[idx]
    
    # 训练模型
    model_boot = RandomForestRegressor(n_estimators=100, max_depth=10, 
                                       random_state=i, n_jobs=-1)
    model_boot.fit(X_boot, y_boot)
    
    # 预测EERIE
    eerie_X_boot = scaler.transform(eerie_X)
    pred = model_boot.predict(eerie_X_boot)[0]
    bootstrap_preds.append(pred)
    
    if (i + 1) % 100 == 0:
        print(f'  完成 {i+1}/{n_bootstrap}')

bootstrap_preds = np.array(bootstrap_preds)
print('Bootstrap完成')

In [None]:
# 计算预测区间
ci_lower = np.percentile(bootstrap_preds, 2.5, axis=0)
ci_upper = np.percentile(bootstrap_preds, 97.5, axis=0)
pred_mean = np.mean(bootstrap_preds, axis=0)
pred_std = np.std(bootstrap_preds, axis=0)

print('\nEERIE结果分布预测（含95%置信区间）:')
print('-' * 60)
print(f'{"Target":<10} {"Mean":>8} {"Std":>8} {"95% CI":>20}')
print('-' * 60)
for i, col in enumerate(target_cols):
    print(f'{col:<10} {pred_mean[i]:>7.1f}% {pred_std[i]:>7.2f}% [{ci_lower[i]:>5.1f}%, {ci_upper[i]:>5.1f}%]')
print('-' * 60)

### 4.2 EERIE预测可视化

**图6说明**：EERIE结果分布预测与历史平均对比。

In [None]:
# 图6: EERIE预测vs历史平均
fig, ax = plt.subplots(figsize=FIGSIZE_WIDE)

x = np.arange(7)
width = 0.35

# 历史平均
hist_mean = df[target_cols].mean().values
bars1 = ax.bar(x - width/2, hist_mean, width, label='Historical Average', 
               color=COLORS['neutral'], edgecolor='black', alpha=0.7)

# EERIE预测
bars2 = ax.bar(x + width/2, pred_mean, width, label='EERIE Prediction', 
               color=COLORS['primary'], edgecolor='black', alpha=0.8,
               yerr=[pred_mean - ci_lower, ci_upper - pred_mean], capsize=4)

ax.set_xlabel('Number of Tries', fontsize=12)
ax.set_ylabel('Percentage (%)', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(['1', '2', '3', '4', '5', '6', 'X'])
ax.legend()

# 添加数值标签
for bar, val in zip(bars2, pred_mean):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
            f'{val:.1f}%', ha='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/fig6_eerie_prediction.pdf', bbox_inches='tight')
plt.show()
print('图6已保存: figures/fig6_eerie_prediction.pdf')

**图7说明**：Bootstrap预测分布，展示模型不确定性。

In [None]:
# 图7: Bootstrap分布
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, col in enumerate(target_cols):
    ax = axes[i]
    ax.hist(bootstrap_preds[:, i], bins=30, color=COLORS['primary'], 
            edgecolor='black', alpha=0.7)
    ax.axvline(x=pred_mean[i], color='red', linestyle='--', linewidth=2, label='Mean')
    ax.axvline(x=ci_lower[i], color='green', linestyle=':', linewidth=2, label='95% CI')
    ax.axvline(x=ci_upper[i], color='green', linestyle=':', linewidth=2)
    ax.set_xlabel(f'{col} (%)', fontsize=10)
    ax.set_ylabel('Frequency', fontsize=10)
    ax.legend(fontsize=8)

# 隐藏最后一个空白子图
axes[7].axis('off')

plt.tight_layout()
plt.savefig('figures/fig7_bootstrap_distribution.pdf', bbox_inches='tight')
plt.show()
print('图7已保存: figures/fig7_bootstrap_distribution.pdf')

---
## 5. 不确定性分析

### 5.1 不确定性来源

模型预测的不确定性主要来自以下几个方面：

1. **数据随机性（Aleatoric Uncertainty）**
   - Twitter样本可能不代表所有Wordle玩家
   - 不同日期的玩家群体可能有所差异

2. **模型不确定性（Epistemic Uncertainty）**
   - 单词属性特征可能未能完全捕捉单词难度
   - 模型结构和参数选择的影响

3. **特征覆盖不完全**
   - 未考虑的因素：单词熟悉度、语义关联、玩家策略等
   - EERIE是非常规单词，历史数据中类似单词较少

In [None]:
# 不确定性量化
print('不确定性量化分析:')
print('='*60)

# 变异系数（CV）
cv = pred_std / pred_mean * 100
print('\n各目标的变异系数（CV）:')
for i, col in enumerate(target_cols):
    print(f'  {col}: {cv[i]:.1f}%')

# 置信区间宽度
ci_width = ci_upper - ci_lower
print('\n95%置信区间宽度:')
for i, col in enumerate(target_cols):
    print(f'  {col}: {ci_width[i]:.1f}%')

### 5.2 与历史数据对比

In [None]:
# 找出与EERIE特征相似的历史单词
# 特征：高元音占比(0.8)、高重复字母(2)
similar_words = df[(df['vowel_ratio'] >= 0.6) & (df['num_repeated_letters'] >= 1)]

print(f'与EERIE特征相似的历史单词数: {len(similar_words)}')
if len(similar_words) > 0:
    print('\n相似单词示例:')
    print(similar_words[['word', 'num_vowels', 'num_repeated_letters', 'avg_tries', 'try_x']].head(10))
    
    print('\n相似单词的结果分布平均:')
    similar_dist = similar_words[target_cols].mean()
    for col, val in similar_dist.items():
        print(f'  {col}: {val:.1f}%')

---
## 6. 结果汇总

In [None]:
# 保存预测结果
eerie_results = pd.DataFrame({
    'Target': target_cols,
    'Point_Prediction': eerie_pred,
    'Bootstrap_Mean': pred_mean,
    'Bootstrap_Std': pred_std,
    'CI_Lower_95': ci_lower,
    'CI_Upper_95': ci_upper,
    'CI_Width': ci_width
})

eerie_results.to_csv('eerie_prediction.csv', index=False)
print('EERIE预测结果已保存至 eerie_prediction.csv')
print(eerie_results.to_string(index=False))

In [None]:
# 保存模型评估结果
results_df.to_csv('model_comparison.csv', index=False)
feature_importance.to_csv('feature_importance.csv', index=False)

# 汇总信息
summary = {
    'task': 'Result Distribution Prediction',
    'target_word': 'EERIE',
    'target_date': '2023-03-01',
    'best_model': 'Random Forest',
    'n_bootstrap': n_bootstrap,
    'n_training_samples': len(df),
    'n_features': len(feature_cols)
}

pd.DataFrame([summary]).to_csv('results_summary.csv', index=False)
print('\n所有结果已保存')

In [None]:
# 最终结论
print('\n' + '='*70)
print('问题三最终结论')
print('='*70)

print('''
【建模方法】
采用随机森林多输出回归模型，基于单词属性特征预测7个结果分布百分比。

【EERIE预测结果】(2023年3月1日)''')

print(f'  1次猜对: {pred_mean[0]:.1f}% [{ci_lower[0]:.1f}%, {ci_upper[0]:.1f}%]')
print(f'  2次猜对: {pred_mean[1]:.1f}% [{ci_lower[1]:.1f}%, {ci_upper[1]:.1f}%]')
print(f'  3次猜对: {pred_mean[2]:.1f}% [{ci_lower[2]:.1f}%, {ci_upper[2]:.1f}%]')
print(f'  4次猜对: {pred_mean[3]:.1f}% [{ci_lower[3]:.1f}%, {ci_upper[3]:.1f}%]')
print(f'  5次猜对: {pred_mean[4]:.1f}% [{ci_lower[4]:.1f}%, {ci_upper[4]:.1f}%]')
print(f'  6次猜对: {pred_mean[5]:.1f}% [{ci_lower[5]:.1f}%, {ci_upper[5]:.1f}%]')
print(f'  失败(X):  {pred_mean[6]:.1f}% [{ci_lower[6]:.1f}%, {ci_upper[6]:.1f}%]')

print('''
【不确定性来源】
1. 数据随机性：Twitter样本的代表性有限
2. 模型不确定性：单词属性未能完全解释结果分布
3. 特征覆盖不完全：EERIE为非常规单词，历史参考有限

【模型置信度评估】
- Bootstrap方法提供了95%置信区间
- 预测的主要不确定性集中在try_3和try_4（最高频区间）
- EERIE的高重复元音特征使其难度高于平均水平
''')
print('='*70)

---
## 附录：图片清单

| 编号 | 文件名 | 内容 | 建议插入位置 |
|------|--------|------|-------------|
| 图1 | fig1_distribution_boxplot.pdf | 结果分布箱线图 | 4.3.1 数据描述 |
| 图2 | fig2_average_distribution.pdf | 平均结果分布柱状图 | 4.3.1 数据描述 |
| 图3 | fig3_feature_target_correlation.pdf | 特征-目标相关性热力图 | 4.3.2 特征分析 |
| 图4 | fig4_model_comparison.pdf | 模型MAE对比 | 4.3.3 模型选择 |
| 图5 | fig5_feature_importance.pdf | 特征重要性 | 4.3.3 特征重要性 |
| 图6 | fig6_eerie_prediction.pdf | EERIE预测vs历史平均 | 4.3.4 预测结果 |
| 图7 | fig7_bootstrap_distribution.pdf | Bootstrap预测分布 | 4.3.5 不确定性分析 |