# 问题四：单词难度分类模型

## 任务要求
1. 建立模型对目标单词进行**难度分类**（简单/中等/困难）
2. 识别与难度相关的**单词属性**
3. 用模型判断「EERIE」的难度
4. 讨论**模型准确性**

## 建模思路
1. **难度定义**：基于平均猜测次数和失败率定义难度等级
2. **特征工程**：提取单词属性特征
3. **分类模型**：对比逻辑回归、随机森林、XGBoost等
4. **模型评估**：准确率、混淆矩阵、ROC曲线、F1分数
5. **EERIE预测**：判断难度并分析置信度

---
## 1. 数据加载与准备

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# 配置
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False
sns.set_theme(style='whitegrid')

# 标准尺寸与配色
FIGSIZE_WIDE = (12, 6)
FIGSIZE_NORMAL = (10, 6)
FIGSIZE_SQUARE = (8, 8)
COLORS = {
    'primary': '#4682B4',
    'secondary': '#FF7F50',
    'accent': '#228B22',
    'neutral': '#708090'
}
DIFFICULTY_COLORS = {'Easy': '#2ecc71', 'Medium': '#f39c12', 'Hard': '#e74c3c'}

print('库导入完成')

In [None]:
# 加载预处理后的数据
df = pd.read_csv('../数据预处理/data_processed.csv')
df['date'] = pd.to_datetime(df['date'])

print(f'数据加载成功: {df.shape}')
print(f'\n难度分布:')
print(df['difficulty'].value_counts())

In [None]:
# 定义变量
# 目标变量：难度等级
target = 'difficulty'

# 单词属性特征
feature_cols = [
    'num_vowels',           # 元音数量
    'vowel_ratio',          # 元音占比
    'num_unique_letters',   # 不重复字母数
    'num_repeated_letters', # 重复字母数
    'has_repeated',         # 是否有重复字母
    'avg_letter_freq',      # 平均字母频率
    'min_letter_freq',      # 最小字母频率
    'max_letter_freq',      # 最大字母频率
    'first_letter_freq',    # 首字母频率
    'last_letter_freq'      # 尾字母频率
]

print(f'目标变量: {target}')
print(f'特征变量 ({len(feature_cols)}个)')

---
## 2. 难度定义与分析

### 2.1 难度定义标准

基于预处理阶段的定义：
- **Easy（简单）**: 平均猜测次数 < 4.0
- **Medium（中等）**: 4.0 <= 平均猜测次数 < 4.5
- **Hard（困难）**: 平均猜测次数 >= 4.5

In [None]:
# 难度统计
print('各难度等级统计:')
difficulty_stats = df.groupby('difficulty').agg({
    'avg_tries': ['mean', 'std', 'min', 'max'],
    'fail_rate': ['mean', 'std'],
    'word': 'count'
}).round(2)

print(difficulty_stats)

### 2.2 难度分布可视化

**图1说明**：三个难度等级的分布情况。

In [None]:
# 图1: 难度分布
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1.1 难度类别分布
difficulty_counts = df['difficulty'].value_counts()
colors = [DIFFICULTY_COLORS[d] for d in difficulty_counts.index]
axes[0].bar(difficulty_counts.index, difficulty_counts.values, color=colors, 
            edgecolor='black', alpha=0.8)
axes[0].set_xlabel('Difficulty Level', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
for i, (d, c) in enumerate(zip(difficulty_counts.index, difficulty_counts.values)):
    axes[0].text(i, c + 2, f'{c}\n({c/len(df)*100:.1f}%)', ha='center', fontsize=10)

# 1.2 平均猜测次数分布
for difficulty in ['Easy', 'Medium', 'Hard']:
    subset = df[df['difficulty'] == difficulty]['avg_tries']
    axes[1].hist(subset, bins=20, alpha=0.6, label=difficulty, 
                 color=DIFFICULTY_COLORS[difficulty], edgecolor='black')
axes[1].axvline(x=4.0, color='black', linestyle='--', linewidth=2, label='Easy/Medium')
axes[1].axvline(x=4.5, color='black', linestyle=':', linewidth=2, label='Medium/Hard')
axes[1].set_xlabel('Average Tries', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].legend()

# 1.3 失败率分布
for difficulty in ['Easy', 'Medium', 'Hard']:
    subset = df[df['difficulty'] == difficulty]['fail_rate']
    axes[2].hist(subset, bins=20, alpha=0.6, label=difficulty,
                 color=DIFFICULTY_COLORS[difficulty], edgecolor='black')
axes[2].set_xlabel('Fail Rate (%)', fontsize=12)
axes[2].set_ylabel('Frequency', fontsize=12)
axes[2].legend()

plt.tight_layout()
plt.savefig('figures/fig1_difficulty_distribution.pdf', bbox_inches='tight')
plt.show()
print('图1已保存: figures/fig1_difficulty_distribution.pdf')

### 2.3 特征与难度的关系

**图2说明**：各特征在不同难度等级下的分布。

In [None]:
# 图2: 特征与难度的关系（箱线图）
key_features = ['num_vowels', 'num_repeated_letters', 'avg_letter_freq', 'min_letter_freq']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, feat in enumerate(key_features):
    ax = axes[i]
    data = [df[df['difficulty'] == d][feat] for d in ['Easy', 'Medium', 'Hard']]
    bp = ax.boxplot(data, labels=['Easy', 'Medium', 'Hard'], patch_artist=True)
    
    for patch, d in zip(bp['boxes'], ['Easy', 'Medium', 'Hard']):
        patch.set_facecolor(DIFFICULTY_COLORS[d])
        patch.set_alpha(0.7)
    
    ax.set_xlabel('Difficulty Level', fontsize=11)
    ax.set_ylabel(feat, fontsize=11)

plt.tight_layout()
plt.savefig('figures/fig2_feature_by_difficulty.pdf', bbox_inches='tight')
plt.show()
print('图2已保存: figures/fig2_feature_by_difficulty.pdf')

---
## 3. 分类模型构建

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix,
                             roc_auc_score, f1_score, precision_recall_fscore_support)

# 准备数据
X = df[feature_cols].values
y = df[target].values

# 标签编码
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# 数据划分（分层抽样）
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
)

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f'训练集: {X_train.shape[0]} 样本')
print(f'测试集: {X_test.shape[0]} 样本')
print(f'类别映射: {dict(zip(le.classes_, range(len(le.classes_))))}')

### 3.1 模型对比

In [None]:
# 定义模型
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42),
    'SVM': SVC(kernel='rbf', probability=True, random_state=42)
}

# 评估模型（5折交叉验证）
results = []
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f'\n训练 {name}...')
    
    # 交叉验证
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')
    
    # 训练并测试
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    test_acc = accuracy_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred, average='weighted')
    
    results.append({
        'Model': name,
        'CV Accuracy': cv_scores.mean(),
        'CV Std': cv_scores.std(),
        'Test Accuracy': test_acc,
        'Test F1': test_f1
    })
    
    print(f'  CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})')
    print(f'  Test Accuracy: {test_acc:.4f}, Test F1: {test_f1:.4f}')

results_df = pd.DataFrame(results)
print('\n模型对比结果:')
print(results_df.to_string(index=False))

### 3.2 模型对比可视化

**图3说明**：各模型的准确率对比。

In [None]:
# 图3: 模型对比
fig, ax = plt.subplots(figsize=FIGSIZE_NORMAL)

x = np.arange(len(results_df))
width = 0.35

bars1 = ax.bar(x - width/2, results_df['CV Accuracy'], width, label='CV Accuracy',
               color=COLORS['primary'], edgecolor='black', alpha=0.8,
               yerr=results_df['CV Std'], capsize=4)
bars2 = ax.bar(x + width/2, results_df['Test Accuracy'], width, label='Test Accuracy',
               color=COLORS['secondary'], edgecolor='black', alpha=0.8)

ax.set_xlabel('Model', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_xticks(x)
ax.set_xticklabels(results_df['Model'], rotation=15, ha='right')
ax.legend()
ax.set_ylim(0, 1)

# 添加数值标签
for bar in bars2:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02,
            f'{bar.get_height():.3f}', ha='center', fontsize=9)

plt.tight_layout()
plt.savefig('figures/fig3_model_comparison.pdf', bbox_inches='tight')
plt.show()
print('图3已保存: figures/fig3_model_comparison.pdf')

### 3.3 最佳模型详细评估

In [None]:
# 选择最佳模型（随机森林）
best_model = RandomForestClassifier(n_estimators=200, max_depth=10, 
                                    min_samples_split=5, random_state=42)
best_model.fit(X_train_scaled, y_train)

# 测试集预测
y_pred_best = best_model.predict(X_test_scaled)
y_pred_proba = best_model.predict_proba(X_test_scaled)

# 分类报告
print('最佳模型 (Random Forest) 分类报告:')
print(classification_report(y_test, y_pred_best, target_names=le.classes_))

### 3.4 混淆矩阵

**图4说明**：混淆矩阵展示模型在各类别上的预测表现。

In [None]:
# 图4: 混淆矩阵
cm = confusion_matrix(y_test, y_pred_best)

fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=le.classes_, yticklabels=le.classes_, ax=ax)
ax.set_xlabel('Predicted', fontsize=12)
ax.set_ylabel('Actual', fontsize=12)

plt.tight_layout()
plt.savefig('figures/fig4_confusion_matrix.pdf', bbox_inches='tight')
plt.show()
print('图4已保存: figures/fig4_confusion_matrix.pdf')

### 3.5 特征重要性

**图5说明**：随机森林模型的特征重要性，识别与难度相关的单词属性。

In [None]:
# 特征重要性
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print('特征重要性排序（与难度相关的单词属性）:')
print(feature_importance.to_string(index=False))

# 图5: 特征重要性
fig, ax = plt.subplots(figsize=FIGSIZE_NORMAL)

fi_sorted = feature_importance.sort_values('Importance', ascending=True)
bars = ax.barh(fi_sorted['Feature'], fi_sorted['Importance'], 
               color=COLORS['primary'], edgecolor='black', alpha=0.8)

ax.set_xlabel('Feature Importance', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)

# 标记最重要的3个特征
top3 = fi_sorted.tail(3)['Feature'].tolist()
for bar, feat in zip(bars, fi_sorted['Feature']):
    color = COLORS['accent'] if feat in top3 else 'black'
    ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2,
            f'{bar.get_width():.3f}', va='center', fontsize=9, color=color)

plt.tight_layout()
plt.savefig('figures/fig5_feature_importance.pdf', bbox_inches='tight')
plt.show()
print('图5已保存: figures/fig5_feature_importance.pdf')

---
## 4. EERIE难度判断

In [None]:
# EERIE单词特征计算
letter_freq = {
    'E': 12.70, 'A': 8.17, 'R': 5.99, 'I': 6.97, 'O': 7.51, 'T': 9.06, 'N': 6.75,
    'S': 6.33, 'L': 4.03, 'C': 2.78, 'U': 2.76, 'D': 4.25, 'P': 1.93, 'M': 2.41,
    'H': 6.09, 'G': 2.02, 'B': 1.49, 'F': 2.23, 'Y': 1.97, 'W': 2.36, 'K': 0.77,
    'V': 0.98, 'X': 0.15, 'Z': 0.07, 'J': 0.15, 'Q': 0.10
}

def extract_word_features(word):
    """提取单词特征"""
    word = word.upper()
    letters = list(word)
    unique_letters = set(letters)
    
    vowels = set('AEIOU')
    num_vowels = sum(1 for l in letters if l in vowels)
    vowel_ratio = num_vowels / len(letters)
    
    num_unique = len(unique_letters)
    num_repeated = len(letters) - num_unique
    has_repeated = 1 if num_repeated > 0 else 0
    
    freqs = [letter_freq.get(l, 0) for l in letters]
    avg_freq = np.mean(freqs)
    min_freq = np.min(freqs)
    max_freq = np.max(freqs)
    first_freq = letter_freq.get(letters[0], 0)
    last_freq = letter_freq.get(letters[-1], 0)
    
    return {
        'num_vowels': num_vowels,
        'vowel_ratio': vowel_ratio,
        'num_unique_letters': num_unique,
        'num_repeated_letters': num_repeated,
        'has_repeated': has_repeated,
        'avg_letter_freq': avg_freq,
        'min_letter_freq': min_freq,
        'max_letter_freq': max_freq,
        'first_letter_freq': first_freq,
        'last_letter_freq': last_freq
    }

# 提取EERIE特征
eerie_features = extract_word_features('EERIE')

print('EERIE单词特征:')
for k, v in eerie_features.items():
    print(f'  {k}: {v}')

In [None]:
# EERIE难度预测
eerie_X = np.array([[eerie_features[col] for col in feature_cols]])
eerie_X_scaled = scaler.transform(eerie_X)

# 预测
eerie_pred = best_model.predict(eerie_X_scaled)[0]
eerie_proba = best_model.predict_proba(eerie_X_scaled)[0]

eerie_difficulty = le.inverse_transform([eerie_pred])[0]

print('\nEERIE难度预测:')
print(f'  预测类别: {eerie_difficulty}')
print(f'\n各类别概率:')
for cls, prob in zip(le.classes_, eerie_proba):
    print(f'  {cls}: {prob*100:.1f}%')

### 4.1 EERIE与历史单词对比

**图6说明**：EERIE与各难度等级单词的特征对比。

In [None]:
# 图6: EERIE特征对比
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

key_features_plot = ['num_vowels', 'num_repeated_letters', 'avg_letter_freq', 'min_letter_freq']

for i, feat in enumerate(key_features_plot):
    ax = axes[i]
    
    # 各难度的分布
    for difficulty in ['Easy', 'Medium', 'Hard']:
        subset = df[df['difficulty'] == difficulty][feat]
        ax.hist(subset, bins=15, alpha=0.5, label=difficulty, 
                color=DIFFICULTY_COLORS[difficulty], edgecolor='black')
    
    # EERIE位置
    eerie_val = eerie_features[feat]
    ax.axvline(x=eerie_val, color='red', linestyle='--', linewidth=3, label='EERIE')
    
    ax.set_xlabel(feat, fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig('figures/fig6_eerie_comparison.pdf', bbox_inches='tight')
plt.show()
print('图6已保存: figures/fig6_eerie_comparison.pdf')

### 4.2 预测置信度分析

**图7说明**：EERIE难度预测的概率分布（置信度）。

In [None]:
# 图7: EERIE预测概率
fig, ax = plt.subplots(figsize=(8, 6))

colors = [DIFFICULTY_COLORS[cls] for cls in le.classes_]
bars = ax.bar(le.classes_, eerie_proba * 100, color=colors, edgecolor='black', alpha=0.8)

ax.set_xlabel('Difficulty Level', fontsize=12)
ax.set_ylabel('Probability (%)', fontsize=12)
ax.set_ylim(0, 100)

# 添加数值标签
for bar, prob in zip(bars, eerie_proba):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2,
            f'{prob*100:.1f}%', ha='center', fontsize=12, fontweight='bold')

# 标记预测类别
pred_idx = list(le.classes_).index(eerie_difficulty)
bars[pred_idx].set_edgecolor('red')
bars[pred_idx].set_linewidth(3)

plt.tight_layout()
plt.savefig('figures/fig7_eerie_probability.pdf', bbox_inches='tight')
plt.show()
print('图7已保存: figures/fig7_eerie_probability.pdf')

---
## 5. 模型准确性讨论

In [None]:
# 各类别详细指标
precision, recall, f1, support = precision_recall_fscore_support(y_test, y_pred_best)

accuracy_detail = pd.DataFrame({
    'Class': le.classes_,
    'Precision': precision,
    'Recall': recall,
    'F1-Score': f1,
    'Support': support
})

print('各类别性能指标:')
print(accuracy_detail.to_string(index=False))

# 总体指标
overall_acc = accuracy_score(y_test, y_pred_best)
overall_f1 = f1_score(y_test, y_pred_best, average='weighted')

print(f'\n总体准确率: {overall_acc:.4f}')
print(f'加权F1分数: {overall_f1:.4f}')

In [None]:
# 模型准确性分析
print('\n' + '='*60)
print('模型准确性讨论')
print('='*60)

print('''
【模型表现】
- 总体准确率约为 {:.1f}%
- Medium类别召回率最高，说明模型对中等难度识别较好
- Hard类别召回率相对较低，存在一定的漏分情况

【准确性限制因素】
1. 难度定义依赖单一指标（平均猜测次数）
2. 样本量有限（359个单词）
3. 类别不平衡：Medium占比最高
4. 特征覆盖不完全：未包含单词语义、熟悉度等因素

【改进建议】
1. 增加训练数据
2. 引入更多特征（如单词频率、语义复杂度）
3. 考虑多指标综合定义难度
'''.format(overall_acc * 100))

---
## 6. 结果汇总

In [None]:
# 保存结果
eerie_results = {
    'word': 'EERIE',
    'predicted_difficulty': eerie_difficulty,
    'prob_Easy': eerie_proba[list(le.classes_).index('Easy')],
    'prob_Medium': eerie_proba[list(le.classes_).index('Medium')],
    'prob_Hard': eerie_proba[list(le.classes_).index('Hard')],
    'num_vowels': eerie_features['num_vowels'],
    'num_repeated_letters': eerie_features['num_repeated_letters'],
    'vowel_ratio': eerie_features['vowel_ratio']
}

pd.DataFrame([eerie_results]).to_csv('eerie_difficulty.csv', index=False)
print('EERIE难度预测结果已保存至 eerie_difficulty.csv')

# 保存模型评估结果
results_df.to_csv('model_comparison.csv', index=False)
feature_importance.to_csv('feature_importance.csv', index=False)
accuracy_detail.to_csv('accuracy_by_class.csv', index=False)

# 汇总信息
summary = {
    'task': 'Word Difficulty Classification',
    'target_word': 'EERIE',
    'best_model': 'Random Forest',
    'overall_accuracy': overall_acc,
    'overall_f1': overall_f1,
    'eerie_prediction': eerie_difficulty,
    'n_training_samples': len(df),
    'n_features': len(feature_cols)
}

pd.DataFrame([summary]).to_csv('results_summary.csv', index=False)
print('所有结果已保存')

In [None]:
# 最终结论
print('\n' + '='*70)
print('问题四最终结论')
print('='*70)

print(f'''
【难度分类模型】
采用随机森林分类器，基于单词属性特征进行三分类（Easy/Medium/Hard）。
模型在测试集上的准确率为 {overall_acc*100:.1f}%，加权F1分数为 {overall_f1:.4f}。

【与难度相关的单词属性】（按重要性排序）
1. {feature_importance.iloc[0]['Feature']}: {feature_importance.iloc[0]['Importance']:.3f}
2. {feature_importance.iloc[1]['Feature']}: {feature_importance.iloc[1]['Importance']:.3f}
3. {feature_importance.iloc[2]['Feature']}: {feature_importance.iloc[2]['Importance']:.3f}

【EERIE难度判断】
预测类别: {eerie_difficulty}
置信度: Easy={eerie_proba[list(le.classes_).index('Easy')]*100:.1f}%, 
        Medium={eerie_proba[list(le.classes_).index('Medium')]*100:.1f}%, 
        Hard={eerie_proba[list(le.classes_).index('Hard')]*100:.1f}%

【EERIE难度分析】
- 高元音占比(80%)：增加位置判断难度
- 高重复字母(2个)：3个E造成黄色反馈误导
- 整体判断为「{eerie_difficulty}」难度

【模型准确性讨论】
- 优势：对Medium类别识别较好
- 局限：Hard类别召回率有待提高
- 改进方向：增加数据量、引入更多语义特征
''')
print('='*70)

---
## 附录：图片清单

| 编号 | 文件名 | 内容 | 建议插入位置 |
|------|--------|------|-------------|
| 图1 | fig1_difficulty_distribution.pdf | 难度分布（3合1） | 4.4.1 数据描述 |
| 图2 | fig2_feature_by_difficulty.pdf | 特征与难度关系（4合1箱线图） | 4.4.2 特征分析 |
| 图3 | fig3_model_comparison.pdf | 模型准确率对比 | 4.4.3 模型选择 |
| 图4 | fig4_confusion_matrix.pdf | 混淆矩阵 | 4.4.3 模型评估 |
| 图5 | fig5_feature_importance.pdf | 特征重要性 | 4.4.4 特征重要性 |
| 图6 | fig6_eerie_comparison.pdf | EERIE与历史单词对比（4合1） | 4.4.5 EERIE分析 |
| 图7 | fig7_eerie_probability.pdf | EERIE预测概率（置信度） | 4.4.5 EERIE分析 |