# EDA End-to-End / 端到端探索性数据分析
本Notebook演示从原始数据到最终洞察的完整流程。
This notebook demonstrates an end-to-end workflow from raw data to insights.

In [None]:
# 初始化 / Initialization (中英文注释 / bilingual)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)
print('Libraries ready / 库已加载')

## 1. 加载原始数据 / Load Raw Data
使用项目内置数据集 `eda_demo_1000.csv`。
Use the built-in dataset `eda_demo_1000.csv`.

In [None]:
df = pd.read_csv('../files/eda_demo_1000.csv')
df.head()

## 2. 数据清洗 / Data Cleaning
- 缺失值填充(均值/中位数)
- 异常值识别(IQR/Z分数)
- 类型转换
- 时间解析

- Missing value imputation (mean/median)
- Outlier detection (IQR/Z-score)
- Type conversion
- Date parsing

In [None]:
df['date'] = pd.to_datetime(df['date'])
num_cols = ['age','income','spend']
# IQR outlier flag / IQR异常值标记
Q1 = df[num_cols].quantile(0.25)
Q3 = df[num_cols].quantile(0.75)
IQR = Q3 - Q1
outlier_mask = ((df[num_cols] < (Q1 - 1.5 * IQR)) | (df[num_cols] > (Q3 + 1.5 * IQR))).any(axis=1)
df['is_outlier'] = outlier_mask
# Simple impute / 简单填充
imp = SimpleImputer(strategy='median')
df[num_cols] = imp.fit_transform(df[num_cols])
df.head()

## 3. 特征工程 / Feature Engineering
示例: 构造月度、周几、交互项等。
Example: build month/day-of-week/interaction terms.

In [None]:
df['month'] = df['date'].dt.month
df['dow'] = df['date'].dt.dayofweek
df['spend_per_income'] = (df['spend'] / (df['income'] + 1e-6)).clip(0, 10)
df.head()

## 4. 可视化 / Visualization (Matplotlib/Seaborn/Plotly)
展示分布、趋势与相关性。
Show distributions, trends, and correlations.

In [None]:
sns.histplot(df['spend'], kde=True, bins=30)
plt.title('Spend Distribution / 消费分布')
plt.show()

sns.lineplot(data=df.sort_values('date'), x='date', y='spend')
plt.title('Spend Over Time / 消费随时间')
plt.show()

fig = px.scatter(df, x='income', y='spend', color='converted',
                 title='Income vs Spend / 收入与消费')
fig.show()

## 5. 统计检验 / Statistical Tests
示例: 两组均值差异t检验、比例差异检验。
Example: t-test for mean difference, proportion test.

In [None]:
g_online = df[df['channel']=='线上']['spend']
g_offline = df[df['channel']=='线下']['spend']
t_stat, p_val = stats.ttest_ind(g_online, g_offline, equal_var=False, nan_policy='omit')
print(f'T-test p-value: {p_val:.4g} / 显著性P值')

# 置信区间 / Confidence Interval (mean spend)
mean_spend = df['spend'].mean()
se = df['spend'].std(ddof=1) / np.sqrt(len(df))
ci = (mean_spend - 1.96*se, mean_spend + 1.96*se)
print('95% CI / 置信区间:', ci)

## 6. 特征重要性 / Feature Importance
示例: RandomForest 对转化预测的特征重要性。
Example: RandomForest feature importance for conversion prediction.

In [None]:
X = df[['age','income','spend','month','dow','spend_per_income']].copy()
y = df['converted'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipe = make_pipeline(StandardScaler(with_mean=False), RandomForestClassifier(n_estimators=200, random_state=42))
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
rf = pipe.named_steps['randomforestclassifier']
imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
imp

## 7. 异常检测与处理 / Anomaly Detection & Handling
示例: IQR过滤与标记后分析。
Example: analyze after IQR filtering and tagging.

In [None]:
df_clean = df[~df['is_outlier']].copy()
sns.scatterplot(data=df_clean, x='income', y='spend', hue='channel')
plt.title('Income vs Spend (No Outliers) / 去异常后')
plt.show()

## 8. 结论与商业价值 / Conclusions & Business Value
- 线上渠道在部分区域消费更高，可加大预算。
- 高收入群体转化率更高，建议定向投放。
- 异常值主要来自大额消费峰值，需对促销期单独建模。

- Online channel shows higher spend in some regions, consider increasing budget.
- Higher income correlates with conversion; target this segment.
- Outliers from peak spend during promotions; model separately.