## Analysis Overview

This notebook contains the full analysis pipeline, including data cleaning,
feature engineering, exploratory analysis, modeling, and interpretation.

In [None]:
# =========================
# Global Visualization Style
# =========================
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(
    style="white",        # 无网格白底
    font_scale=1.1        # 字号统一
)

# 全局 matplotlib 微调
plt.rcParams.update({
    "axes.titlesize": 14,
    "axes.labelsize": 12,
    "legend.fontsize": 11,
    "legend.title_fontsize": 12,
    "figure.dpi": 100,
})

In [None]:
# =========================
# Color Palette (OrRd)
# =========================
ORRD = sns.color_palette("OrRd", 6)

COLOR_LIGHT = ORRD[1]   # baseline / weaker
COLOR_DARK  = ORRD[4]   # uplift / stronger

## STEP1: Define modelling question

Given post-level content and timing features, what factors are most associated with higher engagement (likes) on Red Note?

## STEP2: Data cleaning

Performed data cleaning including post-level deduplication to avoid sample bias caused by multi-tag scraping.

In [None]:
import pandas as pd
df = pd.read_csv("python_笔记标签.csv")
display(df.head())
df.info()
df.shape

In [None]:
missing_summary = df.isna().sum().to_frame("missing_count")
missing_summary["missing_ratio"] = missing_summary["missing_count"] / len(df)
missing_summary.sort_values("missing_ratio", ascending=False)

In [None]:
df["发布时间"] = pd.to_datetime(df["发布时间"], format="%H:%M", errors="coerce")
df["post_hour"] = df["发布时间"].dt.hour
df["发布时间"] = df["发布时间"].dt.time
df["post_hour"]

In [None]:
# 看看是否有完全重复的帖子（标题 + 发布时间）
dup_mask = df.duplicated(subset=["笔记标题", "发布时间"], keep=False)

df[dup_mask].sort_values(["笔记标题", "发布时间"])

In [None]:
dup_rate = df.duplicated(subset=["笔记标题", "发布时间"]).mean()
print(f"Duplicate rate: {dup_rate:.2%}")

## STEP3: EDA (round 1: understand data & target)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6,4))
sns.histplot(df["笔记点赞"], bins=50, color=COLOR_DARK)
plt.title("Distribution of Likes")
plt.xlabel("Likes")
plt.show()

df["笔记点赞"].quantile([0.95, 0.99, 0.995, 0.999])

Engagement exhibits a heavy-tailed distribution with a small number of viral posts.
Rather than removing extreme values, I applied a log transformation to stabilize variance while preserving real high-performing content in the next step.

### Why not winsorize?

I avoided winsorization because high-engagement posts are meaningful rather than erroneous.
Instead, I applied a log transformation to stabilize variance while preserving relative differences.

## STEP4: Feature engineering

In [None]:
import numpy as np
df["log_likes"] = np.log1p(df["笔记点赞"])

In [None]:
plt.figure(figsize=(6,4))
sns.histplot(df["log_likes"], bins=50, color=COLOR_DARK)
plt.title("Distribution of Log(Likes + 1)")
plt.xlabel("Log Likes")
plt.show()

In [None]:
df['笔记标签'].value_counts()

In [None]:
top_tags = ["编程", "python学习", "CS", "python"]

In [None]:
for tag in top_tags:
    df[f"tag_{tag}"] = df["笔记标签"].apply(
        lambda x: 1 if isinstance(x, str) and tag in x else 0
    )

In [None]:
df["num_top_tags"] = df[[f"tag_{t}" for t in top_tags]].sum(axis=1)

In [None]:
!pip install emoji

In [None]:
import emoji

# 确保标题是字符串
df["笔记标题"] = df["笔记标题"].astype(str)

# 文本长度
df["text_length"] = df["笔记标题"].str.len()

# emoji 数量
df["emoji_count"] = df["笔记标题"].apply(
    lambda x: len(emoji.emoji_list(x))
)

# 是否包含 emoji
df["has_emoji"] = (df["emoji_count"] > 0).astype(int)

In [None]:
df.head()

In [None]:
df['num_top_tags'].value_counts()

The number of tags only have two values, so this tag itself was less important than which tags were used. I dropped it.

In [None]:
df = df.drop(columns='num_top_tags')

### Adjust schema violation

In [None]:
display(df['text_length'].value_counts().sort_values(ascending=False))
longest = df[df['text_length']==39]
display(longest['笔记标题'])

Based on platform constraints, titles longer than 20 characters were treated as missing titles caused by scraping errors and corrected accordingly.

In [None]:
MAX_TITLE_LEN = 20

df["has_valid_title"] = (df["text_length"] <= MAX_TITLE_LEN).astype(int)

In [None]:
df["text_length"] = df["text_length"].where(
    df["text_length"] <= MAX_TITLE_LEN,
    0
)
display(df['text_length'].value_counts())
display(df['has_valid_title'].value_counts())

In [None]:
display(df.head())

### Drop duplication

In [None]:
before = len(df)

df_clean = df.drop_duplicates(
    subset=["笔记标题", "发布时间"],
    keep="first"
)

after = len(df_clean)

print(f"Before dedup: {before}")
print(f"After dedup: {after}")
print(f"Removed: {before - after} ({(before - after)/before:.2%})")

### Cleaned dataset info

In [None]:
display(df_clean.head())
df_clean.info()
df_clean.shape

In [None]:
df_clean[["tag_编程", "tag_python", "tag_CS", "tag_python学习"]].mean()

In [None]:
sample_df = df_clean.sample(
    frac=0.05,
    random_state=42
)

In [None]:
sample_df.to_csv(
    "sample_data.csv",
    index=False
)

## STEP5: EDA (round 2: sanity check before modelling)

### Aims of this stage

1：确认每个 feature 与 target 是否“有区分度” \
2：确认关系方向“符合业务直觉” \
3：判断“是否需要更复杂模型”

In [None]:
hourly_mean = (
    df_clean
    .groupby("post_hour")["log_likes"]
    .mean()
    .reset_index()
)

plt.figure(figsize=(6,4))
sns.lineplot(x="post_hour", y="log_likes", data=hourly_mean, color=COLOR_DARK)
plt.title("Average Log Likes by Posting Hour")
plt.xlabel("Post Hour")
plt.ylabel("Avg Log Likes")
plt.show()

Posting hour shows a clear non-linear relationship with engagement, with distinct peak and trough periods rather than a monotonic trend.

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="tag_编程",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Tag", "Has Tag"])
plt.title("Log Likes by Tag: 编程")
plt.show()

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="tag_python学习",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Tag", "Has Tag"])
plt.title("Log Likes by Tag: python学习")
plt.show()

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="tag_CS",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Tag", "Has Tag"])
plt.title("Log Likes by Tag: CS")
plt.show()

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="tag_python",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Tag", "Has Tag"])
plt.title("Log Likes by Tag: python")
plt.show()

Individual topic tags do not show strong positive effects in isolation.
This suggests engagement is driven by interactions between content topic, posting time, and presentation style.

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="has_valid_title",
    y="log_likes",
    data=df_clean, 
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Title", "Has Title"])
plt.title("Log Likes by Title Presence")
plt.show()

In [None]:
df_title = df_clean[df_clean["text_length"] > 0]

ax = sns.scatterplot(
    data=df_title,
    x="text_length",
    y="log_likes",
    color=COLOR_DARK,   # 使用统一的 OrRd 深色
    alpha=0.25,         # 低透明度，看密度
    s=30                # 点大小，适中
)

ax.set_xlabel("Title Length")
ax.set_ylabel("Engagement (log scale)")
ax.set_title("Engagement vs. Title Length (Valid Titles Only)")

# 视觉清理
sns.despine(left=True, bottom=True)
ax.xaxis.grid(False)

sns.regplot(
    data=df_title,
    x="text_length",
    y="log_likes",
    scatter=False,
    lowess=True,
    color=ORRD[5],
    line_kws={"linewidth": 2}
)

plt.tight_layout()
plt.show()

Title length exhibits a weak monotonic relationship with engagement, with diminishing marginal returns, suggesting it acts as a secondary feature rather than a primary driver.

In [None]:
plt.figure(figsize=(4,4))
sns.boxplot(
    x="has_emoji",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.xticks([0,1], ["No Emoji", "Has Emoji"])
plt.title("Log Likes by Emoji Usage")
plt.show()

Emoji usage improves engagement on average, but its effect is secondary compared to timing and topic.

In [None]:
sns.boxplot(
    x="emoji_count",
    y="log_likes",
    data=df_clean,
    color=COLOR_DARK
)
plt.title("Log Likes by Emoji Count")
plt.show()

Engagement increases when a small number of emojis are used, but the effect plateaus and becomes noisy as emoji count increases, indicating diminishing returns.

### Summary

Engagement is influenced by multiple weak signals rather than a single dominant factor.

Posting time shows clear non-linear patterns, while topic tags and text features exhibit weaker marginal effects.

Emoji usage provides a small but consistent uplift, with diminishing returns beyond a few emojis.

These observations support the use of a non-linear model to capture interactions across features.

## STEP 6: Modelling

In [None]:
from sklearn.model_selection import train_test_split

features = [
    "post_hour",
    "has_valid_title",
    "text_length",
    "has_emoji",
    "emoji_count",
    "tag_编程",
    "tag_python学习",
    "tag_CS",
    "tag_python",
]

X = df_clean[features]
y = df_clean["log_likes"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=300,
    max_depth=8,
    min_samples_leaf=20,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error

y_pred = rf.predict(X_test)

r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R2: {r2:.3f}")
print(f"RMSE: {rmse:.3f}")

The model is not intended for high-precision prediction.\
Its primary goal is to identify relative importance of content and timing features, which is why I prioritized model stability and interpretability over raw performance.

In [None]:
features_no_main_tag = [
    f for f in features if f != "tag_编程"
]
X2 = df_clean[features_no_main_tag]
y2 = df_clean["log_likes"]

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2,
    test_size=0.2,
    random_state=42
)

In [None]:
rf2 = RandomForestRegressor(
    n_estimators=300,
    max_depth=8,
    min_samples_leaf=20,
    random_state=42,
    n_jobs=-1
)

rf2.fit(X2_train, y2_train)

In [None]:
y2_pred = rf2.predict(X2_test)

r2 = r2_score(y2_test, y2_pred)
rmse = np.sqrt(mean_squared_error(y2_test, y2_pred))

print(f"R2_update: {r2:.3f}")
print(f"RMSE_update: {rmse:.3f}")

## STEP 7: Permutation importance

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf,
    X_test,
    y_test,
    n_repeats=20,
    random_state=42,
    n_jobs=-1
)

importances = pd.DataFrame({
    "feature": X_test.columns,
    "importance_mean": result.importances_mean,
    "importance_std": result.importances_std
}).sort_values("importance_mean", ascending=False)

importances

Topic category features appear highly important under permutation analysis due to strong correlations with other features.

However, ablation experiments show that removing the main category tag does not significantly degrade model performance, indicating substantial information redundancy.

This suggests engagement is driven by a combination of finer-grained topic signals, timing, and presentation features rather than a single dominant tag.

## STEP 8: Finding insights (conditional analysis)

### 高峰时段 × emoji

In [None]:
from matplotlib.patches import Patch
peak_hours = [10, 11, 12]
df_clean["is_peak_hour"] = df_clean["post_hour"].isin(peak_hours).astype(int)

plot_df = df_clean.groupby(
    ["is_peak_hour", "has_emoji"]
)["log_likes"].mean().reset_index()

# ===== 调色盘（精选两色）=====
palette = sns.color_palette("OrRd", 6)
COLOR_NO  = palette[1]   # 浅
COLOR_YES = palette[4]   # 深

plt.figure(figsize=(6.5, 4))

ax = sns.barplot(
    data=plot_df,
    x="is_peak_hour",
    y="log_likes",
    hue="has_emoji",
    hue_order=[0, 1],
    palette=[COLOR_NO, COLOR_YES],
    width=0.55
)

# x 轴
ax.set_xticks([0, 1])
ax.set_xticklabels(["Off-Peak", "Peak"])
ax.set_xlabel("")

# y 轴
ax.set_ylabel("Average Engagement (log scale)")
ax.set_ylim(4.5, 6.0)

# 标题（结论型）
ax.set_title(
    "Emoji Usage Provides Larger Gains During Off-Peak Hours",
    pad=12
)

# legend（手动绑定，永不翻车）
legend_elements = [
    Patch(facecolor=COLOR_NO,  label="No Emoji"),
    Patch(facecolor=COLOR_YES, label="Has Emoji"),
]

ax.legend(
    handles=legend_elements,
    title="Emoji Usage",
    frameon=False,
    loc="upper left",
    bbox_to_anchor=(1.02, 1)
)

# 视觉清理
sns.despine(left=True, bottom=True)
ax.yaxis.grid(True, color="#E6E6E6", linewidth=1)
ax.xaxis.grid(False)

plt.tight_layout()
plt.savefig("figures/peak_emoji.png", dpi=150, bbox_inches="tight")
plt.show()

### 高峰时段 × title presence

In [None]:
plot_df = df_clean.groupby(
    ["is_peak_hour", "has_valid_title"]
)["log_likes"].mean().reset_index()

# ========= OrRd 调色盘（与你第一张一致） =========
palette = sns.color_palette("OrRd", 6)
COLOR_NO  = palette[1]   # 浅橙
COLOR_YES = palette[4]   # 深橙红

plt.figure(figsize=(6.5, 4))

ax = sns.barplot(
    data=plot_df,         
    x="is_peak_hour",
    y="log_likes",
    hue="has_valid_title",
    hue_order=[0, 1],
    palette=[COLOR_NO, COLOR_YES],
    width=0.55
)

# ========= x 轴 =========
ax.set_xticks([0, 1])
ax.set_xticklabels(["Off-Peak", "Peak"])
ax.set_xlabel("")

# ========= y 轴 =========
ax.set_ylabel("Average Engagement (log scale)")
ax.set_ylim(4.5, 6.0)

# ========= 标题（结论型，和第一张一致风格） =========
ax.set_title(
    "Titles Matter More When Baseline Visibility Is High",
    pad=12
)

# ========= Legend（手动绑定，永不翻车） =========
legend_elements = [
    Patch(facecolor=COLOR_NO,  label="No Title"),
    Patch(facecolor=COLOR_YES, label="Has Title"),
]

ax.legend(
    handles=legend_elements,
    title="Title Presence",
    frameon=False,
    loc="upper left",
    bbox_to_anchor=(1.02, 1)
)

# ========= 视觉清理 =========
sns.despine(left=True, bottom=True)
ax.yaxis.grid(True, color="#E6E6E6", linewidth=1)
ax.xaxis.grid(False)

plt.tight_layout()
plt.savefig("figures/peak_title.png", dpi=150, bbox_inches="tight")
plt.show()

### emoji count × title length

In [None]:
from matplotlib.lines import Line2D
# ===== 数据准备 =====
plot_df = (
    df_clean.assign(
        short_title = (df_clean["text_length"] <= 8).astype(int)
    )
    .groupby(["emoji_count", "short_title"])["log_likes"]
    .mean()
    .reset_index()
)

# ===== 关键：限制主分析区间（避免伪趋势）=====
plot_df_main = plot_df[plot_df["emoji_count"] <= 4]

# ===== OrRd 调色盘（与你前两张一致）=====
palette = sns.color_palette("OrRd", 6)
COLOR_LONG  = palette[1]   # Long title（浅）
COLOR_SHORT = palette[4]   # Short title（深）

sns.set_theme(style="white", font_scale=1.1)

# ===== 只画一张图 =====
plt.figure(figsize=(6.5, 4))

ax = sns.lineplot(
    data=plot_df_main,
    x="emoji_count",
    y="log_likes",
    hue="short_title",
    hue_order=[0, 1],
    palette=[COLOR_LONG, COLOR_SHORT],
    marker="o",
    linewidth=2.4
)

# ===== 轴设置 =====
ax.set_xlabel("Emoji Count")
ax.set_ylabel("Average Engagement (log scale)")
ax.set_ylim(4.5, 6.5)

# ===== 标题（结论型）=====
ax.set_title(
    "Moderate Emoji Usage Enhances Short Titles, With Diminishing Returns",
    pad=12
)

# ===== Legend（统一为 Line2D）=====
legend_elements = [
    Line2D([0], [0], color=COLOR_LONG,  marker='o', linewidth=2.4, label="Long Title"),
    Line2D([0], [0], color=COLOR_SHORT, marker='o', linewidth=2.4, label="Short Title"),
]

ax.legend(
    handles=legend_elements,
    title="Title Length",
    frameon=False,
    loc="upper left",
    bbox_to_anchor=(1.02, 1)
)

# ===== 视觉清理 =====
sns.despine(left=True, bottom=True)
ax.yaxis.grid(True, color="#E6E6E6", linewidth=1)
ax.xaxis.grid(False)

plt.tight_layout()
plt.savefig("figures/title_emoji.png", dpi=150, bbox_inches="tight")
plt.show()

### Summary

Timing is the primary driver of engagement

Emojis act as an attention booster, especially when baseline visibility is low

Presentation features cannot replace good timing, but can partially compensate for it

No single tactic guarantees success; strategies should adapt to posting context