# 03 内容语义与受众画像

目标：主题建模、情感/毒性分析，比较蓝标与非蓝标群体的语言差异。

In [13]:
import sys
from pathlib import Path

# 将项目根目录添加到 Python 路径
project_root = Path('/workspace')
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))
    
print(f"✅ Python 路径已配置: {project_root}")

✅ Python 路径已配置: /workspace


## 步骤 1: 加载数据并准备文本

In [14]:
import polars as pl
from pathlib import Path

# 加载数据
df = pl.read_parquet("parquet/tweets_enriched.parquet")
print(f"📊 数据加载完成: {df.height:,} 行")

# 过滤有效文本（非空、英文）
df_text = df.filter(
    (pl.col('text').is_not_null()) & 
    (pl.col('lang') == 'en') &
    (pl.col('text').str.len_chars() > 10)
)
print(f"📝 有效英文推文: {df_text.height:,} 条")

# 统计蓝标分布
blue_verified = df_text.filter(pl.col('author_isBlueVerified') == True).height
print(f"\n🔵 蓝标用户: {blue_verified:,} 条 ({blue_verified/df_text.height*100:.1f}%)")
print(f"⚪ 非蓝标用户: {df_text.height - blue_verified:,} 条 ({(df_text.height-blue_verified)/df_text.height*100:.1f}%)")

📊 数据加载完成: 508,954 行
📝 有效英文推文: 415,912 条

🔵 蓝标用户: 143,119 条 (34.4%)
⚪ 非蓝标用户: 272,793 条 (65.6%)


## 步骤 2: 文本特征提取

In [15]:
import textstat

# 采样数据（加速处理）
sample_size = min(10000, df_text.height)
df_sample = df_text.sample(n=sample_size, seed=42)
print(f"📋 采样分析: {sample_size:,} 条推文")

# 计算文本特征
texts = df_sample['text'].to_list()
readability_scores = [textstat.flesch_reading_ease(str(t)) for t in texts]
word_counts = [len(str(t).split()) for t in texts]

# 添加特征列
df_sample = df_sample.with_columns([
    pl.Series('readability', readability_scores),
    pl.Series('word_count', word_counts)
])

print(f"\n📊 文本特征统计:")
print(f"  平均可读性: {df_sample['readability'].mean():.2f}")
print(f"  平均词数: {df_sample['word_count'].mean():.1f}")

📋 采样分析: 10,000 条推文

📊 文本特征统计:
  平均可读性: 60.25
  平均词数: 33.0


## 步骤 3: 主题建模（BERTopic）

In [16]:
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import numpy as np

# 加载轻量级模型
print("🤖 加载 Sentence Transformer 模型...")
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# 生成文本嵌入
print("🔢 生成文本嵌入...")
embeddings = embedding_model.encode(texts, show_progress_bar=True)

# 训练 BERTopic 模型
print("📚 训练主题模型...")
topic_model = BERTopic(
    language='english',
    calculate_probabilities=False,
    verbose=True,
    min_topic_size=30
)
topics, probs = topic_model.fit_transform(texts, embeddings)

# 添加主题标签
df_sample = df_sample.with_columns(pl.Series('topic', topics))

print(f"\n✅ 主题建模完成:")
print(f"  发现主题数: {len(set(topics)) - 1}")
print(f"  (主题 -1 表示噪音/异常值)")

# 显示 top 5 主题
print(f"\n🏆 Top 5 主题:")
for topic_id, topic_words in topic_model.get_topics().items():
    if topic_id == -1:
        continue
    if topic_id >= 5:
        break
    words = [word for word, _ in topic_words[:5]]
    print(f"  主题 {topic_id}: {', '.join(words)}")

🤖 加载 Sentence Transformer 模型...
🔢 生成文本嵌入...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

2025-10-29 09:22:41,007 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


📚 训练主题模型...


2025-10-29 09:22:43,438 - BERTopic - Dimensionality - Completed ✓
2025-10-29 09:22:43,439 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-10-29 09:22:43,701 - BERTopic - Cluster - Completed ✓
2025-10-29 09:22:43,715 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-10-29 09:22:43,877 - BERTopic - Representation - Completed ✓



✅ 主题建模完成:
  发现主题数: 48
  (主题 -1 表示噪音/异常值)

🏆 Top 5 主题:
  主题 0: kirk, charlie, you, was, about
  主题 1: charliekirk, charliekirkshot, the, this, is
  主题 2: israel, netanyahu, gaza, the, to
  主题 3: kirk, charlie, this, song, was
  主题 4: robinson, tyler, 22yearold, suspect, utah


## 步骤 4: 蓝标 vs 非蓝标对比

In [17]:
# 按蓝标状态分组统计
comparison = df_sample.group_by('author_isBlueVerified').agg([
    pl.len().alias('count'),
    pl.col('readability').mean().alias('avg_readability'),
    pl.col('word_count').mean().alias('avg_word_count'),
    pl.col('retweetCount').mean().alias('avg_retweets'),
    pl.col('likeCount').mean().alias('avg_likes')
]).sort('author_isBlueVerified')

print("📊 蓝标 vs 非蓝标对比:")
print(comparison)

# 主题分布对比
print("\n📈 主题分布差异:")
topic_by_verified = df_sample.filter(pl.col('topic') >= 0).group_by(['author_isBlueVerified', 'topic']).len().sort(['author_isBlueVerified', 'len'], descending=[False, True])
print(topic_by_verified.head(10))

📊 蓝标 vs 非蓝标对比:
shape: (2, 6)
┌───────────────────────┬───────┬─────────────────┬────────────────┬──────────────┬───────────┐
│ author_isBlueVerified ┆ count ┆ avg_readability ┆ avg_word_count ┆ avg_retweets ┆ avg_likes │
│ ---                   ┆ ---   ┆ ---             ┆ ---            ┆ ---          ┆ ---       │
│ bool                  ┆ u32   ┆ f64             ┆ f64            ┆ f64          ┆ f64       │
╞═══════════════════════╪═══════╪═════════════════╪════════════════╪══════════════╪═══════════╡
│ false                 ┆ 6589  ┆ 62.22907        ┆ 27.533313      ┆ 2.92275      ┆ 29.238731 │
│ true                  ┆ 3411  ┆ 56.435233       ┆ 43.564644      ┆ 54.681618    ┆ 395.50513 │
└───────────────────────┴───────┴─────────────────┴────────────────┴──────────────┴───────────┘

📈 主题分布差异:
shape: (10, 3)
┌───────────────────────┬───────┬─────┐
│ author_isBlueVerified ┆ topic ┆ len │
│ ---                   ┆ ---   ┆ --- │
│ bool                  ┆ i64   ┆ u32 │
╞════════════════

## 步骤 5: 保存分析结果

In [18]:
from src import io

# 保存带主题的样本数据
content_path = Path("parquet/content_analysis.parquet")
io.materialize_parquet(df_sample.lazy(), content_path)
print(f"✅ 内容分析结果已保存: {content_path}")

# 保存对比统计
comparison_path = Path("parquet/verified_comparison.parquet")
io.materialize_parquet(comparison.lazy(), comparison_path)
print(f"✅ 对比统计已保存: {comparison_path}")

# 保存主题分布
topic_dist_path = Path("parquet/topic_distribution.parquet")
io.materialize_parquet(topic_by_verified.lazy(), topic_dist_path)
print(f"✅ 主题分布已保存: {topic_dist_path}")

print(f"\n📂 所有生成的文件:")
for f in io.list_parquet_files():
    print(f"  - {f}")

✅ 内容分析结果已保存: parquet/content_analysis.parquet
✅ 对比统计已保存: parquet/verified_comparison.parquet
✅ 主题分布已保存: parquet/topic_distribution.parquet

📂 所有生成的文件:


## ✅ 内容语义分析完成！

所有分析数据已准备完毕，可以用于 dashboard 可视化。