A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media arxiv
The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT.
All data files are available at the following link:
The directory contains:
training_set_human.jsonl— Human-authored posts (pre-LLM period, before Nov 2022)training_set_aigc.jsonl— AI-generated posts using 17 LLMsexploration_set.jsonl— Post-LLM period posts (2023-2025) for temporal analysis
If you would like to receive notifications about updates, please STAR to subscribe.
The dataset spans 11 content domains. The table below summarizes per-domain statistics: number of posts (#), average text length (Length), average number of tags (#Tags), and average user engagement (Likes, Comments, Collections).
| Domain | # | Length | #Tags | Likes | Comm. | Colls. |
|---|---|---|---|---|---|---|
| Health | 19.2k | 333.6 | 7.6 | 2814.3 | 97.1 | 2468.8 |
| Fashion | 17.7k | 191.2 | 10.9 | 1850.5 | 62.5 | 681.3 |
| Food | 10.5k | 398.4 | 3.0 | 66.2 | 11.9 | 45.0 |
| Career | 11.7k | 406.9 | 10.0 | 2047.2 | 102.6 | 1445.9 |
| Pets | 3.9k | 433.2 | 2.7 | 57.5 | 12.6 | 34.1 |
| Education | 5.9k | 496.0 | 3.7 | 55.9 | 6.1 | 38.3 |
| Sports | 17.7k | 528.8 | 3.2 | 60.1 | 9.8 | 36.4 |
| Relation. | 22.5k | 308.2 | 7.1 | 2663.6 | 275.8 | 593.9 |
| Travel | 14.8k | 452.8 | 10.7 | 1102.4 | 86.6 | 662.0 |
| Wellness | 16.0k | 396.5 | 10.5 | 2348.6 | 217.7 | 1025.8 |
| Others | 16.9k | 220.9 | 8.9 | 208.8 | 1266.2 | 3225.0 |
Each entry contains the following fields:
{
"note_title": "Post title",
"local_time": "Publication timestamp (YYYYMMDDHH)",
"note_content": "Main text content",
"likes": 123,
"collections": 45,
"comments": 67,
"domain": "Content category",
"model_family": "AI provider (for AIGC data)",
"model": "Specific AI model (for AIGC data)"
}pip install torch transformers scikit-learn pandas numpy datasetsRun the quick start example to train a BERT model for AIGT detection:
python quick_start.pyThis script will:
- Load the human and AI-generated datasets
- Prepare balanced training/validation/test splits
- Fine-tune a BERT model for binary classification
- Evaluate performance and show prediction examples
- Save the trained model for future use
Distinguish between human-written and AI-generated content.
Identify the source among six major AI providers:
- OpenAI
- Anthropic
- DeepSeek
- Qwen
Fine-grained identification among 17 specific AI models.
If you use this dataset in your research, please cite our paper:
@misc{li2025rednote,
title={RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media},
author={Yudong Li and Yufei Sun and Yuhan Yao and Peiru Yang and Wanyue Li and Jiajun Zou and Yongfeng Huang and Linlin Shen},
year={2025},
eprint={2509.22055},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.22055},
}This dataset is released under the CC BY-NC 4.0 License.
