RedNote-Vibe Dataset

A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media arxiv

Abstract

The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT.

Dataset Structure

All data files are available at the following link:

Download from Google Drive

The directory contains:

training_set_human.jsonl — Human-authored posts (pre-LLM period, before Nov 2022)
training_set_aigc.jsonl — AI-generated posts using 17 LLMs
exploration_set.jsonl — Post-LLM period posts (2023-2025) for temporal analysis

If you would like to receive notifications about updates, please STAR to subscribe.

Dataset Statistics

The dataset spans 11 content domains. The table below summarizes per-domain statistics: number of posts (#), average text length (Length), average number of tags (#Tags), and average user engagement (Likes, Comments, Collections).

Domain	#	Length	#Tags	Likes	Comm.	Colls.
Health	19.2k	333.6	7.6	2814.3	97.1	2468.8
Fashion	17.7k	191.2	10.9	1850.5	62.5	681.3
Food	10.5k	398.4	3.0	66.2	11.9	45.0
Career	11.7k	406.9	10.0	2047.2	102.6	1445.9
Pets	3.9k	433.2	2.7	57.5	12.6	34.1
Education	5.9k	496.0	3.7	55.9	6.1	38.3
Sports	17.7k	528.8	3.2	60.1	9.8	36.4
Relation.	22.5k	308.2	7.1	2663.6	275.8	593.9
Travel	14.8k	452.8	10.7	1102.4	86.6	662.0
Wellness	16.0k	396.5	10.5	2348.6	217.7	1025.8
Others	16.9k	220.9	8.9	208.8	1266.2	3225.0

Data Format

Each entry contains the following fields:

{
  "note_title": "Post title",
  "local_time": "Publication timestamp (YYYYMMDDHH)",
  "note_content": "Main text content",
  "likes": 123,
  "collections": 45,
  "comments": 67,
  "domain": "Content category",
  "model_family": "AI provider (for AIGC data)",
  "model": "Specific AI model (for AIGC data)"
}

🚀 Quick Start

Prerequisites

pip install torch transformers scikit-learn pandas numpy datasets

Basic Usage

Run the quick start example to train a BERT model for AIGT detection:

python quick_start.py

This script will:

Load the human and AI-generated datasets
Prepare balanced training/validation/test splits
Fine-tune a BERT model for binary classification
Evaluate performance and show prediction examples
Save the trained model for future use

Supported Tasks

1. AIGT Classification (Binary)

Distinguish between human-written and AI-generated content.

2. AI Provider Identification (6-way)

Identify the source among six major AI providers:

OpenAI
Google
Anthropic
DeepSeek
Qwen

3. Model Identification (17-way)

Fine-grained identification among 17 specific AI models.

Citation

If you use this dataset in your research, please cite our paper:

@misc{li2025rednote,
      title={RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media}, 
      author={Yudong Li and Yufei Sun and Yuhan Yao and Peiru Yang and Wanyue Li and Jiajun Zou and Yongfeng Huang and Linlin Shen},
      year={2025},
      eprint={2509.22055},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22055}, 
}

License

This dataset is released under the CC BY-NC 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
README_zh.md		README_zh.md
overview.png		overview.png
quick_start.py		quick_start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RedNote-Vibe Dataset

Abstract

Dataset Structure

Dataset Statistics

Data Format

🚀 Quick Start

Prerequisites

Basic Usage

Supported Tasks

1. AIGT Classification (Binary)

2. AI Provider Identification (6-way)

3. Model Identification (17-way)

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RedNote-Vibe Dataset

Abstract

Dataset Structure

Dataset Statistics

Data Format

🚀 Quick Start

Prerequisites

Basic Usage

Supported Tasks

1. AIGT Classification (Binary)

2. AI Provider Identification (6-way)

3. Model Identification (17-way)

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages