Skip to content

ydli-ai/RedNote-Vibe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RedNote-Vibe Dataset

English | 中文

A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media arxiv

Dataset Overview

Abstract

The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT.

Dataset Structure

All data files are available at the following link:

Download from Google Drive

The directory contains:

  • training_set_human.jsonl — Human-authored posts (pre-LLM period, before Nov 2022)
  • training_set_aigc.jsonl — AI-generated posts using 17 LLMs
  • exploration_set.jsonl — Post-LLM period posts (2023-2025) for temporal analysis

If you would like to receive notifications about updates, please STAR to subscribe.

Dataset Statistics

The dataset spans 11 content domains. The table below summarizes per-domain statistics: number of posts (#), average text length (Length), average number of tags (#Tags), and average user engagement (Likes, Comments, Collections).

Domain # Length #Tags Likes Comm. Colls.
Health 19.2k 333.6 7.6 2814.3 97.1 2468.8
Fashion 17.7k 191.2 10.9 1850.5 62.5 681.3
Food 10.5k 398.4 3.0 66.2 11.9 45.0
Career 11.7k 406.9 10.0 2047.2 102.6 1445.9
Pets 3.9k 433.2 2.7 57.5 12.6 34.1
Education 5.9k 496.0 3.7 55.9 6.1 38.3
Sports 17.7k 528.8 3.2 60.1 9.8 36.4
Relation. 22.5k 308.2 7.1 2663.6 275.8 593.9
Travel 14.8k 452.8 10.7 1102.4 86.6 662.0
Wellness 16.0k 396.5 10.5 2348.6 217.7 1025.8
Others 16.9k 220.9 8.9 208.8 1266.2 3225.0

Data Format

Each entry contains the following fields:

{
  "note_title": "Post title",
  "local_time": "Publication timestamp (YYYYMMDDHH)",
  "note_content": "Main text content",
  "likes": 123,
  "collections": 45,
  "comments": 67,
  "domain": "Content category",
  "model_family": "AI provider (for AIGC data)",
  "model": "Specific AI model (for AIGC data)"
}

🚀 Quick Start

Prerequisites

pip install torch transformers scikit-learn pandas numpy datasets

Basic Usage

Run the quick start example to train a BERT model for AIGT detection:

python quick_start.py

This script will:

  1. Load the human and AI-generated datasets
  2. Prepare balanced training/validation/test splits
  3. Fine-tune a BERT model for binary classification
  4. Evaluate performance and show prediction examples
  5. Save the trained model for future use

Supported Tasks

1. AIGT Classification (Binary)

Distinguish between human-written and AI-generated content.

2. AI Provider Identification (6-way)

Identify the source among six major AI providers:

  • OpenAI
  • Google
  • Anthropic
  • DeepSeek
  • Qwen

3. Model Identification (17-way)

Fine-grained identification among 17 specific AI models.

Citation

If you use this dataset in your research, please cite our paper:

@misc{li2025rednote,
      title={RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media}, 
      author={Yudong Li and Yufei Sun and Yuhan Yao and Peiru Yang and Wanyue Li and Jiajun Zou and Yongfeng Huang and Linlin Shen},
      year={2025},
      eprint={2509.22055},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.22055}, 
}

License

This dataset is released under the CC BY-NC 4.0 License.

About

小红书AIGT数据集

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages