Skip to content
@opendatalab

OpenDataLab

OpenDataLab provides access to numerous significant open-source datasets.

English🌎|简体中文🀄

Note

📚 In 2025, we have open-sourced a high-quality multilingual dataset, WanJuan 3.0 (WanJuan Silu) which comprises over 1.2TB of indigenous textual corpora from five countries. Each subset includes seven major categories and 34 subcategories, covering a wide range of local characteristics, such as history, politics, culture, real estate, shopping, weather, dining, encyclopedic knowledge, and professional expertise. Here are the download links for the five subsets, and we welcome everyone to download and use them.

WanJuan3.0 KoreanArabicVietnameseRussianThai


🔥🔥🔥OpenDataLab Provide ecology for high-quality datasets for community. It provides:

🌟Extensive open data resources for AI Model

● High-speed and simple way to access open datasets
● 7700+ Large scale and high-quality open datasets for large model
● 1200+ Open datasets for Computer Vision
● 200+ Open datasets by CVPR
● Categorized datasets for hot topics

✨Open-source data processing toolkits

● Data acquisition toolkits supporting large datasets
● Data acquisition toolkits supporting kinds of tasks
● Open source intelligent Toolbox for Labeling

💫Dataset description language

● Format standardization
● DSDL: Dataset Description Language
● Define a CV dataset by DSDL
● OpenDataLab Standardized 100+ CV Datasets

Check our tutorials videos (in Chinese) to get started.


📣 We have upgraded and launched the function of authors uploading datasets independently. We hereby invite you to participate in using it to better promote your open source datasets, AI research results, etc., so that more people can access, obtain and use your dataset.

This is an introduction to the dataset autonomous upload function 【help doc】,You can create and share your dataset according to our guidelines. 💪

If you have any questions or obstacles, please feel free to contact us OpenDataLab@pjlab.org.cn.

Popular repositories Loading

  1. MinerU MinerU Public

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Python 28.9k 2.3k

  2. PDF-Extract-Kit PDF-Extract-Kit Public

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

    Python 7.1k 489

  3. labelU labelU Public

    Data annotation toolbox supports image, audio and video data.

    Python 1.1k 111

  4. DocLayout-YOLO DocLayout-YOLO Public

    DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

    Python 967 74

  5. LabelLLM LabelLLM Public

    The Open-Source Data Annotation Platform

    TypeScript 741 64

  6. WanJuan1.0 WanJuan1.0 Public

    万卷1.0多模态语料

    556 28

Repositories

Showing 10 of 41 repositories
  • OHR-Bench Public

    OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

    Python 68 12 1 0 Updated Mar 23, 2025
  • MinerU Public

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Python 28,901 AGPL-3.0 2,253 114 7 Updated Mar 22, 2025
  • FakeVLM Public

    FakeVLM: Advancing Synthetic Image Detection through Explainable Multimodal Models and Fine-Grained Artifact Analysis

    Python 11 0 1 0 Updated Mar 22, 2025
  • LEGION Public

    The official implementation of the paper "LEGION: Learning to Ground and Explain for Synthetic Image Detection"

    Python 11 1 1 0 Updated Mar 21, 2025
  • LOKI Public

    [ICLR 2025 Spotlight] The official implementation of the paper “LOKI:A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models”

    Python 135 1 1 0 Updated Mar 19, 2025
  • skydiffusion Public

    The official implementation of the paper “Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm”

    Python 44 Apache-2.0 2 2 0 Updated Mar 18, 2025
  • labelU Public

    Data annotation toolbox supports image, audio and video data.

    Python 1,097 Apache-2.0 111 9 0 Updated Mar 18, 2025
  • labelU-Kit Public

    Data annotation component library --provided as NPM packages

    TypeScript 83 Apache-2.0 24 3 2 Updated Mar 18, 2025
  • magic-html Public
    Python 412 Apache-2.0 35 6 0 Updated Mar 13, 2025
  • opendatalab-datasets Public

    datasets resource

    106 11 3 0 Updated Mar 12, 2025