Skip to content

Unstructured.IO: ETL for LLMs

Welcome to Unstructured.IO! We're here on a mission to make all of your documents available for LLM applications, from PDFs and Word Docs to emails and markdown. To get started, check out our open source offerings.

Tried the open source library and ready for more power? Check out our products page to learn more about our paid API and Unstructured Platform, and ETL tool built around our core file transformation capabilities.

Learn more

Section Description
Company Website Unstructured.io product and company info
Documentation Full unstructured documentation

Popular repositories Loading

  1. unstructured unstructured Public

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    HTML 10.6k 884

  2. unstructured-api unstructured-api Public

    Python 693 154

  3. unstructured-inference unstructured-inference Public

    Python 176 59

  4. pipeline-sec-filings pipeline-sec-filings Public archive

    Preprocessing pipeline notebooks and API supporting text extraction from SEC documents

    Jupyter Notebook 143 31

  5. unstructured-python-client unstructured-python-client Public

    A Python client for the Unstructured Platform API

    Python 98 17

  6. unstructured-ingest unstructured-ingest Public

    HTML 70 35

Repositories

Showing 10 of 37 repositories
  • HTML 70 Apache-2.0 35 61 27 Updated Mar 25, 2025
  • base-images Public

    Store Dockerfiles and Packer configs for images to use as a base to build upon

    Shell 4 Apache-2.0 2 1 3 Updated Mar 25, 2025
  • UNS-MCP Public
    Python 9 1 0 3 Updated Mar 25, 2025
  • unstructured-python-client Public

    A Python client for the Unstructured Platform API

    Python 98 MIT 17 10 6 Updated Mar 25, 2025
  • docs Public

    Documentation for all Unstructured products and libraries

    MDX 7 22 0 7 Updated Mar 24, 2025
  • unstructured Public

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    HTML 10,643 Apache-2.0 884 151 (3 issues need help) 49 Updated Mar 24, 2025
  • unstructured-js-client Public

    A JavaScript/Typescript client for the Unstructured Platform API

    TypeScript 50 MIT 15 7 2 Updated Mar 19, 2025
  • .github Public
    0 2 2 1 Updated Mar 19, 2025
  • Python 176 Apache-2.0 59 20 11 Updated Mar 18, 2025
  • unstructured.PaddleOCR Public Forked from PaddlePaddle/PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

    Python 34 Apache-2.0 8,261 0 0 Updated Mar 17, 2025

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Most used topics

Loading…