<p align="center">
  <img src="https://github.com/wisupai/e2m/blob/main/docs/images/wisup_e2m_banner.jpg?raw=true" width="100%" alt="wisup_e2m Logo">
</p>


# 👏🏻 Welcome to E2M (Everything2Markdown) 👏🏻

## 📖 Introduction

E2M is a tool that converts various content into Markdown format, supporting multiple input formats:

- **Text**
  - doc
  - docx
  - epub
  - html
  - htm
  - pdf (Note: includes plain text, text + image, and image-only PDF files)
  - ppt
  - pptx
- **Links**
  - url
- **Audio**
  - mp3
  - m4a
- **Video**
  - mp4 (in progress)

## 🚀 Quick Start

### 🔧 Installation

```bash
pip install wisup_e2m
```

## Core Feature 1: Parser

The purpose of the Parser is to extract text or images from various types of files. Since the main inputs for large models are text and images, the parser serves as a preprocessing step before running the Converter.

The data format returned after parsing is `E2MParsedData`:

```python
class E2MParsedData(BaseModel):
    text: Optional[str] = Field(None, description="Parsed text")
    images: Optional[List[str]] = Field([], description="Parsed image paths")
    attached_images: Optional[List[str]] = Field(
        [], description="Attached image paths, like 1_0.png, 1_1.png, etc."
    )
    attached_images_map: Optional[Dict[str, List[str]]] = Field(
        {},
        description="Attached image paths map, like {1.png: ['/path/to/1_0.png'], 2.png: ['/path/to/2_1.png']}, only available for layout detection.",
    )
    metadata: Optional[List[Any] | Dict[str, Any]] = Field(
        {}, description="Metadata of the parsed data, including engine, etc."
    )
```

### URL Parser

In [None]:
from wisup_e2m import UrlParser

url = "https://www.osar.fr/notes/justintonation"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured

In [None]:
url_data = parser.parse(url)
print(url_data.text)

### PDF Parser

In [None]:
# Using the Marker Engine will download models from the Hugging Face Hub and use them for parsing.
# If you encounter network issues, try running the following code:

import os
os.environ['CURL_CA_BUNDLE'] = ''
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [None]:
from wisup_e2m import PdfParser

pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout 

In [None]:
pdf_data = parser.parse(pdf_path)  # By default, images will be generated in the ./figure folder, but this can be changed with parameters.

In [None]:
print(pdf_data.text)

### PPT Parser

If you need to parse PPT or DOC files, you'll need to download the `libreoffice` dependency. The installation methods are as follows:

#### Mac

```bash
brew install libreoffice
```

#### Linux

```bash
sudo apt-get install libreoffice
```

#### Windows

Visit the official website [https://www.libreoffice.org/](https://www.libreoffice.org/) to download.

In [None]:
from wisup_e2m import PptParser

ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # pdf engines: unstructured

In [None]:
ppt_data = parser.parse(ppt_path)  # By default, images will be generated in the ./figure folder, but this can be changed with parameters.

In [None]:
print(ppt_data.text)

### DOCX Parser

In [None]:
from wisup_e2m import DocxParser

docx_path = "./test.docx"
parser = DocxParser(engine="xml") # pdf engines: xml

In [None]:
docx_data = parser.parse(docx_path) # By default, images will be generated in the ./figure folder, but this can be changed with parameters.

In [None]:
print(docx_data.text) # You can find images were inserted in the docx file in ![]() format.

## Core Feature 2: Converter

The purpose of the Converter is to transform text or images after successful parsing. Currently, it supports converting these formats into Markdown using various model engines.

Currently supported engines are `litellm` and `zhipuai`, with more engines to be supported in the future.

- To see the models supported by Litellm, visit:
  - [https://docs.litellm.ai/docs/providers/](https://docs.litellm.ai/docs/providers/)
- To see the models supported by Zhipuai, visit:
  - [https://open.bigmodel.cn/dev/howuse/model](https://open.bigmodel.cn/dev/howuse/model)

### Text Converter

In [None]:
from wisup_e2m import TextConverter

text_converter = TextConverter(
    engine="litellm",
    api_key="<your api key>",
    model="deepseek/deepseek-chat",
    caching=True,
    cache_type="disk-cache",
)

raw_text = docx_data.text

✨ Currently, the Converter only supports the default strategy.  
Under the `default` strategy, the model will first determine the **text type** and **text format** and then perform sequential conversion to ensure the continuity of the generated Markdown text.

In [None]:
markdown_text = text_converter.convert(raw_text) #  strategy = "default"

In [None]:
print(markdown_text)

### Image Converter

The Image Converter (`ImageConverter`) uses multimodal large models for recognition and can work with the `surya_layout` engine of the `PdfParser` for more detailed image recognition.

Layout recognition engines with the `layout` suffix do not generate text; their main function is to recognize the layout within images. They then extract images that cannot be converted into text and mark them accordingly.

In [None]:
import os
from wisup_e2m import PdfParser, ImageConverter

work_dir = os.getcwd()
image_dir = os.path.join(work_dir, "figure")

test_surya_layout_pdf = "./test_surya_layout_pdf.pdf"

# load parser
pdf_parser = PdfParser(engine="surya_layout")

image_converter = ImageConverter(
    engine="litellm",
    api_key="<you api key>",
    model="gpt-4o",
    base_url="<you base url>",
    caching=True,
    cache_type="disk-cache",
)

In [None]:
# parse the pdf with layout analysis
test_surya_layout_pdf_data = pdf_parser.parse(
    test_surya_layout_pdf,
    start_page=0,
    end_page=20,
    work_dir=work_dir,
    image_dir=image_dir, # extracted images will be saved to this directory
    relative_path=True, # wheather to save extracted images with relative path or absolute path
)

In [None]:
test_surya_layout_pdf_data.to_dict()

In [None]:
test_surya_layout_pdf_data.to_dict().keys()

In [None]:
# let's check the layout of the pdf file
# you can find the header part of the pdf file is hidden, and the image part is recognized correctly.
from IPython.display import Image, display

image_16 = test_surya_layout_pdf_data.images[16]
print(image_16)

display(Image(image_16))


In [None]:
# Then use ImageConverter to convert the images to markdown format.

image_text = image_converter.convert(
    images = test_surya_layout_pdf_data.images,
    attached_images_map= test_surya_layout_pdf_data.attached_images_map,
    work_dir = work_dir, # images will be relative to work_dir, default is absolute path

)

In [None]:
# save Lecture Notes in Artificial Intelligence
with open("Lecture Notes in Artificial Intelligence.md", "w") as f:
    f.write(image_text)