# Using MarkItDown's Python API: Comprehensive Conversion Examples

MarkItDown is a powerful Python library designed to convert a wide array of file formats into Markdown, facilitating tasks like documentation, indexing, and text analysis. In this article, we'll explore practical examples of using MarkItDown's Python API to convert various types of files. We'll cover the following input files located in the `/kaggle/input/unstructured-files/` directory:

- **Documents**:
  - `/kaggle/input/unstructured-files/data.pdf`
  - `/kaggle/input/unstructured-files/data.docx`
  - `/kaggle/input/unstructured-files/data.xlsx`
  - `/kaggle/input/unstructured-files/data.pptx`

- **Web and Text Content**:
  - `/kaggle/input/unstructured-files/data.html`
  - `/kaggle/input/unstructured-files/data.txt`
  - `/kaggle/input/unstructured-files/data.csv`

- **Media Files**:
  - `/kaggle/input/unstructured-files/data.mp3`
  - `/kaggle/input/unstructured-files/bar.jpg`
  - `/kaggle/input/unstructured-files/tabular.jpg`
  - `/kaggle/input/unstructured-files/bubble.png`
  - `/kaggle/input/unstructured-files/line.png`

Let's dive into each example to understand how to effectively utilize MarkItDown for diverse file conversions.

## Prerequisites

Before we begin, ensure you have the following:

- **Python Environment**: Make sure you have Python 3.6 or later installed.
- **MarkItDown and OpenAI Libraries**: Install MarkItDown and OpenAI libraries using `pip`:

    ```bash
    !pip install -qU openai
    !pip install -qU markitdown
    ```

- **Optional Dependencies**:
  - **LLM Integration**: If you intend to use Large Language Models (e.g., OpenAI's GPT-4) for enhanced image descriptions, ensure you have the necessary API clients installed and configured.
  - **Audio Transcription**: For transcribing audio files, install `pydub` and `speech_recognition`:

    ```bash
    pip install pydub speechrecognition
    ```

    Also, ensure that `ffmpeg` is installed in your system, as `pydub` relies on it for audio processing.

In [None]:
!pip install -qU openai
!pip install -qU markitdown

## Initialization

First, let's initializes the language model (LLM) using either OpenAI's GPT-4 or Anthropic's Claude-3. The API key is securely fetched using kaggle_secrets. Then, let's import the necessary classes and initialize the `MarkItDown` converter. If you plan to use LLM integration for image descriptions, configure the `llm_client` and `llm_model` accordingly.

**Explanation**:

1. **Secure API Key Retrieval**:
   - **`kaggle_secrets`**: Utilizes Kaggle's `UserSecretsClient` to securely fetch your OpenAI API key, ensuring that sensitive information isn't hard-coded or exposed in your scripts.
   
2. **OpenAI Client Initialization**:
   - **`OpenAI`**: Initializes the OpenAI client with the retrieved API key, enabling integration with models like GPT-4 for enhanced functionalities such as image descriptions.
   
3. **MarkItDown Initialization**:
   - **`MarkItDown`**: Initializes the MarkItDown converter with the LLM client and specifies the model to use (`gpt-4o` in this case).

In [None]:
from openai import OpenAI
from markitdown import MarkItDown
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()

# Initialize the OpenAI client
client = OpenAI(api_key=user_secrets.get_secret("my-openai-api-key"))

# Initialize MarkItDown with LLM integration
md_converter = MarkItDown(llm_client=client, llm_model="gpt-4o")

---

## Converting Individual Files
Below are examples of converting each specified file type to Markdown using MarkItDown's Python API.

### PDF Conversion
MarkItDown allows you to effortlessly convert PDF documents into Markdown format, extracting the text content for easy integration into your documentation or analysis workflows.

In [None]:
# Path to the PDF file
pdf_path = "/kaggle/input/unstructured-files/data.pdf"

# Convert PDF to Markdown
pdf_result = md_converter.convert(pdf_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.pdf.md", "w", encoding="utf-8") as f:
    f.write(pdf_result.text_content)

print("PDF conversion completed. Output saved to data.pdf.md")

### Word Document Conversion

Convert Word documents (`.docx`) to Markdown seamlessly, preserving essential formatting elements like headings and tables for consistency in your documentation.

**Explanation**:
- `mammoth` is used internally to convert `.docx` to HTML, which is then transformed into Markdown.
- The converted content retains formatting elements like headings and tables.

In [None]:
# Path to the DOCX file
docx_path = "/kaggle/input/unstructured-files/data.docx"

# Convert DOCX to Markdown
docx_result = md_converter.convert(docx_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.docx.md", "w", encoding="utf-8") as f:
    f.write(docx_result.text_content)

print("Word document conversion completed. Output saved to data.docx.md")

### Excel Spreadsheet Conversion

Efficiently convert Excel spreadsheets (`.xlsx`) into Markdown tables, facilitating data analysis and reporting within Markdown-supported environments.

**Explanation**:
- Each sheet in the Excel file is converted into separate Markdown tables.
- Sheet names are used as section headers in the Markdown document.

In [None]:
# Path to the XLSX file
xlsx_path = "/kaggle/input/unstructured-files/data.xlsx"

# Convert XLSX to Markdown
xlsx_result = md_converter.convert(xlsx_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.xlsx.md", "w", encoding="utf-8") as f:
    f.write(xlsx_result.text_content)

print("Excel spreadsheet conversion completed. Output saved to data.xlsx.md")

### PowerPoint Presentation Conversion

Transform PowerPoint presentations (`.pptx`) into Markdown, extracting slide titles, text, images, tables, and charts for comprehensive documentation.

**Explanation**:
- Slides are processed to extract titles, text, images, tables, and charts.
- Each slide is annotated with its slide number and notes (if available).

In [None]:
# Path to the PPTX file
pptx_path = "/kaggle/input/unstructured-files/data.pptx"

# Convert PPTX to Markdown
pptx_result = md_converter.convert(pptx_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.pptx.md", "w", encoding="utf-8") as f:
    f.write(pptx_result.text_content)

print("PowerPoint presentation conversion completed. Output saved to data.pptx.md")

### HTML Page Conversion

Convert HTML pages (`.html`) to Markdown, preserving the structure and essential elements like headings, links, and images for streamlined documentation.

**Explanation**:
- The HTML content is parsed, and scripts and styles are removed.
- The main content is converted into Markdown, preserving headings, links, and images.

In [None]:
# Path to the HTML file
html_path = "/kaggle/input/unstructured-files/data.html"

# Convert HTML to Markdown
html_result = md_converter.convert(html_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.html.md", "w", encoding="utf-8") as f:
    f.write(html_result.text_content)

print("HTML page conversion completed. Output saved to data.html.md")

### Plain Text Conversion

Directly convert plain text files (`.txt`) to Markdown, facilitating the integration of raw text into Markdown-supported platforms with minimal formatting.

**Explanation**:
- Plain text files are directly converted to Markdown by extracting the text content.
- Minimal formatting is applied since the source is plain text.

In [None]:
# Path to the TXT file
txt_path = "/kaggle/input/unstructured-files/data.txt"

# Convert TXT to Markdown
txt_result = md_converter.convert(txt_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.txt.md", "w", encoding="utf-8") as f:
    f.write(txt_result.text_content)

print("Plain text conversion completed. Output saved to data.txt.md")

### CSV Conversion

Convert structured data formats like CSV into Markdown, enabling easy embedding of data tables and formatted content within Markdown documents.

In [None]:
# Path to the CSV file
csv_path = "/kaggle/input/unstructured-files/data.csv"

# Convert CSV to Markdown
csv_result = md_converter.convert(csv_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.csv.md", "w", encoding="utf-8") as f:
    f.write(csv_result.text_content)

print("CSV conversion completed. Output saved to data.csv.md")

### MP3 Audio Conversion

Leverage MarkItDown to convert MP3 audio files into Markdown, extracting metadata and transcribing speech for comprehensive documentation.

**Explanation**:
- **Metadata Extraction**: Extracts metadata like Title, Artist, Album, Genre, and Duration using `exiftool`.
- **Speech Transcription**: Transcribes speech from the audio file using `speech_recognition` if available.
- **Output**: Includes metadata and the transcription in the Markdown file.

In [None]:
# Path to the MP3 file
mp3_path = "/kaggle/input/unstructured-files/data.mp3"

# Convert MP3 to Markdown
mp3_result = md_converter.convert(mp3_path)

# Save the Markdown content to a file
with open("/kaggle/working/data.mp3.md", "w", encoding="utf-8") as f:
    f.write(mp3_result.text_content)

print("MP3 audio conversion completed. Output saved to data.mp3.md")

### Image Conversion

Convert image files (`.jpg`, `.jpeg`, `.png`) to Markdown, extracting metadata and generating detailed descriptions using Large Language Models for enhanced documentation.

**Input Files**:
- `/kaggle/input/unstructured-files/bar.jpg`
- `/kaggle/input/unstructured-files/tabular.jpg`
- `/kaggle/input/unstructured-files/bubble.png`
- `/kaggle/input/unstructured-files/line.png`

**Explanation**:
- **Metadata Extraction**: Extracts metadata like Image Size, Title, Description, etc., using `exiftool`.
- **LLM-Based Description**: Uses a Large Language Model to generate a detailed description of the image.
- **Output**: Includes metadata and the generated description in the Markdown file.

**Repeat the above steps for the other image files (`tabular.jpg`, `bubble.png`, `line.png`) by updating the `image_path` and output filenames accordingly.**

In [None]:
# Path to the image file
image_path = "/kaggle/input/unstructured-files/bar.jpg"

# Convert Image to Markdown with LLM-based description
image_result = md_converter.convert(image_path)

# Save the Markdown content to a file
with open("/kaggle/working/bar.jpg.md", "w", encoding="utf-8") as f:
    f.write(image_result.text_content)

print("Image conversion completed. Output saved to bar.jpg.md")

---

## Batch Processing Multiple Files

MarkItDown excels not only in converting individual files but also in handling batch conversions seamlessly. Here's how you can convert all the specified files in one go using a Python script.

### Batch Conversion Script

The script is a batch file conversion tool designed to transform various file formats (PDF, DOCX, Excel, images, etc.) into Markdown format. It uses the `MarkItDown` library with optional LLM integration through OpenAI's API for enhanced conversion capabilities.

Key features:
- Secure API key handling via Kaggle Secrets
- Comprehensive logging
- Error handling for unsupported formats and failed conversions
- Support for multiple input file types
- LLM integration for improved content interpretation

In [None]:
from markitdown import MarkItDown, FileConversionException, UnsupportedFormatException
from openai import OpenAI  # Optional: For LLM integration
from kaggle_secrets import UserSecretsClient
import os
import logging

# Configure logging
logging.basicConfig(
    filename='/kaggle/working/markitdown_conversion.log',
    filemode='w',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Fetch API key securely
user_secrets = UserSecretsClient()

# Initialize the OpenAI client (replace 'my-openai-api-key' with your actual secret key name)
client = OpenAI(api_key=user_secrets.get_secret("my-openai-api-key"))

# Initialize MarkItDown with LLM integration
md_converter = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"  # Specify the desired LLM model
)

# Directory containing the input files
input_dir = "/kaggle/input/unstructured-files/"

# Output directory
output_dir = "/kaggle/working/"

# List of input files
input_files = [
    "data.pdf",
    "data.docx",
    "data.xlsx",
    "data.pptx",
    "data.html",
    "data.txt",
    "data.csv",
    "data.mp3",
    "bar.jpg",
    "tabular.jpg",
    "bubble.png",
    "line.png"
]

for file_name in input_files:
    input_path = os.path.join(input_dir, file_name)
    output_file_name = f"{file_name}.md"
    output_path = os.path.join(output_dir, output_file_name)
    
    try:
        # Check if the input file exists
        if not os.path.isfile(input_path):
            logging.warning(f"File not found: {input_path}")
            print(f"Warning: File not found: {file_name}")
            continue
        
        # Convert the file to Markdown
        result = md_converter.convert(input_path)
        
        # Save the result to a Markdown file
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(result.text_content)
        
        logging.info(f"Successfully converted {file_name} to {output_file_name}")
        print(f"Successfully converted {file_name} to {output_file_name}")
        
    except UnsupportedFormatException as ufe:
        logging.error(f"Unsupported format for file {file_name}: {ufe}")
        print(f"Error: Unsupported format for file {file_name}: {ufe}")
        
    except FileConversionException as fce:
        logging.error(f"Conversion failed for file {file_name}: {fce}")
        print(f"Error: Conversion failed for file {file_name}: {fce}")
        
    except Exception as e:
        logging.error(f"Unexpected error for file {file_name}: {e}")
        print(f"Error: Unexpected error for file {file_name}: {e}")

print("\nBatch conversion completed. Check 'markitdown_conversion.log' for details.")

## Handling Exceptions

It's essential to handle potential errors during the conversion process, such as unsupported file formats or corrupted files. Here's an enhanced version of the batch conversion script with robust exception handling and logging.

Key features:
- Detailed exception hierarchy (UnsupportedFormatException, FileConversionException)
- Comprehensive logging with timestamps and severity levels
- File existence validation before conversion
- Clear error messages for different failure scenarios
- Safe file handling with proper encoding
- Progress tracking through console output
- Detailed logging to a separate file for debugging

In [None]:
from markitdown import MarkItDown, FileConversionException, UnsupportedFormatException
from openai import OpenAI  # Optional: For LLM integration
from kaggle_secrets import UserSecretsClient
import os
import logging

# Configure logging
logging.basicConfig(
    filename='/kaggle/working/markitdown_conversion.log',
    filemode='w',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

# Fetch API key securely
user_secrets = UserSecretsClient()

# Initialize the OpenAI client (replace 'my-openai-api-key' with your actual secret key name)
client = OpenAI(api_key=user_secrets.get_secret("my-openai-api-key"))

# Initialize MarkItDown with LLM integration
md_converter = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"  # Specify the desired LLM model
)

# Directory containing the input files
input_dir = "/kaggle/input/unstructured-files/"

# Output directory
output_dir = "/kaggle/working/"

# List of input files
input_files = [
    "data.pdf",
    "data.docx",
    "data.xlsx",
    "data.pptx",
    "data.html",
    "data.txt",
    "data.csv",
    "data.mp3",
    "bar.jpg",
    "tabular.jpg",
    "bubble.png",
    "line.png"
]

for file_name in input_files:
    input_path = os.path.join(input_dir, file_name)
    output_file_name = f"{file_name}.md"
    output_path = os.path.join(output_dir, output_file_name)
    
    try:
        # Check if the input file exists
        if not os.path.isfile(input_path):
            logging.warning(f"File not found: {input_path}")
            print(f"Warning: File not found: {file_name}")
            continue
        
        # Convert the file to Markdown
        result = md_converter.convert(input_path)
        
        # Save the result to a Markdown file
        with open(output_path, "w", encoding="utf-8") as f:
            f.write(result.text_content)
        
        logging.info(f"Successfully converted {file_name} to {output_file_name}")
        print(f"Successfully converted {file_name} to {output_file_name}")
        
    except UnsupportedFormatException as ufe:
        logging.error(f"Unsupported format for file {file_name}: {ufe}")
        print(f"Error: Unsupported format for file {file_name}: {ufe}")
        
    except FileConversionException as fce:
        logging.error(f"Conversion failed for file {file_name}: {fce}")
        print(f"Error: Conversion failed for file {file_name}: {fce}")
        
    except Exception as e:
        logging.error(f"Unexpected error for file {file_name}: {e}")
        print(f"Error: Unexpected error for file {file_name}: {e}")

print("\nBatch conversion completed. Check 'markitdown_conversion.log' for details.")

---

## Command-Line Usage

MarkItDown provides a straightforward Command-Line Interface (CLI) for converting files to Markdown without writing any code. You can perform conversions directly from your terminal using simple commands.

**Basic Conversion:**

```bash
markitdown path-to-file.pdf > document.md
```

**Specify Output File:**

```bash
markitdown path-to-file.pdf -o document.md
```

**Piping Content:**

```bash
cat path-to-file.pdf | markitdown
```

**Explanation**:
- **Basic Conversion**: Converts `path-to-file.pdf` to Markdown and redirects the output to `document.md`.
- **Specify Output File**: Uses the `-o` flag to specify the name of the output Markdown file.
- **Piping Content**: Allows you to pipe the content of a file directly into MarkItDown for conversion.


---

## Conclusion
MarkItDown offers a robust and flexible solution for converting a diverse range of file formats into Markdown. Whether you're dealing with documents, spreadsheets, presentations, media files, or web content, MarkItDown streamlines the conversion process, saving you time and effort.

**Key Takeaways**:
- **Versatility**: Supports numerous file formats, making it suitable for various use cases.
- **Ease of Use**: Simple API allows for straightforward integration into Python projects.
- **Extensibility**: Easily add support for new formats or integrate advanced features like LLM-based image descriptions.
- **Batch Processing**: Efficiently convert multiple files simultaneously with comprehensive logging and error handling.

By incorporating MarkItDown into your workflow, you can enhance productivity, maintain consistency across documentation, and leverage the power of Markdown for your projects. Whether you're a developer, data scientist, or content creator, MarkItDown is an invaluable tool in your arsenal.