# Example without LLM

First, let's try using the URL-to-Markdown converter without any LLM. This will use only the standard HTML-to-markdown conversion.

## Medium Article

In [1]:
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
load_dotenv()

app = DocToMarkdown()

# Convert Medium article
result = app.convert_url_to_markdown(
    urlpath="https://medium.com/the-ai-forum/build-a-local-reliable-rag-agent-using-crewai-and-groq-013e5d557bcd",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

# Display first 500 chars to preview
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

Page Number: 1
Content Preview: # Build A Local Reliable RAG Agent Using CrewAI And Groq | by Plaban Nayak | The AI Forum | Medium

# Build A Local Reliable RAG Agent Using CrewAI And Groq

[![Plaban Nayak](https://miro.medium.com/v2/resize:fill:64:64/1*oFXd8MlaJnMFie2YKsWB_Q.jpeg)](</@nayakpplaban?source=post_page---byline--013e5d557bcd--------------------------------------->)

[Plaban Nayak](</@nayakpplaban?source=post_page---byline--013e5d557bcd--------------------------------------->)

Follow

36 min read

·

Jun 16, 2024
...
Total Length: 69327 characters


# Example with Wikipedia (Non-LLM)

Let's try converting a Wikipedia article without using an LLM.

In [2]:
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
load_dotenv()

app = DocToMarkdown()

# Convert Wikipedia article
result = app.convert_url_to_markdown(
    urlpath="https://en.wikipedia.org/wiki/India",
    extract_images=True,
    extract_tables=True,
    output_path="markdown_output"
)

# Display first 500 chars to preview
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

Page Number: 1
Content Preview: # India - Wikipedia

## Etymology

## History

## Geography

## Biodiversity

## Politics and government

## Foreign, economic, and strategic relations

## Economy

## Demographics, languages, and religion

## Culture

## See also

## Notes

## References

## Bibliography

## External links

### Ancient India

### Medieval India

### Early modern India

### Modern India

### Politics

### Government

### Administrative divisions

### Industries

### Energy

### Socio-economic challenges

### Vis...
Total Length: 184589 characters


# Example using LLM

Now, let's try using the URL-to-Markdown converter with an LLM (Groq in this case).
The LLM will format and clean the extracted content.

In [3]:
from groq import Groq
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
import os
load_dotenv()

client_groq = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)
app = DocToMarkdown(
    llm_client=client_groq,
    llm_model='gemma2-9b-it'
)

# Convert Medium article with LLM
result = app.convert_url_to_markdown(
    urlpath="https://medium.com/@gsayantan1999/inferential-statistics-types-of-hypothesis-testing-207bd345b6b3",
    output_path="markdown_output"
)

# Display first 500 chars to preview
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'invalid_request_error'}} : 2
Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'i

Page Number: 1
Content Preview: # Inferential Statistics : Types of Hypothesis testing | by Sayantan Ghosh | May, 2025 | Medium

```
None
```
ns--207bd345b6b3---------------------bookmark_footer------------------>)

Listen

Share

Hypothesis testing is a fundamental concept in statistics, enabling data scientists and researchers to make informed decisions based on sample data. **It involves evaluating a hypothesis about a population parameter using sample data and determining the likelihood that the hypothesis is true.** This ...
Total Length: 7845 characters


# Example with Wikipedia (LLM)

Finally, let's try converting a Wikipedia article using the LLM for better formatting.

In [4]:
from groq import Groq
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
import os
load_dotenv()

client_groq = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)
app = DocToMarkdown(
    llm_client=client_groq,
    llm_model='gemma2-9b-it'
)

# Convert Wikipedia article with LLM
result = app.convert_url_to_markdown(
    urlpath="https://en.wikipedia.org/wiki/Summer_Olympic_Games",
    output_path="markdown_output"
)

# Display first 500 chars to preview
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'invalid_request_error'}} : 2
Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'i

Page Number: 1
Content Preview: # Summer Olympic Games - Wikipedia

```
None
```
al multi-sport event of its kind, organised by the [International Olympic Committee](</wiki/International_Olympic_Committee> "International Olympic Committee") (IOC) founded by [Pierre de Coubertin](</wiki/Pierre_de_Coubertin> "Pierre de Coubertin").[1] The tradition of awarding medals began in [1904](</wiki/1904_Summer_Olympics> "1904 Summer Olympics"); in each [Olympic](</wiki/Olympic_Games> "Olympic Games") event, [gold medals](</wiki/Gold_meda...
Total Length: 108595 characters


In [5]:
from groq import Groq
from doctomarkdown import DocToMarkdown
from dotenv import load_dotenv
import os
load_dotenv()

client_groq = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)
app = DocToMarkdown(
    llm_client=client_groq,
    llm_model='gemma2-9b-it'
)

# Convert Wikipedia article with LLM
result = app.convert_url_to_markdown(
    urlpath="https://python.langchain.com/docs/introduction/",
    output_path="markdown_output"
)

# Display first 500 chars to preview
for page in result.pages:
    print(f"Page Number: {page.page_number}")
    print(f"Content Preview: {page.page_content[:500]}...")
    print(f"Total Length: {len(page.page_content)} characters")

Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'invalid_request_error'}} : 2
Extraction failed retry left over exception Error code: 400 - {'error': {'message': "'messages.1' : for 'role:user' the following must be satisfied[('messages.1.content' : one of the following must be satisfied[('messages.1.content' : value must be a string) OR ('messages.1.content.0' : one of the following must be satisfied[('messages.1.content.0.text' : Value is not nullable) OR ('messages.1.content.0.type' : value is not one of the allowed values ['image_url'])])])]", 'type': 'i

Page Number: 1
Content Preview: # Introduction | 🦜️🔗 LangChain

```
None
```
eady APIs and Assistants with [LangGraph Platform](<https://langchain-ai.github.io/langgraph/cloud/>).

![Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.](/svg/langchain_stack_112024.svg)![Diagram outlining the hierarchical organization of the LangChain framework, displaying the interconnected parts across multiple layers.](/svg/langchain_stack_112024_dark.svg)

La...
Total Length: 6052 characters
