Skip to content

Code blocks lose formatting when converting from HTML to markdown #325

@AswanthManoj

Description

@AswanthManoj

When converting HTML <pre> and <code> tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.

Example:
Original HTML:

<pre><code>
def example():
    print("Hello")
    return True
</code></pre>

Current markdown output:

def example(): print("Hello") return True


Expected markdown output:
```python
def example():
    print("Hello")
    return True

Impact

  • Code indentations were lost
  • Line breaks are removed
  • Makes code blocks harder to read

Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"

Current Workaround

I implemented a custom solution using markdownify with a custom pre tag conversion:

class CustomMarkdownify(MarkdownConverter):
    def convert_pre(self, el, text, convert_as_inline=False):
        language = ''
        pre_class = el.get('class', '')
        if isinstance(pre_class, str):
            pre_class = pre_class.split()
        for class_name in pre_class:
            if class_name.startswith('language-'):
                language = class_name.replace('language-', '')
                break
        return f'```{language}\n{text}\n```\n' if language else f'```\n{text}\n```\n'

Then extracted code snippets using regex:

def extract_code_snippets(markdown: str) -> list:
    pattern = r'\n```.*?\n(.*?)\n```\n' 
    snippets = []
    matches = re.finditer(pattern, markdown, re.DOTALL)
    for match in matches:
        full_block = match.group(0)
        content = match.group(1)
        snippets.append((full_block, content))
    return snippets

This workaround involves:

  1. Extracting snippets from both the original markdown and newly converted markdown
  2. Using token-based similarity matching to identify corresponding code blocks
  3. Replacing the poorly formatted blocks with properly formatted ones

While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.

I can help implement a fix for this issue. Could you please point me to:

  1. The relevant code handling HTML to markdown conversion for code blocks
  2. Any specific patterns/conventions to follow

I'll submit a PR with the necessary changes once I have this guidance.

Metadata

Metadata

Assignees

Labels

🐞 BugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions