Code blocks lose formatting when converting from HTML to markdown

When converting HTML `<pre>` and `<code>` tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.

Example:
Original HTML:
```html
<pre><code>
def example():
    print("Hello")
    return True
</code></pre>

Current markdown output:
```
`def example(): print("Hello") return True`
```

Expected markdown output:
```python
def example():
    print("Hello")
    return True
```

## Impact
- Code indentations were lost
- Line breaks are removed
- Makes code blocks harder to read

**Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"**

## Current Workaround
I implemented a custom solution using `markdownify` with a custom pre tag conversion:

```python
class CustomMarkdownify(MarkdownConverter):
    def convert_pre(self, el, text, convert_as_inline=False):
        language = ''
        pre_class = el.get('class', '')
        if isinstance(pre_class, str):
            pre_class = pre_class.split()
        for class_name in pre_class:
            if class_name.startswith('language-'):
                language = class_name.replace('language-', '')
                break
        return f'```{language}\n{text}\n```\n' if language else f'```\n{text}\n```\n'
```

Then extracted code snippets using regex:

```python
def extract_code_snippets(markdown: str) -> list:
    pattern = r'\n```.*?\n(.*?)\n```\n' 
    snippets = []
    matches = re.finditer(pattern, markdown, re.DOTALL)
    for match in matches:
        full_block = match.group(0)
        content = match.group(1)
        snippets.append((full_block, content))
    return snippets
```

This workaround involves:
1. Extracting snippets from both the original markdown and newly converted markdown
2. Using token-based similarity matching to identify corresponding code blocks
3. Replacing the poorly formatted blocks with properly formatted ones

While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.



I can help implement a fix for this issue. Could you please point me to:

1. The relevant code handling HTML to markdown conversion for code blocks
2. Any specific patterns/conventions to follow

I'll submit a PR with the necessary changes once I have this guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Code blocks lose formatting when converting from HTML to markdown #325

Impact

Current Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Code blocks lose formatting when converting from HTML to markdown #325

Description

Impact

Current Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions