When converting HTML <pre> and <code> tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.
Example:
Original HTML:
<pre><code>
def example():
print("Hello")
return True
</code></pre>
Current markdown output:
def example(): print("Hello") return True
Expected markdown output:
```python
def example():
print("Hello")
return True
Impact
- Code indentations were lost
- Line breaks are removed
- Makes code blocks harder to read
Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"
Current Workaround
I implemented a custom solution using markdownify with a custom pre tag conversion:
class CustomMarkdownify(MarkdownConverter):
def convert_pre(self, el, text, convert_as_inline=False):
language = ''
pre_class = el.get('class', '')
if isinstance(pre_class, str):
pre_class = pre_class.split()
for class_name in pre_class:
if class_name.startswith('language-'):
language = class_name.replace('language-', '')
break
return f'```{language}\n{text}\n```\n' if language else f'```\n{text}\n```\n'
Then extracted code snippets using regex:
def extract_code_snippets(markdown: str) -> list:
pattern = r'\n```.*?\n(.*?)\n```\n'
snippets = []
matches = re.finditer(pattern, markdown, re.DOTALL)
for match in matches:
full_block = match.group(0)
content = match.group(1)
snippets.append((full_block, content))
return snippets
This workaround involves:
- Extracting snippets from both the original markdown and newly converted markdown
- Using token-based similarity matching to identify corresponding code blocks
- Replacing the poorly formatted blocks with properly formatted ones
While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.
I can help implement a fix for this issue. Could you please point me to:
- The relevant code handling HTML to markdown conversion for code blocks
- Any specific patterns/conventions to follow
I'll submit a PR with the necessary changes once I have this guidance.
When converting HTML
<pre>and<code>tags to markdown, the current implementation removes newlines and modifies code block formatting, making it difficult to preserve the original code structure. This affects the readability and usability of the generated markdown.Example:
Original HTML:
def example(): print("Hello") return TrueImpact
Note: To replicate crawl the link "https://www.firecrawl.dev/blog/automated-web-scraping-free-2025"
Current Workaround
I implemented a custom solution using
markdownifywith a custom pre tag conversion:Then extracted code snippets using regex:
This workaround involves:
While this solution helps in some cases, it doesn't address all formatting issues and requires additional processing steps that shouldn't be necessary.
I can help implement a fix for this issue. Could you please point me to:
I'll submit a PR with the necessary changes once I have this guidance.