Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting non-english anchor tags leads to "-x" values (or umlauts are replaced) #534

Closed
HardMax71 opened this issue Nov 6, 2023 · 2 comments · Fixed by #535
Closed

Comments

@HardMax71
Copy link

HardMax71 commented Nov 6, 2023

Describe the bug
When converting non-english text with anchor tags in UTF-8 to html, output tags are "-1", "-2", .. instead of error thrown / tag converted in Russian. Also in case of German (not tested on other languages with umlauts), umlauts (= ä, ö, ü, ..) are changed to their default versions (a, o, u, ..) in id's.

To Reproduce
With english text:

# main.py
import markdown2
help_text = '''
# Header

## Table of Contents
1. [Getting Started](#getting-started)

### Getting Started {#}
To begin using the application, launch `main.py`.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Result (all ok):

<h1 id="header">Header</h1>

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
<li><a href="#getting-started">Getting Started</a></li>
</ol>

<h3 id="getting-started">Getting Started {#}</h3>

<p>To begin using the application, launch <code>main.py</code>.</p>

With Russian text, encoding - UTF-8:

import markdown2
help_text = '''
# Руководство 

## Содержание
1. [Начало работы](#начало-работы)

### Начало работы {#}
Для начала работы запустите `main.py`.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Output (id's are somehow "-x"..):

<h1 id="-1">Руководство</h1>

<h2 id="-2">Содержание</h2>

<ol>
<li><a href="#начало-работы">Начало работы</a></li>
</ol>

<h3 id="-3">Начало работы {#}</h3>

<p>Для начала работы запустите <code>main.py</code>.</p>

With German text, encoding - UTF-8 (Umlauts replaced in id's):
<had to change text a bit cause translation for text above doesn't contain any umlauts by default>

import markdown2
help_text = '''
## Handbuch 

## Inhalt
1. [ü-umlaut-test-encoding](#ü-umlaut-test-encoding)

### ü-umlaut-test-encoding {#}
Führen Sie `main.py` aus, um loszulegen.
'''

help_text_html = markdown2.markdown(help_text, extras=['header-ids'])
print(help_text_html)

Output:

<h2 id="handbuch">Handbuch</h2>

<h2 id="inhalt">Inhalt</h2>

<ol>
<li><a href="#ü-umlaut-test-encoding">ü-umlaut-test-encoding</a></li>
</ol>

<h3 id="u-umlaut-test-encoding">ü-umlaut-test-encoding {#}</h3>

<p>Führen Sie <code>main.py</code> aus, um loszulegen.</p>

Expected behavior
In case if only ASCII is supported, it would like to see an error thrown with sort of "Unsupported character at position XYZ" description. Also i would expect warning and/or error in case of German where "ü" would be preserved in text, in link (#ü-umlaut-test-encoding), BUT not in id: <h3 id="**u**-umlaut-test-encoding">

Debug info
markdown2 version = 2.4.10

Any extras being used: 'header-ids'

@HardMax71 HardMax71 changed the title Converting non-english anchor tags leads to "-x" values Converting non-english anchor tags leads to "-x" values (or umlauts are replaced) Nov 6, 2023
@Crozzers
Copy link
Contributor

Crozzers commented Nov 6, 2023

Seems to be a problem in the _slugify function where we encode all chars as ascii and ignore all errors.

value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode()

Git blame shows this was last touched April 2012, so I guess this was a compatibility limitation at the time? The wiki page also explicity says header IDs are ASCII.
@nicholasserra can you see any issues with bumping this up to utf-8?

@nicholasserra
Copy link
Collaborator

I have no idea what the effects might be. If we switch and do some proper testing I don't see any reason not to.

nicholasserra added a commit that referenced this issue Nov 9, 2023
Update `_slugify` to use utf-8 encoding (issue #534)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants