Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other issue with Markdown style links. #28

Closed
PhilipMay opened this issue Feb 13, 2024 · 1 comment
Closed

Other issue with Markdown style links. #28

PhilipMay opened this issue Feb 13, 2024 · 1 comment

Comments

@PhilipMay
Copy link

Links in this format: "*[Neubau](https://www.some-link.com)*" have an issue.

Code:

text = "*[Neubau](https://www.some-link.com)*"
sentences = somajo.tokenize_text([text])
for sentence in sentences:
    for token in sentence:
        print(f"{token.text}\t{token.token_class}\t{token.extra_info}")

Returns:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com)*	URL	

Should return something like this:

*	symbol	SpaceAfter=No
[	symbol	SpaceAfter=No
Neubau	regular	SpaceAfter=No
]	symbol	SpaceAfter=No
(	symbol	SpaceAfter=No
https://www.some-link.com	URL
)       symbol SpaceAfter=No
*	symbol SpaceAfter=No

Full code: https://colab.research.google.com/drive/16-CKdzp20Gin02emrLVeHfFFir2veK8M?usp=sharing

tsproisl added a commit that referenced this issue Feb 19, 2024
@tsproisl
Copy link
Owner

I’ve decided to explicitly add markdown links, so this should be fixed now, with the caveat that it will fail if the link description contains square brackets or if the URL contains parentheses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants