Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashtags with extended alphabet characters aren't recognized as hashtags, AP=>Bluesky #1131

Open
MS-potilas opened this issue Jun 13, 2024 · 10 comments
Labels
compat Protocol differences that need special handling.

Comments

@MS-potilas
Copy link

AP Hashtags containing extended alphabet characters, like ä (a with dots) and ö (o with dots), aren't recognized as hashtags. They show as text in Bluesky.

Example:
https://mementomori.social/@rolle/112586679114646311
https://bsky.app/profile/rolle.mementomori.social.ap.brid.gy/post/3kuikyelvzdc2

Here #Äänestäminen was not recognized as hashtag,

@snarfed snarfed added the now label Jun 27, 2024
@snarfed
Copy link
Owner

snarfed commented Jun 27, 2024

Huh, this turned out to be more interesting than I though. Mastodon's AS2 JSON for this post removes the umlauts from those characters in the tag objects. It renders them in content and in the UI:

image

...but the AS2 tag has "name" : "#aanestaminen", no umlauts. Full object below.

Interestingly, if you click on the #Äänestäminen hashtag chip in the UI, it goes to the hashtag page, https://mementomori.social/tags/%C3%84%C3%A4nest%C3%A4minen , which has the umlauts, but they're only for show, evidently they're not in the underlying hashtag index. If you remove them from that URL to get https://mementomori.social/tags/Aanestaminen , it renders the hashtag without them but shows the same results.

{
   "type" : "Note",
   "id" : "https://mementomori.social/users/rolle/statuses/112586679114646311",
   "url" : "https://mementomori.social/@rolle/112586679114646311",
   "attributedTo" : "https://mementomori.social/users/rolle",
   "content" : "<p>Muista käydä äänestämässä! Klo 20 asti aikaa. On tyhmää olla vaikuttamatta, kun siihen demokratiassa on mahdollisuus. Kaikille maailmassa ei tällaista suoda.</p><p><a href=\"https://mementomori.social/tags/Eurovaalit2024\" class=\"mention hashtag\" rel=\"tag\">#<span>Eurovaalit2024</span></a> <a href=\"https://mementomori.social/tags/Eurovaalit\" class=\"mention hashtag\" rel=\"tag\">#<span>Eurovaalit</span></a> <a href=\"https://mementomori.social/tags/%C3%84%C3%A4nest%C3%A4minen\" class=\"mention hashtag\" rel=\"tag\">#<span>Äänestäminen</span></a> <a href=\"https://mementomori.social/tags/Politiikka\" class=\"mention hashtag\" rel=\"tag\">#<span>Politiikka</span></a></p>",
   "tag" : [
      {
         "href" : "https://mementomori.social/tags/eurovaalit2024",
         "name" : "#eurovaalit2024",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/eurovaalit",
         "name" : "#eurovaalit",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/aanestaminen",
         "name" : "#aanestaminen",
         "type" : "Hashtag"
      },
      {
         "href" : "https://mementomori.social/tags/politiikka",
         "name" : "#politiikka",
         "type" : "Hashtag"
      }
   ]
}

@snarfed
Copy link
Owner

snarfed commented Jun 27, 2024

I actually like this, it seems clever and a good UX idea, but it's definitely more difficult to translate. Bluesky uses index-based facets for hashtags and other rich text, but Mastodon's AS2 tags don't have indices, so we have to search for their name in the content, which doesn't work in this case because the name is the normalized text, eg #aanestaminen, which doesn't have the umlauts.

I could do something Mastodon-specific and parse content as HTML and search for class="hashtag" or rel="tag", but I'd still have to map the umlaut text there to the plain Latin text in tag.name, but that's a proprietary special that I'd rather avoid. Or I could ignore tags entirely and only look at the parsed HTML, but that's even more proprietary. Hrm.

@snarfed snarfed removed the now label Jun 27, 2024
@snarfed
Copy link
Owner

snarfed commented Aug 13, 2024

More details on Mastodon's behavior here in mastodon/mastodon#26518 . No response from their team though.

@MS-potilas
Copy link
Author

FYI, it looks like this is is fixed in Iceshrimp https://bsky.app/profile/AlderForrest.1m2lab.anvil.top.ap.brid.gy/post/3l25re3eiu7c2 as the hashtag #härkis is working. https://1m2lab.anvil.top/

@snarfed
Copy link
Owner

snarfed commented Aug 20, 2024

@MS-potilas nice! Or maybe it always worked in Iceshrimp? Here are the key parts of the AS2 for that post:

  "content": "<p><span>h\u00e4rkisdolmiospagettikastike. Ehdottomasti jatkoon!<br><br></span><a href=\"https://1m2lab.anvil.top/tags/h\u00e4rkis\" rel=\"tag\">#h\u00e4rkis</a></p>",
  "tag": [{
      "type": "Hashtag",
      "href": "https://1m2lab.anvil.top/tags/h%C3%A4rkis",
      "name": "#h\u00e4rkis"
    }]

Unlike Mastodon, Iceshrimp preserves the ä in the tag's name, so Bridgy Fed is able to translate it.

@MS-potilas
Copy link
Author

Ah, I thought Iceshrimp is a Mastodon fork, but it is a Misskey fork, so maybe it did work from the beginning.

@MS-potilas
Copy link
Author

What if we searched content with umlauts removed to get the indices, those indices will work also with the original content with umlauts. Simpler than parsing the content tags etc. This of course only in Mastodon. Just a thought.

@snarfed
Copy link
Owner

snarfed commented Aug 28, 2024

Sadly Bluesky facet indices are bytes, not characters/graphemes, so they won't match. Eg a is one byte, ä is two.

@Tamschi Tamschi added the compat Protocol differences that need special handling. label Oct 31, 2024
@gunchleoc
Copy link

I guess the solution here would be to use the "name" : "#Gàidhlig", from Mastodon and then encode it to G%C3%A0idhlig?

Here's another broken example that you can use for testing:

Corresponding hashtag searches:

Testing on whether conversion must be done could be either querying the Fediverse node software, or splitting the last element off the href and comparing it to the name. If they're identical, no conversion needs to be done.

@snarfed
Copy link
Owner

snarfed commented Nov 25, 2024

%-encoding like that is for URLs. Here, we have the un-encoded Unicode text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compat Protocol differences that need special handling.
Projects
None yet
Development

No branches or pull requests

4 participants