Skip to content

Conversation

@Skyliife
Copy link
Contributor

link to issue: #470

  • Wrap incoming HTML in charset.NewReader before goquery parsing
  • Ensures ISO‑8859‑1 (and other legacy) input is normalized to UTF‑8
  • Prevents “mojibake” (e.g. “ä” instead of “ä”)
  • Updated TestWorldAntica to simulate Latin‑1 input and verify correct Umlaut decoding
  • Added Antica.html for parsing character Näurin

Closes #470

link to issue: tibiadata#470

- Wrap incoming HTML in charset.NewReader before goquery parsing
- Ensures ISO‑8859‑1 (and other legacy) input is normalized to UTF‑8
- Prevents “mojibake” (e.g. “ä” instead of “ä”)
- Updated TestWorldAntica to simulate Latin‑1 input and verify correct Umlaut decoding
- Added Antica.html for parsing character Näurin

Closes tibiadata#470
@Skyliife Skyliife marked this pull request as ready for review April 18, 2025 12:23
- fix for character endpoint
- Replace custom TibiaDataConvertEncodingtoUTF8 with golang.org/x/net/html/charset.NewReader
- Use the actual Content‑Type header from Tibia.com to normalize response bytes into UTF‑8
- Remove resIo/resIo2 steps and feed the UTF‑8 reader directly into goquery
@sonarqubecloud
Copy link

@Skyliife
Copy link
Contributor Author

@tobiasehlert I’ve updated the HTML collector to use charset.NewReader with the real Content-Type header instead of our custom converter, so incoming pages should now be normalized to proper UTF‑8 and preserve Umlauts (e.g. “Näurin”). I’m not super familiar with all the Go idioms here, so I’d really appreciate if someone could double check my changes.

@tobiasehlert
Copy link
Member

List of some umlaut-characters:

  • Näurin
  • Hidofäs
  • König der Toten
  • Torbjörn
  • Sir Pösi
  • Wiliam Lundström
  • Der Nachtjäger
  • Stählerner Krieger
  • Nöber of Guards
  • Skalle pär
  • Höfix
  • Bürgy
  • Wächter der Hölle
  • Gordon Dödsmetal
  • Nöber

@tobiasehlert
Copy link
Member

Thanks for your PR @Skyliife, but I've created #506 to only adress the umlaut issue itself.

Any particular reason why we should switch to charset.NewReader?
I see maybe the benefit in using the Content-Type header, but maybe I miss something else.

@tobiasehlert
Copy link
Member

@Skyliife, didn't notice that the encoding from tibia.com is utf-8 now.. so should have given you credits in #511.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

bug: character issues with umlauts

2 participants