Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore low-confidence CharlockHolmes guesses when parsing link cards #9510

Merged
merged 3 commits into from
Dec 17, 2018

Conversation

ClearlyClaire
Copy link
Contributor

Fixes #9466 by ignoring low-confidence CharlockHolmes charset guesses.

High-confidence guesses will still override server or document-specified charset, and low confidence guesses won't be used at all even if there is no server or document-specified charset, so this might not be the best way to proceed.

Note also that the “hint” given to CharlockHolmes as very little importance, and feeding it the correct charset may not prevent it from outputing an incorrect, low-confidence guess.

@@ -137,7 +137,8 @@ def attempt_opengraph
detector.strip_tags = true

guess = detector.detect(@html, @html_charset)
page = Nokogiri::HTML(@html, nil, guess&.fetch(:encoding, nil))
encoding = guess&.fetch(:confidence, 0) > 60 ? guess&.fetch(:encoding, nil) : nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think nil > 60 will throw no method error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! Addressed.

@Gargron Gargron merged commit e709b8d into mastodon:master Dec 17, 2018
@jomo
Copy link
Contributor

jomo commented Dec 17, 2018

Thank you, @ThibG!

@ClearlyClaire ClearlyClaire deleted the fixes/fetch-link-card-charset branch March 14, 2019 15:42
hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this pull request Oct 2, 2019
…mastodon#9510)

* Add failing test for windows-1251 link cards

* Ignore low-confidence CharlockHolmes guesses

Fixes mastodon#9466

* Fix no method error when charlock holmes cannot detect charset
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants