Skip to content
This repository has been archived by the owner on Sep 24, 2019. It is now read-only.

test for WCAG technique H58: Using language attributes to identify changes in the human language #58

Closed
hannolans opened this issue Dec 4, 2013 · 20 comments

Comments

@hannolans
Copy link
Contributor

Check that the human language of the content of the element is the same as the inherited language for the element.

This can be done with a guesslanguage script. A guesslanguage script checks characters (chinese, japanese) used, and checks for regular words ('and', 'und', 'et' etc).

Here is one we can implement:
https://github.com/richtr/guessLanguage.js

@hannolans
Copy link
Contributor Author

guesslanguage.js has a license that is not compatible with the MIT license. It's based on older scripts in other languages.

@kevee
Copy link
Collaborator

kevee commented Jan 23, 2014

I've gone through all the past older scripts and found it's GPL all the way down, but there are people like Spam Assassin who are apache licensed and use the same method. Basically, all of the libraries use the standard Unicode blocks for languages and then a simple algorithm that says if these characters are over 40% of characters in a set of strings over a certain number of characters, they are a certain language.

I think we might want to contact the author of guesslanguage and ask if we can include his regex in quail.

@hannolans
Copy link
Contributor Author

Here is another code base, with BSD-license, based on the same alogrithm principle you mentioned: https://github.com/webmil/text-language-detect

@hannolans
Copy link
Contributor Author

And this is also MIT licensed and very interesting: https://github.com/shuyo/ldig
This script is infinity-gram and espacially designed to detect the language even for small texts like tweets. See "Language Detection for twitter with 99.1% Accuracy" http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/
Problem with this method seems the dataset (>20 MB)
In the slides Shuvo compares other open source n-gram projects:
http://code.google.com/p/chromium-compact-language-detector/ (New BSD-license, code ported from Google: https://code.google.com/p/cld2/)
http://tika.apache.org/ (Apache License)

@kevee
Copy link
Collaborator

kevee commented Jan 28, 2014

In looking at a few of the projects mentioned, they are using the same kind of unicode ranges, but languages like Python or C have language shortcuts that map to unicode groups like https://code.google.com/p/chromium-compact-language-detector/source/browse/encodings.cc.

@kevee
Copy link
Collaborator

kevee commented Jan 31, 2014

I've started a language-detection feature branch to cleanup all our language detection into a single component that would include a simple unicode set regular expression solution for now. While this is not the most accurate when given a single word, in initial tests it worked fine with languages that had a distinct unicode set (Japanese, Cryllic), but detection dropped off as a function of language differences (like English & Spanish, where you could go a few sentences without using diacritical marks).

I'm getting these unicode sets from documents like http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml, which list exemplarCharacters as characters or a unicode range, like:

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 \u0670 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]
</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F پ چ ژ ڜ ڢ ڤ ڥ ٯ ڧ ڨ ک گ ی]</exemplarCharacters>

@kevee
Copy link
Collaborator

kevee commented Jan 31, 2014

One question I have is how to determine what language the current document is in if it's not set. Right now I first look at the element quail was asked to be run against, then it's immediate ancestor with a lang attribute, then finally fallback to the browser. I don't like the last one, but our only option is to just not run the test at all. I mean, if there is not element on the page at all with a lang attribute, how are we to determine a baseline? I could see if there are two different scripts or languages with singleton unicode blocks, and just say "hey, it looks like two languages here."

@hannolans
Copy link
Contributor Author

We don't need to have a base language. We could run the test for text in each html element and compare the guessed language with the language in the other elements and check if there is a lang attribute in between if there is a difference. This won't work for if there is mixed language within an element.
I don't know if matching one word in a foreign character should trigger a fail. Foreign websites contain English wellknown words that shouldn't trigger.

@kevee
Copy link
Collaborator

kevee commented Jan 31, 2014

I was also just thinking about non-basic latin words that are used in english and should not be considered a context switch.

Can we say for now that we will just check all text-containing elements and compare them to each other?

@hannolans
Copy link
Contributor Author

Yes, that's a good solution. And if the website contains the right semantic elements (other test) this will work for block or inline quotations as well.

@kevee
Copy link
Collaborator

kevee commented Jan 31, 2014

Great! I'm also thinking later down the line we can use the Content-Language HTTP header on the page as well, but that will require making an additional HTTP request on the current page, and I have found that this can cause problems where a page initiates an action (like visiting page/123/edit and then re-opening another HTTP request on the same URL might cause the backend application to either lock the content).

@hannolans
Copy link
Contributor Author

Great to have this in the code. Just curious, I couldnt track the code for that, but what happens when there is a single word in latin script like: 'এটি একটি ভাষা একক IBM স্ক্রিপ্ট'.
Only some words should pass at the moment (or at least not fail). Does it behave like that?

@kevee
Copy link
Collaborator

kevee commented Feb 4, 2014

Right now it would throw an error, but we could only capture strings of text that are either longer than n or only capture complete character groups (like a sentence).

I've also reached out to an author of a good trigram database to include in quail, I think it's really heavy to include in a browser, but quail could accept a trigram database passed to it and use it to more accurately determine language change.

@kevee
Copy link
Collaborator

kevee commented Feb 12, 2014

Added an issue to the guesslanguage repo to discuss including the trigram database with quail.
richtr/guessLanguage.js#7

@hannolans
Copy link
Contributor Author

This trigram set has a BSD license: https://github.com/webmil/text-language-detect/blob/master/lib/data/lang.dat and is 340kb
It's a fork of https://github.com/pear/Text_LanguageDetect also BSD
I assume that the databases are the same. https://github.com/pear/Text_LanguageDetect/tree/trunk/data
It can test 52 languages. You can test this script on http://languagedetect.org/

@hannolans
Copy link
Contributor Author

It would be very interesting to find a way we can include this even in the plugin. We could probably preload QUAIL with the tri-gram of only the given base language and start loading the other trigrams as soon as a certain treshold is reached that a text is not the base language? If we make the test more lightweight, we even don't have to know which language it is, we only have to know that it is not the base language.

@kevee
Copy link
Collaborator

kevee commented Feb 22, 2014

Given that we can only really accurately detect changes in script using regex, not language, we will definitely need a trigram database of some kind; however, any database would be too large to just include in the plugin code. For now, I'm going to just accept the format of the Trigram databases you identified (which are all identical) and that way a project just provides quail with the database, which could include just one language, or all of them.

You are correct that if there is a base language identified, we can actually just run through text and see if anything does not match the trigrams in question, although we should probably only throw an error in that case if there's enough characters in the element for a trigram test to be viable.

If there is no provided trigram, there is a method of generating trigrams, and we could just go element-by-element creating a trigram database on the fly and then throwing errors on elements that are wildly divergent from the rest of the page.

I've started the trigram-language branch to capture this work.

@kevee
Copy link
Collaborator

kevee commented Feb 22, 2014

I've spent a lot of time either building or going down different implementations, and I am just going to say at this point guessLanguage.js is our best hope, but I haven't gotten a response from the maintainer (the repo is over a year since the last commit.)

At this point, I'd like to suggest we add guessLanguage as a dev dependency (i.e. it won't be rolled into a release, but people can run bower install or grunt package locally to download it). This tool is the most robust one out there, and I don't think we want to commit any more dev time (although it has been an educational experience) to rebuilding this particular wheel.

Thoughts? I'm going to go forward in the trigram-language branch with this model for the time being, I don't think there are any licensing issues with doing it this way.

@kevee
Copy link
Collaborator

kevee commented Feb 24, 2014

I'm almost ready to merge in the branch, but I'm just playing with how many characters long a string should be before guessLanguage even makes sense to use. There's some research on the subject (including math that I'm not going to pretend to understand) that is leading me toward a simple model:

  1. If the language is character based like Chinese, only use if string.length > 40.
  2. If the language is letter based like English, only use if string.length > 300.

Luckily we have the unicode blocks to separate these out.

@kevee
Copy link
Collaborator

kevee commented Feb 25, 2014

Actually it ends up the string length is unneeded since all character-based languages are in different unicode singletons and therefore captured even without guessLanguage. I'm going to pull into dev.

kevee pushed a commit that referenced this issue Feb 25, 2014
Use guessLanguage.js to find language changes without unicode singletons. Closes #58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants