test for WCAG technique H58: Using language attributes to identify changes in the human language #58

hannolans · 2013-12-04T20:11:42Z

Check that the human language of the content of the element is the same as the inherited language for the element.

This can be done with a guesslanguage script. A guesslanguage script checks characters (chinese, japanese) used, and checks for regular words ('and', 'und', 'et' etc).

Here is one we can implement:
https://github.com/richtr/guessLanguage.js

hannolans · 2014-01-14T15:52:31Z

guesslanguage.js has a license that is not compatible with the MIT license. It's based on older scripts in other languages.

kevee · 2014-01-23T03:02:08Z

I've gone through all the past older scripts and found it's GPL all the way down, but there are people like Spam Assassin who are apache licensed and use the same method. Basically, all of the libraries use the standard Unicode blocks for languages and then a simple algorithm that says if these characters are over 40% of characters in a set of strings over a certain number of characters, they are a certain language.

I think we might want to contact the author of guesslanguage and ask if we can include his regex in quail.

hannolans · 2014-01-23T22:10:02Z

Here is another code base, with BSD-license, based on the same alogrithm principle you mentioned: https://github.com/webmil/text-language-detect

hannolans · 2014-01-23T23:02:47Z

And this is also MIT licensed and very interesting: https://github.com/shuyo/ldig
This script is infinity-gram and espacially designed to detect the language even for small texts like tweets. See "Language Detection for twitter with 99.1% Accuracy" http://shuyo.wordpress.com/2012/02/21/language-detection-for-twitter-with-99-1-accuracy/
Problem with this method seems the dataset (>20 MB)
In the slides Shuvo compares other open source n-gram projects:
http://code.google.com/p/chromium-compact-language-detector/ (New BSD-license, code ported from Google: https://code.google.com/p/cld2/)
http://tika.apache.org/ (Apache License)

kevee · 2014-01-28T16:01:47Z

In looking at a few of the projects mentioned, they are using the same kind of unicode ranges, but languages like Python or C have language shortcuts that map to unicode groups like https://code.google.com/p/chromium-compact-language-detector/source/browse/encodings.cc.

kevee · 2014-01-31T17:30:12Z

I've started a language-detection feature branch to cleanup all our language detection into a single component that would include a simple unicode set regular expression solution for now. While this is not the most accurate when given a single word, in initial tests it worked fine with languages that had a distinct unicode set (Japanese, Cryllic), but detection dropped off as a function of language differences (like English & Spanish, where you could go a few sentences without using diacritical marks).

I'm getting these unicode sets from documents like http://unicode.org/cldr/trac/browser/trunk/common/main/ar.xml, which list exemplarCharacters as characters or a unicode range, like:

<exemplarCharacters>[\u064B \u064C \u064D \u064E \u064F \u0650 \u0651 \u0652 \u0670 ء آ أ ؤ إ ئ ا ب ت ة ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ى]
</exemplarCharacters>
<exemplarCharacters type="auxiliary">[\u200C\u200D\u200E\u200F پ چ ژ ڜ ڢ ڤ ڥ ٯ ڧ ڨ ک گ ی]</exemplarCharacters>

kevee · 2014-01-31T19:37:11Z

One question I have is how to determine what language the current document is in if it's not set. Right now I first look at the element quail was asked to be run against, then it's immediate ancestor with a lang attribute, then finally fallback to the browser. I don't like the last one, but our only option is to just not run the test at all. I mean, if there is not element on the page at all with a lang attribute, how are we to determine a baseline? I could see if there are two different scripts or languages with singleton unicode blocks, and just say "hey, it looks like two languages here."

hannolans · 2014-01-31T20:23:01Z

We don't need to have a base language. We could run the test for text in each html element and compare the guessed language with the language in the other elements and check if there is a lang attribute in between if there is a difference. This won't work for if there is mixed language within an element.
I don't know if matching one word in a foreign character should trigger a fail. Foreign websites contain English wellknown words that shouldn't trigger.

kevee · 2014-01-31T20:51:57Z

I was also just thinking about non-basic latin words that are used in english and should not be considered a context switch.

Can we say for now that we will just check all text-containing elements and compare them to each other?

hannolans · 2014-01-31T21:05:28Z

Yes, that's a good solution. And if the website contains the right semantic elements (other test) this will work for block or inline quotations as well.

kevee · 2014-01-31T21:45:10Z

Great! I'm also thinking later down the line we can use the Content-Language HTTP header on the page as well, but that will require making an additional HTTP request on the current page, and I have found that this can cause problems where a page initiates an action (like visiting page/123/edit and then re-opening another HTTP request on the same URL might cause the backend application to either lock the content).

…tain whitespace. #58

hannolans · 2014-02-03T15:14:48Z

Great to have this in the code. Just curious, I couldnt track the code for that, but what happens when there is a single word in latin script like: 'এটি একটি ভাষা একক IBM স্ক্রিপ্ট'.
Only some words should pass at the moment (or at least not fail). Does it behave like that?

kevee · 2014-02-04T15:15:56Z

Right now it would throw an error, but we could only capture strings of text that are either longer than n or only capture complete character groups (like a sentence).

I've also reached out to an author of a good trigram database to include in quail, I think it's really heavy to include in a browser, but quail could accept a trigram database passed to it and use it to more accurately determine language change.

kevee · 2014-02-12T15:24:54Z

Added an issue to the guesslanguage repo to discuss including the trigram database with quail.
richtr/guessLanguage.js#7

hannolans · 2014-02-12T20:22:53Z

This trigram set has a BSD license: https://github.com/webmil/text-language-detect/blob/master/lib/data/lang.dat and is 340kb
It's a fork of https://github.com/pear/Text_LanguageDetect also BSD
I assume that the databases are the same. https://github.com/pear/Text_LanguageDetect/tree/trunk/data
It can test 52 languages. You can test this script on http://languagedetect.org/

hannolans · 2014-02-12T20:44:29Z

It would be very interesting to find a way we can include this even in the plugin. We could probably preload QUAIL with the tri-gram of only the given base language and start loading the other trigrams as soon as a certain treshold is reached that a text is not the base language? If we make the test more lightweight, we even don't have to know which language it is, we only have to know that it is not the base language.

kevee · 2014-02-22T04:15:42Z

Given that we can only really accurately detect changes in script using regex, not language, we will definitely need a trigram database of some kind; however, any database would be too large to just include in the plugin code. For now, I'm going to just accept the format of the Trigram databases you identified (which are all identical) and that way a project just provides quail with the database, which could include just one language, or all of them.

You are correct that if there is a base language identified, we can actually just run through text and see if anything does not match the trigrams in question, although we should probably only throw an error in that case if there's enough characters in the element for a trigram test to be viable.

If there is no provided trigram, there is a method of generating trigrams, and we could just go element-by-element creating a trigram database on the fly and then throwing errors on elements that are wildly divergent from the rest of the page.

I've started the trigram-language branch to capture this work.

kevee · 2014-02-22T17:14:42Z

I've spent a lot of time either building or going down different implementations, and I am just going to say at this point guessLanguage.js is our best hope, but I haven't gotten a response from the maintainer (the repo is over a year since the last commit.)

At this point, I'd like to suggest we add guessLanguage as a dev dependency (i.e. it won't be rolled into a release, but people can run bower install or grunt package locally to download it). This tool is the most robust one out there, and I don't think we want to commit any more dev time (although it has been an educational experience) to rebuilding this particular wheel.

Thoughts? I'm going to go forward in the trigram-language branch with this model for the time being, I don't think there are any licensing issues with doing it this way.

kevee · 2014-02-24T14:48:34Z

I'm almost ready to merge in the branch, but I'm just playing with how many characters long a string should be before guessLanguage even makes sense to use. There's some research on the subject (including math that I'm not going to pretend to understand) that is leading me toward a simple model:

If the language is character based like Chinese, only use if string.length > 40.
If the language is letter based like English, only use if string.length > 300.

Luckily we have the unicode blocks to separate these out.

kevee · 2014-02-25T01:22:33Z

Actually it ends up the string length is unneeded since all character-based languages are in different unicode singletons and therefore captured even without guessLanguage. I'm going to pull into dev.

Use guessLanguage.js to find language changes without unicode singletons. Closes #58

This was referenced Dec 14, 2013

Test for WCAG technique H56 (dir attribute on an inline element) #72

Closed

Test for WCAG technique H58: Using language attributes to identify changes in the human language #63

Closed

jessebeach mentioned this issue Dec 14, 2013

Incorporate a build step for sub-modules #83

Closed

kevee pushed a commit that referenced this issue Jan 31, 2014

Added test files for langaugeChangesAreIdentified, made latin not con…

9adf033

…tain whitespace. #58

kevee mentioned this issue Feb 2, 2014

Pull in language detection #137

Merged

kevee mentioned this issue Feb 12, 2014

Licensing and the Quail project richtr/guessLanguage.js#7

Open

hannolans added the 3.1.2 Language of parts label Feb 17, 2014

kevee pushed a commit that referenced this issue Feb 24, 2014

Added language tester, guessLanguage integration. Per #58

b89f5dc

kevee mentioned this issue Feb 25, 2014

Use guessLanguage.js to find language changes without unicode singletons #151

Merged

kevee pushed a commit that referenced this issue Feb 25, 2014

Merge pull request #151 from kevee/trigram-language

a59deea

Use guessLanguage.js to find language changes without unicode singletons. Closes #58

kevee closed this as completed Feb 25, 2014

hannolans added the plugin label Feb 28, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test for WCAG technique H58: Using language attributes to identify changes in the human language #58

test for WCAG technique H58: Using language attributes to identify changes in the human language #58

hannolans commented Dec 4, 2013

hannolans commented Jan 14, 2014

kevee commented Jan 23, 2014

hannolans commented Jan 23, 2014

hannolans commented Jan 23, 2014

kevee commented Jan 28, 2014

kevee commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Feb 3, 2014

kevee commented Feb 4, 2014

kevee commented Feb 12, 2014

hannolans commented Feb 12, 2014

hannolans commented Feb 12, 2014

kevee commented Feb 22, 2014

kevee commented Feb 22, 2014

kevee commented Feb 24, 2014

kevee commented Feb 25, 2014

test for WCAG technique H58: Using language attributes to identify changes in the human language #58

test for WCAG technique H58: Using language attributes to identify changes in the human language #58

Comments

hannolans commented Dec 4, 2013

hannolans commented Jan 14, 2014

kevee commented Jan 23, 2014

hannolans commented Jan 23, 2014

hannolans commented Jan 23, 2014

kevee commented Jan 28, 2014

kevee commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Jan 31, 2014

kevee commented Jan 31, 2014

hannolans commented Feb 3, 2014

kevee commented Feb 4, 2014

kevee commented Feb 12, 2014

hannolans commented Feb 12, 2014

hannolans commented Feb 12, 2014

kevee commented Feb 22, 2014

kevee commented Feb 22, 2014

kevee commented Feb 24, 2014

kevee commented Feb 25, 2014