TAWS Technical Documentation

Pēteris Ņikiforovs edited this page Jun 19, 2013 · 5 revisions
Clone this wiki locally

The Technical Documentation of the Terminology Use Case

The Terminology Use Case is designed and implemented by Tilde. It consists of three logical projects:

  1. Tilde ITS – a reusable software library that implements the Internationalization Tag Set (ITS) Version 2.0
  2. Tilde TAWS API – a web service that annotates terminology in the submitted document
  3. Tilde TAWS Showcase – a user interface for TAWS in the form of a web site. For a detailed user guide refer to the PDF version of the documentation or the PPTX presentation of the showcase.

The figure below reflects the architecture of the terminology use case. Overall architecture of the Terminology Use Case

Tilde ITS

Tilde.Its is a .NET class library written in C# for parsing ITS 2.0 annotated content in XML and HTML formats. All functionality is located in the Tilde.Its namespace.

ItsDocument

The abstract class ItsDocument represents an ITS 2.0 annotated document. It performs the following functionalities:

  • loads and parses the ITS 2.0 enriched document;
  • finds and loads global rules associated to the document.

The class ItsXmlDocument, which inherits from ItsDocument, represents an XML document. Similarly, the class ItsHtmlDocument represents an HTML document.

Code examples for usage of the ItsXmlDocument class are given below. The ItsXmlDocument class requires the declaration of the Tilde.Its namespace:

using Tilde.Its;

A new document can be created in the following two ways:

  • from a file:

    ItsXmlDocument doc = new ItsXmlDocument(uri: "document.xml");

  • from a string:

    ItsXmlDocument doc = new ItsXmlDocument(xml: "<root/>");

If a new document is created from a string and it has external rules with relative paths, the user can specify the base path for these rules:

ItsXmlDocument doc = new ItsXmlDocument(  
xml: "<root><its:rules ... xlink:href='rules.xml'/></root>",  
uri: "http://example.com"); // will load rules from http://example.com/rules.xml

A new document can be also created by cloning an existing document:

ItsXmlDocument doc2 = new ItsXmlDocument(doc);

A document can be converted to an XML document or a string (e.g., to save it in a file):

System.Xml.Linq.XDocument xmlDoc = doc.Document;  
string xml = doc.Document.ToString();

Data categories

Each data category is represented by a class that inherits from DataCategory. Class names are formed using the data category name and the suffix DataCategory, for example: TranslateDataCategory, ElementsWithinTextDataCategory, MtConfidenceDataCategory, IdValueDataCategory.

The constructor of these classes accepts two arguments: ItsDocument which contains the global rules, and the node (element or attribute) to analyse.

using Tilde.Its;
using System.Xml.Linq; // XEement, XAttribute
using System.Diagnostics; // Assert

ItsHtmlDocument doc = 
    new ItsHtmlDocument("<html><body translate=no>0x01 0x02 0x03</body></html>");
XElement html = doc.Document.Root;
XElement body = html.Element(ItsHtmlDocument.XhtmlNamespace + "body");
XAttribute bodyAttribute = body.Attribute("translate");

TranslateDataCategory htmlTranslate = new TranslateDataCategory(doc, html);
Assert.IsTrue(htmlTranslate.IsTranslatable); // default value

TranslateDataCategory bodyTranslate = new TranslateDataCategory(doc, body);
Assert.IsFalse(bodyTranslate.IsTranslatable); // local value

TranslateDataCategory bodyAttributeTranslate = 
    new TranslateDataCategory(doc, bodyAttribute);
Assert.IsFalse(bodyAttributeTranslate.IsTranslatable); // default value

The annotatorsRef attribute is not a data category but it can be used in a similar way.

AnnotatorAnnotation annotators = new AnnotatorsAnnotation(doc, element);
annotators.AnnotatorRef; // terminology|http://1 text-analysis|http://2
annotators["terminology"]; // AnnotatorsRef with "terminology" and "http://1"
// annotators is IEnumerable<AnnotatorsRef>

Annotations

The class System.Xml.Linq.XObject (from which XElement and XAttribute inherit) supports adding annotations.

html.AddAnnotation(new TranslateDataCategory(doc, html));
TranslateDataCategory htmlTranslate = html.Annotation<TranslateDataCategory>();

The class ItsDocument takes advantage of this functionality and provides a quick way to annotate all elements and attributes in the document. In order to find all data categories in Tilde.Its and annotate all elements and attributes in the document, use:

doc.AnnotateAll();

In order to find all data categories in Tilde.Its and annotate the html element and its attributes, use:

doc.AnnotateAll(html);

In order to annotate the html element and its attributes with the Translate data category, use:

doc.Annotate<TranslateDataCategory>(html);

In order to annotate all elements and attributes in the document with the Translate data category, use:

doc.Annotate<TranslateDataCategory>(doc.Document.Descendants());

Before adding an annotation, previous annotations of the same type are removed. That means that the Annotate*() methods can be used to re-annotate the content as well. When an annotation is added, its value is not computed. Its value is computed lazily. For instance, in the following example:

string html = "<html><body><p></p></body></html>";

We add all data categories to all nodes (no value computations are performed):

doc.AnnotateAll();

The following Assert computes the value for <body> (result: no local, no global rules), then for <html> (result: no local, no global rules), then it computes the default value (true); <p> is not computed as it is not necessary:

Assert.IsTrue(body.Annotation<TranslateDataCategory>().IsTranslatable);

Once the value has been computed for an annotation, it is cached. In the following code the value for <body> has been already computed in the last example (true) and the value for <html> has also been already computed (true).

Assert.IsTrue(body.Annotation<TranslateDataCategory>().IsTranslatable);

Some of the data categories support inheritance. When a node is annotated and looks for the inherited value, it will use the annotation on the parent element if there is one (if there is no annotation, it is not added). Thus, if the parent elements are already annotated, they do not have to be annotated again, and it improves performance.

Overriding values

You can override the computed values for some data categories.

doc.AnnotateAll();
body.Annotation<TranslateDataCategory>().IsTranslatable = false;
// <p>   : false   (will be computed and taken from <body>)
// <body>: false   (already computed = overriden)
// <html>: true    (will be computed)

However, if you override a value and the values for children nodes have already been computed, you will get incorrect results.

doc.AnnotateAll(); // computes values for all elements
Assert.IsTrue(p.Annotation<TranslateDataCategory>().IsTranslatable);
// override the value
body.Annotation<TranslateDataCategory>().IsTranslatable = false;
// <p>   : true    (already computed in Assert.IsTrue)   <- INCORRECT
// <body>: false   (already computed = overriden)
// <html>: true    (already computed in Assert.IsTrue)

When you override a value, you should re-annotate all descendants to avoid such situations.

doc.AnnotateAll(); // computes values for all elements
Assert.IsTrue(p.Annotation<TranslateDataCategory>().IsTranslatable);
// override the value
body.Annotation<TranslateDataCategory>().IsTranslatable = false;
// re-annotate descendants
doc.Annotate<TranslateDataCategory>(body.Descendants());
// <p>   : false   (will be computed and taken from <body>)
// <body>: false   (already computed = overriden)
// <html>: true    (already computed in Assert.IsTrue)

Tilde TAWS API

The Terminology Annotation Web Service (TAWS) can annotate terminology in plaintext and in ITS 2.0 enriched HTML5 and XLIFF documents. Tilde.Taws is an ASP.NET Web API project written in C#. It exposes a RESTful API over HTTP.

API

TAWS exposes a RESTful API over HTTP.

HTML5

Request:

POST /api/html5 HTTP/1.1
Host: taws.tilde.com
Content-Length: 62

<!DOCTYPE html><html lang="en"><body>hello world</body></html>

Response:

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8

<!DOCTYPE html>
<html lang="en">
<body its-annotators-ref="terminology|http://tilde.com/term-annotation-service">
<span its-term="yes" its-term-confidence="1">hello world</span>
</body></html>

XLIFF

Request:

POST /api/xliff HTTP/1.1
Host: taws.tilde.com
Content-Length: 307

<?xml version="1.0" encoding="utf-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="hello.txt" source-language="en-us" target-language="lv-lv" datatype="plaintext">
<body>
<trans-unit id='1'>
<source>hello world</source>
</trans-unit>
</body>
</file>
</xliff>

Response:

HTTP/1.1 200 OK
Content-Type: text/xml; charset=utf-8

<?xml version="1.0" encoding="utf-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/ns/its-xliff/"  its:annotatorsRef="terminology|http://tilde.com/term-annotation-service">
    <file original="hello.txt" source-language="en-us" datatype="plaintext">
        <body>
            <trans-unit id="1">
                <source><mrk mtype="term" itsx:termConfidence="1">hello world</mrk></source>
            </trans-unit>
        </body>
    </file>
</xliff>

Every single piece of text in a document must have a language identifier or it will not be annotated since the language is unknown. You can add a lang parameter to the query string to set the default language of the content without modifying the markup.

/api/html5?lang=en

The Domain data category identifies the topic of the document content. If the document contains no domain information, terminology from all domains is annotated. You can optionally add one or more domain parameters to the query string to set the default domain(s) of the content without modifying the markup. Each domain must be a EuroVoc domain code (two or four digit string). A parent domain (two digit code) includes all child domains.

/api/html5?domain=32
/api/html5?domain=66&domain=2441
/api/html5?lang=en&domain=32&domain=40

If your document contains references to external rules with relative paths (e.g., <link rel="its-rules" href="rules.xml">), you can add a baseUri parameter specifying an accessible base path (e.g., http://example.org/its/), otherwise the rules cannot be loaded by TAWS.

/api/html5?baseUri=http://example.org/its/
/api/html5?lang=en&baseUri=http://example.org/its/

Plaintext

For convenience, it is possible to annotate terminology in plaintext documents as well.

Request:

POST /api/plaintext?lang=en HTTP/1.1
Host: taws.tilde.com
Content-Length: 11

hello world

Response:

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8

<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8" />
    </head>
    <body its-annotators-ref="terminology|http://tilde.com/term-annotation-service">
        <span its-term="yes" its-term-confidence="1">hello world</span>
    </body>
</html>

The text will be converted and returned as an HTML5 document since there is no standard way to annotate terminology in plaintext using ITS 2.0 metadata.

Because there is no ITS 2.0 markup present in plaintext content, the lang parameter is mandatory for plaintext documents. You can optionally add one or more domain parameters to specify the domain of the text.

Terminology Annotation Methods

By default, terminology is annotated using both the “Statistical terminology annotation” (statistical) and the “Term bank based terminology annotation” (termbank) method.

To use only one method for annotation of terminology, specify it with a method parameter in the query string:

/api/html5?method=statistical
/api/html5?method=termbank
/api/html5?lang=en&domain=32&method=termbank

Note that the Domain data category is ignored if the “Term bank based terminology annotation” (termbank) method is not used.

HTTP Status Codes

TAWS will respond with one of the following status codes:

  • 200 OK – document was annotated successfully;
  • 400 Bad Request – invalid document or parameters passed to TAWS;
  • 500 Internal Server Error – there is a problem with the service.

The content of the response will be the annotated document or an error message in case of an error.

An example of a response indicating an error is as follows:

HTTP/1.1 400 Bad Request
Content-Type: text/plain; charset=utf-8

Invalid root element.

Limitations

  • Only input in UTF-8 is supported.
  • Only the first 50 000 characters of the submitted document will be annotated. The remaining document will be returned to the user without annotated terminology. This is a limitation for showcase purposes in order not to allow misuse of the service.
  • Domain values are limited to EuroVoc codes. The domains and their codes can be found on the EuroVoc website. EuroVoc Edition 4.4 is used. More information about EuroVoc classification can be found on the EuroVoc website.

API Implementation

This section describes the organisation of the code, responsibilities of classes and the work of software.

Requests

The implementation uses the Model-View-Controller (MVC) design pattern. Routes describe how requests should be mapped to controllers. The front controller (provided by the framework) uses the routes to determine which Controller to use. The Controller creates and operates Models that do the work and generates the output (or View) which is returned as the response.

When the application is started for the first time, the method WebApiApplication.Application_Start (in Global.asax) is called. The method configures the Web API by calling the methods in the static WebApiConfig class. One of these methods, WebApiConfig.RegisterRoutes, tells the server how to route incoming requests:

/api/plaintext -> Tilde.Taws.Controllers.ApiController.Plaintext
/api/html5     -> Tilde.Taws.Controllers.ApiController.Html5
/api/xliff     -> Tilde.Taws.Controllers.ApiController.Xliff

All routes accept only POST requests.

When /api/html5 is requested, the method ApiController.ModelBinder.BindModel() is called. The method creates a new ApiDocument instance, which holds all current request parameters. Parameters at this stage are not validated and no error/warning message is given about any parameters which are not recognised or used.

Then the ApiController.Html5(ApiDocument) method is called with an ApiDocument as the only parameter.

In the controller method, a new instance of the Html5Annotator class is created and the ApiDocument is passed as a parameter. The constructor will validate the document and throw an exception if it is invalid. Then the Html5Annotator.Annotate() method, which performs the annotation, is called asynchronously.

The return value of the Html5Annotator.Annotate() method is returned as the response to the client.

Plaintext and Xliff methods work similarly to the Html5 method.

Annotators

Annotators parse the submitted document, annotate terms in the document and then return the document. To accomplish this, annotators perform a number of steps.

Annotators take the submitted document as a parameter in the constructor where it is parsed and ITS 2.0 rules loaded. When an annotation is requested, the content of the document is split into chunks and these chunks are sent to the Terminology as a Service (TaaS) API which finds and annotates terms in these chunks. Annotated chunks are then merged back in the document, post-processed to make sure there is no invalid markup and finally additional information is added to the document to indicate that the document has been processed. Lastly, the document is returned.

The abstract Annotator<T> class contains most of the functionality while inherited classes add implementation specific functionality and define the order of execution.

PlaintextAnnotator

It is not really possible to visualise terminology in plaintext. Therefore, the visualisation of terminology in plaintext is done in HTML5. Since it is possible for an HTML5 document to look exactly like a plaintext document, a plaintext document is converted to an HTML5 document and then annotated using the Html5Annotator to avoid duplicating functionality.

Before conversion, HTML entities are escaped and new lines are preserved by using the <br> HTML tag. The processed text is placed in the <body> tag and there is a tag identifying the text encoding (<meta charset="utf-8">) added to the <head> element.

Html5Annotator

The Html5Annotator annotates terminology in HTML5 documents. XHTML and older versions of HTML may also be processed (however, they are not officially supported). Invalid markup is tolerated since we use a forgiving HTML5 parser (HtmlParserSharp, see the section about dependencies).

In the constructor, an instance of the class Tilde.Its.ItsHtmlDocument is created, which parses the submitted document (which is a string at this point) and loads ITS 2.0 rules.

Annotate() is an asynchronous method which performs the annotation. All annotation is performed on elements in the <body> element.

Default values

If <body> has the default Language Information data category value (i.e., there is no explicit language defined within the ITS 2.0 enriched content), the submitted language is used as the <body> language (and it will be inherited by all child nodes that have no language defined). Similarly, if <body> has the default Domain data category value (i.e., no domain metadata specified in the ITS 2.0 enriched content), the submitted domains are used (if there are any) and these will be also inherited by child nodes.

Chunking

The content of <body> is split into chunks by Html5Chunker (see Chunkers for more details).

Chunks are sent to TaaS

After chunking, the chunks are sent to the TaaS API asynchronously and in parallel. In parallel means that all chunks are sent at the same time and responses can also be processed at the same time. Asynchronously means that the threads will not block and are able to perform other tasks while the I/O operation is processing.

Integration of chunks within the document

Annotated chunks are received from the API and integrated back in the document. At this point, terms in the document are represented by the Tename class (which inherits from the XElement class) in the document XML tree and appear as <tename xmlns="">term</tename> in XML.

Removal of duplicates

If a piece of text was in two domains (e.g., law and finance) and there was a word in the text that was a term in both domains, the word would be annotated twice. Annotator<T>.MergeTenames() merges such terms into one by averaging the confidence value and preserving term information (TBX).

Removal of invalid annotations

The Terminology data category does not support inheritance, and thus some annotations that are not valid (e.g., overlapping annotations) are removed. For instance, <tename>terminology <span>annotation</span></tename> is not valid and the annotation will be removed.

Removal of existing terms

If a term was already annotated in the original document with either local or global markup, the added annotation is removed in order not to conflict with the existing annotation. To identify the annotation service used for annotation, the its-annotators-ref attribute is added to the <body> element if there are no terms in the document that were annotated by another annotator).

Replacement

All remaining term annotations (which are still Tename instances) are renamed to HTML tags. If its-annotators-ref for terminology is not inherited (i.e. it was not added to the <body> element), it is added as a local attribute.

<span its-term="yes" its-term-confidence="0.5">term</span>

For each term that has additional terminological information provided by the TaaS terminology annotation service in a TBX format, this information is added in a <script> tag to the <head> element and identified by a unique ID based on the term ID(s) (prefix with tilde-tbx). This ID is then added as an its-term-info-ref reference to each term.

<span its-term="yes" its-term-info-ref="#tilde-tbx-1" ...>

TBX is an XML document. Because terms could have been merged, each TBX can contain several entries.

<script type="text/xml" id="tilde-tbx-1">
<?xml version='1.0'?>
<!DOCTYPE martif SYSTEM "TBXcoreStructV02.dtd">
    <martif type="TBX">
        <martifHeader>
            <fileDesc>
                <sourceDesc>
                    <p>Tilde Terminology Annotation Service</p>
                </sourceDesc>
            </fileDesc>
            <encodingDesc>
                <p type="XCSURI">http://www.ttt.org/oscarstandards/tbx/TBXXCS.xcs</p>
            </encodingDesc>
        </martifHeader>
        <text>
            <body>
                <termEntry>...</termEntry>
                <termEntry>...</termEntry>
            </body>
        </text>
    </martif>
</script>

Finally, the underlying XML document is converted to an HTML5 document (e.g., <script/> is invalid, it must be <script></script>) and returned as a string.

XliffAnnotator

XliffAnnotator works similarly to Html5Annotator.

The XliffAnnotator annotates terminology in XLIFF 1.0, 1.1 and 1.2 documents.

In the constructor, an instance of the class Tilde.Its.ItsXmlDocument is created, which parses the submitted document. At this point the XML document is validated to make sure the document is in an XLIFF namespace and the root element is <xliff> and an exception is thrown if it is not. Before loading ITS 2.0 rules, the following custom rules are transparently inserted into the document (which can be overriden by the user defined ITS 2.0 rules). These rules are based on the information provided in the XLIFF Mapping wiki.

<its:rules version='2.0'
    xmlns:its='http://www.w3.org/2005/11/its'
    xmlns:itsx='http://www.w3.org/ns/its-xliff/'
    xmlns:xliff11='urn:oasis:names:tc:xliff:document:1.1'
    xmlns:xliff12='urn:oasis:names:tc:xliff:document:1.2'>
    
    <its:termRule selector='/xliff//mrk[@mtype="term"]' term="yes" />
    <its:termRule selector='//xliff11:mrk[@mtype="term"]' term="yes" />
    <its:termRule selector='//xliff12:mrk[@mtype="term"]' term="yes" />
    
    <its:termRule selector='/xliff//mrk[@mtype="x-its-term-no"]' term="no" />
    <its:termRule selector='//xliff11:mrk[@mtype="x-its-term-no"]' term="no" />
    <its:termRule selector='//xliff12:mrk[@mtype="x-its-term-no"]' term="no" />
    
    <its:domainRule selector='//*' domainPointer='@itsx:domains'/>
    
    <its:withinTextRule withinText='yes' selector='/xliff//g | /xliff//x | /xliff//bx | /xliff//ex | /xliff//bpt | /xliff//ept | /xliff//it | /xliff//ph | /xliff//mrk'/>
    <its:withinTextRule withinText='yes' selector='//xliff11:g | //xliff11:x | //xliff11:bx | //xliff11:ex | //xliff11:bpt | //xliff11:ept | //xliff11:it | //xliff11:ph | //xliff11:mrk'/>
    <its:withinTextRule withinText='yes' selector='//xliff12:g | //xliff12:x | //xliff12:bx | //xliff12:ex | //xliff12:bpt | //xliff12:ept | //xliff12:it | //xliff12:ph | //xliff12:mrk'/>
    
    <its:withinTextRule withinText='nested' selector='/xliff//sub'/>
    <its:withinTextRule withinText='nested' selector='//xliff11:sub'/>
    <its:withinTextRule withinText='nested' selector='//xliff12:sub'/>
    
</its:rules>

Annotate() is an asynchronous method which performs the annotation. All annotation is performed only on , <seg-source> and <target> elements.

Default values

If the XLIFF document has missing language information or has the default Language Information data category value (i.e., there is no explicit language defined within the ITS 2.0 enriched content), the submitted language is used as the source language of the document (and it will be inherited by all child nodes that have no language defined). Similarly, if <xliff> has the default Domain data category value (i.e., no domain metadata specified in the ITS 2.0 enriched content), the submitted domains are used (if there are any) and these will be also inherited by child nodes.

In XLIFF 1.2, only the following elements are allowed to have an xml:lang attribute: <xliff>, <note>, <prop>, <source>, <target>, <alt-trans>. If any of these elements has this attribute defined, as per ITS 2.0 rules, it will used as the value for the Languge Information data category and thus inherited by child elements. This is why the value of the xml:lang attribute is ignored on other elements (i.e. not <xliff>, <note>, <prop>, etc.) outside <source>, <source-seg> and <target>.

Chunking

The content of <source>, <seg-source> and <target> elements is split into chunks by XliffChunker (see Chunkers for more details).

Chunks are sent to TaaS

After chunking, the chunks are sent to the TaaS API asynchronously and in parallel. In parallel means that all chunks are sent at the same time and responses can also be processed at the same time. Asynchronously means that the threads will not block and are able to perform other tasks while the I/O operation is processing.

Integration of chunks within the document

Annotated chunks are received from the API and integrated back in the document. At this point, terms in the document are represented by the Tename class (which inherits from the XElement class) in the document XML tree and appear as <tename xmlns="">term</tename> in XML.

Removal of duplicates

If a piece of text was in two domains (e.g., law and finance) and there was a word in the text that was a term in both domains, the word would be annotated twice. Annotator<T>.MergeTenames() merges such terms into one by averaging the confidence value and preserving term information (TBX).

Removal of invalid annotations

The Terminology data category does not support inheritance, and thus some annotations that are not valid (e.g., overlapping annotations) are removed. For instance, <tename>terminology <g id='1'>annotation</g></tename> is not valid and the annotation will be removed.

Removal of existing terms

If a term was already annotated in the original document with either local or global markup, the added annotation is removed in order not to conflict with the existing annotation.

To identify the annotation service used for annotation, the its:annotatorsRef attribute is added to the <xliff> element if there are no terms in the document that were annotated by another annotator).

If the ITS 2.0 namespace has not been defined in the document it is added with the prefix its. If this prefix is already used by another namespace, other prefixes (such as its2, its3 etc.) are used. Similary, the XLIFF Mapping namespace (http://www.w3.org/ns/its-xliff/) is defined with the prefix itsx.

Replacement

All remaining term annotations (which are still Tename instances) are renamed to XLIFF tags. If its:annotatorsRef for terminology is not inherited (i.e. it was not added to the <xliff> element), it is added as a local attribute.

<mrk mtype="term" itsx:termConfidence="0.5">term</mrk>

For each term that has additional terminological information provided by the TaaS terminology annotation service in a TBX format, this information is added in a <martif> tag to the <header> element of each file and each term in it is identified by a unique ID based on the term ID(s) (prefix with tilde-tbx). This ID is then added as an itsx:termInfoRef reference to each term.

<mrk mtype="term" itsx:termInfoRef="#tilde-tbx-1" ...>

TBX is an XML document embedded in the header element.

<xliff ...>
    <file original='tbx.txt' source-language='en'>
        <header>
            <martif type='TBX' xmlns='http://www.ttt.org/oscarstandards/tbx/TBXcoreStructV02.dtd'>
                <martifHeader>
                    <fileDesc>
                        <sourceDesc>
                            <p>Tilde Terminology Annotation Service</p>
                        </sourceDesc>
                    </fileDesc>
                    <encodingDesc>
                        <p type='XCSURI'>http://www.ttt.org/oscarstandards/tbx/TBXXCS.xcs</p>
                    </encodingDesc>
                </martifHeader>
                <text>
                    <body>
                        <termEntry xml:id='tilde-tbx-etb-1'>...</termEntry>
                    </body>
                </text>
            </martif>
        </header>
    ...
    </file>
</xliff>

Finally, the underlying XML document is returned as a string.

Chunkers

Chunkers are classes that split the content of a document into smaller fragments called chunks. The text in each chunk has the same properties (i.e., the same language, the same domain). A fragment of a text can be in two chunks (e.g., a paragraph can be in two domains like finance and law) at the same time.

Chunking is necessary for the TaaS terminology annotation service as it is designed to process text of one language and one domain at a time (for more details see the TaaS terminology annotation service).

Chunkers use the Language Information, Domain, Elements Within Text and Locale Filter data categories, however ignore the Terminology data category, which is taken into account by annotators.

“Split” and “splitting” refer to splitting the document into chunks, whereas “join” and “joining” refer to the opposite of splitting.

Chunkers operate according to the following algorithm:

  • Find all unique languages in the document (e.g., , en, en-US, lv).
  • Find all unique domains in the document (e.g., , auto, finance, law).
  • For each language/domain pair, find elements with independent text flow (Elements Within Text is not equal to “yes”), ignoring some elements (e.g., <script> in HTML5, everything but <source>, <target> in XLIFF). The content of these elements is in the selected language and the content is also in the selected domain.
  • For each independent element, extract text from it:
    • If it is an ignored element, return nothing at all;
    • If it is a whitespace element (e.g., <br> in HTML5) and it is not the first node, represent it with a space;
    • If it is in another language or domain, ignore the text;
    • If it is excluded by the Locale Filter data category, ignore the text;
    • Otherwise include the node text;
    • If the text is followed by another independent text flow, add a break;
    • Repeat the process with all the children.

Example in HTML5:

<body lang='en'>
    <p lang='lv'>Sveika, pasaule</p>
    <div lang='en'>Hello, <strong>world</strong> <div>:)</div>!</div>
    <p lang='en-US' domains='law, finance'>money laundering</p>
    <p its-locale-filter-list='lv'>High-five!</p>
    <script>...</script>
</body>

Languages: en, lv, en-us
Domains: <none>, law, finance

Language: en, Domain: <none>
Independent text flows: 
    <body lang='en'>
    (<p lang='lv'> is in another language)
    <div lang='en'> 
    <div> 
    (<p lang='en-US' is in another language and domain)
    <p its-locale-filter-list='lv'>
Extracted text:
    <body lang='en'>: (nothing because it contains only child elements)
    <div lang='en'>: Hello, world <break>! (<break> because of another independent element)
    <div>: :)
    <p its-locale-filter-list='lv'>: (nothing because the content applies only to 'lv')

Language: lv, Domain: <none>
Independent text flows: <p lang='lv'> (not <p its-locale-filter-list='lv'> because the language is 'en', inherited from <body>)
Extracted text: <p lang='lv'>: Sveika, pasaule

Language: lv, Domain: finance
Independent text flows: none

Language: en, Domain: finance
Independent text flows: none

Language: en-us, Domain: finance
Independent text flows: <p lang='en-US' domains='law, finance'>
Extracted text: money laundering

Language: en-us, Domain: law
Independent text flows: <p lang='en-US' domains='law, finance'>
Extracted text: money laundering

The extracted text is then sent to the TaaS terminology annotation service, which adds terminology annotation to the text but does not change anything else.

The TaaS terminology annotation service

Terminology as a Service (TaaS) exposes a RESTful API over HTTP that takes a text, its language and domain and which terminology annotation method to use as parameters, and annotates terminology in the text according to the parameters. At the time of writing, the TaaS terminology annotation service can annotate texts in English, Latvian and Lithuanian.

The TaaS terminology annotation service can annotate terminology using three methods:

  • Method1: statistical;
  • Method2: term bank based;
  • Method3: both statistical and term bank based.

The Domain values have to be valid EuroVoc codes (two or four letter digit strings). Parent domains (two letter codes) include all their child domains (four letter codes). An empty domain means that terms from all domains should be annotated. Domains are only taken into account when the Term bank based terminology annotation method is used.

The TaaS terminology annotation service returns the same text with terms enclosed in XML tags, although the text itself is not an XML document.

This is a <TENAME SCORE="1.0" LEMMA="term" MSD="NN">term</TENAME>.

If the Term bank based terminology annotation method is used, annotated terms may contain a termID attribute and a corresponding termEntry (in the TBX format) containing additional terminological information.

This is a <TENAME termID="etb-1" SCORE="1.0" LEMMA="term" MSD="NN">term</TENAME>.

<martif type="TBX" xmlns='http://www.ttt.org/oscarstandards/tbx/TBXcoreStructV02.dtd'>
    <martifHeader>
        <fileDesc>
        <sourceDesc>
                <p>Tilde Terminology Annotation Service</p>
            </sourceDesc>
        </fileDesc>
        <encodingDesc>
            <p type="XCSURI">http://www.ttt.org/oscarstandards/tbx/TBXXCS.xcs</p>
        </encodingDesc>
    </martifHeader>
    <text>
        <body>
        
            <termEntry id="etb-1">
                <admin type="sourceLanguage">en</admin>
                <descrip type="subjectField">04</descrip>
                <langSet xml:lang="en">
                    <ntig>
                        <termGrp>
                            <term>Translation</term>
                            <termCompList type="lemma">
                                <termCompGrp>
                                    <termComp>translation</termComp>
                                    <termNote type="partOfSpeech">noun</termNote>
                                    <termNote type="grammaticalNumber">singular</termNote>
                                </termCompGrp>
                            </termCompList>
                        </termGrp>
                        <descrip type="reliabilityCode">1</descrip>
                        <admin type="score">0.24</admin>
                        <xref target="http://www.eurotermbank.com/Collection.aspx?collectionid=382" type="xSource">Eesti Õigustõlke Keskuse terminibaas ESTERM</xref>
                    </ntig>
                </langSet>
            </termEntry>
            
        </body>
    </text>
</martif>

The class TaaS is the TaaS terminology annotation service client.

It can be configured with two properties:

  • UseStatisticalExtraction (boolean),
  • UseTermBankExtraction (boolean).

To perform annotation, the method async Task<string[][]> Annotate(string language, string domain, string[][] text) is used.

The first dimension of string[][] text defines paragraphs. They are independent units of text. The second dimension of the array defines the text fragments.

The following HTML5 segment

<p>Hello <span>world</span>!</p>
<div>Beginning <div>middle</div> end</div>

is represented as:

new[] { 
    // span's Elements Within Text data category value is Yes
    new[] { "Hello world!" }, 
    // <div> is an independent element, so "Beginning end" cannot be a term
    new[] { "Beginning ", " end" },
    // <div> like <p> has an independent text flow
    new[] { "middle" } 
}

The two-dimensional array is transformed into a string:

Hello world!
 ... ... 
Beginning 
 .. .. 
end
 ... ... 
middle

Where \n ... ... \n are paragraph boundaries and \n .. .. \n are fragment boundaries.

If the term information along with annotations is returned, it is stored. As a text fragment can be annotated in two different domains, the IDs of the returned terms conflict due to multiple calls to the terminology annotation service. Therefore, the IDs are suffixed with the ID (from 1 to N) of the call to the terminology annotation service.

domain: law      etb-1 => taas-1-etb-1
domain: finance  etb-1 => taas-2-etb-1

The additional term information can be retrieved with XElement GetTermEntry(string termID).

TaaS Configuration

The TaaS terminology annotation service is an external Web service and not in the scope of this project. The TaaS terminology annotation service is developed as part of the TaaS project (see http://taas-project.eu for more details). The credentials to access the TaaS terminology annotation service are not provided in the source code. For more details on how to access the TaaS terminology annotation service, please refer to the TaaS project or contact Tilde. The access details can be changed in the Web.config file.

<configuration>
    <appSettings>
        <add key="TaaS_Username" value="..." />
        <add key="TaaS_Password" value="..." />
        <add key="TaaS_Server" value="http://..." />
        <add key="TaaS_Timeout" value="00:10:00" />
        <add key="TaaS_MaxLength" value="50000" /> <!-- Max number of characters to send to TaaS. If text length exceeds this number, the remaining part will not be sent to TaaS and will not be annotated. -->
    </appSettings>
</configuration>

Other dependencies

This section lists software libraries that are used in the project.