# FoLiApy

## Introduction
This Python module provides an extensive library for parsing, creating and processing documents in the Format for Linguistic Annotation, aka <b>FoLiA</b>. It has been in active development since 2010 and used by numerous Natural Language Processing (NLP) tools. The FoLiA library provides an Application Programming Interface for the reading, creation and manipulation of FoLiA XML documents. The library is written for Python 3.5 and above.

## Installation

To install use the following pip command:
<b> pip3 install folia</b>

## Getting started with FoLiA

### FoLiA: Format for Linguistic Annotation

FoLiA, an acronym for Format for Linguistic Annotation, is a data model and file format to represent digitised language resources enriched with linguistic annotation, e.g. linguistically enriched textual documents or transcriptions of speech. <b>The format is intended to provide a standard for the storage and exchange of such language resources, including corpora and promote interoperability amongst Natural Language Processing tools that use the format.</b>

The aim was to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. FoLiA has following characteristics:

- Expressive
- Generic
- Specific
- Formalized
- Practical

FoLiA is a document-based format, representing each document and all relevant annotations in a single XML file.
Following is a basic structure of a FoLiA document. It should always be UTF-8 encoded.

```xml
<?xml version="1.0" encoding="utf-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/FoLiA"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  version="2.0"
  xml:id="example">
  <metadata>
      <annotations>
          ...
      </annotations>
      <provenance>
          ..
      </provenance>
      ...
  </metadata>
  <text xml:id="example.text">
     ...
  </text>
</FoLiA>
```

The structure of a FoLiA document can be divided into two parts, the metadata section and a body. The body is formed by either the ``` <text> ``` element or the ``` <speech> ``` element. The body elements (``` <text>/<speech> ```) are structural elements but take no sets, classes, nor a declaration. 

All forms of annotation in FoLiA are encoded using an distinct XML element. The first few layers of nested XML elements are structural elements such as divisions, paragraphs and sentences. Then the deepest structure layer is tokenisation (``` <w> ```, Token Annotation). Within these structures, there could be inline annotation elements encoding linguistic information, and also layers with span annotation, which refer back to the tokens/words in a stand-off fashion.

An example of a simple speech document might look like this:
```xml
<?xml version="1.0" encoding="utf-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.0" xml:id="example">
  <metadata>
      <annotations>
          <phon-annotation>
			 <annotator processor="p1" />
          </phon-annotation>
          <utterance-annotation>
			 <annotator processor="p1" />
          </utterance-annotation>
          <token-annotation>
			 <annotator processor="p1" />
          </token-annotation>
      </annotations>
      <provenance>
         <processor xml:id="p1" name="proycon" type="manual" />
      </provenance>
  </metadata>
  <speech xml:id="example.speech">
    <utt xml:id="example.utt.1" src="helloworld.mp3"  begintime="00:00:01.000" endtime="00:00:02.000">
        <ph>helˈoʊ wɝːld</ph>
        <w xml:id="example.utt.1.w.1" begintime="00:00:00.000" endtime="00:00:01.000">
            <ph>helˈoʊ</ph>
        </w>
        <w xml:id="example.utt.1.w.2" begintime="00:00:01.000" endtime="00:00:02.000">
            <ph>wɝːld</ph>
        </w>
    </utt>
  </speech>
</FoLiA>
```

## Creating a FoLiA Document

Now lets start using the FoLiApy library. Any script that uses FoLiA has to start with the follwoing import

In [1]:
import folia.main as folia

Now to create a FoLia Document we use the Document constructor and give our document an ID.

In [2]:
doc = folia.Document(id='example')

<b> Adding structure to our document:</b>
First, we should first add a Text element. Then we can add paragraphs, sentences, or other structural elements. The ``` AbstractElement.add() ``` method adds new children to an element:

In [3]:
text = doc.add(folia.Text)
paragraph = text.add(folia.Paragraph)
sentence = paragraph.add(folia.Sentence)
sentence.add(folia.Word, 'This')
sentence.add(folia.Word, 'is')
sentence.add(folia.Word, 'a')
sentence.add(folia.Word, 'test')
sentence.add(folia.Word, '.')

<Word at 305086068720 id=example.text.1.p.1.s.1.w.5 set=None class=None>

<b> Next, we add annotations to our document: </b>
Adding annotations is also done using the ``` AbstractElement.add() ``` method on the intended parent element. Let's add it to our previous example document:

In [4]:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)

#Add Part-of-Speech tag
word.add(folia.PosAnnotation(doc, set='brown-tagset',cls='n'))

#Add lemma
word.add(folia.LemmaAnnotation(doc, set='cgn-tag-set', cls='test'))

<LemmaAnnotation at 305083480720 id=None set=cgn-tag-set class=test>

<b>Provenance Information:</b>
Since FoLiA documents can be large there could be more than one contributor ie. many people working on the same document. And they might want to declare new annotations. As a result, contributors can clarify the origin of an annotation is, i.e. who annotated it. This can be done using the ``` <processor> ``` attribute.

In [5]:
from folia.main import Processor

In [6]:
#Start a new paragraph
paragraph = text.add(folia.Paragraph)
sentence = paragraph.add(folia.Sentence)
sentence.add(folia.Word, 'This')
sentence.add(folia.Word, 'is')
sentence.add(folia.Word, 'the')
sentence.add(folia.Word, '2nd')
sentence.add(folia.Word, 'test')
sentence.add(folia.Word, '.')
word = sentence.words(3)

#First we declare the annotation type with a processor
posprocessor = doc.declare(folia.PosAnnotation, set='brown-tagset', processor=Processor.create(name="mypostagger"))

#Then we add an annotation to our word
word.add( folia.PosAnnotation, set='brown-tagset', cls='n', processor=posprocessor)

<PosAnnotation at 305085944496 id=None set=brown-tagset class=n>

<b> Adding a span annotation: </b> Span annotation uses a stand-off annotation embedded in annotation layers. These layers are in turn embedded in structural elements such as sentences. Consider the following example of a named entity:

In [7]:
doc.declare(folia.Entity, "https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml")
text = doc.add(folia.Text)

sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'I',id='example.s.1.w.1')
sentence.add(folia.Word, 'saw',id='example.s.1.w.2')
sentence.add(folia.Word, 'the',id='example.s.1.w.3')
word = sentence.add(folia.Word, 'Dalai',id='example.s.1.w.4')
word2 =sentence.add(folia.Word, 'Lama',id='example.s.1.w.5')
sentence.add(folia.Word, '.', id='example.s.1.w.6')

word.add(folia.Entity, word, word2, cls="per")

<Entity at 305093224936 id=None set=https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentities.foliaset.xml class=per>

For the next example we can try to do things more explicitly. Let's first create a sentence and then add a syntax parse, consisting of nested elements as follows:

In [8]:
from folia.main import SyntacticUnit

In [9]:
doc.declare(folia.SyntaxLayer, 'some-syntax-set')
text = doc.add(folia.Text)

sentence = text.add(folia.Sentence)
sentence.add(folia.Word, 'The',id='example.s.1.w.7')
sentence.add(folia.Word, 'boy',id='example.s.1.w.8')
sentence.add(folia.Word, 'pets',id='example.s.1.w.9')
sentence.add(folia.Word, 'the',id='example.s.1.w.10')
sentence.add(folia.Word, 'cat',id='example.s.1.w.11')
sentence.add(folia.Word, '.', id='example.s.1.w.12')

#Adding Syntax Layer
layer = sentence.add(folia.SyntaxLayer)

#Adding Syntactic Units
layer.add(SyntacticUnit(doc, cls='s', contents=[
            SyntacticUnit(doc, cls='np', contents=[
                SyntacticUnit(doc, doc['example.s.1.w.7'], cls='det'),
                SyntacticUnit(doc, doc['example.s.1.w.8'], cls='n'),
        ]),
        SyntacticUnit(doc, cls='vp', contents=[
            SyntacticUnit(doc, doc['example.s.1.w.9'], cls='v'),
                SyntacticUnit(doc, cls='np', contents=[
                    SyntacticUnit(doc, doc['example.s.1.w.10'], cls='det'),
                    SyntacticUnit(doc, doc['example.s.1.w.11'], cls='n'),
                ]),
            ]),
        SyntacticUnit(doc, doc['example.s.1.w.12'], cls='fin')
    ])
)

<SyntacticUnit at 305093269656 id=None set=some-syntax-set class=s>

<b>Saving a FoLiA doc file:</b> To save a file we can use the save method of the document object.

In [10]:
doc.save(filename = 'example.folia.xml')

<b>Deleting Annotation:</b> Now, what if we want to disassociate a annotation from a parent? We might want to delete annotations which are were either assigned by mistake or need to be removed per requirement. Any element can be deleted by calling the ``` AbstractElement.remove() ``` method on its parent. Suppose we want to delete ``` word ```

In [11]:
# Check the parent of word:
word.parent

<Sentence at 305093223816 id=example.text.2.s.1 set=None class=None>

In [12]:
# Remove word associated with the parent
word.parent.remove(word)

In [13]:
# Rechecking the association with the parent
print(word.parent)

None


## Reading a FoLiA Document

To read a document from file, we can instantiate a document as follows:

In [14]:
doc = folia.Document(file="example.folia.xml")

This returned Document instance holds the entire document in memory. It is important to note that for large FoLiA documents this may consume quite some memory! Once we load a document, all data is available for us to read and manipulate as we see fit. Let's first see some simple use cases:

<b> Printing text: </b> We can simply print all (plain) text contained in the document, which is as easy as:

In [15]:
print(doc)

This is a test .

This is the 2nd test .


I saw the Dalai Lama .


The boy pets the cat .


Obtaining the text as a string is done by invoking the documentss ```Document.text()``` method:

In [16]:
text = doc.text()
print(text)

# Or alternatively
text = str(doc)
print(text)

This is a test .

This is the 2nd test .


I saw the Dalai Lama .


The boy pets the cat .
This is a test .

This is the 2nd test .


I saw the Dalai Lama .


The boy pets the cat .


<b>Index:</b> A document instance has an index which you can use to grab any of its elements by ID. Querying using the index proceeds similar to using a python dictionary:

In [17]:
word = doc['example.s.1.w.2']
print(word)

saw


<b>Obtaining list of elements:</b> If the need is to iterate over all of the child elements of a certain element, regardless of what type they are, we can simply do so as follows:

In [18]:
for sentence in doc.sentences():
    for word in sentence.words():
        print(word)

This
is
a
test
.
This
is
the
2nd
test
.
I
saw
the
Dalai
Lama
.
The
boy
pets
the
cat
.


<b>Loading a Document with processor: </b> Instead of explicitly assigning a processor with invididual annotations, we can do so implicitly by associating a processor with the Document, it will then be automatically be used for any subsequent annotations we add. Thus, we can associate a processor immediately upon document instantation.

In [19]:
doc = folia.Document(file="example.folia.xml", processor=Processor.create(name="myscript", version="0.1"))

## Searching in a FoLiA Document

The FoLiA library supports a language called <b>FoLiA Query Language</b> (FQL), which provides an efficient way of searching in a document. This language needs to be imported first:

In [20]:
from folia import fql

All FQL processing is done via the ```Query``` class. Selecting a word with a particular text is done as follows:

In [21]:
query = fql.Query('SELECT w WHERE text = "boy"')
for word in query(doc):
    print(word)  #this will be an instance of folia.Word

boy


Regular expression matching can be done using the ```MATCHES``` operator:

In [22]:
query = fql.Query('SELECT w WHERE text MATCHES "^Th.*$"')
for word in query(doc):
    print(word)

This
This
The


We can also constrain our queries to a particular target selection using the ```FOR``` keyword

In [23]:
query = fql.Query('SELECT w WHERE text MATCHES "^Th.*$" FOR s WHERE text CONTAINS "pets"')
for word in query(doc):
    print(word)

The


And, that concludes our basic overview of FoLiA! Don't forget to checkout the actual documentation for more detailed explanations and many more cool methods. Refer to links in the Reference section below.

## References

- https://github.com/proycon/foliapy
- https://folia.readthedocs.io/en/latest/
- https://foliapy.readthedocs.io/en/latest/index.html