# Annotation of Corpora

### Part-of-Speech (POS) Annotation
The most basic linguistic annotation is POS, each token is classified using a predeffined set of tags, these can be very simple such as:
- Noun (claw, hyphen, etc)
- Adjective (red, small, etc)
- Verb (encourage, betray, etc)
Or more detailed:
- Singular common noun (elephant, table, etc)
- Comparative adjective (larger, neater, etc)
- Past participle (listened, written, etc)

One approach would be to look up each token in a dictionary and assign its tag, however some words have server possible POS. An example of this would be "like" which could be a singular noun, verb or preposition. To resolve this we need to use semantic analysis.

One approach is probabilistic POS tagging where we analyse the frequencies of POS and other context clues in order to accuratly tag the text. 

### XML for Corpora

XML is the most widely used markup language for corpora, the following is an example from the british national corpus.
```xml
<wtext type="FICTION">
    <div level="1">
        <head> <s n="1">
            <w c5="NN1" hw="chapter" pos="SUBST"> CHAPTER </w>
            <w c5="CRD" hw="1" pos="ADJ"> 1 </w>
        </s> </head>
        <p> <s n="2">
            <c c5="PUQ"> ‘ </c>
            <w c5="CJC" hw="but" pos="CONJ"> But </w>
            <c c5="PUN"> , </c> <c c5="PUQ"> ’ </c>
            <w c5="VVD" hw="say" pos="VERB"> said </w>
            <w c5="NP0" hw="owen" pos="SUBST"> Owen </w>
            <c c5="PUN"> , </c> <c c5="PUQ"> ‘ </c>
            <w c5="AVQ" hw="where" pos="ADV"> where </w>
            <w c5="VBZ" hw="be" pos="VERB"> is </w>
            <w c5="AT0" hw="the" pos="ART"> the </w>
            <w c5="NN1" hw="body" pos="SUBST"> body </w>
            <c c5="PUN"> ? </c> <c c5="PUQ"> ’ </c>
        </s> </p>
        ....
    </div>
</wtext>
```

- `wtext` is a _written text_ element, with a `type` attribute indicating the type of written text.
- `s` tag marks a sentenct.
- `w` element tags words, the `pos` attribute gives basic POS tags, the `c5` uses the standard CLAWS code.
- `c` tags punctuation

### Syntatic Annotation
The next level of annotation provides infomation on whole sentence structure, rather than individiual words. Linguists break down sentences with the following _phrase markers_:
- __noun phrase:__ a noun and its adjectives, determiners, etc.
- __verb phrase:__ a verb and its object.
- __propositional phrase:__ a preposition and its noun phrase.
- __sentence:__ a verb phrase and its subject.

Heres an example of the sentence "They saw the president of the company":
![Screenshot%20from%202017-08-25%2012-30-40.png](attachment:Screenshot%20from%202017-08-25%2012-30-40.png)
In XML this looks like:
```xml
<s>
    <np> <w pos="PRP"> They </w></np>
    <vp> <w pos="VB"> saw </w>
        <np>
            <np> <w pos="DT"> the </w>
                <w pos="NN"> president </w></np>
            <pp> <w pos="NN"> of </w>
                <np> <w pos="DT"> the </w>
                    <w pos="NN"> company </w></np>
            </pp>
        </np>
    </vp>
</s>
```

### Analysis of Corpora

Once we have annotated a corpus, we can begin to answer questiosn, such as the occurance of words, frequency of words and what words form groups.

#### Concordance

__Concordance:__ all occurrences of a given word, in context.

Concordance is a useful tool for analysing corpora, by building up expressions a concordance program will find and display all results typiclaya as keyword in context (kwic), the matched expression in the middle with fixed sized context either side.

Here is an example of a concordane for all forms of "remember".
```
s cellar . Scrooge then <remembered> to have heard that ghost
, for your own sake , you <remember> what has passed between
e-quarters more , when he <remembered> , on a sudden , that the
corroborated everything , <remembered> everything , enjoyed eve
urned from them , that he <remembered> the Ghost , and became c
ht be pleasant to them to <remember> upon Christmas Day , who
its festivities ; and had <remembered> those he cared for at a
wn that they delighted to <remember> him . It was a great sur
ke ceased to vibrate , he <remembered> the prediction of old Ja
as present myself , and I <remember> to have felt quite uncom
```

#### Frequencies
Frequency infomation is used to identify characterisitcs of a language.
- __Token count (N):__ number of tokens in a corpus.
- __Type count:__ number of types of tokens in a corpus.
- __Absolute frequency f(t) of type t:__ the number of tokens of type t in a corpus.
- __Relative frequency of type t:__ the absolute frequency of t scaled by the token count, f(t) / N.

#### Unigrams
A unigram ranking is an ordered list of the frequiency of each work in the corpus. We can genralize this to bigrams (pairs of adjecent words), trigrams and n-grams.

Here is an exampleof a concorance of bigrams, with an adjective followed by the word "tea" using the expression `[pos="J.*"][word="tea"]`:
```
87773: now , notwithstanding the <hot tea> they had given me before
281162: .’ ’ Shall I put a little <more tea> in the pot afore I go ,
565002: o moisten a box-full with <cold tea> , stir it up on a piece
607297: tween eating , drinking , <hot tea> , devilled grill , muffi
663703: e , handed round a little <stronger tea> . The harp was there ;
692255: e so repentant over their <early tea> , at home , that by eigh
1141472: rs. Sparsit took a little <more tea> ; and , as she bent her
1322382: s illness ! Dry toast and <warm tea> offered him every night
1456507: of robing , after which , <strong tea> and brandy were administ
1732571: rsty . You may give him a <little tea> , ma’am , and some dry t
```