<a href="https://colab.research.google.com/github/tlu-dt-nlp/POSgram-finder/blob/main/posgram_finder_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Error Detection Based on Part-of-Speech Sequences

Demo for using the `posgram-finder` error detection tool from the https://koodivaramu.eesti.ee/tartunlp/corrector toolkit.

The application finds unlikely part-of-speech (POS) sequences. These are detected based on the probability of POS trigrams (three-word strings) to occur in a certain context. In other words, how likely they are to be used together with the preceding/succeeding POS or at the beginning/end of a sentence. Punctuation is skipped in the analysis.


## Setup

Clone the repository, install dependencies and import the `PosgramFinder` class.

In [1]:
! git clone https://github.com/tlu-dt-nlp/posgram-finder.git

Cloning into 'posgram-finder'...
remote: Enumerating objects: 43, done.[K
remote: Counting objects: 100% (43/43), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 43 (delta 19), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (43/43), 358.29 KiB | 3.85 MiB/s, done.
Resolving deltas: 100% (19/19), done.


In [2]:
%cd posgram-finder

/content/posgram-finder


In [3]:
! pip install stanza

Collecting stanza
  Downloading stanza-1.6.1-py3-none-any.whl (881 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.2/881.2 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.8.0-py2.py3-none-any.whl (358 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.9/358.9 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: emoji, stanza
Successfully installed emoji-2.8.0 stanza-1.6.1


In [5]:
from posgram_finder import PosgramFinder
p = PosgramFinder()
# By default, a trigram context is detected as low-probability if its relative frequency is <5% in the language model.
# It can be changed using the "lower_percentage_limit" argument
#p = PosgramFinder(lower_percentage_limit=2.5)

INFO:stanza:Loading these models for language: et (Estonian):
| Processor | Package      |
----------------------------
| tokenize  | edt          |
| pos       | edt_nocharlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


## Finding errors in non-corrected texts

The tool can be used for analysing original, non-corrected texts as well as automated correction output.

In the following two examples, we analyse original sentences written by learners of Estonian as a second language.

### 1. Word choice error

In [6]:
result = p.posgram_errors("See on väga ilus, kiiresti ja mugav punane auto.")

^PVDAZDJAASZ$


Here the adverb *kiiresti* has been used instead of the adjective *kiire*. Three untypical POS sequences are found, all include the erroneous word.

In [7]:
from pprint import pprint
pprint(result, sort_dicts=False)

[{'sentence': 'See on väga ilus, kiiresti ja mugav punane auto.',
  'sentence_posgram': '^PVDAZDJAASZ$',
  'error_candidates': [{'value': 'on väga ilus , kiiresti',
                        'posgram': 'VDAZD',
                        'start_token': 1,
                        'end_token': 5,
                        'trigram': 'VDA',
                        'type': 'post',
                        'context': 'D',
                        'percent': 3.085772778093205},
                       {'value': 'ilus , kiiresti ja mugav',
                        'posgram': 'AZDJA',
                        'start_token': 3,
                        'end_token': 7,
                        'trigram': 'DJA',
                        'type': 'pre',
                        'context': 'A',
                        'percent': 1.9070039340843379},
                       {'value': 'kiiresti ja mugav punane',
                        'posgram': 'DJAA',
                        'start_token': 5,
                      

### 2. Missing verb

In [8]:
sentence = "Eile koristasin vannituba ja näen nagu kodumasin katki."
result = p.posgram_errors(sentence)

^DVSJVJSDZ$


The existential verb *olema* is missing from the subordinate clause at the end of the sentence (*nagu kodumasin katki*). It is reflected in consecutive unlikely word strings.

In [9]:
pprint(result[0]["error_candidates"], sort_dicts=False)

[{'value': 'ja näen nagu kodumasin',
  'posgram': 'JVJS',
  'start_token': 3,
  'end_token': 6,
  'trigram': 'VJS',
  'type': 'pre',
  'context': 'J',
  'percent': 4.7853188712363295},
 {'value': 'näen nagu kodumasin katki',
  'posgram': 'VJSD',
  'start_token': 4,
  'end_token': 7,
  'trigram': 'VJS',
  'type': 'post',
  'context': 'D',
  'percent': 4.402904046086682},
 {'value': 'nagu kodumasin katki .',
  'posgram': 'JSDZ$',
  'start_token': 5,
  'end_token': 9,
  'trigram': 'JSD',
  'type': 'post',
  'context': '$',
  'percent': 4.90649701176017}]


## Finding errors in correction output

Next, we analyse two sentences that have already been processed by the best-performing grammatical error correction model.

The corrector has not made changes to the first sentence. The second sentence has been edited considerably. Originally, it was *Seal palju kohvikud, muuseumusid kino*.

In [10]:
text = "Vitebskis me kuulasime kontserti. Seal on palju kohvikuid, muuseume, kinosid."
result = p.posgram_errors(text)

^SPVSZ$
^DVDSZSZSZ$


### 1. Word order error

The first sentence starts with a noun-pronoun-verb sequence that does not align with the verb-second (V2) word order. It would be advisable to begin the sentence with *Vitebskis kuulasime me*.

### 2. Missing conjunction

An adverb is typically not followed by three consecutive nouns without a conjunction. While not grammatically incorrect, it would be more fluent to use the conjunction *ja* when listing the nouns in the second sentence (between the words *muuseume* and *kinosid*).

In [11]:
pprint(result, sort_dicts=False)

[{'sentence': 'Vitebskis me kuulasime kontserti.',
  'sentence_posgram': '^SPVSZ$',
  'error_candidates': [{'value': 'Vitebskis me kuulasime',
                        'posgram': 'SPV',
                        'start_token': 0,
                        'end_token': 2,
                        'trigram': 'SPV',
                        'type': 'pre',
                        'context': '^',
                        'percent': 4.862718076808973}]},
 {'sentence': 'Seal on palju kohvikuid, muuseume, kinosid.',
  'sentence_posgram': '^DVDSZSZSZ$',
  'error_candidates': [{'value': 'palju kohvikuid , muuseume , kinosid',
                        'posgram': 'DSZSZS',
                        'start_token': 2,
                        'end_token': 7,
                        'trigram': 'SSS',
                        'type': 'pre',
                        'context': 'D',
                        'percent': 3.9641244155294872}]}]
