### RWSE-Checker: false-positives (false alarm) statistics

In [1]:
from rwse import RWSE_Checker

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
rwse = RWSE_Checker()
rwse.set_confusion_sets('../data/confusion_sets.csv')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#### Read from corpus and transform to CAS

In [3]:
with open('../data/eng_news_2023_10K-sentences.txt', 'r') as f:
    sentences = f.readlines()

In [4]:
from cassis import Cas, load_typesystem

path = '../data/TypeSystem.xml'

with open(path, 'rb') as f:
    ts = load_typesystem(f)
cas = Cas(ts)

sentences_cleaned = [sentence.split('\t')[1].strip() for sentence in sentences]
cas.sofa_string = ' '.join(sentences_cleaned[:8000]) # limited computational resources

In [5]:
import spacy
T_SENTENCE = 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence'
T_RWSE = 'de.tudarmstadt.ukp.dkpro.core.api.anomaly.type.RWSE'
T_TOKEN = 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'

nlp = spacy.load('en_core_web_sm')

S = ts.get_type(T_SENTENCE)
T = ts.get_type(T_TOKEN)

doc = nlp(cas.sofa_string)
for sent in doc.sents:
    cas_sentence = S(begin=sent.start_char, end=sent.end_char)
    cas.add(cas_sentence)
for token in doc:
    cas_token = T(begin=token.idx, end=token.idx+len(token.text), id=token.i)
    cas.add(cas_token)

#### Find RSWEs in CAS

In [6]:
rwse.check_cas(cas, ts)

d.t.u.d.c.a.a.t.RWSE(begin=81820, end=81826)
d.t.u.d.c.a.a.t.RWSE(begin=115783, end=115789)
d.t.u.d.c.a.a.t.RWSE(begin=182246, end=182250)
d.t.u.d.c.a.a.t.RWSE(begin=216127, end=216131)
d.t.u.d.c.a.a.t.RWSE(begin=340555, end=340561)
d.t.u.d.c.a.a.t.RWSE(begin=569782, end=569787)
d.t.u.d.c.a.a.t.RWSE(begin=617422, end=617426)
d.t.u.d.c.a.a.t.RWSE(begin=630500, end=630505)
d.t.u.d.c.a.a.t.RWSE(begin=696814, end=696817)
d.t.u.d.c.a.a.t.RWSE(begin=806823, end=806828)


In [7]:
false_positives = cas.select(T_RWSE) # Sofa-String of CAS is assumed to error-free

false_positives_transformed = [(token.begin, token.end, cas.sofa_string[token.begin:token.end]) for token in false_positives]

for sent in doc.sents:
    for token in false_positives_transformed:
        if sent.start_char <= token[0] and sent.end_char >= token[1]:
            print(sent.text, '[['+token[2]+']]')

An Officer of the OBE is awarded for distinguished regional or county-wide role in any field, through achievement or service to the community. [[county]]
"As the county's Destination Management Organisation, we trust in the planning process to ensure a decision that balances all these important factors." [[county]]
But they are two distinct and separate occurrences. [[they]]
Crowds were entertained by the Broke FMX Motocross Stunt Team, as well as a crowd pleasing display form the Pony club Games. [[form]]
Henrico must continue to invest in the most up-to-date resources and infrastructure in its schools, and make sure to do so in both the east and west sides of the county, Rogish said. [[county]]
but you can refer to their to confirm if your local restaurant is open during the Christmas period and what times they are trading. [[their]]
Next thing your going to post is that chopping the foreskin of babies isn't a symbol of virtuous enlightenment. [[your]]
Now we get to the timeliness — 

In [8]:
len(false_positives_transformed)

10

#### Analyze members of confusion sets in CAS

In [9]:
keys = rwse.confusion_sets.keys()

key_count = {key: 0 for key in keys}

for token in doc:
    if token.lemma_ in keys:
        key_count[token.lemma_] += 1

confusion_set_counts = dict()

with open('../data/confusion_sets.csv', 'r') as f:
    lines = f.readlines()
    for line in lines:
        line_cleaned = line.strip()
        confusion_set_counts[line_cleaned] = 0
        line_split = line_cleaned.split(',')
        for key, value in key_count.items():
            if key in line_split:
                confusion_set_counts[line_cleaned] += value

for key, value in confusion_set_counts.items():
    print(key, '=', value)

accept,except = 28
advise,advice = 27
affect,effect = 56
begin,being = 64
bitch,pitch = 10
brakes,breaks = 0
burrows,borrows = 0
sight,site = 36
cords,chords = 0
country,county = 126
crap,crab = 2
dessert,desert = 7
ease,easy = 44
effects,affects = 0
extend,extent = 15
feet,feat = 3
few,view = 110
form,from = 752
forth,fourth = 28
forums,forms = 0
fund,found = 44
lead,led = 124
life,live = 178
loose,lose = 64
mad,made = 2
or,ore = 357
passed,past = 38
peace,piece = 31
plane,plain = 11
principal,principle = 18
quite,quiet = 33
raise,rise = 100
safe,save = 62
spit,split = 10
than,then = 366
their,there,they = 1513
theme,them = 8
things,thinks = 0
trail,trial = 29
tree,three = 151
two,too,to = 4756
weak,week = 145
weather,whether = 46
weed,wheat = 2
where,were = 131
which,witch = 345
whole,hole = 29
with,width = 1191
world,word = 115
you,your = 712


In [10]:
sum(confusion_set_counts.values())

11919