### RWSE-Checker: unkown words

In [1]:
from rwse import RWSE_Checker

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
rwse = RWSE_Checker()
rwse.set_confusion_sets('../data/confusion_sets.csv')

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


#### Read from corpus and transform to CAS

In [3]:
with open('../data/eng_news_2023_10K-sentences.txt', 'r') as f:
    sentences = f.readlines()

In [4]:
from cassis import Cas, load_typesystem

path = '../data/TypeSystem.xml'

with open(path, 'rb') as f:
    ts = load_typesystem(f)
cas = Cas(ts)

sentences_cleaned = [sentence.split('\t')[1].strip() for sentence in sentences]
cas.sofa_string = ' '.join(sentences_cleaned[:8000]) # limited computational resources

In [5]:
import spacy
T_SENTENCE = 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence'
T_RWSE = 'de.tudarmstadt.ukp.dkpro.core.api.anomaly.type.RWSE'
T_TOKEN = 'de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token'

nlp = spacy.load('en_core_web_sm')

S = ts.get_type(T_SENTENCE)
T = ts.get_type(T_TOKEN)

doc = nlp(cas.sofa_string)
for sent in doc.sents:
    cas_sentence = S(begin=sent.start_char, end=sent.end_char)
    cas.add(cas_sentence)
for token in doc:
    cas_token = T(begin=token.idx, end=token.idx+len(token.text), id=token.i)
    cas.add(cas_token)

#### Find RSWEs in CAS

In [6]:
rwse.check_cas(cas, ts)

The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exis

d.t.u.d.c.a.a.t.RWSE(begin=26932, end=26939)


The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `corse` does not exi

d.t.u.d.c.a.a.t.RWSE(begin=81820, end=81826)


The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.


d.t.u.d.c.a.a.t.RWSE(begin=95290, end=95297)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.


d.t.u.d.c.a.a.t.RWSE(begin=115783, end=115789)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist 

d.t.u.d.c.a.a.t.RWSE(begin=182246, end=182250)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.


d.t.u.d.c.a.a.t.RWSE(begin=189408, end=189412)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in 

d.t.u.d.c.a.a.t.RWSE(begin=216127, end=216131)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `manly` does not exist in the model vocabulary. Replacing with `man`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist 

d.t.u.d.c.a.a.t.RWSE(begin=323598, end=323601)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `cite` does not exist in the model vocabulary. Replacing with `c`.


d.t.u.d.c.a.a.t.RWSE(begin=324642, end=324646)


The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `provence` does not exist in the model vocabulary. Replacing with `proven`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.


d.t.u.d.c.a.a.t.RWSE(begin=340555, end=340561)


The specified target token `breadth` does not exist in the model vocabulary. Replacing with `bread`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does no

d.t.u.d.c.a.a.t.RWSE(begin=513607, end=513614)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist

d.t.u.d.c.a.a.t.RWSE(begin=569782, end=569787)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist 

d.t.u.d.c.a.a.t.RWSE(begin=617422, end=617426)


The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.


d.t.u.d.c.a.a.t.RWSE(begin=630500, end=630505)


The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `corse` does not exist in the model vocabulary. Replacing with `co`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does 

d.t.u.d.c.a.a.t.RWSE(begin=696814, end=696817)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exi

d.t.u.d.c.a.a.t.RWSE(begin=806823, end=806828)


The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `manly` does not exist in the model vocabulary. Replacing with `man`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `capitol` does not exist in the model vocabulary. Replacing with `cap`.
The specified target token `it's` does not exist in the model vocabulary. Replacing with `it`.
The specified target token `it's` does not e

In [7]:
false_positives = cas.select(T_RWSE) # Sofa-String of CAS is assumed to error-free

false_positives_transformed = [(token.begin, token.end, cas.sofa_string[token.begin:token.end]) for token in false_positives]

for sent in doc.sents:
    for token in false_positives_transformed:
        if sent.start_char <= token[0] and sent.end_char >= token[1]:
            print(sent.text, '[['+token[2]+']]')

After discussions with researchers from Washington, D.C., King chose the Jefferson elm, a disease-resistant species with a sturdy structure currently thriving on the National Mall in the nation’s capitol. [[capitol]]
An Officer of the OBE is awarded for distinguished regional or county-wide role in any field, through achievement or service to the community. [[county]]
Around that March arrest, Bristol police announced that they had been asked to perform a wellness check on Hernandez the previous week after receiving an anonymous complaint saying Hernandez planned to smash windows at ESPN and at the state capitol. [[capitol]]
"As the county's Destination Management Organisation, we trust in the planning process to ensure a decision that balances all these important factors." [[county]]
But they are two distinct and separate occurrences. [[they]]
Can you cite that CA law? [[cite]]
Crowds were entertained by the Broke FMX Motocross Stunt Team, as well as a crowd pleasing display form the 

In [8]:
len(false_positives_transformed)

16

#### Analyze members of confusion sets in CAS

In [9]:
keys = rwse.confusion_sets.keys()

key_count = {key: 0 for key in keys}

for token in doc:
    if token.lemma_ in keys:
        key_count[token.lemma_] += 1

confusion_set_counts = dict()

with open('../data/confusion_sets.csv', 'r') as f:
    lines = f.readlines()
    for line in lines:
        line_cleaned = line.strip()
        confusion_set_counts[line_cleaned] = 0
        line_split = line_cleaned.split(',')
        for key, value in key_count.items():
            if key in line_split:
                confusion_set_counts[line_cleaned] += value

for key, value in confusion_set_counts.items():
    print(key, '=', value)

accept,except = 28
advise,advice = 27
affect,effect = 56
begin,being = 64
bitch,pitch = 10
brakes,breaks = 0
breath,breadth = 1
burrows,borrows = 0
capitol,capital = 26
cite,sight,site = 47
cords,chords = 0
corse,course = 37
country,county = 126
crap,crab = 2
dessert,desert = 7
ease,easy = 44
effects,affects = 0
extend,extent = 15
feet,feat = 3
few,view = 110
form,from = 752
forth,fourth = 28
forums,forms = 0
fund,found = 44
its,it's = 366
lead,led = 124
life,live = 178
loose,lose = 64
mad,made = 2
manly,mainly = 8
or,ore = 357
passed,past = 38
peace,piece = 31
plane,plain = 11
principal,principle = 18
provence,province = 6
quite,quiet = 33
raise,rise = 100
safe,save = 62
simulate,stimulate = 2
spit,split = 10
than,then = 366
their,there,they = 1513
theme,them = 8
things,thinks = 0
trail,trial = 29
tree,three = 151
two,too,to = 4756
weak,week = 145
weather,whether = 46
weed,wheat = 2
where,were = 131
which,witch = 345
whole,hole = 29
with,width = 1191
world,word = 115
you,your = 712


In [10]:
sum(confusion_set_counts.values())

12376