# Exercises for Class 12: Augmentation

1) Solve the above mystery, why doesn't the model estimate change might when uppercasing? *Hint*: Check the tokenizer of the model
2) Examining the data, I seemed to notice that spelling error were more common among offensive tweets. Is this correct? [*Hint*](https://kennethenevoldsen.github.io/augmenty/augmenty.character.html?highlight=keystroke#augmenty.character.replace.create_keystroke_error_augmenter)
3) Examine the data yourself and create three hypothesis on what augmentation might change the performance.
4) Outline how you could apply augmentation (behavioral testing) to examine a model (or pipeline) in your project
5) (Optional): Apply this behavioural testing to your model

#### Setup before exercises (from augmentation.ipynb)

In [1]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [2]:
from danlp.datasets import DKHate
import pandas as pd
dkhate = DKHate()
test, train = dkhate.load_with_pandas()

samples = 20

# make sure to sample evenly from the two samples
n_labels = len(test["subtask_a"].unique())
samples_pr_lab = samples//n_labels

off = test[test["subtask_a"] == "OFF"].sample(samples_pr_lab)
not_off = test[test["subtask_a"] == "NOT"].sample(samples_pr_lab)
mini_test = pd.concat([off, not_off])

Downloading file C:\Users\Sara\AppData\Local\Temp\tmp9kz5ibxv


In [3]:
from transformers import pipeline

model_name = "DaNLP/da-bert-hatespeech-detection"
pipe = pipeline("sentiment-analysis", # text classification == sentiment analysis (don't ask me why, but they removed textcat in the latest version)
               model=model_name)

Downloading: 100%|██████████| 905/905 [00:00<00:00, 302kB/s]
Downloading: 100%|██████████| 443M/443M [01:25<00:00, 5.18MB/s]
Downloading: 100%|██████████| 253k/253k [00:00<00:00, 566kB/s]
Downloading: 100%|██████████| 112/112 [00:00<00:00, 56.2kB/s]
Downloading: 100%|██████████| 342/342 [00:00<00:00, 89.4kB/s]


In [4]:
texts = mini_test["tweet"].to_list()

def apply(texts):
    output = pipe(texts, truncation=True)
    return [t["score"] if t["label"] == "offensive" else 1 - t["score"] for t in output]


# first without augmentations
mini_test["p_offensive_no_aug"] = apply(texts)

In [5]:
import augmenty
import spacy

nlp = spacy.load("da_core_news_lg")

In [6]:
# load an augmenter
upper_case_augmenter = augmenty.load("upper_case.v1", level=1.00) # augment 100% 

In [7]:
aug_texts = augmenty.texts(texts, augmenter=upper_case_augmenter, nlp=nlp)
mini_test["p_offensive_upper"] = apply(list(aug_texts))

In [8]:
def compare_cols(
    augmentation,
    baseline=mini_test["p_offensive_no_aug"],
    category=mini_test["subtask_a"],
):
    """Compares augmentation with the baseline for each of the categories"""
    changes = ((augmentation > 0.5) != (baseline > 0.5)).sum()
    n = len(augmentation)
    print(f"The augmentation lead to classification changes in {changes}/{n}")
    for cat in set(category):
        aug_cat_mean = augmentation[category == cat].mean().round(3)
        aug_cat_std = augmentation[category == cat].std().round(3)
        cat_mean = baseline[category == cat].mean().round(3)
        cat_std = baseline[category == cat].std().round(3)
        print(
            f"The average prob. of {cat} went from {cat_mean}({cat_std}) to {aug_cat_mean}({aug_cat_std})."
        )

compare_cols(mini_test["p_offensive_upper"])

The augmentation lead to classification changes in 0/20
The average prob. of OFF went from 0.408(0.493) to 0.408(0.493).
The average prob. of NOT went from 0.004(0.006) to 0.004(0.006).


### 1) Solve the above mystery, why doesn't the model estimate change might when uppercasing?

In [17]:
texts_batch = []
for i, txt in enumerate(augmenty.texts(texts, augmenter=upper_case_augmenter, nlp=nlp)):
    print(txt, end='\n')
    texts_batch.append(txt)
    if i == 9:
        break

NORMALT SYNES JEG MARX VAR LIGE HØJREORIENTERET NOK. MEN PÅ DETTE PUNKT KAN JEG MÆRKE EN EKSTREMT LIBERAL TILGANG TIL EMNE. TIL DENNE DEBAT VIL JEG SIGE, AT KRÆNKELSESKULTUREN ER EN AF DE DUMMESTE IMPORTER, VI HAR GJORT OS FRA NORDAMERIKA.  JEG AFVENTER PERSONLIGT, AT VI KOLLEKTIVT VEDTAGER, AT DET ER FLØJTENDE LIGEGYLDIGT OM MAN ER FØDT MED TAP ELLER EJ, HVAD FARVE MAN HAR, OG HVEM MAN BEDST KAN LIDE AT KYSSE PÅ.  NEJ, VI VIL IKKE HAVE HATE CRIMES. NEJ, VI GIDER IKKE SE PÅ PASSIV HUMOR PÅ BEKOSTNING AF ANDRE. MEN MAN SKULLE FANDME TRO, AT DANMARK HAVDE ASSIMILERET DEN MELLEMØSTLIGE AVERSION OVERFOR SELVIRONI. HVILKET ER NOGET BEKLAGELIGT EFTERSOM SELVIRONI ER ET RET STORT INTERPERSONLIGT FÆNOMEN FOR DANSKERE.   MEN ER EN JOKE OM, AT NOGEN GANGE VENTEDE JEG LIDT LÆNGE PÅ, AT MIN EKS GJORDE SIG KLAR PÅ BEKOSTNING AF HENDE? ELLER ER DEN RELATERBAR FOR ANDRE MÆND OG KVINDER, DER GODT LIDT KAN SE, AT DER (FOR MANGE) ER NOGET OM SNAKKEN? I VIRKELIGHEDEN SIGER JEG JO MED EN SÅDAN KOMMENTAR, 

In [18]:
for t in pipe(texts_batch, truncation=True):
    print(t)

{'label': 'not offensive', 'score': 0.8743921518325806}
{'label': 'not offensive', 'score': 0.9776104092597961}
{'label': 'offensive', 'score': 0.9834025502204895}
{'label': 'not offensive', 'score': 0.9994292259216309}
{'label': 'offensive', 'score': 0.9637059569358826}
{'label': 'not offensive', 'score': 0.998209536075592}
{'label': 'offensive', 'score': 0.9789860248565674}
{'label': 'not offensive', 'score': 0.9995690584182739}
{'label': 'not offensive', 'score': 0.9866626858711243}
{'label': 'offensive', 'score': 0.9946720004081726}


### 2) Examining the data, I seemed to notice that spelling error were more common among offensive tweets. Is this correct?

In [19]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a,p_offensive_no_aug,p_offensive_upper
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
544,Normalt synes jeg Marx var lige højreorientere...,OFF,0.125608,0.125608
3326,Det fandme en fugtig migmig det der,OFF,0.02239,0.02239
1425,[FUCK YOU!! INGLIN WILL RISE AGAIN!!](URL,OFF,0.983403,0.983403
3299,"Det er kraftedme en stor præstation der her, e...",OFF,0.000571,0.000571
962,Passiv aggressiv måde at kalde dig for et pikfjæs,OFF,0.963706,0.963706
546,Norge er nu ikke så slemt igen... AKA Svensken...,OFF,0.00179,0.00179
1251,Han EJER ikke respekt for nogen eller noget......,OFF,0.978986,0.978986
519,Det sgu heller ikke okay. jeg havde sgu også b...,OFF,0.000431,0.000431
658,Fuckkkkk det her er mig....,OFF,0.013337,0.013337
890,"Syge kælling, hun skulle i fængsel for det stu...",OFF,0.994672,0.994672


In [24]:
keystroke_error_augmenter = augmenty.load("keystroke_error.v1", level=0.10) # augment 10% 

In [25]:
aug_texts = augmenty.texts(texts, augmenter=keystroke_error_augmenter, nlp=nlp)
mini_test["p_offensive_keyerror"] = apply(list(aug_texts))

compare_cols(mini_test["p_offensive_keyerror"])

The augmentation lead to classification changes in 5/20
The average prob. of OFF went from 0.408(0.493) to 0.125(0.309).
The average prob. of NOT went from 0.004(0.006) to 0.007(0.015).


### 3) Examine the data yourself and create three hypothesis on what augmentation might change the performance.

In [32]:
synonym_augmenter = augmenty.load("wordnet_synonym.v1", level=0.50) # augment 50%
grundtvig_augmenter = augmenty.load("grundtvigian_spacing_augmenter.v1", level=0.05) # augment 5%

In [37]:
for i, txt in enumerate(augmenty.texts(texts, augmenter=grundtvig_augmenter, nlp=nlp)):
    print(txt, end='\n')
    if i == 5:
        break

Normalt synes jeg Marx var lige højreorienteret nok. Men på dette punkt kan jeg mærke en ekstremt liberal tilgang til emne. Til denne debat vil jeg sige, at krænkelseskulturen er en af de dummeste importer, vi har gjort os fra Nordamerika.  Jeg afventer personligt, a t vi kollektivt vedtager, at det er fløjtende ligegyldigt om man e r født med tap eller ej, hvad farve man har, og h v e m man b e d s t kan l i d e at kysse på.  Nej, vi vil ikke have hate crimes. Nej, vi gider ikke se på passiv humor på bekostning af andre. Men man skulle fandme tro, at Danmark havde assimileret den Mellemøstlige aversion overfor selvironi. Hvilket er noget beklageligt eftersom selvironi er et ret stort interpersonligt fænomen for danskere.   Men er en j o k e o m, at nogen gange ventede jeg lidt længe på, at min eks gjorde sig klar på bekostning af hende? Eller er den relaterbar for andre mænd og kvinder, der godt lidt kan se, at der (for mange) er noget om snakken? I virkeligheden siger jeg jo med en s

In [35]:
aug_texts = augmenty.texts(texts, augmenter=grundtvig_augmenter, nlp=nlp)
mini_test["p_offensive_grundtvig"] = apply(list(aug_texts))

compare_cols(mini_test["p_offensive_grundtvig"])

The augmentation lead to classification changes in 1/20
The average prob. of OFF went from 0.408(0.493) to 0.497(0.514).
The average prob. of NOT went from 0.004(0.006) to 0.017(0.024).


In [36]:
aug_texts = augmenty.texts(texts, augmenter=synonym_augmenter, nlp=nlp)
mini_test["p_offensive_synonym"] = apply(list(aug_texts))

compare_cols(mini_test["p_offensive_synonym"])

The augmentation lead to classification changes in 0/20
The average prob. of OFF went from 0.408(0.493) to 0.412(0.492).
The average prob. of NOT went from 0.004(0.006) to 0.004(0.006).
