# Model validation using Augmentation
For this class we will conduct model validation using augmentation, we will especially use the package [Augmenty](https://kennethenevoldsen.github.io/augmenty/).

## Setup

We will need to set up a few things before we start.

### Packages:
For this tutorial you will need the following packages:

- spaCy and augmenty are used for the augmentation
- transformers are use to run the model we wish to validate
- danlp is used to download the dataset we want to use

In [1]:
# !pip install augmenty spacy==3.1.1 transformers==4.2.2 danlp==0.0.12
# !python -m spacy download da_core_news_lg

SyntaxError: invalid syntax (<ipython-input-1-5a5edb48c1f0>, line 2)

## Dataset
For this dataset we will be using [DKHate](https://github.com/alexandrainst/danlp/blob/master/docs/docs/datasets.md#dkhate). The DKHate dataset contains user-generated comments from social media platforms (Facebook and Reddit) annotated for various types and target of offensive language. Note that only labels for the sub-task A (Offensive language identification), i.e. NOT (Not Offensive) / OFF (Offensive), are available.

In [1]:
from danlp.datasets import DKHate
import pandas as pd
dkhate = DKHate()
test, train = dkhate.load_with_pandas()

to make everything run faster we will only be using a subsample of the dataset:

In [2]:
samples = 20

# make sure to sample evenly from the two samples
n_labels = len(test["subtask_a"].unique())
samples_pr_lab = samples//n_labels

off = test[test["subtask_a"] == "OFF"].sample(samples_pr_lab)
not_off = test[test["subtask_a"] == "NOT"].sample(samples_pr_lab)
mini_test = pd.concat([off, not_off])

We can now inspect the data using:

In [3]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a
id,Unnamed: 1_level_1,Unnamed: 2_level_1
3299,"Det er kraftedme en stor præstation der her, e...",OFF
305,De små får fri på vores skole fordi en knægt b...,OFF
2041,HEY! Bilar er jo sådan det eneste gode der er ...,OFF
701,"@USER hvis hun ikke kan koge pastaen rigtigt, ...",OFF
962,Passiv aggressiv måde at kalde dig for et pikfjæs,OFF
799,hvorfor i den fucking store helvede skal man f...,OFF
519,Det sgu heller ikke okay. jeg havde sgu også b...,OFF
987,"Tak, fordi du ikke vanærede @USER ved at sætte...",OFF
1251,Han EJER ikke respekt for nogen eller noget......,OFF
1167,"Lækkert lorteindslag v1, jeg giver d1 1/1.",OFF


## Loading the model
For this dataset we will be using a model trained on the train set of the corpus:

In [4]:
from transformers import pipeline
import torch

model_name = "DaNLP/da-bert-hatespeech-detection"
pipe = pipeline("sentiment-analysis", # text classification == sentiment analysis (don't ask me why, but they removed textcat in the latest version)
               model=model_name)

We can quickly check the output using:

In [5]:
pipe(["Gamle stupide idiot", "Lækkert vejr i dag"]) # old stupid idiot, nice weather today

[{'label': 'offensive', 'score': 0.9902199506759644},
 {'label': 'not offensive', 'score': 0.9998297691345215}]

We can quickly apply this model to all our examples and save them in the dataset:

In [6]:
texts = mini_test["tweet"].to_list()

def apply(texts):
    output = pipe(texts, truncation=True)
    return [t["score"] if t["label"] == "offensive" else 1 - t["score"] for t in output]


# first without augmentations
mini_test["p_offensive_no_aug"] = apply(texts)

# Behavioural check using Augmentation

In the following we want to examine the behavioural consistency of the model using augmentation. The idea is to check the behavioural consistently of the model for instance if we introduce slight spelling errors we the model should still be able to recognize names. If this is not the case it might be unwise to apply the model to domains where spelling errors are common such as social media.  

![](img/aug.png)
**Figure 1**: Examples of augmentation applied by Enevoldsen et al. (2020) and what domains they might be of relevance.




## Augmenty
For the augmentation we will be using the package augmenty, the following provides a brief introduction to it.

**NOTE**: You are naturally not forced to use augmenty, you implement your own augmenters i.e. the following example with uppercasing is easy to implement by hand.  For example if you want to examine the effect of questionmarks you could make the augmentation:
```py
q_aug = [text + "?" for text in texts]
```

In [7]:
import augmenty
import spacy

nlp = spacy.load("da_core_news_lg")

# a list of augmenters
for augmenter in augmenty.augmenters():
    print(augmenter)


spacy.orth_variants.v1
spacy.lower_case.v1
random_casing.v1
char_replace_random.v1
char_replace.v1
keystroke_error.v1
remove_spacing.v1
char_swap.v1
random_starting_case.v1
conditional_token_casing.v1
token_dict_replace.v1
wordnet_synonym.v1
token_replace.v1
word_embedding.v1
grundtvigian_spacing_augmenter.v1
spacing_insertion.v1
token_swap.v1
token_insert.v1
token_insert_random.v1
duplicate_token.v1
random_synonym_insertion.v1
ents_replace.v1
per_replace.v1
ents_format.v1
upper_case.v1
spongebob.v1
da_æøå_replace.v1
da_historical_noun_casing.v1


A list naturally does not give you all the information you need. You can always examine a specific augmenter more en detain in the [documentation](https://kennethenevoldsen.github.io/augmenty/).


Let us try one of the augmenters. We can use the `augmenty.load` as a common interface for all augmenters.

In [8]:
# load an augmenter
upper_case_augmenter = augmenty.load("upper_case.v1", level=1.00) # augment 100% 

In [9]:
random_synonym = augmenty.load("random_synonym_insertion.v1", level=1.00)

In [19]:
svampebobben = augmenty.load("spongebob.v1", level=1.00)

These augmenters are made to work on the SpaCy data class Examples which allows for much more detailed augmentation, however augmenty have utility function to allow us to use them for strings:

In [20]:
examples = ["this is an example", "and another one"]
aug_texts = augmenty.texts(examples, augmenter=svampebobben, nlp=nlp)
list(aug_texts)

['ThIs iS An eXaMpLe', 'AnD AnOtHeR OnE']

## Is uppercasing more offensive?

Now we will can apply our model to the augmented examples to see if it changes predictions of the model.


In [15]:
aug_texts = augmenty.texts(texts, augmenter=random_synonym, nlp=nlp)
mini_test["p_offensive_upper"] = apply(list(aug_texts))

Examining the output of our models we quickly see that it doesn't change the result at all! 

In [16]:
mini_test

Unnamed: 0_level_0,tweet,subtask_a,p_offensive_no_aug,p_offensive_upper
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3299,"Det er kraftedme en stor præstation der her, e...",OFF,0.000571,0.000896
305,De små får fri på vores skole fordi en knægt b...,OFF,0.030936,0.014478
2041,HEY! Bilar er jo sådan det eneste gode der er ...,OFF,0.01471,0.003938
701,"@USER hvis hun ikke kan koge pastaen rigtigt, ...",OFF,0.973846,0.260128
962,Passiv aggressiv måde at kalde dig for et pikfjæs,OFF,0.963706,0.030268
799,hvorfor i den fucking store helvede skal man f...,OFF,0.595546,0.988877
519,Det sgu heller ikke okay. jeg havde sgu også b...,OFF,0.000431,0.000431
987,"Tak, fordi du ikke vanærede @USER ved at sætte...",OFF,0.965915,0.015198
1251,Han EJER ikke respekt for nogen eller noget......,OFF,0.978986,0.97228
1167,"Lækkert lorteindslag v1, jeg giver d1 1/1.",OFF,0.004151,0.004151


To be a bit more explicit we can also compare it using summary information:

In [18]:
def compare_cols(
    augmentation,
    baseline=mini_test["p_offensive_no_aug"],
    category=mini_test["subtask_a"],
):
    """Compares augmentation with the baseline for each of the categories"""
    changes = ((augmentation > 0.5) != (baseline > 0.5)).sum()
    n = len(augmentation)
    print(f"The augmentation lead to classification changes in {changes}/{n}")
    for cat in set(category):
        aug_cat_mean = augmentation[category == cat].mean().round(3)
        aug_cat_std = augmentation[category == cat].std().round(3)
        cat_mean = baseline[category == cat].mean().round(3)
        cat_std = baseline[category == cat].std().round(3)
        print(
            f"The average prob. of {cat} went from {cat_mean}({cat_std}) to {aug_cat_mean}({aug_cat_std})."
        )

compare_cols(mini_test["p_offensive_upper"])

The augmentation lead to classification changes in 6/20
The average prob. of NOT went from 0.298(0.471) to 0.012(0.018).
The average prob. of OFF went from 0.453(0.48) to 0.229(0.404).


# Exercises:

1) Solve the above mystery, why doesn't the model estimate change might when uppercasing? *Hint*: Check the tokenizer of the model
2) Examining the data, I seemed to notice that spelling error were more common among offensive tweets. Is this correct? [*Hint*](https://kennethenevoldsen.github.io/augmenty/augmenty.character.html?highlight=keystroke#augmenty.character.replace.create_keystroke_error_augmenter)
3) Examine the data yourself and create three hypothesis on what augmentation might change the performance.
4) Outline how you could apply augmentation (behavioral testing) to examine a model (or pipeline) in your project
5) (Optional): Apply this behavioural testing to your model