
# 07. Data Augmentation Logic (Sanitized)

This notebook documents the **data augmentation strategy** used to improve model robustness,
especially on informal and noisy text sources. The implementation is fully **sanitized**
to prevent any leakage of protected data while preserving the exact experimental logic.



## 1. Motivation

Augmentation is introduced to:
- Mitigate class imbalance across DSM-5 labels
- Improve robustness to linguistic variation
- Enhance generalization on informal user-generated text

No augmented text originates from or reconstructs protected datasets.



## 2. Augmentation Categories

We categorize augmentation operations into three groups:
1. Lexical-level perturbation
2. Syntactic-level perturbation
3. Semantic-preserving paraphrase (interface-level)


In [None]:

import random



## 3. Lexical-level Augmentation (Sanitized)

Operations include token deletion, insertion, and swap.


In [None]:

def random_deletion(tokens, p=0.2):
    if len(tokens) == 1:
        return tokens
    return [t for t in tokens if random.random() > p]

def random_swap(tokens):
    if len(tokens) < 2:
        return tokens
    i, j = random.sample(range(len(tokens)), 2)
    tokens[i], tokens[j] = tokens[j], tokens[i]
    return tokens



## 4. Syntactic-level Augmentation

We simulate minor word order variations while preserving meaning.


In [None]:

def shuffle_within_window(tokens, window=3):
    tokens = tokens.copy()
    for i in range(0, len(tokens), window):
        random.shuffle(tokens[i:i+window])
    return tokens



## 5. Semantic-preserving Paraphrase (Interface)

Paraphrasing is applied **only at the interface level** using placeholder functions.
No actual paraphrase model outputs are released.


In [None]:

def fake_paraphrase(tokens):
    # Placeholder paraphrase interface
    return tokens[::-1]  # reverse as a dummy transformation



## 6. Example Augmentation Flow


In [None]:

original = ["feel", "tired", "cannot", "sleep"]

augmented_samples = [
    random_deletion(original),
    random_swap(original.copy()),
    shuffle_within_window(original),
    fake_paraphrase(original)
]

augmented_samples



## 7. Label Preservation Policy

- Augmentation is applied **after label assignment**
- All augmented samples inherit the original multi-label vector
- No label synthesis or hallucination is performed



## 8. Experimental Notes

- Augmentation is used only in training
- Validation and test sets remain untouched
- Identical augmentation policies are applied across models



## 9. Ethics and Reproducibility

- Augmented text does not reconstruct or approximate original user data
- All operations are generic and dataset-agnostic
- This strategy aligns with responsible NLP practices in mental health research
