<a href="https://colab.research.google.com/github/xiashi123/ML-Notebooks/blob/main/Text_Data_Augmentation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Augmentation
Data augmentation techniques are used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. It is closely related to oversampling as well.

source: [Wikipedia](https://en.wikipedia.org/wiki/Data_augmentation)

In this notebook we will go through the most commonly used Data Augmentation Techniques, specifically for text data, and their implementation using [Text-Data-Augmentation](https://github.com/Ritvik19/Text-Data-Augmentation) package.

In [None]:
%%capture
!pip install sentencepiece
!pip install git+https://github.com/Ritvik19/Text-Data-Augmentation.git
!python -m spacy download en_core_web_lg

In [None]:
%%capture
import nltk
nltk.download('all')

## Abstractive Summarization

Abstractive Summarization Augmentation uses State-of-the-Art transformer models to summarize the given text. [[17]](#ref-17) [[18]](#ref-18)

In [None]:
from text_data_augmentation import AbstractiveSummarization
aug = AbstractiveSummarization()
aug(["""Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization,
 abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as
 paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal
  documents analysis."""])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


  0%|          | 0/1 [00:00<?, ?it/s]

Your max_length is set to 142, but you input_length is only 109. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=54)


['Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text. Unlike extractive summarization,\n abstractive summarization does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as \n paraphrasing. Abstractive summarization yields a number of applications in different domains, from books and literature, to science and R&D, to financial research and legal\n  documents analysis.',
 ' Abstractive Summarization is a task in Natural Language Processing (NLP) that aims to generate a concise summary of a source text . Unlike extractive summarization, it does not simply copy important phrases from the source text but also potentially come up with new phrases that are relevant, which can be seen as paraphrasing .']

## Back Translation

Back Translation Augmentation relies on translating text data to another language and then translating it back to the original language. This technique allows generating textual data of distinct wording to original text while preserving the original context and meaning.[[1]](#ref-1) [[2]](#ref-2) [[10]](#ref-10)

In [None]:
from text_data_augmentation import BackTranslation
aug = BackTranslation()
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps on the lazy dog']

## Character Noise

Character Noise Augmentation adds character level noise by randomly inserting, deleting, swaping or replacing some charaters in the input text. [[2]](#ref-2) [[9]](#ref-9)

In [None]:
from text_data_augmentation import CharacterNoise
aug = CharacterNoise(alpha=0.2, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A 3quick brown fox jumps over the lazy dog']

## Contextual Word Replacement

Contextual Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a mask and then using a Masked Language Model to fill it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[3]](#ref-3) [[11]](#ref-11) [[19]](#ref-19)

In [None]:
from text_data_augmentation import ContextualWordReplacement
aug = ContextualWordReplacement(n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


  0%|          | 0/1 [00:00<?, ?it/s]

  v != 0, np.minimum((0.7 * (c - v) / z), np.ones_like(v)), 0


['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps over the lazy dog']

## Easy Data Augmentation

Easy Data Augmentation adds word level noise by randomly inserting, deleting, swaping some words in the input text or by shuffling the sentences in the input text. [[4]](#ref-4) [[5]](#ref-5) [[9]](#ref-9) [[12]](#ref-12) [[13]](#ref-13)

In [None]:
from text_data_augmentation import EasyDataAugmentation
aug = EasyDataAugmentation(n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps over the lazy dog']

## KeyBoard Noise

KeyBoard Noise Augmentation adds character level spelling mistake noise by mimicing typographical errors made using a qwerty keyboard in the input text. [[2]](#ref-2) [[9]](#ref-9)

In [None]:
from text_data_augmentation import KeyBoardNoise
aug = KeyBoardNoise(alpha=0.1, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A qujck brown fox jum)s over the lazy dot']

## OCR Noise

OCR Noise Augmentation adds character level spelling mistake noise by mimicing ocr errors in the input text. [[6]](#ref-6)

In [None]:
from text_data_augmentation import OCRNoise
aug = OCRNoise(alpha=0.1, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A quick brown fox turnqs over the lazx dog']

## Paraphrase

Paraphrase Augmentation rephrases the input sentences using T5 models. [[2]](#ref-2)

In [None]:
from text_data_augmentation import Paraphrase
aug = Paraphrase("hetpandya/t5-small-tapaco", n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]



['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps over a lazy dog with a quick brown fox']

## Similar Word Replacement

Similar Word Replacement Augmentation creates Augmented Samples by randomly replacing some words with a word having the most similar vector to it. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[7]](#ref-7) [[15]](#ref-15) [[16]](#ref-16) [[19]](#ref-19)

In [None]:
from text_data_augmentation import SimilarWordReplacement
aug = SimilarWordReplacement("en_core_web_lg",  alpha=0.2, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

  v != 0, np.minimum((0.7 * (c - v) / z), np.ones_like(v)), 0


['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps over oF lazy dog']

## Synonym Replacement

Synonym Replacement Augmentation creates Augmented Samples by randomly replacing some words with their synonyms based on the word net data base. Sampling of words can be weighted using TFIDF values as well. [[2]](#ref-2) [[4]](#ref-4) [[8]](#ref-8) [[13]](#ref-13) [[19]](#ref-19)

In [None]:
from text_data_augmentation import SynonymReplacement
aug = SynonymReplacement(alpha=0.2, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

  v != 0, np.minimum((0.7 * (c - v) / z), np.ones_like(v)), 0


['A quick brown fox jumps over the lazy dog',
 'A quick brown fox jumps over the lazy dog']

## Word Split

Word Split Augmentation adds word level spelling mistake noise by spliting words randomly in the input text. [[2]](#ref-2) [[14]](#ref-14)

In [None]:
from text_data_augmentation import WordSplit
aug = WordSplit(alpha=0.15, n_aug=1)
aug(['A quick brown fox jumps over the lazy dog'])

  0%|          | 0/1 [00:00<?, ?it/s]

['A quick brown fox jumps over the lazy dog',
 'A quick brow n fox jumps over the lazy dog']

# References

1. <a href="https://arxiv.org/pdf/2106.04681.pdf" id="ref-1">Data Expansion Using Back Translation and Paraphrasing for Hate Speech Detection</a>
2. <a href="https://arxiv.org/ftp/arxiv/papers/2107/2107.03158.pdf" id="ref-2">A Survey on Data Augmentation for Text Classification</a>
3. <a href="https://arxiv.org/pdf/1805.06201.pdf" id="ref-3">Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations</a>
4. <a href="https://arxiv.org/pdf/1901.11196.pdf" id="ref-4">EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks</a>
5. <a href="https://aclanthology.org/2020.coling-main.343.pdf" id="ref-5">An Analysis of Simple Data Augmentation for Named Entity Recognition</a>
6. <a href="https://zenodo.org/record/3245169/files/JCDL2019_Deep_Analysis.pdf" id="ref-6">Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing</a>
7. <a href="https://www.researchgate.net/publication/331784439_A_Study_of_Various_Text_Augmentation_Techniques_for_Relation_Classification_in_Free_Text" id="ref-7">A Study of Various Text Augmentation Techniques for Relation Classification in Free Text</a>
8. <a href="http://ceur-ws.org/Vol-2268/paper11.pdf" id="ref-8">Text Augmentation for Neural Networks</a>
9. <a href="https://arxiv.org/pdf/1711.02173.pdf" id="ref-9">Synthetic And Natural Noise Both Break Neural Machine Translation</a>
10. <a href="https://arxiv.org/pdf/1511.06709.pdf" id="ref-10">Improving Neural Machine Translation Models with Monolingual Data</a>
11. <a href="https://arxiv.org/pdf/2003.02245.pdf" id="ref-11">Data Augmentation Using Pre-trained Transformer Models</a>
12. <a href="https://arxiv.org/pdf/1903.09460.pdf" id="ref-12">Data Augmentation via Dependency Tree Morphing for Low-Resource Languages</a>
13. <a href="https://arxiv.org/pdf/1809.02079.pdf" id="ref-13">Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models</a>
14. <a href="https://arxiv.org/pdf/1812.05271v1.pdf" id="ref-14">TextBugger: Generating Adversarial Text Against Real-world Applications</a>
15. <a href="https://arxiv.org/pdf/1804.07998.pdf" id="ref-15">Generating Natural Language Adversarial Examples</a>
16. <a href="https://arxiv.org/pdf/1509.01626.pdf" id="ref-16">Character-level Convolutional Networks for Text Classification</a>
17. <a href="https://arxiv.org/pdf/1812.02303.pdf" id="ref-17">Neural Abstractive Text Summarization with Sequence-to-Sequence Models</a>
18. <a href="https://arxiv.org/pdf/1910.13461v1.pdf" id="ref-18">BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension</a>
19. <a href="https://arxiv.org/pdf/1904.12848.pdf" id="ref-19">Unsupervised Data Augmentation for Consistency Training</a>
20. <a href="https://arxiv.org/pdf/2007.02033.pdf" id="ref-20">Text Data Augmentation: Towards better detection of spear-phishing emails</a>
