# Unit 4

## Data Augmentation Techniques for Large-Scale LLM Training

# Introduction to Data Augmentation

Welcome to the lesson on **Data Augmentation** for **LLM Training**. In this lesson, we'll explore how data augmentation can enhance the training of large-scale language models (LLMs). Data augmentation involves creating new data samples from existing ones, which can help improve model performance and generalization. By the end of this lesson, you'll understand how to apply various data augmentation techniques to your datasets.

-----

### Recall: Importance of Clean Data

Before we dive into data augmentation, let's briefly recall the importance of having clean and well-prepared data. In previous lessons, we discussed techniques for efficient data storage, deduplication, and filtering. These steps ensure that your dataset is free from duplicates, non-English content, and toxicity, which is crucial for effective augmentation. Remember, **clean data is the foundation for successful data augmentation.**

-----

### Synonym Replacement using WordNet

One common data augmentation technique is **synonym replacement**, where words in a sentence are replaced with their synonyms. This can help create diverse training samples. We'll use the `WordNetAugmenter` from the `textattack` library to perform synonym replacement.

First, let's import the necessary library and create a sample text:

```python
from textattack.augmentation import WordNetAugmenter

# Sample text
txt = "The cheerful child played in the sunny park."
```

Next, we create an instance of `WordNetAugmenter` and use it to augment the text:

```python
# Synonym Replacement using WordNet
wordnet_aug = WordNetAugmenter()
synonym_augmented = wordnet_aug.augment(txt)
```

In this code, `WordNetAugmenter` is used to replace words in the text with their synonyms. The `augment` method generates a new version of the text with synonyms. Let's see the output:

```python
print("WordNet Synonym Replacement:", synonym_augmented)
```

Example output:

```
"The blithe child frolicked in the sunny park."
```

-----

### Easy Data Augmentation (EDA) Techniques

**Easy Data Augmentation** (EDA) includes several techniques like synonym replacement, random insertion, and more. These techniques help create diverse training samples with minimal effort. We'll use the `EasyDataAugmenter` class to demonstrate EDA.

First, import the necessary library and create a sample text:

```python
from textattack.augmentation import EasyDataAugmenter

# Sample text
txt = "The cheerful child played in the sunny park."
```

Now, create an instance of `EasyDataAugmenter` and use it to augment the text:

```python
# Easy Data Augmentation (EDA)
eda_aug = EasyDataAugmenter()
eda_augmented = eda_aug.augment(txt)
```

The `EasyDataAugmenter` applies various EDA techniques to the text. The `augment` method generates a new version of the text with these techniques. Let's see the output:

```python
print("Easy Data Augmentation:", eda_augmented)
```

Example output:

```
"The cheerful child played in the sunny park joyfully."
```

-----

### Back-Translation for Data Augmentation

**Back-translation** is a technique where a sentence is translated to another language and then back to the original language. This can create diverse training samples by altering sentence structure while preserving meaning. We'll use the `BackTranslationAugmenter` for this purpose.

First, import the necessary library and create a sample text:

```python
from textattack.augmentation import BackTranslationAugmenter

# Sample text
txt = "The cheerful child played in the sunny park."
```

Now, create an instance of `BackTranslationAugmenter` and use it to augment the text:

```python
# Back-Translation (English → French → English)
backtrans_aug = BackTranslationAugmenter()
back_translated = backtrans_aug.augment(txt)
```

The `BackTranslationAugmenter` translates the text to another language and back to English. The `augment` method generates a new version of the text. Let's see the output:

```python
print("Back-Translated:", back_translated)
```

Example output:

```
"The happy child was playing in the sunny park."
```

-----

### Summary and Preparation for Practice

In this lesson, you learned about three data augmentation techniques: **synonym replacement**, **Easy Data Augmentation (EDA)**, and **back-translation**. These techniques help create diverse training samples, improving the performance and generalization of large-scale language models. As you move on to the practice exercises, apply these techniques to see their effects on model training. Experiment with different methods to gain a deeper understanding of data augmentation. Keep up the great work, and enjoy the hands-on practice\!

## Synonym Replacement with WordNet

You've learned about synonym replacement using WordNet, and now it's time to put that knowledge into action. Your task is to implement a function that uses WordNet synonym replacement to augment a dataset of sentences.

Here's what you need to do:

Create a function that takes a list of sentences and an augmentation factor as input.
Use the WordNetAugmenter from the textattack library to replace words with their synonyms.
Return a new list containing both the original sentences and their augmented versions.
This exercise will help you see how synonym replacement can diversify your dataset. Dive in and see the impact of your work!

```python
from textattack.augmentation import WordNetAugmenter

def augment_sentences(sentences, augmentation_factor):
    # TODO: Import the WordNetAugmenter from the textattack library
    # TODO: Create an instance of WordNetAugmenter

    for sentence in sentences:
        # TODO: Add the original sentence to the augmented_sentences list
        for _ in range(augmentation_factor):
            # TODO: Use the augment method to create augmented versions
            # TODO: Extend the augmented_sentences list with the augmented version

    # TODO: Return the list containing both original and augmented sentences

# Example usage
sentences = ["The cheerful child played in the sunny park.", "The quick brown fox jumps over the lazy dog."]
augmented = augment_sentences(sentences, 2)
print(augmented)

```

```python
from textattack.augmentation import WordNetAugmenter

def augment_sentences(sentences, augmentation_factor):
    # Create an instance of WordNetAugmenter
    augmenter = WordNetAugmenter()
    
    augmented_sentences = []
    
    for sentence in sentences:
        # Add the original sentence to the augmented_sentences list
        augmented_sentences.append(sentence)
        
        # Create and add augmented versions
        for _ in range(augmentation_factor):
            # Use the augment method to create augmented versions
            augmented_text = augmenter.augment(sentence)
            # Extend the augmented_sentences list with the augmented version
            augmented_sentences.extend(augmented_text)
            
    # Return the list containing both original and augmented sentences
    return augmented_sentences

# Example usage
sentences = ["The cheerful child played in the sunny park.", "The quick brown fox jumps over the lazy dog."]
augmented = augment_sentences(sentences, 2)
print(augmented)
```

## Easy Data Augmentation Techniques

Nice job on learning about synonym replacement! Now, let's explore Easy Data Augmentation (EDA) techniques. Your task is to create a function that applies EDA to a list of sentences.

Here's what you need to do:

Use the EasyDataAugmenter from the textattack library.
Configure it with the transformations_per_example parameter.
Return a combined dataset of original and augmented sentences.
This exercise will help you see how EDA can diversify your dataset. Dive in and observe the impact of your work!


```python
from textattack.augmentation import EasyDataAugmenter

def augment_sentences(sentences, augmentation_factor):
    # TODO: Initialize the EasyDataAugmenter with the transformations_per_example parameter
    
    augmented_sentences = []
    for sentence in sentences:
        # TODO: Iterate over each sentence in the input list
        # TODO: Apply EDA to each sentence using the augment method
        # TODO: Combine the original sentence with its augmented versions
    
    return augmented_sentences

# Example usage
sentences = ["The cheerful child played in the sunny park.", "The quick brown fox jumps over the lazy dog."]

# TODO: Call the augment_sentences function with a list of sentences and an augmentation factor

print(augmented_data)

```

```python
from textattack.augmentation import EasyDataAugmenter

def augment_sentences(sentences, augmentation_factor):
    """
    Applies Easy Data Augmentation (EDA) to a list of sentences.

    Args:
        sentences (list): A list of sentences to be augmented.
        augmentation_factor (int): The number of augmented sentences to generate for each original sentence.

    Returns:
        list: A combined list of original and augmented sentences.
    """
    # Initialize the EasyDataAugmenter with the transformations_per_example parameter
    augmenter = EasyDataAugmenter(transformations_per_example=augmentation_factor)

    augmented_sentences = []
    # Iterate over each sentence in the input list
    for sentence in sentences:
        # Apply EDA to each sentence using the augment method
        augmented_versions = augmenter.augment(sentence)
        
        # Combine the original sentence with its augmented versions
        augmented_sentences.append(sentence)
        augmented_sentences.extend(augmented_versions)
    
    return augmented_sentences

# Example usage
sentences = ["The cheerful child played in the sunny park.", "The quick brown fox jumps over the lazy dog."]

# Call the augment_sentences function with a list of sentences and an augmentation factor
augmented_data = augment_sentences(sentences, 2)

print(augmented_data)
```

## Back-Translation Augmentation Task

Let's simplify the task to align with the lesson's approach. Your task is to implement a function using the BackTranslationAugmenter from the textattack library.

Here's what you need to do:

Create a function that takes a single sentence.
Use BackTranslationAugmenter to translate the sentence to another language and back to English.
Return the back-translated sentence.
This exercise will help you understand how back-translation can diversify your dataset. Due to resource limits, you may want to test your implementation with a small sentence, such as "We are late."

```python
from textattack.augmentation import BackTranslationAugmenter

def back_translation_augmentation(sentence):
    # Initialize the BackTranslationAugmenter
    backtrans_aug = None  # TODO: Initialize with BackTranslationAugmenter()
    
    # Perform back-translation augmentation
    back_translated_sentence = ______  # TODO: Use backtrans_aug to augment the sentence
    
    # Return the back-translated sentence
    return _________

# Example usage
sentence = "We are late"
augmented_sentence = back_translation_augmentation(sentence)
print("Back-Translated:", augmented_sentence)

```

```python
from textattack.augmentation import BackTranslationAugmenter

def back_translation_augmentation(sentence):
    # Initialize the BackTranslationAugmenter
    backtrans_aug = BackTranslationAugmenter()
    
    # Perform back-translation augmentation
    back_translated_sentences = backtrans_aug.augment(sentence)
    
    # Return the first (and only) back-translated sentence
    # Note: The augment method returns a list, so we take the first element.
    if back_translated_sentences:
        return back_translated_sentences[0]
    else:
        return None

# Example usage
sentence = "We are late"
augmented_sentence = back_translation_augmentation(sentence)
print("Back-Translated:", augmented_sentence)

```