## Overview of the below code:

Below code defines a Python class `TextAugmentation` that performs various text augmentation techniques using libraries such as `nltk`, `torch`, `transformers`, and `deep_translator`. Here's an overview of its components and functionalities:

### Components and Functionalities

1. **Class Initialization**:
   - The `TextAugmentation` class is initialized with a configuration dictionary (`config`), which determines the augmentation techniques to be applied.
   - It initializes tokenizers and models for T5 and BERT from the `transformers` library.

2. **Text Cleaning**:
   - `clean_text`: Cleans the input text by removing specific tokens and extra whitespace.

3. **Text Augmentation Methods**:
   - `synonym_replacement`: Replaces words in the text with their synonyms using WordNet.
   - `get_synonyms`: Retrieves synonyms for a given word using WordNet.
   - `add_noise`: Adds random noise to the text by replacing characters with random letters.
   - `random_insertion`: Inserts random characters at random positions in the text.
   - `random_deletion`: Deletes words from the text with a certain probability.
   - `random_swap`: Swaps positions of random words in the text.
   - `paraphrasing`: Paraphrases the text by replacing words with their synonyms.
   - `style_transfer_nmt`: Performs style transfer using neural machine translation.
   - `stochastic_text_generation`: Generates text stochastically by randomly selecting words from the input text.
   - `masking`: Masks random words in the text.
   - `correct_grammar`: Corrects grammar using a T5 model.
   - `generation`: Generates text based on the input sentence using a T5 model.
   - `combination`: Combines selected words from the input text into a new sentence.
   - `conditional_text_generation`: Generates text conditionally based on the input using a T5 model.
   - `bert_augmentation`: Augments text using a BERT model by masking and predicting words.
   - `hierarchical_text_generation`: Generates text hierarchically by applying multiple augmentations.
   - `back_translation`: Translates text to an intermediate language and back to the original language.

4. **Applying Augmentations**:
   - `start_augmentation_process`: Applies the selected augmentations to specified text columns in a DataFrame and returns the augmented DataFrame.

### Detailed Method Descriptions

- **Initialization**:
  - Initializes T5 and BERT tokenizers and models from the `transformers` library.

- **Text Cleaning**:
  - **clean_text**: Removes specific tokens (`[CLS]`, `[SEP]`, `[PAD]`) and extra spaces.

- **Augmentation Methods**:
  - **synonym_replacement**: Uses WordNet to replace words with their synonyms.
  - **get_synonyms**: Retrieves synonyms for a word from WordNet.
  - **add_noise**: Introduces noise by replacing characters with random letters.
  - **random_insertion**: Randomly inserts characters into the text.
  - **random_deletion**: Randomly deletes words from the text.
  - **random_swap**: Swaps random words in the text.
  - **paraphrasing**: Paraphrases text by replacing words with their synonyms.
  - **style_transfer_nmt**: Changes text style using neural machine translation.
  - **stochastic_text_generation**: Randomly generates text by selecting words from the input.
  - **masking**: Masks random words in the text.
  - **correct_grammar**: Uses a T5 model to correct grammar.
  - **generation**: Uses a T5 model to generate text based on input.
  - **combination**: Combines selected words from the input text.
  - **conditional_text_generation**: Generates text based on input conditions using a T5 model.
  - **bert_augmentation**: Uses BERT to augment text by masking and predicting words.
  - **hierarchical_text_generation**: Applies multiple augmentations to generate text.
  - **back_translation**: Translates text to another language and back for augmentation.

- **Applying Augmentations**:
  - **start_augmentation_process**: Applies various text augmentation techniques to specified text columns in a DataFrame. It iterates over the columns and applies each selected augmentation, creating new columns in the DataFrame for each augmented version of the text.



In [None]:
import pandas as pd
import nltk
import re
import random
import string
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, BertTokenizer, BertModel
from deep_translator import GoogleTranslator
from itertools import chain

# References: https://dzlab.github.io/dltips/en/pytorch/text-augmentation/
# https://www.kaggle.com/code/nandhuelan/nlp-augmentation-bert
# https://github.com/topics/text-augmentation?o=asc&s=forks
class TextAugmentation:
    """
    Class for performing text augmentation using various techniques.
    """

    def __init__(self, config):
        """
        Initialize the TextAugmentation object.

        Args:
            config (dict): Configuration for text augmentation.
        """
        self.config = config

        # Initialize T5 tokenizer and model
        self.t5_tokenizer = T5Tokenizer.from_pretrained('t5-small', legacy=False)
        self.t5_model = T5ForConditionalGeneration.from_pretrained('t5-small')

        # Initialize BERT tokenizer and model
        self.bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.bert_model = BertModel.from_pretrained('bert-base-uncased')

    def clean_text(self, text):
        """
        Clean the input text.

        Args:
            text (str): Input text to be cleaned.

        Returns:
            str: Cleaned text.
        """
        text = re.sub(r'\[CLS\]|\[SEP\]|\[PAD\]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text

    def synonym_replacement(self, text):
        """
        Perform synonym replacement on the input text.

        Args:
            text (str): Input text for synonym replacement.

        Returns:
            str: Text after synonym replacement.
        """
        if pd.isna(text):
            return ""
        words = nltk.word_tokenize(text)
        augmented_text = []
        for word in words:
            synonyms = self.get_synonyms(word)
            if synonyms:
                synonym = random.choice(list(synonyms))
                augmented_text.append(synonym)
            else:
                augmented_text.append(word)
        return " ".join(augmented_text)

    def get_synonyms(self, word):
        """
        Get synonyms for a word using WordNet.

        Args:
            word (str): Input word to find synonyms for.

        Returns:
            set: Set of synonyms for the input word.
        """
        try:
            synonyms = nltk.corpus.wordnet.synsets(word)
            return set(chain(*[syn.lemma_names() for syn in synonyms]))
        except Exception as e:
            print(f"Error getting synonyms: {e}")
            return set()

    def add_noise(self, text, noise_level='low'):
        """
        Add noise to the input text.

        Args:
            text (str): Input text to add noise to.
            noise_level (str): Level of noise to add ('low', 'medium', 'high').

        Returns:
            str: Text after adding noise.
        """
        if pd.isna(text):
            return ""
        noisy_text = []
        for char in text:
            if char.isalpha():
                if random.random() < 0.1:  # Adjust noise probability based on level
                    noisy_text.append(random.choice(string.ascii_letters))
                else:
                    noisy_text.append(char)
            else:
                noisy_text.append(char)
        return ''.join(noisy_text)

    def random_insertion(self, text, num_inserts=3):
        """
        Perform random insertion on the input text.

        Args:
            text (str): Input text for random insertion.
            num_inserts (int): Number of random insertions to perform.

        Returns:
            str: Text after random insertion.
        """
        if pd.isna(text):
            return ""
        words = nltk.word_tokenize(text)
        augmented_text = words[:]
        for _ in range(num_inserts):
            random_index = random.randint(0, len(words) - 1)
            random_letter = random.choice(string.ascii_lowercase)
            augmented_text.insert(random_index, random_letter)
        return " ".join(augmented_text)

    def random_deletion(self, text, probability=0.2):
        """
        Perform random deletion on the input text.

        Args:
            text (str): Input text for random deletion.
            probability (float): Probability of deleting each word.

        Returns:
            str: Text after random deletion.
        """
        if pd.isna(text):
            return ""
        words = nltk.word_tokenize(text)
        augmented_text = []
        for word in words:
            if random.random() > probability:
                augmented_text.append(word)
        return " ".join(augmented_text)

    def random_swap(self, text, num_swaps=2):
        """
        Perform random swap on the input text.

        Args:
            text (str): Input text for random swap.
            num_swaps (int): Number of random swaps to perform.

        Returns:
            str: Text after random swap.
        """
        if pd.isna(text):
            return ""
        words = nltk.word_tokenize(text)
        augmented_text = words[:]
        for _ in range(num_swaps):
            if len(augmented_text) > 1:
                random_index1 = random.randint(0, len(augmented_text) - 1)
                random_index2 = random.randint(0, len(augmented_text) - 1)
                augmented_text[random_index1], augmented_text[random_index2] = augmented_text[random_index2], \
                augmented_text[random_index1]
        return " ".join(augmented_text)

    def paraphrasing(self, text):
        """
        Perform paraphrasing on the input text.

        Args:
            text (str): Input text for paraphrasing.

        Returns:
            str: Text after paraphrasing.
        """
        if pd.isna(text):
            return ""
        words = nltk.word_tokenize(text)
        augmented_text = []
        for word in words:
            synonyms = self.get_synonyms(word)
            if synonyms:
                synonym = random.choice(list(synonyms))
                augmented_text.append(synonym)
            else:
                augmented_text.append(word)
        return " ".join(augmented_text)

    def style_transfer_nmt(self, text, target_style='en'):
        """
        Perform style transfer using neural machine translation (NMT).

        Args:
            text (str): Input text for style transfer.
            target_style (str, optional): Target style for style transfer. Defaults to 'en'.

        Returns:
            str: Text after style transfer.
        """
        try:
            transformed_text = text.upper() if target_style == 'en' else text.lower()
            return transformed_text
        except Exception as e:
            print(f"Error during style transfer: {e}")
            return text

    def stochastic_text_generation(self, text):
        """
        Perform stochastic text generation on the input text.

        Args:
            text (str): Input text for stochastic text generation.

        Returns:
            str: Generated text.
        """
        if pd.isna(text):
            return ""
        words = text.split()
        generated_words = []

        for word in words:
            if random.random() < 0.5:
                generated_words.append(word)
            else:
                random_word = random.choice(words)
                generated_words.append(random_word)

        generated_text = " ".join(generated_words)
        return generated_text

    def masking(self, text):
        """
        Perform masking on the input text.

        Args:
            text (str): Input text for masking.

        Returns:
            str: Text after masking.
        """
        if pd.isna(text):
            return ""
        words = text.split()
        masked_words = []

        for word in words:
            if random.random() < 0.3:
                masked_words.append("[MASK]")
            else:
                masked_words.append(word)

        masked_text = " ".join(masked_words)
        return masked_text

    def correct_grammar(self, text):
        """
        Correct grammar of the input text.

        Args:
            text (str): Input text for grammar correction.

        Returns:
            str: Text after grammar correction.
        """
        try:
            input_text = "correct grammar: " + text
            input_ids = self.t5_tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)
            outputs = self.t5_model.generate(input_ids, max_length=512, num_beams=5, early_stopping=True)
            corrected_text = self.t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
            return corrected_text
        except Exception as e:
            print(f"Error during grammar correction: {e}")
            return text

    def generation(self, sentence):
        """
        Generate text based on the input sentence.

        Args:
            sentence (str): Input sentence for text generation.

        Returns:
            str: Generated text.
        """
        try:
            input_ids = self.t5_tokenizer.encode(sentence, return_tensors="pt", max_length=512, truncation=True)
            outputs = self.t5_model.generate(input_ids, max_length=512, num_beams=5, temperature=0.7,
                                             early_stopping=True)
            generated_text = self.t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
            return generated_text
        except Exception as e:
            print(f"Error during text generation: {e}")
            return sentence

    def combination(self, text):
        """
        Perform combination of selected words from the input text.

        Args:
            text (str): Input text for combination.

        Returns:
            str: Combined text.
        """
        if pd.isna(text):
            return ""
        try:
            words = text.split()

            #             # Determine the number of words to select
            num_words_to_select = random.randint(1, len(words))

            # Select the words
            selected_words = words[:num_words_to_select]

            # Combine the selected words into a sentence
            combined_sentence = ' '.join(selected_words)

            return combined_sentence
        except Exception as e:
            print(f"Error during sentence combination: {e}")
            return text

    def conditional_text_generation(self, text):
        """
        Perform conditional text generation on the input text.

        Args:
            text (str): Input text for conditional text generation.

        Returns:
            str: Generated text.
        """
        if pd.isna(text):
            return ""
        try:
            input_ids = self.t5_tokenizer.encode("generate response for: " + text, return_tensors="pt", max_length=512,
                                                 truncation=True)
            outputs = self.t5_model.generate(input_ids, max_length=150, num_beams=5, temperature=0.7, do_sample=True,
                                             early_stopping=True)
            generated_text = self.t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
            return generated_text
        except Exception as e:
            print(f"Error during conditional text generation: {e}")
            return text

    def bert_augmentation(self, text):
        """
        Perform BERT-based text augmentation based on input text.

        Args:
            text (str): Input text.

        Returns:
            str: Augmented text.
        """
        try:
            # Tokenize the input text
            tokens = self.bert_tokenizer.tokenize(text)

            # Randomly choose a token to mask
            masked_index = random.randint(0, len(tokens) - 1)
            tokens[masked_index] = '[MASK]'

            # Convert tokens to tensor
            indexed_tokens = self.bert_tokenizer.convert_tokens_to_ids(tokens)
            tokens_tensor = torch.tensor([indexed_tokens])

            # Get hidden states from the model
            with torch.no_grad():
                outputs = self.bert_model(tokens_tensor)
                hidden_states = outputs.last_hidden_state

            # Predict the token from hidden states
            hidden_state = hidden_states[0, masked_index]
            predicted_token_id = torch.argmax(hidden_state).item()
            predicted_token = self.bert_tokenizer.convert_ids_to_tokens([predicted_token_id])[0]

            # Replace the masked token with the predicted token
            tokens[masked_index] = predicted_token

            # Reconstruct the augmented text
            augmented_text = self.bert_tokenizer.convert_tokens_to_string(tokens)
            return augmented_text
        except Exception as e:
            print(f"Error during BERT-based text augmentation: {e}")
            return text

    def hierarchical_text_generation(self, text):
        """
        Perform hierarchical text generation on the input text.

        Args:
            text (str): Input text for hierarchical text generation.

        Returns:
            str: Generated text.
        """
        text = self.synonym_replacement(text)
        text = self.paraphrasing(text)
        text = self.stochastic_text_generation(text)
        return text

    def back_translation(self, text, source_lang='en', intermediate_lang='es'):
        """
        Perform back translation on a given text.

        :param text: The original text to be back-translated.
        :param source_lang: The source language of the original text (default is 'en' for English).
        :param intermediate_lang: The intermediate language to translate to and then back from (default is 'es' for Spanish).
        :return: The back-translated text.
        """
        try:
            # Translate the text to the intermediate language
            translated_text = GoogleTranslator(source=source_lang, target=intermediate_lang).translate(text)

            # Translate the text back to the original language
            back_translated_text = GoogleTranslator(source=intermediate_lang, target=source_lang).translate(
                translated_text)
            return back_translated_text
        except Exception as e:
            print(f"Error during back translation: {e}")
            return text

    def start_augmentation_process(self, df1, text_columns):
        """
        Start the text augmentation process on the given DataFrame.

        Args:
            df1 (pd.DataFrame): The input DataFrame containing text data.
            text_columns (list): List of text column names to apply augmentation to.

        Returns:
            pd.DataFrame: The input DataFrame with augmented text.
        """
        try:
            for col in text_columns:
                df1[f'{col}_synonym_replacement'] = df1['Sentence'].apply(self.synonym_replacement)
                df1[f'{col}_add_noise'] = df1['Sentence'].apply(self.add_noise)
                df1[f'{col}_random_insertion'] = df1['Sentence'].apply(self.random_insertion)
                df1[f'{col}_random_deletion'] = df1['Sentence'].apply(self.random_deletion)
                df1[f'{col}_random_swap'] = df1['Sentence'].apply(self.random_swap)
                df1[f'{col}_paraphrasing'] = df1['Sentence'].apply(self.paraphrasing)
                df1[f'{col}_style_transfer_nmt'] = df1['Sentence'].apply(
                    lambda x: self.style_transfer_nmt(x, target_style='en'))
                df1[f'{col}_stochastic_text_generation'] = df1['Sentence'].apply(self.stochastic_text_generation)
                df1[f'{col}_hierarchical_text_generation'] = df1['Sentence'].apply(self.hierarchical_text_generation)
                df1[f'{col}_masking'] = df1['Sentence'].apply(self.masking)
                df1[f'{col}_correct_grammar'] = df1['Sentence'].apply(self.correct_grammar)
                df1[f'{col}_generation'] = df1['Sentence'].apply(self.generation)
                df1[f'{col}_conditional_text_generation'] = df1['Sentence'].apply(self.conditional_text_generation)
                df1[f'{col}_bert_augmentation'] = df1['Sentence'].apply(self.bert_augmentation)
                df1[f'{col}_combination'] = df1['Sentence'].apply(self.combination)
                df1[f'{col}_back_translation'] = (df1['Sentence']
                                                  .apply(self.back_translation))  # Applying back_translation
        except Exception as e:
            print(f"Error during augmentation process: {e}")
        finally:
            return df1
