Overview of the below code:

This code consists of two main classes designed for handling text data, particularly in the context of Natural Language Processing (NLP). The two classes are `CustomDataset` and `TextPreprocessing`. Here's an overview of their functionalities and workflow:

### 1. `CustomDataset` Class

#### Purpose
The `CustomDataset` class is designed to load and tokenize text data from a CSV file, preparing it for further use in machine learning models, especially those involving transformers like BERT.

#### Attributes
- `data`: A list of dictionaries containing tokenized text data and labels.
- `tokenizer`: A `BertTokenizer` instance used for tokenizing the text data.
- `max_length`: Maximum length of the tokenized sequences.
- `df`: A DataFrame containing the data read from the CSV file.

#### Methods
- `__init__(self, file_path, tokenizer, max_length, text_columns)`: Initializes the dataset by reading the CSV file, validating columns, and tokenizing the text data.
- `__len__(self)`: Returns the total number of samples in the dataset.
- `__getitem__(self, idx)`: Retrieves the sample at the specified index, returning a dictionary containing tokenized text data.

### 2. `TextPreprocessing` Class

#### Purpose
The `TextPreprocessing` class handles the preprocessing pipeline for text data, including tokenization, filtering, and various text preprocessing tasks such as lowercasing, contraction fixing, and URL removal.

#### Attributes
- `config`: Configuration dictionary containing file paths and settings for preprocessing.

#### Methods
- `__init__(self, config)`: Initializes the preprocessing object and validates the configuration.
- `start_preprocessing(self)`: Executes the entire preprocessing pipeline, including tokenization, filtering, and text cleaning. Returns a DataFrame with the preprocessed text data.

#### Workflow
1. **Tokenize and Create Dataset**: Initializes a `BertTokenizer` and creates a `CustomDataset` instance using the provided file path and text columns.
2. **Filter Dataset**: Filters the dataset to a specified number of samples.
3. **Text Preprocessing**: Applies various text preprocessing steps:
    - Converts text to lowercase.
    - Fixes contractions using the `contractions` library.
    - Removes URLs by replacing them with the word "website".
    - Converts text back to tokenized format and updates the dataset.
4. **Convert to DataFrame**: Converts the preprocessed text data back to a DataFrame.
5. **Additional Preprocessing**: Further cleans the text data in the DataFrame by:
    - Removing punctuation.
    - Removing numbers.
    - Removing newline and carriage return characters.
6. **Handle NaN Values**: Replaces NaN values with empty strings to ensure data consistency.

### Summary
The `CustomDataset` class handles the loading and tokenization of text data, while the `TextPreprocessing` class performs a series of preprocessing steps to clean and prepare the text for further use. This setup is particularly useful for preparing text data for training NLP models using transformer architectures like BERT.

In [2]:
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer
import nltk
import contractions
import emoji
import pandas as pd
import os

# Ensure NLTK resources are available
nltk.download('wordnet')

class CustomDataset(Dataset):
    """
    Custom Dataset class for loading and tokenizing text data.

    Args:
        file_path (str): Path to the CSV file containing data.
        tokenizer (BertTokenizer): Tokenizer to be used for text tokenization.
        max_length (int): Maximum length of the tokenized sequences.
        text_columns (list): List of column names containing the text data.

    Attributes:
        data (list): List of dictionaries containing tokenized data and labels.
        tokenizer (BertTokenizer): Tokenizer used for text tokenization.
        max_length (int): Maximum length of the tokenized sequences.
    """

    def __init__(self, file_path, tokenizer, max_length, text_columns):
        self.data = []
        self.tokenizer = tokenizer
        self.max_length = max_length

        # Read the CSV file using pandas
        print(file_path)
        self.df = pd.read_csv(file_path, nrows=50)

        # Ensure the CSV has the correct columns
        for col in text_columns:
            if col not in self.df.columns:
                raise ValueError(f"CSV file must contain '{col}' column.")

        # Process each row in the DataFrame
        for _, row in self.df.iterrows():
            combined_text = " ".join([row[col] for col in text_columns])
            encoding = self.tokenizer.encode_plus(
                combined_text,
                add_special_tokens=True,
                truncation=True,
                padding='max_length',
                max_length=self.max_length,
                return_tensors='pt',
                return_token_type_ids=False,
                return_attention_mask=True,
                return_overflowing_tokens=False,
                return_special_tokens_mask=False,
            )
            self.data.append({
                'input_ids': encoding['input_ids'].flatten(),
                'attention_mask': encoding['attention_mask'].flatten(),
            })

    def __len__(self):
        """
        Returns the total number of samples in the dataset.

        Returns:
            int: Number of samples in the dataset.
        """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Retrieves the sample at the specified index.

        Args:
            idx (int): Index of the sample to retrieve.

        Returns:
            dict: Dictionary containing 'input_ids', 'attention_mask', and 'Sentiment' tensors.
        """
        return self.data[idx]


class TextPreprocessing:
    """
    Class for text preprocessing and visualization.

    Args:
        config (dict): Configuration dictionary containing file paths and other settings.

    Attributes:
        config (dict): Configuration dictionary.
    """

    def __init__(self, config):
        self.config = config  # Store the configuration

        # Validate the "text_data" section contains "file_path" and "text_columns"
        if "file_path" not in config["data_augmentation"]["text_augmentation"] or "text_columns" not in config["data_augmentation"]["text_augmentation"]:
            raise KeyError("Configuration is missing 'file_path' or 'text_columns' in the 'text_augmentation' section.")

    def start_preprocessing(self):
        """
        Start the preprocessing pipeline.

        Steps:
            1. Tokenize and create dataset.
            2. Filter dataset.
            3. Apply text preprocessing (lowercase, contractions, URLs).
            4. Convert preprocessed dataset to DataFrame.

        Returns:
            pd.DataFrame: Preprocessed dataset as a DataFrame.
        """
        # Step 1: Tokenize and create dataset
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

        dataset = CustomDataset(
            os.path.join(self.config["data"]["base_input_path"], self.config["data_augmentation"]["text_augmentation"]["file_path"]),
            tokenizer,
            max_length=128,
            text_columns=self.config["data_augmentation"]["text_augmentation"]["text_columns"]
        )
        print("Dataset created")

        # Step 2: Filter dataset
        dataset_filtered = dataset[:10]
        print("Dataset filtered")

        # Step 3: Text preprocessing (lowercase, contractions, URLs)
        for data in dataset_filtered:
            text = data['input_ids'].tolist()
            text = tokenizer.decode(text, skip_special_tokens=True)  # Remove special tokens
            text = text.lower()  # Convert to lowercase
            text = contractions.fix(emoji.demojize(text))  # Fix contractions
            text = text.replace(r"https?://\S+|www\.\S+", "website")  # Remove URLs
            encoded_text = tokenizer.encode(text, add_special_tokens=False)
            data['input_ids'] = torch.tensor(encoded_text)

        print("Text preprocessing applied")

        # Step 4: Convert preprocessed dataset to DataFrame
        df1 = pd.DataFrame({
            'Sentence': [tokenizer.decode(data['input_ids'].tolist()) for data in dataset_filtered],
        })

        # Additional preprocessing steps
        df1["Sentence"] = df1["Sentence"].str.lower()  # Convert text to lowercase
        df1["Sentence"] = df1["Sentence"].str.replace(r"[^\w\s]", "", regex=True)  # Remove punctuation
        df1["Sentence"] = df1["Sentence"].str.replace(r"\d+", "", regex=True)  # Remove numbers
        df1["Sentence"] = df1["Sentence"].str.replace("\n", "").replace("\r", "")  # Remove newline and carriage return characters

        # Handle NaN values
        df1.fillna("", inplace=True)

        return df1  # Return the preprocessed DataFrame


[nltk_data] Downloading package wordnet to /home/bcae/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
