In [None]:
%pip install transformers soundfile pandas librosa

In [None]:
import pandas as pd
import torch
import soundfile as sf
import librosa
# Load model directly
from transformers import AutoProcessor, AutoModelForCTC, TrainingArguments, Trainer, DataCollatorCTCWithPadding
from torch.utils.data import Dataset, DataLoader

In [None]:
processor = AutoProcessor.from_pretrained("auditi41/wav2vec2-large-xlsr-53-Bangla")
model = AutoModelForCTC.from_pretrained("auditi41/wav2vec2-large-xlsr-53-Bangla")

When you pass data as an argument to a function in Python, the behavior in terms of memory usage can be nuanced. It's important to understand how Python handles data passing and memory management:

Pass by Object Reference: Python uses a mechanism known as "pass by object reference." When you pass an argument to a function, you're actually passing a reference to the object, not the actual object itself. This means that the function accesses the same object in memory as the caller, rather than a separate copy.

Memory Impact: Because a reference is passed, not the object, passing an argument doesn't inherently consume additional memory. However, what happens inside the function can affect memory usage:

- If the function only reads or performs operations on the passed object without modifying it, no additional memory for the object is typically allocated.
- If the function modifies the object (and the object is mutable like a list or a dictionary), the changes are made to the original object in memory, and no additional memory is required for these modifications.
- However, if the function creates new objects or expands existing ones (for example, appending items to a list, which might lead to reallocation of the list's memory), this will consume additional memory.

Temporary Objects: Any new objects created within the function (like local variables) will consume memory. This memory is usually freed up when the function finishes execution, as these local objects go out of scope and are garbage collected if there are no more references to them.

Garbage Collection: After the function finishes executing, any additional memory used by local variables or newly created objects within the function will be freed if there are no references to them outside the function.

In [None]:
def preprocess_data(row):
    # Load the audio file using librosa
    speech, sr = librosa.load(f"/kaggle/input/bengaliai-speech/train_mp3s/{row['id']}.mp3", sr=16000)  # Resample to 16kHz

    # Process the audio file
    input_values = processor(speech, sampling_rate=16000, return_tensors="pt").input_values

    # Tokenize the labels
    with processor.as_target_processor():
        labels = processor(row['sentence'], return_tensors="pt").input_ids

    return {"input_values": input_values.squeeze(), "labels": labels.squeeze()}

In [None]:
# Load and preprocess the dataset
df = pd.read_csv('/kaggle/input/bengaliai-speech/train.csv')

# Split the dataset based on the 'split' column
train_df = df[df['split'] == 'train']
valid_df = df[df['split'] == 'valid']

In this step, you're merely creating two new DataFrame references (train_df and valid_df) that point to subsets of the original DataFrame df. This operation is generally memory efficient for two reasons:

View vs Copy: Pandas often handles such operations by creating views rather than copies of the data. A view is just a new perspective on the same data in memory, not a full duplication. This means it doesn't significantly increase memory usage.

Lazy Evaluation: Pandas and similar libraries often employ lazy evaluation, meaning they postpone certain operations until absolutely necessary. When splitting the dataset, it doesn't immediately duplicate all the data; it keeps a reference to the original data and only accesses the relevant parts when needed.

Executing the cell below will give the following error: "Your notebook tried to allocate more memory than is available. It has restarted."

In [None]:
#train_dataset = train_df.apply(preprocess_data, axis=1)
#val_dataset = valid_df.apply(preprocess_data, axis=1)

The error you're encountering, "The notebook tried to allocate more memory than is available. It has restarted," is a common issue when working with large datasets or performing intensive computations in Jupyter notebooks on platforms like Kaggle. This typically happens when your code is consuming more memory than what is allocated to your notebook environment.

This operation is different in terms of memory usage for several reasons:

Processing Overhead: The apply method processes each row of the DataFrame using the preprocess_data function. If this function is memory-intensive (e.g., loading large audio files, complex computations), it can significantly increase the memory usage.

New Data Creation: The apply method does not just reference existing data; it creates new data based on the output of the preprocess_data function. This new data is stored in memory in addition to the original DataFrame, doubling up on memory usage (or more, depending on the nature of the processed data).

No Lazy Evaluation: Unlike the dataset splitting, which can defer data manipulation, the apply operation actively processes each row and stores the results immediately. This means all the computations and associated memory allocations are done upfront.

**Understanding Memory Overhead in Preprocessing**

The memory overhead in your case likely comes from the nature of the preprocess_data function. This function is applied to each row of your DataFrames and can involve:

- Loading audio files, which can be large.
- Performing transformations or computations that increase the data size or complexity.
- Creating new data structures (like tensors) that are more memory-intensive than the original row data.

If these operations are memory-intensive and you're doing them for every row in your dataset, it can quickly lead to high memory usage, especially compared to the relatively lightweight operation of just splitting the DataFrame. This is why batch processing or using generators, as discussed earlier, becomes crucial in managing memory usage effectively when working with large datasets or complex processing functions.

Here are some strategies to mitigate this issue:

1. **Batch Processing**: Instead of processing the entire dataset at once, break it into smaller chunks. Process each chunk separately and then combine the results. This can be done using a for loop or more sophisticated batch processing techniques.

2. **Optimize Data Processing**: Look for ways to make your data processing more memory-efficient. For example, you could:
   - Use more memory-efficient data types (e.g., `float32` instead of `float64`).
   - Reduce the precision of your audio data, if high precision is not necessary.
   - Clear variables that are no longer needed using `del variable_name` and periodically call `gc.collect()` to free up memory.

3. **Reduce Dataset Size**: If feasible, consider using a smaller subset of your dataset for training and validation.

4. **Optimize Librosa Load**: When loading audio files with Librosa, consider loading them in a more memory-efficient manner. For example, you can load only a certain duration of the audio files instead of full length, if that's suitable for your task.

5. **Use Generator Functions**: Instead of applying the preprocessing function to the entire DataFrame at once, use a generator function that processes and yields one row at a time. This can significantly reduce memory usage.

6. **Optimize Pandas Operations**: Ensure that your Pandas operations are efficient. For example, using `apply` with `axis=1` can be inefficient for large datasets. Vectorized operations or using `itertuples()` for row-wise operations can be more efficient.

7. **Move to a Platform with More Memory**: If none of these solutions work, you might need to run your notebook in an environment with more available memory.

To implement batch processing, you could modify your code as follows:

This code processes the dataset in batches, reducing the overall memory footprint at any given time.

In [None]:
#def process_batch(dataframe):
#    return dataframe.apply(preprocess_data, axis=1)

#batch_size = 100  # Adjust this based on your memory constraints
#train_dataset_batches = [process_batch(train_df.iloc[i:i + batch_size]) for i in range(0, len(train_df), batch_size)]
#val_dataset_batches = [process_batch(valid_df.iloc[i:i + batch_size]) for i in range(0, len(valid_df), batch_size)]

Example
Suppose train_df has 5 rows and batch_size is 2. The slicing works like this:

First iteration: i = 0, slice is train_df.iloc[0:2] – This includes rows 0 and 1.

Second iteration: i = 2, slice is train_df.iloc[2:4] – This includes rows 2 and 3.

Third iteration: i = 4, slice is train_df.iloc[4:6] – This includes row 4. Note that row 6 does not exist, but Pandas handles this gracefully by ending the slice at the last available row.

Let's clarify the differences between using `DataFrame.apply` and batch processing, especially in terms of processing "one row at a time."

### 1. `DataFrame.apply` Method:

- **Simultaneous Processing**: While `apply` does indeed work on one row at a time in terms of applying the function, it doesn't necessarily mean that it's memory-efficient. When you use `apply` with `axis=1`, it processes each row across the entire DataFrame, but the key point is that the result of this processing is stored simultaneously. 
- **Result Storage**: The result of the `apply` function is a new Series or DataFrame (depending on the function applied), which is held in memory. If the function you're applying generates large amounts of data for each row, the combined result can be quite large and can consume significant memory.
- **In-Memory Data**: Even though each row is processed individually, the entire DataFrame (and the resulting Series/DataFrame) must fit into memory.

### 2. Batch Processing:

- **Chunk-by-Chunk Processing**: In batch processing, the DataFrame is divided into smaller chunks or batches, and each batch is processed separately. This means only one batch is in memory at any given time.
- **Memory Management**: After processing a batch, you can either store the result and free up the memory used by that batch before moving to the next, or process and use the data immediately (e.g., training a model on that batch) without storing it. This approach significantly reduces the overall memory footprint.
- **Control Over Memory Usage**: You have control over the size of each batch, allowing you to manage the memory usage more effectively, especially with very large DataFrames.

### Key Differences:

- **Memory Footprint**: `apply` can result in a larger memory footprint as it stores the result of processing the entire DataFrame. Batch processing, on the other hand, limits memory usage to the size of the current batch.
- **Control and Flexibility**: Batch processing gives more control over how much data is processed at a time and how memory is managed, which is crucial for large datasets or memory-intensive operations.

### Example Scenario:

Suppose you have a DataFrame `df` with 1 million rows, and you're applying a function that significantly expands the size of each row:

- Using `apply`: The entire DataFrame and the expanded results need to fit into memory, which might not be feasible and could lead to out-of-memory issues.
- Using Batch Processing: You can process, say, 10,000 rows at a time. Only these 10,000 rows and their processed results need to fit into memory at any given point, making it much more manageable.

In summary, while `apply` processes rows one at a time, it still requires enough memory for the entire DataFrame and the results. Batch processing, in contrast, limits memory usage to the size of the current batch, making it a more memory-efficient approach for large datasets.

Still, the batch processing implementation may not optimize enough and may allocate a lot of memory. Why ?

The way the batch processing is implemented in the code snippet you provided will eventually lead to high memory usage, negating the benefits of batch processing. Let's break down why this happens:

Understanding the Issue

train_dataset_batches = [process_batch(train_df.iloc[i:i + batch_size]) for i in range(0, len(train_df), batch_size)]

This line creates a list comprehension that processes each batch and stores all the results in the train_dataset_batches list. Here's what's happening:

Processing in Batches: The data is indeed processed in batches (train_df.iloc[i:i + batch_size]), which is good for memory management during the processing step.

Storing Results in a List: However, the results of processing each batch are stored in a list (train_dataset_batches). This means that as each batch is processed, its output is kept in memory, accumulating with the output of previous batches.

Memory Accumulation: By the end of the loop, the entire processed dataset is held in memory, which is essentially what you were trying to avoid by using batch processing. If the processing function significantly expands each row's data or if the original dataset is very large, this can lead to substantial memory usage.

Therefore, we will use generator functions.

In [None]:
# Generator function
#def data_generator(dataframe):
#    for _, row in dataframe.iterrows():
#        yield preprocess_data(row)

In [None]:
# Create generators
#train_generator = data_generator(train_df)
#val_generator = data_generator(valid_df)

In [None]:
# Collecting processed training data
#train_data = []
#for data in train_generator:
#    train_data.append(data)

# Collecting processed validation data
#val_data = []
#for data in val_generator:
#    val_data.append(data)

Even in the above cell, memory accumulation happens. 

Memory Accumulation: By the end of the loop, the entire processed dataset is held in memory, which is essentially what you were trying to avoid by using batch processing. 

In [None]:
# Convert the datasets into the format expected by Hugging Face
#class CustomDataset(Dataset):
#    def __init__(self, dataframe):
#        self.dataframe = dataframe

#    def __len__(self):
#        return len(self.dataframe)

#    def __getitem__(self, idx):
#        return self.dataframe.iloc[idx]

#train_dataset = CustomDataset(train_data)
#valid_dataset = CustomDataset(val_data)

When we iterates over the CustomDataset, it calls __getitem__ for each index in the train_data. This means the data for only a few samples (as many as the batch size) is in memory at any given time, not the entire dataset. This is good. But we have already performed memory accumulation above. So, we need to implement CustomDataset class in a different way so that memory accumulation does not happen.

In [None]:
# Custom Dataset
class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        return preprocess_data(self.dataframe.iloc[idx])

In [None]:
# Create Dataset objects
train_dataset = CustomDataset(train_df)
valid_dataset = CustomDataset(valid_df)

In [None]:
# DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32, shuffle=False)

The `CustomDataset` class in the provided code example reduces memory usage by processing each audio file and its corresponding label on-demand, rather than pre-loading and processing the entire dataset into memory at once. This approach is beneficial, especially when dealing with large datasets. Here's how it works:

1. **Lazy Loading**: The `CustomDataset` class implements lazy loading of data. This means that the data for each sample (i.e., each audio file and its corresponding label) is only loaded and processed when it's needed - specifically, when the `__getitem__` method is called. 

2. **The `__getitem__` Method**: This method is a special method in PyTorch's `Dataset` class. It's designed to fetch a single data point. In our case, this method calls `preprocess_data`, which loads an audio file, processes it, and returns the processed audio data (`input_values`) and its corresponding label (`labels`). This happens for each batch of data requested during training.

3. **Batch Processing with `DataLoader`**: When the `DataLoader` iterates over the `CustomDataset`, it calls `__getitem__` for each index in the batch. This means the data for only a few samples (as many as the batch size) is in memory at any given time, not the entire dataset.

4. **Efficient Memory Usage**: By processing data in batches and loading each sample only when needed, the `CustomDataset` class avoids the high memory cost of loading the entire dataset into memory. This is especially important for large datasets or when working with limited memory resources.

5. **On-the-Fly Processing**: The preprocessing (like resampling and tokenizing) is done in real-time for each batch. This approach is different from pre-processing the entire dataset beforehand and storing it in memory, which can be very memory-intensive.

In summary, the `CustomDataset` class in conjunction with `DataLoader` enables efficient memory usage by loading and processing data in smaller, manageable batches, rather than loading the entire dataset at once. This approach is particularly useful in scenarios where you have large datasets or limited memory resources.

In [None]:
# Training Arguments
training_args = TrainingArguments(
    output_dir="./wav2vec2-finetuned",
    group_by_length=True,
    per_device_train_batch_size=32,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=3,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=3e-4,
    warmup_steps=500,
    save_total_limit=2,
)

# Data Collator
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)


# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_loader,
    eval_dataset=valid_loader,
    tokenizer=processor.feature_extractor,
)

# Train the model
trainer.train()

In [None]:
def predict(audio_filepath):
    # Load and process the audio file
    speech, _ = sf.read(audio_filepath, dtype="float32")
    input_values = processor(speech, sampling_rate=16000, return_tensors="pt").input_values

    # Perform inference
    with torch.no_grad():
        logits = model(input_values).logits

    # Decode the prediction
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)

    return transcription[0]

# Example usage
audio_file = "path_to_audio_file.mp3"  # Replace with your audio file path
transcription = predict(audio_file)
print("Predicted transcription:", transcription)