In [None]:
import numpy as np
def batch_generator(features, labels, batch_size):
    total_samples = len(features)
    indices = np.arange(total_samples)
    np.random.shuffle(indices)

    for start in range(0, total_samples, batch_size):
        end = min(start + batch_size, total_samples)
        batch_indices = indices[start:end]
        yield features[batch_indices], labels[batch_indices]


# Loading from Disk
stream data from disk without loading the entire dataset into memory


# Pipelining
 create efficient input pipelines by chaining multiple dataset transformations (e.g., map, batch, shuffle), 

# Parallelizing

Parallelize data loading and preprocessing using functions like map, which can significantly speed up the training process.
### `tf.data.Dataset`
chain the `prefetch` method after the other data processing methods like map, batch, and shuffle to ensure that the data pipeline is optimized for parallel processing


Yes, Hugging Face's `datasets` library provides similar functionality to TensorFlow's `tf.data.Dataset` but within the context of NLP tasks and the broader Hugging Face ecosystem. The `datasets` library allows you to handle and process large datasets efficiently. Here’s how you can achieve similar functionality with Hugging Face `datasets`:

### Hugging Face Datasets

1. **Loading Data:**
   Hugging Face provides convenient methods to load various datasets.

   ```python
   from datasets import load_dataset

   dataset = load_dataset('titanic')
   ```

2. **Mapping/Preprocessing:**
   Use the `map` method to apply transformations to your dataset.

   ```python
   def preprocess_function(example):
       # Apply preprocessing (e.g., normalization)
       example['features'] = [x / 255.0 for x in example['features']]
       return example

   dataset = dataset.map(preprocess_function)
   ```

3. **Shuffling:**
   You can shuffle the dataset to randomize the order of elements.

   ```python
   dataset = dataset.shuffle(seed=42)
   ```

4. **Batching:**
   Hugging Face datasets do not have a direct `batch` method like TensorFlow. Instead, you typically handle batching within the data loader of your deep learning framework (e.g., PyTorch DataLoader).

5. **Prefetching:**
   While Hugging Face datasets do not have a specific `prefetch` method, they are designed to be efficient and can be used with data loaders that support prefetching (e.g., PyTorch’s `DataLoader` with the `num_workers` parameter).



```python
from datasets import load_dataset
from torch.utils.data import DataLoader

# Load dataset
dataset = load_dataset('csv', data_files='path/to/titanic.csv')

# Example preprocessing function
def preprocess_function(example):
    # Assume 'features' is a list of feature values and 'label' is the target
    example['features'] = [x / 255.0 for x in example['features']]
    return example

# Apply the preprocessing function
dataset = dataset.map(preprocess_function)

# Shuffle the dataset
dataset = dataset.shuffle(seed=42)

# Convert to PyTorch DataLoader
def collate_fn(batch):
    # Custom collate function to handle batching
    features = [item['features'] for item in batch]
    labels = [item['label'] for item in batch]
    return {'features': features, 'labels': labels}

batch_size = 4
dataloader = DataLoader(dataset['train'], batch_size=batch_size, collate_fn=collate_fn, num_workers=4)

# Iterate over the DataLoader
for batch in dataloader:
    print("Features:")
    print(batch['features'])
    print("Labels:")
    print(batch['labels'])
    print()
```

