<a href="https://colab.research.google.com/github/sonawanekrishna/python_basics/blob/main/huggingface_datasets_handson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

5.1 Introduction to the Datasets Library


5.1.1 Overview and Purpose

The 🤗 Datasets library provides a unified interface for accessing and working with datasets for NLP tasks.
It simplifies dataset loading, preprocessing, and integration with machine learning pipelines.
5.1.2 Key Features

Extensive Collection: Access to over 900 datasets for various NLP tasks, including text classification, language modeling, and question answering.
Efficient Handling: Optimized for large-scale datasets with memory-mapped files and streaming capabilities.
Interoperability: Seamless integration with 🤗 Transformers and other machine learning frameworks.


5.2 Installing and Importing the Datasets Library

5.2.1 Installation

Install the 🤗 Datasets library via pip:

In [1]:
pip install datasets



5.2.2 Importing Modules

Import necessary modules for working with datasets:

In [3]:
# from datasets import load_dataset, load_metric, Dataset, DatasetDict
from datasets import load_dataset, Dataset, DatasetDict

5.3 Loading and Exploring Datasets


5.3.1 Loading a Dataset

Load a dataset from the 🤗 Hub by name:

In [6]:
dataset = load_dataset("imdb")

5.3.2 Dataset Structure

Explore dataset structure and metadata:

In [7]:
# print(dataset.info)
print(dataset["train"].features)
print(dataset["train"][0])

{'text': Value('string'), 'label': ClassLabel(names=['neg', 'pos'])}
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was consider

5.3.3 Accessing Dataset Splits

Access specific splits of the dataset:

In [9]:
train_dataset = dataset["train"]
test_dataset = dataset["test"]

5.4 Preprocessing Datasets
5.4.1 Data Cleaning and Formatting

*   Preprocess data to prepare for model training:

In [10]:
def preprocess_function(examples):
    examples["text"] = [text.lower() for text in examples["text"]]
    return examples

train_dataset = train_dataset.map(preprocess_function)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

5.4.2 Tokenization

*    Tokenize text data using 🤗 Transformers tokenizers:

In [11]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = train_dataset.map(tokenize_function, batched=True)

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

5.5 Dataset Splitting and Sampling


5.5.1 Splitting Datasets

*   Split dataset into training and validation sets:

In [None]:
train_test_split = train_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
val_dataset = train_test_split["test"]

5.5.2 Sampling

* Create smaller samples for quick experimentation:

In [None]:
small_train_dataset = train_dataset.shuffle(seed=42).select([i for i in range(1000)])

5.6 Dataset Transformation
5.6.1 Applying Transformations

* Apply transformations such as data augmentation or feature engineering:

In [None]:
def add_feature(examples):
    examples["length"] = [len(text.split()) for text in examples["text"]]
    return examples

train_dataset = train_dataset.map(add_feature)

5.6.2 Batching and Padding

* Handle batching and padding for model training:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16, collate_fn=data_collator)
val_dataloader = DataLoader(val_dataset, batch_size=16, collate_fn=data_collator)

5.7 Integration with 🤗 Transformers


5.7.1 Preparing Datasets for Training
* Integrate preprocessed dataset with 🤗 Transformers:

In [None]:
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

5.7.2 Evaluation and Metrics

* Evaluate model performance using validation dataset:python

In [None]:
results = trainer.evaluate()
print(results)

5.7.3 Custom Metrics

* Implement custom metrics for evaluation:

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

5.8 Best Practices and Tips

5.8.1 Efficient Data Handling

* Use memory-mapped files and streaming datasets for efficient data processing, especially with large datasets:

In [None]:
dataset = load_dataset("wikipedia", "20200501.en", split="train", streaming=True)
for example in dataset:
    print(example)
    break

5.8.2 Dataset Caching

* Utilize dataset caching to speed up preprocessing and avoid redundant computations:

In [None]:
dataset = load_dataset("imdb", cache_dir="./cache")

5.8.3 Handling Imbalanced Datasets

* Use techniques like oversampling, undersampling, or class weighting to handle imbalanced datasets:

In [None]:
from imblearn.over_sampling import RandomOverSampler

def oversample_dataset(dataset):
    texts = dataset["text"]
    labels = dataset["label"]
    ros = RandomOverSampler()
    texts_resampled, labels_resampled = ros.fit_resample(np.array(texts).reshape(-1, 1), labels)
    return Dataset.from_dict({"text": texts_resampled.flatten(), "label": labels_resampled})

train_dataset = oversample_dataset(train_dataset)