# Dataset Loading and Exploration

This notebook loads semantic similarity datasets for fine-tuning EmbeddingGemma. We'll explore the toy dataset and prepare it for training.


In [1]:
# Import data loading functions from the scripts directory
# This reinforces the rule that ALL functions must live in /src scripts
from src.data.loaders import load_toy_dataset, validate_pairs
import pandas as pd


## Load Toy Dataset

The toy dataset contains 4 pairs of semantically similar sentences. This small dataset is perfect for demonstrating the fine-tuning process.


In [2]:
# Using the imported function to load the toy dataset
# No logic is embedded in the notebook â€“ notebooks are for workflow execution only
train_data = load_toy_dataset()

# Display the dataset in a table format for exploration
df = pd.DataFrame(train_data)
df


Unnamed: 0,anchor,positive
0,I love playing football.,Playing soccer is my favorite hobby.
1,The weather is really nice today.,It's quite a sunny day outside.
2,The soccer game ended with no winner.,The football match ended in a draw.
3,There was a heavy downpour all day.,It rained heavily throughout the day.


## Dataset Statistics

Let's examine the characteristics of our dataset.


In [3]:
# Validate and get statistics about the dataset
stats = validate_pairs(train_data)

print("Dataset Statistics:")
print("-" * 40)
print(f"Number of pairs: {stats['count']}")
print(f"Average anchor length: {stats['avg_anchor_length']:.1f} characters")
print(f"Average positive length: {stats['avg_positive_length']:.1f} characters")
print(f"Contains empty strings: {stats['has_empty']}")


Dataset Statistics:
----------------------------------------
Number of pairs: 4
Average anchor length: 32.2 characters
Average positive length: 34.8 characters
Contains empty strings: False


## Dataset Overview

Each pair consists of an anchor sentence and a positive (similar) sentence. The model will learn to make embeddings of these pairs close together in vector space.


In [4]:
# Display each pair for inspection
for i, pair in enumerate(train_data, 1):
    print(f"\nPair {i}:")
    print(f"  Anchor:   '{pair['anchor']}'")
    print(f"  Positive: '{pair['positive']}'")



Pair 1:
  Anchor:   'I love playing football.'
  Positive: 'Playing soccer is my favorite hobby.'

Pair 2:
  Anchor:   'The weather is really nice today.'
  Positive: 'It's quite a sunny day outside.'

Pair 3:
  Anchor:   'The soccer game ended with no winner.'
  Positive: 'The football match ended in a draw.'

Pair 4:
  Anchor:   'There was a heavy downpour all day.'
  Positive: 'It rained heavily throughout the day.'
