# Memory Bacon: Embedding Exploration

This notebook documents the embedding generation process for the `memory_bacon` dataset.

## What We Did Today

1. **Parsed essay text** (`3essay.txt`) into structured sentences with chapter and position metadata
2. **Generated embeddings** using OpenAI's `text-embedding-3-small` model (1536 dimensions)
3. **Computed contextual statistics** including similarity with neighboring sentences
4. **Built graph edges** connecting sentences based on adjacency, questions, and definitions
5. **Organized data** into a structured directory: `core/`, `graph/`, and `meta/`

## Dataset Overview

- **99 sentences** from 3 chapters
- **1536-dimensional embeddings** per sentence
- **201 graph edges** connecting related sentences
- **Contextual features** including similarity scores with neighbors



In [14]:
# Setup and imports
import json
import numpy as np
from pathlib import Path
from numpy.linalg import norm
from numpy import dot

# Load data paths
BASE_DIR = Path("memory_bacon")
CORE_DIR = BASE_DIR / "core"
GRAPH_DIR = BASE_DIR / "graph"
META_DIR = BASE_DIR / "meta"

print("Data directories:")
print(f"  Core: {CORE_DIR}")
print(f"  Graph: {GRAPH_DIR}")
print(f"  Meta: {META_DIR}")

# Verify files exist
assert (CORE_DIR / "sentences.jsonl").exists(), "sentences.jsonl not found"
assert (CORE_DIR / "embeddings.npy").exists(), "embeddings.npy not found"
assert (CORE_DIR / "statistics.jsonl").exists(), "statistics.jsonl not found"
assert (META_DIR / "embedding_meta.json").exists(), "embedding_meta.json not found"

print("\n[OK] All required files found")



Data directories:
  Core: memory_bacon\core
  Graph: memory_bacon\graph
  Meta: memory_bacon\meta

[OK] All required files found


In [15]:
# Load metadata and overview
with open(META_DIR / "embedding_meta.json", "r", encoding="utf-8") as f:
    meta = json.load(f)

print("Embedding Metadata:")
print(f"  Model: {meta['model']}")
print(f"  Timestamp: {meta['timestamp']}")
print(f"  Number of sentences: {meta['num_sentences']}")
print(f"  Embedding dimension: {meta['embedding_dim']}")

# Load embeddings
embeddings = np.load(CORE_DIR / "embeddings.npy")
print(f"\nEmbeddings array shape: {embeddings.shape}")
print(f"  - Rows (sentences): {embeddings.shape[0]}")
print(f"  - Columns (dimensions): {embeddings.shape[1]}")
print(f"  - Data type: {embeddings.dtype}")
print(f"  - Memory size: {embeddings.nbytes:,} bytes ({embeddings.nbytes / 1024 / 1024:.2f} MB)")

meta



Embedding Metadata:
  Model: text-embedding-3-small
  Timestamp: 2025-11-12 04:18:39
  Number of sentences: 99
  Embedding dimension: 1536

Embeddings array shape: (99, 1536)
  - Rows (sentences): 99
  - Columns (dimensions): 1536
  - Data type: float32
  - Memory size: 608,256 bytes (0.58 MB)


{'model': 'text-embedding-3-small',
 'timestamp': '2025-11-12 04:18:39',
 'num_sentences': 99,
 'embedding_dim': 1536}

## Understanding Embeddings

Embeddings are dense vector representations that capture semantic meaning. Each sentence is mapped to a 1536-dimensional vector where:

- **Similar sentences** have vectors that are close together (high cosine similarity)
- **Different sentences** have vectors that are far apart (low cosine similarity)
- **Each dimension** captures some aspect of meaning (though individual dimensions are not directly interpretable)

Let's examine a few example sentences and their embeddings:



In [16]:
# Load sentences and select examples
sentences = []
with open(CORE_DIR / "sentences.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        sentences.append(json.loads(line))

# Select interesting examples
example_indices = [0, 1, 8, 22, 47]  # First, second, question, chapter break, definition-like

print("Selected Example Sentences:\n")
examples = []
for idx in example_indices:
    sent = sentences[idx]
    emb = embeddings[idx]
    examples.append({
        "index": idx,
        "id": sent["id"],
        "chapter": sent["chapter"],
        "position": sent["position"],
        "text": sent["text"],
        "embedding": emb
    })
    print(f"[{idx}] {sent['id']} (Chapter {sent['chapter']}, Position {sent['position']})")
    print(f"     Text: {sent['text'][:80]}...")
    print()

examples



Selected Example Sentences:

[0] bacon_001 (Chapter 1, Position 1)
     Text: What is truth? said jesting Pilate, and would not stay for an answer....

[1] bacon_002 (Chapter 1, Position 2)
     Text: Certainly there be, that delight in giddiness, and count it a bondage to fix a b...

[8] bacon_009 (Chapter 1, Position 9)
     Text: Doth any man doubt, that if there were taken out of men's minds, vain opinions, ...

[22] bacon_023 (Chapter 2, Position 1)
     Text: Men fear death, as children fear to go in the dark; and as that natural fear in ...

[47] bacon_048 (Chapter 3, Position 1)
     Text: Religion being the chief band of human society, is a happy thing, when itself is...



[{'index': 0,
  'id': 'bacon_001',
  'chapter': 1,
  'position': 1,
  'text': 'What is truth? said jesting Pilate, and would not stay for an answer.',
  'embedding': array([-0.02344766,  0.03554578, -0.03653585, ...,  0.01509247,
          0.03518356,  0.02172108], shape=(1536,), dtype=float32)},
 {'index': 1,
  'id': 'bacon_002',
  'chapter': 1,
  'position': 2,
  'text': 'Certainly there be, that delight in giddiness, and count it a bondage to fix a belief; affecting free-will in thinking, as well as in acting.',
  'embedding': array([ 0.01569673,  0.02020729,  0.0174038 , ...,  0.03453004,
         -0.01758423,  0.01815325], shape=(1536,), dtype=float32)},
 {'index': 8,
  'id': 'bacon_009',
  'chapter': 1,
  'position': 9,
  'text': "Doth any man doubt, that if there were taken out of men's minds, vain opinions, flattering hopes, false valuations, imaginations as one would, and the like, but it would leave the minds, of a number of men, poor shrunken things, full of melancholy and i

In [17]:
# Examine embedding properties for first example
first_emb = examples[0]["embedding"]

print(f"Embedding for: {examples[0]['text'][:60]}...")
print(f"\nBasic Statistics:")
print(f"  Shape: {first_emb.shape}")
print(f"  Min value: {first_emb.min():.4f}")
print(f"  Max value: {first_emb.max():.4f}")
print(f"  Mean: {first_emb.mean():.4f}")
print(f"  Std: {first_emb.std():.4f}")
print(f"  Norm (length): {norm(first_emb):.4f}")

print(f"\nFirst 10 dimensions:")
for i in range(10):
    print(f"  dim[{i:4d}] = {first_emb[i]:8.4f}")

print(f"\nLast 10 dimensions:")
for i in range(len(first_emb) - 10, len(first_emb)):
    print(f"  dim[{i:4d}] = {first_emb[i]:8.4f}")



Embedding for: What is truth? said jesting Pilate, and would not stay for a...

Basic Statistics:
  Shape: (1536,)
  Min value: -0.0793
  Max value: 0.0906
  Mean: -0.0001
  Std: 0.0255
  Norm (length): 1.0000

First 10 dimensions:
  dim[   0] =  -0.0234
  dim[   1] =   0.0355
  dim[   2] =  -0.0365
  dim[   3] =   0.0289
  dim[   4] =  -0.0269
  dim[   5] =   0.0027
  dim[   6] =  -0.0496
  dim[   7] =  -0.0257
  dim[   8] =  -0.0022
  dim[   9] =   0.0010

Last 10 dimensions:
  dim[1526] =   0.0224
  dim[1527] =  -0.0026
  dim[1528] =  -0.0008
  dim[1529] =   0.0305
  dim[1530] =  -0.0178
  dim[1531] =  -0.0018
  dim[1532] =  -0.0077
  dim[1533] =   0.0151
  dim[1534] =   0.0352
  dim[1535] =   0.0217


## What Do Embeddings Represent?

The embedding vector captures semantic meaning in a high-dimensional space. While individual dimensions aren't directly interpretable, the **relationships between vectors** are meaningful:

- **Cosine similarity** measures how similar two sentences are semantically
- **Euclidean distance** can also measure similarity (though cosine is more common for embeddings)
- **Clustering** in this space groups semantically related sentences

Let's compute similarities between our example sentences:



In [18]:
# Compute cosine similarity between example sentences
def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    if norm(a) * norm(b) == 0:
        return 0.0
    return float(dot(a, b) / (norm(a) * norm(b)))

print("Cosine Similarity Matrix (Example Sentences):\n")
print(" " * 30, end="")
for ex in examples:
    print(f"{ex['id']:>12}", end="")
print()

similarity_matrix = []
for i, ex1 in enumerate(examples):
    row = []
    print(f"{ex1['id']:>12} {ex1['text'][:15]:>15}...", end="")
    for j, ex2 in enumerate(examples):
        sim = cosine_similarity(ex1["embedding"], ex2["embedding"])
        row.append(sim)
        print(f"{sim:>12.4f}", end="")
    similarity_matrix.append(row)
    print()

# Find most similar pair
max_sim = -1
max_pair = None
for i in range(len(examples)):
    for j in range(i + 1, len(examples)):
        sim = similarity_matrix[i][j]
        if sim > max_sim:
            max_sim = sim
            max_pair = (i, j)

if max_pair:
    print(f"\nMost similar pair:")
    print(f"  {examples[max_pair[0]]['id']}: {examples[max_pair[0]]['text'][:60]}...")
    print(f"  {examples[max_pair[1]]['id']}: {examples[max_pair[1]]['text'][:60]}...")
    print(f"  Similarity: {max_sim:.4f}")



Cosine Similarity Matrix (Example Sentences):

                                 bacon_001   bacon_002   bacon_009   bacon_023   bacon_048
   bacon_001 What is truth? ...      1.0000      0.2458      0.3058      0.1623      0.2190
   bacon_002 Certainly there...      0.2458      1.0000      0.4520      0.2786      0.3321
   bacon_009 Doth any man do...      0.3058      0.4520      1.0000      0.2953      0.2362
   bacon_023 Men fear death,...      0.1623      0.2786      0.2953      1.0000      0.1638
   bacon_048 Religion being ...      0.2190      0.3321      0.2362      0.1638      1.0000

Most similar pair:
  bacon_002: Certainly there be, that delight in giddiness, and count it ...
  bacon_009: Doth any man doubt, that if there were taken out of men's mi...
  Similarity: 0.4520


In [19]:
# Load contextual statistics to see neighbor similarities
statistics = []
with open(CORE_DIR / "statistics.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        statistics.append(json.loads(line))

# Show statistics for first few sentences
print("Contextual Statistics (First 5 Sentences):\n")
for i in range(5):
    stat = statistics[i]
    sent = sentences[i]
    print(f"[{i}] {stat['id']}: {sent['text'][:50]}...")
    print(f"    sim_prev: {stat['sim_prev']}")
    print(f"    sim_next: {stat['sim_next']}")
    print(f"    context_avg: {stat['context_avg']}")
    print(f"    context_delta: {stat['context_delta']}")
    print()



Contextual Statistics (First 5 Sentences):

[0] bacon_001: What is truth? said jesting Pilate, and would not ...
    sim_prev: None
    sim_next: 0.2458
    context_avg: 0.2458
    context_delta: None

[1] bacon_002: Certainly there be, that delight in giddiness, and...
    sim_prev: 0.2458
    sim_next: 0.3099
    context_avg: 0.2778
    context_delta: 0.0641

[2] bacon_003: And though the sects of philosophers of that kind ...
    sim_prev: 0.3099
    sim_next: 0.3168
    context_avg: 0.3134
    context_delta: 0.0069

[3] bacon_004: But it is not only the difficulty and labor, which...
    sim_prev: 0.3168
    sim_next: 0.6
    context_avg: 0.4584
    context_delta: 0.2832

[4] bacon_005: One of the later school of the Grecians, examineth...
    sim_prev: 0.6
    sim_next: 0.357
    context_avg: 0.4785
    context_delta: 0.243



## Statistics Dimensions Explained

Each sentence in `statistics.jsonl` has **14 dimensions** that capture different aspects of the sentence. Let's break down what each dimension means with concrete examples:



In [20]:
# Detailed explanation of each statistics dimension
# Let's load a few diverse examples to illustrate

example_stats = [
    statistics[0],   # First sentence (chapter start, definition-like)
    statistics[8],   # Question sentence
    statistics[21],  # Middle sentence with both neighbors
    statistics[22],  # Chapter end
]

print("=" * 80)
print("STATISTICS DIMENSIONS EXPLAINED")
print("=" * 80)
print()

for i, stat in enumerate(example_stats):
    sent_idx = int(stat['id'].replace('bacon_', '')) - 1
    sent_text = sentences[sent_idx]['text']
    
    print(f"\n{'='*80}")
    print(f"Example {i+1}: {stat['id']}")
    print(f"Text: {sent_text}")
    print(f"{'='*80}\n")
    
    # Basic structural features
    print("1. BASIC STRUCTURAL FEATURES:")
    print(f"   len_char = {stat['len_char']}")
    print(f"      → Number of characters in the sentence")
    print(f"      → Example: '{sent_text[:30]}...' has {stat['len_char']} characters")
    print()
    
    print(f"   len_tok = {stat['len_tok']}")
    print(f"      → Number of tokens (words) when split by whitespace")
    print(f"      → Example: '{sent_text[:30]}...' has {stat['len_tok']} words")
    print()
    
    print(f"   chapter = {stat['chapter']}")
    print(f"      → Which chapter this sentence belongs to (1-indexed)")
    print()
    
    print(f"   position = {stat['position']}")
    print(f"      → Position of this sentence within its chapter (1-indexed)")
    print()
    
    # Content features
    print("2. CONTENT FEATURES:")
    print(f"   is_question = {stat['is_question']}")
    if stat['is_question'] == 1:
        print(f"      → 1 means the sentence ends with '?' (it's a question)")
        print(f"      → Example: '{sent_text}' ends with '?'")
    else:
        print(f"      → 0 means the sentence does NOT end with '?'")
        print(f"      → Example: '{sent_text[:50]}...' does not end with '?'")
    print()
    
    print(f"   is_definition_like = {stat['is_definition_like']}")
    if stat['is_definition_like'] == 1:
        print(f"      → 1 means the sentence contains definition patterns")
        print(f"      → Patterns: ' is ', ' means ', ' refers to ' (case-insensitive)")
        print(f"      → Example: '{sent_text[:60]}...' contains a definition pattern")
    else:
        print(f"      → 0 means no definition patterns found")
        print(f"      → Example: '{sent_text[:50]}...' does not contain definition patterns")
    print()
    
    # Stylistic features
    print("3. STYLISTIC FEATURES:")
    print(f"   punct_density = {stat['punct_density']:.4f}")
    print(f"      → Ratio of punctuation marks to total characters")
    print(f"      → Formula: (count of .,;:!?-\"()[] etc.) / len_char")
    print(f"      → Range: 0.0 (no punctuation) to ~0.1+ (very punctuated)")
    print(f"      → Example: {stat['punct_density']:.4f} means {stat['punct_density']*100:.2f}% of characters are punctuation")
    print()
    
    print(f"   rel_pos_in_chapter = {stat['rel_pos_in_chapter']:.4f}")
    print(f"      → Relative position within the chapter (0.0 to 1.0)")
    print(f"      → Formula: position / total_sentences_in_chapter")
    print(f"      → 0.0 = first sentence, 1.0 = last sentence, 0.5 = middle")
    print(f"      → Example: {stat['rel_pos_in_chapter']:.4f} means this is {stat['rel_pos_in_chapter']*100:.1f}% through the chapter")
    print()
    
    # Contextual similarity features
    print("4. CONTEXTUAL SIMILARITY FEATURES:")
    print(f"   sim_prev = {stat['sim_prev']}")
    if stat['sim_prev'] is not None:
        print(f"      → Cosine similarity with the PREVIOUS sentence in the same chapter")
        print(f"      → Range: -1.0 to 1.0 (typically 0.2 to 0.8 for related text)")
        print(f"      → Higher values = more semantically similar to previous sentence")
        print(f"      → Example: {stat['sim_prev']:.4f} means moderate similarity with previous")
    else:
        print(f"      → null because this is the FIRST sentence in the chapter")
        print(f"      → No previous sentence to compare with")
    print()
    
    print(f"   sim_next = {stat['sim_next']}")
    if stat['sim_next'] is not None:
        print(f"      → Cosine similarity with the NEXT sentence in the same chapter")
        print(f"      → Range: -1.0 to 1.0 (typically 0.2 to 0.8 for related text)")
        print(f"      → Higher values = more semantically similar to next sentence")
        print(f"      → Example: {stat['sim_next']:.4f} means moderate similarity with next")
    else:
        print(f"      → null because this is the LAST sentence in the chapter")
        print(f"      → No next sentence to compare with")
    print()
    
    print(f"   context_avg = {stat['context_avg']}")
    if stat['context_avg'] is not None:
        print(f"      → Average of sim_prev and sim_next")
        print(f"      → Formula: (sim_prev + sim_next) / 2")
        print(f"      → Measures overall contextual similarity (stability indicator)")
        print(f"      → High values = sentence fits well in its context")
        print(f"      → Example: {stat['context_avg']:.4f} means average similarity with neighbors")
    else:
        print(f"      → null when only one neighbor exists (chapter start or end)")
        print(f"      → In that case, equals the single available similarity")
    print()
    
    print(f"   context_delta = {stat['context_delta']}")
    if stat['context_delta'] is not None:
        print(f"      → Absolute difference between sim_next and sim_prev")
        print(f"      → Formula: |sim_next - sim_prev|")
        print(f"      → Measures asymmetry in contextual similarity")
        print(f"      → Low values = similar to both neighbors (balanced)")
        print(f"      → High values = more similar to one neighbor than the other (transition point)")
        print(f"      → Example: {stat['context_delta']:.4f} means {'balanced' if stat['context_delta'] < 0.1 else 'asymmetric'} context")
    else:
        print(f"      → null when only one neighbor exists (can't compute difference)")
    print()
    
    # Positional features
    print("5. POSITIONAL FEATURES:")
    print(f"   is_paragraph_start = {stat['is_paragraph_start']}")
    if stat['is_paragraph_start'] == 1:
        print(f"      → 1 means this sentence starts a new paragraph")
        print(f"      → Either: position == 1 (chapter start) OR previous line was blank")
        print(f"      → Example: '{sent_text[:50]}...' starts a paragraph")
    else:
        print(f"      → 0 means this sentence continues a paragraph")
    print()
    
    print(f"   is_paragraph_end = {stat['is_paragraph_end']}")
    if stat['is_paragraph_end'] == 1:
        print(f"      → 1 means this sentence ends a paragraph")
        print(f"      → Either: last sentence in chapter OR next line is blank")
        print(f"      → Example: '{sent_text[:50]}...' ends a paragraph")
    else:
        print(f"      → 0 means this sentence continues a paragraph")
    print()
    
    if i < len(example_stats) - 1:
        print("\n" + "─" * 80 + "\n")



STATISTICS DIMENSIONS EXPLAINED


Example 1: bacon_001
Text: What is truth? said jesting Pilate, and would not stay for an answer.

1. BASIC STRUCTURAL FEATURES:
   len_char = 69
      → Number of characters in the sentence
      → Example: 'What is truth? said jesting Pi...' has 69 characters

   len_tok = 13
      → Number of tokens (words) when split by whitespace
      → Example: 'What is truth? said jesting Pi...' has 13 words

   chapter = 1
      → Which chapter this sentence belongs to (1-indexed)

   position = 1
      → Position of this sentence within its chapter (1-indexed)

2. CONTENT FEATURES:
   is_question = 0
      → 0 means the sentence does NOT end with '?'
      → Example: 'What is truth? said jesting Pilate, and would not ...' does not end with '?'

   is_definition_like = 1
      → 1 means the sentence contains definition patterns
      → Patterns: ' is ', ' means ', ' refers to ' (case-insensitive)
      → Example: 'What is truth? said jesting Pilate, and would n

## Statistics Summary Table

Here's a quick reference table for all statistics dimensions:

| Dimension | Type | Range | Description |
|-----------|------|-------|-------------|
| `id` | string | - | Sentence identifier (e.g., "bacon_001") |
| `chapter` | int | 1+ | Chapter number |
| `position` | int | 1+ | Position within chapter |
| `len_char` | int | 1+ | Number of characters |
| `len_tok` | int | 1+ | Number of tokens (words) |
| `is_question` | int | 0 or 1 | 1 if ends with "?" |
| `is_definition_like` | int | 0 or 1 | 1 if contains " is ", " means ", " refers to " |
| `punct_density` | float | 0.0-1.0 | Punctuation count / character count |
| `rel_pos_in_chapter` | float | 0.0-1.0 | position / total_sentences_in_chapter |
| `sim_prev` | float/null | -1.0 to 1.0 | Cosine similarity with previous sentence |
| `sim_next` | float/null | -1.0 to 1.0 | Cosine similarity with next sentence |
| `context_avg` | float/null | -1.0 to 1.0 | Average of sim_prev and sim_next |
| `context_delta` | float/null | 0.0+ | Absolute difference |sim_next - sim_prev| |
| `is_paragraph_start` | int | 0 or 1 | 1 if starts a paragraph |
| `is_paragraph_end` | int | 0 or 1 | 1 if ends a paragraph |

### Use Cases

These statistics can be used for:
- **XGBoost feature engineering**: All numeric features can be used as model inputs
- **Graph edge weighting**: Use similarity scores to weight edges
- **Content analysis**: Identify questions, definitions, and structural patterns
- **Context understanding**: Measure how well sentences fit in their context



In [21]:
# Compute all pairwise similarities (sample for efficiency)
# For full dataset, this would be 99 * 99 = 9801 comparisons
# Let's do a sample: first 20 sentences

sample_size = 20
sample_embs = embeddings[:sample_size]
sample_sents = sentences[:sample_size]

similarities = []
for i in range(sample_size):
    for j in range(i + 1, sample_size):
        sim = cosine_similarity(sample_embs[i], sample_embs[j])
        similarities.append(sim)

similarities = np.array(similarities)

print(f"Similarity Statistics (first {sample_size} sentences):")
print(f"  Total pairs: {len(similarities)}")
print(f"  Mean similarity: {similarities.mean():.4f}")
print(f"  Std similarity: {similarities.std():.4f}")
print(f"  Min similarity: {similarities.min():.4f}")
print(f"  Max similarity: {similarities.max():.4f}")
print(f"  Median similarity: {np.median(similarities):.4f}")

# Show distribution
print(f"\nSimilarity Distribution:")
print(f"  < 0.2: {np.sum(similarities < 0.2)} pairs")
print(f"  0.2-0.4: {np.sum((similarities >= 0.2) & (similarities < 0.4))} pairs")
print(f"  0.4-0.6: {np.sum((similarities >= 0.4) & (similarities < 0.6))} pairs")
print(f"  0.6-0.8: {np.sum((similarities >= 0.6) & (similarities < 0.8))} pairs")
print(f"  >= 0.8: {np.sum(similarities >= 0.8)} pairs")



Similarity Statistics (first 20 sentences):
  Total pairs: 190
  Mean similarity: 0.3734
  Std similarity: 0.0901
  Min similarity: 0.1202
  Max similarity: 0.6169
  Median similarity: 0.3747

Similarity Distribution:
  < 0.2: 4 pairs
  0.2-0.4: 117 pairs
  0.4-0.6: 66 pairs
  0.6-0.8: 3 pairs
  >= 0.8: 0 pairs


In [22]:
# Find nearest neighbors for a specific sentence
target_idx = 0  # First sentence
target_emb = embeddings[target_idx]
target_sent = sentences[target_idx]

print(f"Target sentence: {target_sent['text']}\n")

# Compute similarities with all other sentences
all_similarities = []
for i in range(len(embeddings)):
    if i != target_idx:
        sim = cosine_similarity(target_emb, embeddings[i])
        all_similarities.append((i, sim, sentences[i]))

# Sort by similarity (descending)
all_similarities.sort(key=lambda x: x[1], reverse=True)

print("Top 5 Most Similar Sentences:\n")
for rank, (idx, sim, sent) in enumerate(all_similarities[:5], 1):
    print(f"{rank}. Similarity: {sim:.4f}")
    print(f"   {sent['id']} (Chapter {sent['chapter']}, Position {sent['position']})")
    print(f"   {sent['text']}")
    print()

print("\nTop 5 Least Similar Sentences:\n")
for rank, (idx, sim, sent) in enumerate(all_similarities[-5:], 1):
    print(f"{rank}. Similarity: {sim:.4f}")
    print(f"   {sent['id']} (Chapter {sent['chapter']}, Position {sent['position']})")
    print(f"   {sent['text']}")
    print()



Target sentence: What is truth? said jesting Pilate, and would not stay for an answer.

Top 5 Most Similar Sentences:

1. Similarity: 0.4772
   bacon_085 (Chapter 3, Position 38)
   For truth and falsehood, in such things, are like the iron and clay, in the toes of Nebuchadnezzar's image; they may cleave, but they will not incorporate.

2. Similarity: 0.4413
   bacon_007 (Chapter 1, Position 7)
   Truth may perhaps come to the price of a pearl, that showeth best by day; but it will not rise to the price of a diamond, or carbuncle, that showeth best in varied lights.

3. Similarity: 0.4284
   bacon_017 (Chapter 1, Position 17)
   To pass from theological, and philosophical truth, to the truth of civil business; it will be acknowledged, even by those that practise it not, that clear, and round dealing, is the honor of man's nature; and that mixture of falsehoods, is like alloy in coin of gold and silver, which may make the metal work the better, but it embaseth it.

4. Similarity: 0.4217

## Summary

### What Embeddings Enable

1. **Semantic Search**: Find sentences similar to a query
2. **Clustering**: Group related sentences together
3. **Graph Construction**: Connect semantically similar sentences
4. **Feature Engineering**: Use embedding dimensions as features for ML models

### Key Insights

- Each sentence is represented as a **1536-dimensional vector**
- Similarity is measured via **cosine similarity** (normalized dot product)
- Embeddings capture **semantic meaning**, not just word overlap
- The model (`text-embedding-3-small`) was trained on large text corpora to learn these representations

### Next Steps

- Use embeddings for **FAISS indexing** (fast similarity search)
- Build **graph edges** based on semantic similarity thresholds
- Train **XGBoost models** using embedding features + contextual statistics

