# Bridge Paragraph Embeddings with BGE-M3

Generate embeddings for bridge technical paragraphs using BAAI/bge-m3 model.

In [1]:
%pip install sentence-transformers torch pandas numpy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.2 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from pathlib import Path
import json
import time

def print_section(title, char="=", width=80):
    """Print formatted section header."""
    print(f"\n{char * width}")
    print(title.center(width))
    print(f"{char * width}\n")

print("✅ Imports loaded successfully")

✅ Imports loaded successfully


## Load BGE-M3 Model

Download and initialize the BGE-M3 embedding model (~2.3GB).

In [2]:
print_section("Loading BGE-M3 Model")

# Load the model
model = SentenceTransformer('BAAI/bge-m3')

# Check device
device = 'cuda' if model.device.type == 'cuda' else 'cpu'
print(f"✓ Model loaded on: {device}")
print(f"✓ Max sequence length: {model.max_seq_length} tokens")
print(f"✓ Embedding dimension: {model.get_sentence_embedding_dimension()}D")


                              Loading BGE-M3 Model                              

✓ Model loaded on: cpu
✓ Max sequence length: 8192 tokens
✓ Embedding dimension: 1024D


## Load CSV and Generate Embeddings

Process the bridge paragraph CSV and generate embeddings for each paragraph.

In [3]:
print_section("Loading CSV")

# Define paths
csv_path = Path(r'C:\Users\wongb\Bridge-ML\Bridge-ML-LLM-Embedding-Architecture\nlp_processing\nlp_data\bridge_paragraphs.csv')
output_path = Path(r'C:\Users\wongb\Bridge-ML\Bridge-ML-LLM-Embedding-Architecture\nlp_data\bridge_paragraphs_embedded.csv')

# Load the CSV
df = pd.read_csv(csv_path)

print(f"✓ CSV loaded: {csv_path}")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst paragraph preview:")
print(f"  Structure ID: {df.iloc[0, 0]}")
print(f"  Coordinates: {df.iloc[0, 1]}")
print(f"  Paragraph length: {len(str(df.iloc[0, 2])):,} characters")
print(f"  Paragraph preview: {str(df.iloc[0, 2])[:200]}...")


                                  Loading CSV                                   

✓ CSV loaded: C:\Users\wongb\Bridge-ML\Bridge-ML-LLM-Embedding-Architecture\nlp_processing\nlp_data\bridge_paragraphs.csv
  Rows: 4,914
  Columns: ['STRUCTURE_ID', 'COORDINATES', 'PARAGRAPH']

First paragraph preview:
  Structure ID:       1W
  Coordinates: (48.29745556, -122.6078139)
  Paragraph length: 21,169 characters
  Paragraph preview: Design Peak Ground Acceleration: High seismic demand bridges with PGA values between 0.4 and 0.6 must incorporate advanced engineering techniques to withstand significant seismic forces and prevent st...


In [12]:
print_section("Generating Embeddings")

# Extract paragraphs from third column (index 2)
paragraphs = df.iloc[:, 2].astype(str).tolist()

print(f"Total paragraphs: {len(paragraphs):,}")
print(f"Average length: {sum(len(p) for p in paragraphs) / len(paragraphs):,.0f} characters")
print(f"Max length: {max(len(p) for p in paragraphs):,} characters")

# Detect device and set parallelization parameters
device = 'cuda' if model.device.type == 'cuda' else 'cpu'
if device == 'cpu':
    # Use multiprocessing for CPU
    import multiprocessing
    num_workers = max(1, multiprocessing.cpu_count() - 1)  # Leave 1 core free
    print(f"\nUsing CPU parallelization with {num_workers} workers")
else:
    # GPU doesn't benefit from num_workers
    num_workers = 0
    print(f"\nUsing GPU acceleration")

# Generate embeddings with optimized settings
print(f"\nProcessing {len(paragraphs):,} paragraphs...")
start_time = time.time()

# Use encode() with optimized settings
all_embeddings = model.encode(
    paragraphs,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=False
)

elapsed = time.time() - start_time
print(f"\n✓ Generated {len(all_embeddings):,} embeddings in {elapsed/60:.1f} minutes")
print(f"  Rate: {len(all_embeddings)/elapsed:.1f} embeddings/sec")
print(f"  Embedding shape: {all_embeddings[0].shape}")
print(f"  Embedding type: {type(all_embeddings[0])}")

print_section("Saving Embeddings to CSV")

# Convert embeddings to list format for JSON serialization
embeddings_as_lists = [emb.tolist() for emb in all_embeddings]

# Add embeddings as fourth column
df['EMBEDDING'] = embeddings_as_lists

# Save to CSV
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path, index=False)

print(f"✓ CSV saved: {output_path}")
print(f"  Rows: {len(df):,}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst embedding preview:")
print(f"  Embedding length: {len(df.iloc[0, 3])}")
print(f"  First 10 values: {df.iloc[0, 3][:10]}")
print(f"\n✅ Complete! Embeddings added to CSV as column 4.")


                             Generating Embeddings                              

Total paragraphs: 4,914
Average length: 21,794 characters
Max length: 23,344 characters

Using CPU parallelization with 19 workers

Processing 4,914 paragraphs...


Batches:   0%|          | 0/154 [00:00<?, ?it/s]


✓ Generated 4,914 embeddings in 1476.8 minutes
  Rate: 0.1 embeddings/sec
  Embedding shape: (1024,)
  Embedding type: <class 'numpy.ndarray'>

                            Saving Embeddings to CSV                            

✓ CSV saved: C:\Users\wongb\Bridge-ML\Bridge-ML-LLM-Embedding-Architecture\nlp_data\bridge_paragraphs_embedded.csv
  Rows: 4,914
  Columns: ['STRUCTURE_ID', 'COORDINATES', 'PARAGRAPH', 'EMBEDDING']

First embedding preview:
  Embedding length: 1024
  First 10 values: [-0.04067869856953621, 0.006592259742319584, -0.00884220376610756, -0.012684069573879242, 0.03733628988265991, -0.05344384163618088, 0.002753559732809663, 0.02851228415966034, -0.0036807351280003786, 0.03813118115067482]

✅ Complete! Embeddings added to CSV as column 4.
