# Embed Data and Upload to Azure Search

This notebook:
1. Loads data from **CSV** or **Celonis EMS** (configurable below)
2. Generates vector embeddings using Azure OpenAI
3. Uploads documents to Azure Search
4. Provides checkpoint/resume functionality for large datasets

## Features:
- ‚úÖ Automatic checkpointing every 250 rows
- ‚úÖ Resume from checkpoint if interrupted
- ‚úÖ Progress tracking
- ‚úÖ Error handling
- ‚úÖ CSV or Celonis data source

## 1. Setup - Import utilities and load configuration

In [None]:
import sys
from pathlib import Path
import time
import warnings
from datetime import datetime, timezone
import pandas as pd

# Suppress SSL warnings (needed for IP-based endpoints)
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Add parent directory to path
sys.path.insert(0, str(Path().absolute().parent))

from utils import (
    config,
    load_csv,
    format_datetime_column,
    init_embedding_tracking,
    save_checkpoint,
    load_checkpoint,
    get_embedding,
    upload_documents,
    print_embedding_summary,
    get_index_stats
)

print("‚úÖ Setup complete")
print(f"   Index: {config.azure_search_index_name}")
print(f"   CSV: {config.csv_file_path}")

## 2. Configuration - Customize for Your Project

**Update these mappings to match your data:**

In [None]:
# Map CSV columns to Azure Search index fields
FIELD_MAPPING = {
    # CSV Column Name : Index Field Name
    "Id": "contract_item_id",
    "SystemContractNumber": "contract_number",
    "SystemContractItemNumber": "contract_item_number",
    "ShortText": "item_text",
    "Name": "vendor_name",
    "NetUnitPrice": "unit_price",
    "Currency": "currency",
    "ValidityPeriodStartDate": "contract_start",
    "ValidityPeriodEndDate": "contract_end"
}

# Which CSV column contains the text to embed?
TEXT_COLUMN_TO_EMBED = "ShortText"

# Which columns contain dates that need formatting?
DATETIME_COLUMNS = ["ValidityPeriodStartDate", "ValidityPeriodEndDate"]

print("‚úÖ Configuration set")

## 2b. Data Source Selection

Choose your data source: `"csv"` (default) or `"celonis"`.

If using Celonis, configure the PQL columns below and ensure your `.env` has the Celonis credentials set.

In [None]:
# ---- Choose data source ----
DATA_SOURCE = "csv"  # Change to "celonis" to load from Celonis EMS

# ---- Celonis PQL columns (only used when DATA_SOURCE = "celonis") ----
# Each column "name" must match the keys in FIELD_MAPPING above.
CELONIS_PQL_COLUMNS = [
    {"name": "Id",                          "query": '"o_celonis_ContractItem"."ID"'},
    {"name": "SystemContractNumber",        "query": '"o_celonis_ContractItem"."SystemContractNumber"'},
    {"name": "SystemContractItemNumber",    "query": '"o_celonis_ContractItem"."SystemContractItemNumber"'},
    {"name": "ShortText",                   "query": '"o_celonis_ContractItem"."ShortText"'},
    {"name": "Name",                        "query": '"o_celonis_ContractItem"."Name"'},
    {"name": "NetUnitPrice",                "query": '"o_celonis_ContractItem"."NetUnitPrice"'},
    {"name": "Currency",                    "query": '"o_celonis_ContractItem"."Currency"'},
    {"name": "ValidityPeriodStartDate",     "query": '"o_celonis_ContractItem"."ValidityPeriodStartDate"'},
    {"name": "ValidityPeriodEndDate",       "query": '"o_celonis_ContractItem"."ValidityPeriodEndDate"'},
]

print(f"‚úÖ Data source: {DATA_SOURCE}")

## 3. Load Data

Loads from checkpoint (if resuming), Celonis (if selected), or CSV (default).

In [None]:
# Try to load from checkpoint first
df = load_checkpoint(config.checkpoint_file_path)

if df is not None:
    print("‚úÖ Resumed from checkpoint\n")
elif DATA_SOURCE == "celonis":
    from utils import load_celonis_data
    config.validate_celonis()
    print("üì° Loading data from Celonis EMS...\n")
    df = load_celonis_data(columns=CELONIS_PQL_COLUMNS)
else:
    print("‚ÑπÔ∏è  Loading from CSV...\n")
    df = load_csv(config.csv_file_path)

print(f"\nüìä Dataset shape: {df.shape}")
df.head()

## 4. Prepare Data

Format datetime columns and initialize tracking

In [None]:
# Format datetime columns for Azure
print("üìÖ Formatting datetime columns...")
for col in DATETIME_COLUMNS:
    if col in df.columns:
        format_datetime_column(df, col)
        print(f"   ‚úì {col}")

# Initialize tracking columns
df = init_embedding_tracking(df)

print("\n‚úÖ Data prepared")

## 5. Test Embedding on One Row

Before processing all data, test on a single row to verify everything works

In [None]:
# Get a sample row
test_row = df.iloc[0]
test_text = test_row[TEXT_COLUMN_TO_EMBED]

print(f"üß™ Testing embedding on: '{test_text}'\n")

# Generate embedding
embedding = get_embedding(test_text)

if embedding:
    print(f"‚úÖ Embedding generated successfully")
    print(f"   Dimensions: {len(embedding)}")
    print(f"   First 5 values: {embedding[:5]}")
else:
    print("‚ùå Failed to generate embedding")

## 6. Check Current Progress

In [None]:
print_embedding_summary(df)

## 7. Process Small Batch (Test)

Process just 5 rows as a test before running the full dataset

In [None]:
def process_rows(df, max_rows=None, checkpoint_every=250):
    """Process rows: embed and upload"""
    
    # Filter to pending rows
    pending = df[df["embedded_status"] != "success"]
    
    if len(pending) == 0:
        print("‚úÖ All rows already processed!")
        return
    
    # Limit if requested
    if max_rows:
        pending = pending.head(max_rows)
        print(f"‚öôÔ∏è  Processing {max_rows} rows (test mode)\n")
    
    total = len(pending)
    success_count = 0
    fail_count = 0
    
    for i, (idx, row) in enumerate(pending.iterrows(), 1):
        doc_id = row.get("Id", f"row_{idx}")
        
        print(f"‚Üí [{i}/{total}] {doc_id}...", end=" ")
        
        # 1. Generate embedding
        text = row[TEXT_COLUMN_TO_EMBED]
        embedding = get_embedding(text)
        
        if embedding is None:
            print("‚ùå Embedding failed")
            df.at[idx, "embedded_status"] = "failed"
            df.at[idx, "embedded_error"] = "Embedding generation failed"
            df.at[idx, "embedded_at"] = datetime.now(timezone.utc)
            fail_count += 1
            continue
        
        # 2. Prepare document
        doc = {}
        for csv_col, index_field in FIELD_MAPPING.items():
            if csv_col in row.index:
                value = row[csv_col]
                if pd.isna(value):
                    value = None
                doc[index_field] = value
        
        doc["embedding"] = embedding
        
        # 3. Upload
        try:
            success = upload_documents([doc])
            if success:
                print("‚úÖ")
                df.at[idx, "embedded_status"] = "success"
                df.at[idx, "embedded_error"] = ""
                df.at[idx, "embedded_at"] = datetime.now(timezone.utc)
                success_count += 1
            else:
                print("‚ùå Upload failed")
                df.at[idx, "embedded_status"] = "failed"
                df.at[idx, "embedded_error"] = "Upload failed"
                df.at[idx, "embedded_at"] = datetime.now(timezone.utc)
                fail_count += 1
        except Exception as e:
            print(f"‚ùå {str(e)[:50]}")
            df.at[idx, "embedded_status"] = "failed"
            df.at[idx, "embedded_error"] = str(e)[:2000]
            df.at[idx, "embedded_at"] = datetime.now(timezone.utc)
            fail_count += 1
        
        # Checkpoint periodically
        processed = success_count + fail_count
        if processed % checkpoint_every == 0:
            save_checkpoint(df, config.checkpoint_file_path)
        
        # Rate limiting
        time.sleep(config.sleep_between_requests)
    
    # Final checkpoint
    save_checkpoint(df, config.checkpoint_file_path)
    
    print(f"\n‚úÖ Batch complete: {success_count} success, {fail_count} failed")


# Process 5 rows as a test
process_rows(df, max_rows=5)

## 8. Process All Remaining Rows

‚ö†Ô∏è **This will process ALL remaining rows**

Remove `max_rows` parameter to process everything

In [None]:
# Process all rows (remove max_rows parameter)
process_rows(df)  # or process_rows(df, max_rows=100) for another test batch

# Show final summary
print("\n" + "="*60)
print_embedding_summary(df)
print("="*60)

## 9. Verify Upload - Check Index Statistics

In [None]:
print("üìä Checking index statistics...\n")
get_index_stats()

## 10. Test Search

Try out all three search modes

In [None]:
from utils import text_search, vector_search, hybrid_search
import json

# Your search query
QUERY = "CHAIR"

def print_results(results, title):
    """Pretty print search results"""
    print(f"\n{'='*80}")
    print(f"{title}")
    print('='*80)
    
    if not results or "value" not in results:
        print("No results found")
        return
    
    for i, doc in enumerate(results["value"], 1):
        score = doc.get("@search.score", "N/A")
        item_id = doc.get("contract_item_id", "N/A")
        item_text = doc.get("item_text", "N/A")
        vendor = doc.get("vendor_name", "N/A")
        price = doc.get("unit_price", "N/A")
        
        print(f"\n{i}. {item_text}")
        print(f"   Score: {score}")
        print(f"   ID: {item_id}")
        print(f"   Vendor: {vendor}")
        print(f"   Price: {price}")

print(f"üîç Searching for: '{QUERY}'")

### Text Search (BM25 - Keyword Matching)

In [None]:
results = text_search(QUERY, top_k=3)
print_results(results, "Text Search (BM25)")

### Vector Search (Semantic Similarity)

In [None]:
results = vector_search(QUERY, top_k=3)
print_results(results, "Vector Search (Semantic)")

### Hybrid Search (Best of Both Worlds) - RECOMMENDED

In [None]:
# Specify which fields to search for the BM25 leg
search_fields = ["item_text", "vendor_name", "contract_number"]

results = hybrid_search(
    QUERY, 
    top_k=3,
    search_fields=search_fields
)
print_results(results, "Hybrid Search (Text + Vector with RRF)")

## Next Steps

Your index is now populated with embedded data! You can:

1. **Use the search script:** `python scripts/search_index.py --query "your query" --mode hybrid`
2. **Integrate into your application:** Import the search functions from `utils`
3. **Process more data:** Just run this notebook again with new CSV data