## Prerequisites

- Java 8/11/17 installed and configured
- Apache Spark installed and in PATH
- Python 3.12+ with dependencies installed (`uv sync`)

In [2]:
import json
import os

# Ensure we are in the project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/r39132/Projects/spark-solr-indexer


## 1. Check Environment

Verify that Java and Spark are properly configured.

In [3]:
# Check Java version
print("Java Version:")
!java -version

print("\nSpark Version:")
!spark-submit --version 2>&1 | head -n 1

Java Version:
openjdk version "17.0.8.1" 2023-08-22 LTS
OpenJDK Runtime Environment Corretto-17.0.8.8.1 (build 17.0.8.1+8-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.8.8.1 (build 17.0.8.1+8-LTS, mixed mode, sharing)

Spark Version:
25/11/24 12:53:59 WARN Utils: Your hostname, Sid-Home-MBP.local resolves to a loopback address: 127.0.0.1; using 10.0.4.171 instead (on interface en0)


## 2. Generate Data

Run the Python script to generate dummy JSON data locally.

In [4]:
# Check if data already exists
data_exists = False
if os.path.exists("data/dummy_data.json"):
    with open("data/dummy_data.json") as f:
        lines = f.readlines()
    if len(lines) > 0:
        print(f"⏭️  Skipping data generation: {len(lines)} documents already exist")
        data_exists = True

# Generate data if needed
if not data_exists:
    print("Generating data...")
    !python3 data_gen/generate_data.py

# Verify data was created/exists
if os.path.exists("data/dummy_data.json"):
    with open("data/dummy_data.json") as f:
        lines = f.readlines()
    print(f"✓ Data ready: {len(lines)} records")
    print(f"Sample record: {lines[0][:100]}...")
else:
    print("✗ Data generation failed")

⏭️  Skipping data generation: 1000 documents already exist
✓ Data ready: 1000 records
Sample record: {"id": "0", "title": "Itself begin trip fly with you too.", "description": "Game small too thus quic...


## 3. Setup Local Solr

Download and start a local Solr instance on port 8983.

In [5]:
import time

import requests

# Check if Solr is already running
try:
    response = requests.get("http://localhost:8983/solr/admin/info/system", timeout=2)
    if response.status_code == 200:
        print("⏭️  Skipping Solr setup: already running")
        solr_info = response.json()
        print(f"  Version: {solr_info['lucene']['solr-spec-version']}")
        solr_running = True
    else:
        solr_running = False
except Exception:
    solr_running = False

# Start Solr if not running
if not solr_running:
    print("Setting up Solr...")
    !./scripts/setup_solr.sh
    time.sleep(5)  # Wait for Solr to fully start

# Final verification
try:
    response = requests.get("http://localhost:8983/solr/admin/info/system")
    if response.status_code == 200:
        print("✓ Solr is running")
        solr_info = response.json()
        print(f"  Version: {solr_info['lucene']['solr-spec-version']}")
    else:
        print("✗ Solr is not responding")
except Exception as e:
    print(f"✗ Cannot connect to Solr: {e}")

⏭️  Skipping Solr setup: already running
  Version: 8.11.3
✓ Solr is running
  Version: 8.11.3


## 4. Index Data with Local Spark

Submit the Spark job to index the generated data into local Solr.

In [6]:
# Check if indexing is already complete
def check_indexing_complete():
    """Check if data is already indexed in Solr"""
    try:
        # Get local document count
        with open("data/dummy_data.json") as f:
            local_count = sum(1 for _ in f)

        # Get Solr document count
        response = requests.get(
            "http://localhost:8983/solr/dummy_data/select?q=*:*&rows=0", timeout=5
        )
        if response.status_code == 200:
            solr_count = response.json()["response"]["numFound"]

            if local_count == solr_count and solr_count > 0:
                # Verify sample document exists
                with open("data/dummy_data.json") as f:
                    first_doc = json.loads(f.readline())
                    doc_id = first_doc["id"]

                check_response = requests.get(
                    f"http://localhost:8983/solr/dummy_data/select?q=id:{doc_id}&rows=1", timeout=5
                )
                if check_response.status_code == 200:
                    match_count = check_response.json()["response"]["numFound"]
                    if match_count > 0:
                        return True, solr_count
        return False, 0
    except Exception:
        return False, 0


already_indexed, doc_count = check_indexing_complete()

if already_indexed:
    print(f"⏭️  Skipping indexing: {doc_count} documents already indexed and verified")
else:
    print("Indexing data with Spark...")
    !spark-submit \
        --master "local[*]" \
        --packages com.lucidworks.spark:spark-solr:4.0.0 \
        spark_job/index_to_solr.py
    print("✓ Indexing complete")

⏭️  Skipping indexing: 1000 documents already indexed and verified


## 5. Verify Indexing

Query local Solr to ensure data was indexed successfully.

In [7]:
# Query Solr for document count
try:
    response = requests.get("http://localhost:8983/solr/dummy_data/select?q=*:*&rows=0")
    if response.status_code == 200:
        result = response.json()
        num_docs = result["response"]["numFound"]
        print(f"✓ Indexed {num_docs} documents in Solr")
    else:
        print("✗ Failed to query Solr")
except Exception as e:
    print(f"✗ Query failed: {e}")

# Show sample documents
print("\nSample documents:")
!curl -s "http://localhost:8983/solr/dummy_data/select?q=*:*&rows=3" | python3 -m json.tool

✓ Indexed 1000 documents in Solr

Sample documents:
{
    "responseHeader": {
        "zkConnected": true,
        "status": 0,
        "QTime": 0,
        "params": {
            "q": "*:*",
            "rows": "3"
        }
    },
    "response": {
        "numFound": 1000,
        "start": 0,
        "numFoundExact": true,
        "docs": [
            {
                "author": "Danielle Schmidt",
                "category": "Music",
                "created_at": "2025-10-16T12:04:32.042766",
                "description": "Memory character his mean. Center measure thousand player thought become tough else.",
                "id": "0",
                "in_stock": true,
                "price": 219.12,
                "title": "Seven reach sport middle.",
                "_version_": 1849702511848456192
            },
            {
                "author": "Todd Stevens",
                "category": "Science",
                "created_at": "2025-04-24T14:51:10.656735",
           

## 6. Cleanup (Optional)

Stop Solr and clean up generated data.

In [8]:
# Uncomment to stop Solr and clean data
# !./solr-dist/solr-8.11.3/bin/solr stop -all
# !rm -rf data/dummy_data.json
# print("✓ Cleaned up")