# Spark Solr Indexer Pipeline

This notebook replicates the full data pipeline: generating data, setting up Solr, and indexing with Spark. It also includes GCP authentication.

In [None]:
import os
import sys
from google.cloud import storage

## 1. GCP Authentication
Authenticate with Google Cloud Platform using the gcloud CLI.

In [None]:
!gcloud auth login

## 2. Verify GCP Connection
List storage buckets to verify that we are authenticated and have access.

In [None]:
def list_buckets():
    """Lists all buckets."""
    try:
        storage_client = storage.Client()
        buckets = list(storage_client.list_buckets())
        print(f"Found {len(buckets)} buckets.")
        for bucket in buckets[:5]: # Print first 5 only
            print(bucket.name)
    except Exception as e:
        print(f"Failed to list buckets: {e}")
        print("Make sure you have set your project ID: gcloud config set project <PROJECT_ID>")

list_buckets()

## 3. Generate Data
Run the Python script to generate dummy JSON data.

In [None]:
# Ensure we are in the project root
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir("..")
print(f"Current working directory: {os.getcwd()}")

!python3 data_gen/generate_data.py

## 4. Setup Solr
Download and start a local Solr instance.

In [None]:
!./scripts/setup_solr.sh

## 5. Index Data with Spark
Submit the Spark job to index the generated data into Solr.

In [None]:
!spark-submit --conf spark.jars.ivy=/tmp/antigravity_ivy --packages com.lucidworks.spark:spark-solr:4.0.0 spark_job/index_to_solr.py

## 6. Verify Indexing
Query Solr to ensure data was indexed.

In [None]:
!curl "http://localhost:8983/solr/dummy_data/select?q=*:*&rows=5"