# üéì Conference RAG - Complete Setup GuideWelcome! In this notebook, you'll build a **production-ready Retrieval Augmented Generation (RAG) application** that lets users ask questions about conference talks using semantic search and AI-generated answers.## What You'll BuildA full-stack web application with:- ‚úÖ User authentication (Supabase magic links)- ‚úÖ Vector embeddings & semantic search (pgvector)- ‚úÖ Server-side API key management (Edge Functions)- ‚úÖ Row Level Security (RLS)- ‚úÖ Deployed on GitHub Pages## Architecture```‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê‚îÇ   Browser   ‚îÇ  Student asks question‚îÇ  (GitHub    ‚îÇ‚îÇ   Pages)    ‚îÇ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò       ‚îÇ       ‚îú‚îÄ‚îÄ‚îÄ Supabase Auth (magic link)       ‚îÇ       ‚îú‚îÄ‚îÄ‚îÄ Edge Function: embed-question       ‚îÇ         ‚Üì OpenAI API (server-side key üîí)       ‚îÇ         ‚Üì Returns embedding vector       ‚îÇ       ‚îú‚îÄ‚îÄ‚îÄ Supabase Database (pgvector)       ‚îÇ         ‚Üì Vector similarity search       ‚îÇ         ‚Üì Returns top matching sentences       ‚îÇ       ‚îî‚îÄ‚îÄ‚îÄ Edge Function: generate-answer                 ‚Üì OpenAI GPT-4 (server-side key üîí)                 ‚Üì Returns final answer```## Learning ObjectivesYou'll learn:1. **Vector Embeddings** - How to represent text as numbers2. **Semantic Search** - Finding similar content without exact keyword matches3. **RAG Architecture** - Combining retrieval + generation4. **Server-side Security** - Protecting API keys with Edge Functions5. **Row Level Security** - User-specific data isolation6. **Production Deployment** - Real-world application architecture## Time Estimate‚è±Ô∏è **~85 minutes** (grab a coffee!)## Cost Estimateüí∞ **~$0.60** in OpenAI API usage (for 5 years of conference talks)Let's get started! üöÄ

# Part 1: Repository Setup (5 min)## Step 1: Get Your Own Copy of the CodeBefore we begin in Colab, you need your own copy of the conference-rag repository:### Option A: Using GitHub Template (Recommended)1. Go to: https://github.com/YOUR-ORG/conference-rag2. Click **"Use this template"** ‚Üí **"Create a new repository"**3. Name it: `my-conference-rag` (or anything you'd like)4. Make it **public** (required for GitHub Pages free hosting)5. Click **"Create repository"**### Option B: Fork the Repository1. Go to: https://github.com/YOUR-ORG/conference-rag2. Click **"Fork"** in the top right3. Create the fork‚úÖ **You're all set!** Continue below to configure your project.

# Part 2: Supabase Project Setup (10 min)## Step 2a: Create a Supabase Project1. Go to [https://supabase.com](https://supabase.com)2. Sign up / Sign in3. Click **"New Project"**4. Fill in:   - **Name**: `conference-rag` (or anything)   - **Database Password**: Choose a strong password (save it!)   - **Region**: Choose closest to you5. Click **"Create new project"** (takes ~2 minutes)## Step 2b: Get Your CredentialsOnce the project is created:1. Go to **Settings** (gear icon) ‚Üí **API**2. You'll need these values:   - **Project URL**: `https://xyzabc123.supabase.co`   - **anon public** key: Long string starting with `eyJ...`   - **service_role** key: Long string starting with `eyJ...` (click "Reveal")3. Extract your **Project Reference ID** from the URL:   - Example: `https://xyzabc123.supabase.co` ‚Üí Reference ID is `xyzabc123`4. Get a **Personal Access Token**:   - Go to [https://supabase.com/dashboard/account/tokens](https://supabase.com/dashboard/account/tokens)   - Click "Generate new token"   - Name: "Conference RAG Setup"   - Copy the token (starts with `sbp_`)5. Get an **OpenAI API Key**:   - Go to [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys)   - Click "Create new secret key"   - Copy the key (starts with `sk-`)Now add these to **Colab Secrets** üîë## Step 2c: Load Credentials

In [None]:
# @title üîê Load Your Credentials from Colab Secrets# To add secrets in Colab:# 1. Click the üîë key icon in the left sidebar# 2. Add each secret below (click "+ Add new secret")# 3. Toggle "Notebook access" ON for eachfrom google.colab import userdataimport os# Required secrets:# - SUPABASE_URL# - SUPABASE_ANON_KEY# - SUPABASE_SERVICE_KEY# - SUPABASE_PROJECT_REF# - SUPABASE_ACCESS_TOKEN# - OPENAI_API_KEYtry:    SUPABASE_URL = userdata.get('SUPABASE_URL')    SUPABASE_ANON_KEY = userdata.get('SUPABASE_ANON_KEY')    SUPABASE_SERVICE_KEY = userdata.get('SUPABASE_SERVICE_KEY')    SUPABASE_PROJECT_REF = userdata.get('SUPABASE_PROJECT_REF')    SUPABASE_ACCESS_TOKEN = userdata.get('SUPABASE_ACCESS_TOKEN')    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')        # Set environment variable for Supabase CLI    os.environ['SUPABASE_ACCESS_TOKEN'] = SUPABASE_ACCESS_TOKEN        print("‚úÖ All credentials loaded!")    print(f"   Project: {SUPABASE_URL}")    print(f"   OpenAI Key: {OPENAI_API_KEY[:8]}...")except Exception as e:    print(f"‚ùå Error: {e}")    print("\\nAdd credentials to Colab Secrets (üîë icon)")    raise

# Part 3: Database Schema (10 min)## Step 3a: Create Database SchemaNow we'll create the database table with pgvector support for semantic search.**What's pgvector?** It's a PostgreSQL extension that lets you store and search vector embeddings efficiently using vector similarity (cosine distance).

In [None]:
# @title üóÑÔ∏è Create Database Schema# Install Supabase Python client!pip install -q supabasefrom supabase import create_client# Create admin client (uses service_role key)supabase_admin = create_client(SUPABASE_URL, SUPABASE_SERVICE_KEY)# SQL to create schemaschema_sql = """-- Enable pgvector extensionCREATE EXTENSION IF NOT EXISTS vector;-- Create sentence_embeddings tableCREATE TABLE IF NOT EXISTS sentence_embeddings (    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),    talk_id UUID NOT NULL,    title TEXT NOT NULL,    speaker TEXT,    calling TEXT,    year INTEGER,    season TEXT,    url TEXT,    sentence_num INTEGER,    text TEXT NOT NULL,    embedding vector(1536),    created_at TIMESTAMPTZ DEFAULT NOW());-- Create index for vector similarity searchCREATE INDEX IF NOT EXISTS sentence_embeddings_embedding_idx ON sentence_embeddings USING ivfflat (embedding vector_cosine_ops)WITH (lists = 100);-- Create index for talk_id groupingCREATE INDEX IF NOT EXISTS sentence_embeddings_talk_id_idx ON sentence_embeddings(talk_id);-- Enable Row Level SecurityALTER TABLE sentence_embeddings ENABLE ROW LEVEL SECURITY;-- RLS policy: authenticated users can readDROP POLICY IF EXISTS "Allow authenticated users to read" ON sentence_embeddings;CREATE POLICY "Allow authenticated users to read"ON sentence_embeddings FOR SELECTTO authenticatedUSING (true);-- Create function for similarity searchCREATE OR REPLACE FUNCTION match_sentences(  query_embedding vector(1536),  match_threshold float DEFAULT 0.7,  match_count int DEFAULT 20)RETURNS TABLE (  id uuid,  talk_id uuid,  title text,  speaker text,  text text,  similarity float)LANGUAGE sql STABLEAS $$  SELECT    sentence_embeddings.id,    sentence_embeddings.talk_id,    sentence_embeddings.title,    sentence_embeddings.speaker,    sentence_embeddings.text,    1 - (sentence_embeddings.embedding <=> query_embedding) as similarity  FROM sentence_embeddings  WHERE 1 - (sentence_embeddings.embedding <=> query_embedding) > match_threshold  ORDER BY sentence_embeddings.embedding <=> query_embedding  LIMIT match_count;$$;"""print("üìù Running SQL script...")print("   This creates:")print("   - pgvector extension")print("   - sentence_embeddings table")print("   - Vector similarity search index")print("   - Row Level Security policies")print("   - match_sentences() function")print()# Execute via Supabase SQL editor (manual step for now)print("‚ö†Ô∏è  Please run this SQL manually:")print("")print("1. Go to your Supabase Dashboard")print("2. Click 'SQL Editor' in the left sidebar")print("3. Click 'New Query'")print("4. Paste the SQL below and click 'Run'")print("")print("="*60)print(schema_sql)print("="*60)print("")print("5. Come back here and run the checkpoint below")

## Step 3b: Verify Schema

In [None]:
# ‚úÖ CHECKPOINT 1: Verify Database Setuptry:    result = supabase_admin.table('sentence_embeddings').select('id', count='exact').limit(1).execute()    print("‚úÖ Database connection successful!")    print(f"   Table 'sentence_embeddings' exists")    print(f"   Current rows: {result.count or 0}")except Exception as e:    print(f"‚ùå Database check failed: {e}")    print("   Make sure you ran the SQL above before continuing")    raise

### üí° Learning Checkpoint**What is Row Level Security (RLS)?**RLS lets you control who can access which rows in a table. In our case:- ‚úÖ Authenticated users can **read** all sentences- ‚ùå Unauthenticated users cannot read anything- This protects your data even if someone gets your anon key!**Why sentence-level chunks?**- Higher precision for fact-based queries- Natural semantic boundaries- Can aggregate by talk for context

# Part 4: Frontend Deployment (15 min)Now let's get your frontend app online! This is where students will actually use the RAG system.## Step 4a: Update config.jsIn your GitHub repository, edit the `config.js` file:1. Go to your repository on GitHub2. Click on `config.js`3. Click the pencil icon (‚úèÔ∏è) to edit4. Replace the placeholder values:```javascriptconst SUPABASE_CONFIG = {    url: 'YOUR_SUPABASE_URL',      // Replace with your actual URL    anonKey: 'YOUR_ANON_KEY'       // Replace with your actual anon key};```5. Click "Commit changes"## Step 4b: Deploy to GitHub Pages1. Go to your repository **Settings**2. Click **Pages** in the left sidebar3. Under "Source":   - Select **Deploy from a branch**   - Branch: **main** (or **master**)   - Folder: **/ (root)**4. Click **Save**5. Wait ~2 minutes for deploymentYour site will be at: `https://YOUR-USERNAME.github.io/my-conference-rag/`## Step 4c: Configure Auth RedirectCopy your deployed URL and add it to Supabase:1. Go to Supabase Dashboard ‚Üí **Authentication** ‚Üí **URL Configuration**2. Under "Redirect URLs", click **Add URL**3. Paste: `https://YOUR-USERNAME.github.io/my-conference-rag/`4. Click **Save**## Step 4d: Test Login1. Visit your deployed site2. Enter your email3. Click "Sign In with Magic Link"4. Check your inbox5. Click the magic link6. You should be logged in! ‚úÖ**Expected behavior**: You can log in, but asking questions will fail (we haven't deployed Edge Functions yet).## ‚úÖ Checkpoint 2

In [None]:
# Verify your deploymentprint("üåê Check list:")print("")print("1. ‚úÖ config.js updated with your credentials?")print("2. ‚úÖ Site deployed to GitHub Pages?")print("3. ‚úÖ Redirect URL added to Supabase?")print("4. ‚úÖ Successfully logged in?")print("")print("If yes to all, continue! If not, review the steps above.")print("")print("Your deployed URL should be:")print(f"https://YOUR-USERNAME.github.io/REPO-NAME/")

### üí° Learning Checkpoint**Why can't we ask questions yet?**The frontend is trying to call Edge Functions that don't exist yet:1. `embed-question` - converts question to vector2. `generate-answer` - calls GPT-4 for final answerWe'll deploy those next!

# Part 5: Deploy Edge Functions (10 min)Edge Functions let us call OpenAI's API server-side, keeping our API keys secret. We'll deploy two functions:1. `embed-question` - Converts user questions to embeddings2. `generate-answer` - Calls GPT-4 to generate final answers## Step 5a: Install Supabase CLI

In [None]:
# @title üì¶ Install Supabase CLI# Install Node.js tools (already available in Colab)!npm install -g supabase@latest# Verify installation!supabase --versionprint("‚úÖ Supabase CLI installed!")

## Step 5b: Create Edge Function Files

In [None]:
# @title üìù Create Edge Function Codeimport os# Create directories!mkdir -p supabase/functions/embed-question!mkdir -p supabase/functions/generate-answer# Edge Function 1: embed-questionembed_function_code = '''import { serve } from "https://deno.land/std@0.168.0/http/server.ts"import { createClient } from 'https://esm.sh/@supabase/supabase-js@2'const corsHeaders = {  'Access-Control-Allow-Origin': '*',  'Access-Control-Allow-Headers': 'authorization, x-client-info, apikey, content-type',}serve(async (req) => {  if (req.method === 'OPTIONS') {    return new Response('ok', { headers: corsHeaders })  }  try {    const { question } = await req.json()    const openaiKey = Deno.env.get('OPENAI_API_KEY')        // Call OpenAI embeddings API    const response = await fetch('https://api.openai.com/v1/embeddings', {      method: 'POST',      headers: {        'Content-Type': 'application/json',        'Authorization': `Bearer ${openaiKey}`      },      body: JSON.stringify({        model: 'text-embedding-3-small',        input: question      })    })        const data = await response.json()        return new Response(      JSON.stringify({ embedding: data.data[0].embedding }),      { headers: { ...corsHeaders, 'Content-Type': 'application/json' } }    )  } catch (error) {    return new Response(      JSON.stringify({ error: error.message }),      { headers: { ...corsHeaders, 'Content-Type': 'application/json' }, status: 500 }    )  }})'''# Edge Function 2: generate-answeranswer_function_code = '''import { serve } from "https://deno.land/std@0.168.0/http/server.ts"const corsHeaders = {  'Access-Control-Allow-Origin': '*',  'Access-Control-Allow-Headers': 'authorization, x-client-info, apikey, content-type',}serve(async (req) => {  if (req.method === 'OPTIONS') {    return new Response('ok', { headers: corsHeaders })  }  try {    const { question, context_talks } = await req.json()    const openaiKey = Deno.env.get('OPENAI_API_KEY')        // Build context from talks    const context = context_talks.map((talk, i) =>       `Talk ${i+1}: "${talk.title}" by ${talk.speaker}\\n${talk.text}`    ).join('\\n\\n')        // Call OpenAI GPT-4    const response = await fetch('https://api.openai.com/v1/chat/completions', {      method: 'POST',      headers: {        'Content-Type': 'application/json',        'Authorization': `Bearer ${openaiKey}`      },      body: JSON.stringify({        model: 'gpt-4o-mini',        messages: [          {            role: 'system',            content: 'You are a helpful assistant answering questions based on conference talks. Use only the provided talks to answer. Cite speakers and talk titles.'          },          {            role: 'user',            content: `Question: ${question}\\n\\nRelevant Talks:\\n${context}`          }        ],        temperature: 0.7,        max_tokens: 500      })    })        const data = await response.json()        return new Response(      JSON.stringify({ answer: data.choices[0].message.content }),      { headers: { ...corsHeaders, 'Content-Type': 'application/json' } }    )  } catch (error) {    return new Response(      JSON.stringify({ error: error.message }),      { headers: { ...corsHeaders, 'Content-Type': 'application/json' }, status: 500 }    )  }})'''# Write fileswith open('supabase/functions/embed-question/index.ts', 'w') as f:    f.write(embed_function_code)with open('supabase/functions/generate-answer/index.ts', 'w') as f:    f.write(answer_function_code)print("‚úÖ Edge Function code created!")print("   - supabase/functions/embed-question/index.ts")print("   - supabase/functions/generate-answer/index.ts")

## Step 5c: Deploy Edge Functions

In [None]:
# @title üöÄ Deploy Edge Functions to Supabase# Link to your project!supabase link --project-ref {SUPABASE_PROJECT_REF}# Deploy embed-question functionprint("Deploying embed-question...")!supabase functions deploy embed-question --no-verify-jwt# Deploy generate-answer functionprint("\\nDeploying generate-answer...")!supabase functions deploy generate-answer --no-verify-jwt# Set OpenAI API key as secretprint("\\nSetting OpenAI API key secret...")!supabase secrets set OPENAI_API_KEY={OPENAI_API_KEY}print("\\n‚úÖ Edge Functions deployed successfully!")

## Step 5d: Test Edge Functions

In [None]:
# ‚úÖ CHECKPOINT 3: Test Edge Functionsimport requestsimport jsonprint("Testing Edge Functions...\\n")# Test embed-questiontest_question = "What is faith?"embed_url = f"{SUPABASE_URL}/functions/v1/embed-question"try:    response = requests.post(        embed_url,        headers={            "Authorization": f"Bearer {SUPABASE_ANON_KEY}",            "Content-Type": "application/json"        },        json={"question": test_question}    )    result = response.json()        if 'embedding' in result:        print("‚úÖ embed-question function works!")        print(f"   Embedding length: {len(result['embedding'])} dimensions")    else:        print(f"‚ùå Error: {result}")except Exception as e:    print(f"‚ùå Test failed: {e}")print()# Test generate-answeranswer_url = f"{SUPABASE_URL}/functions/v1/generate-answer"test_talks = [    {        "title": "Test Talk",        "speaker": "Test Speaker",        "text": "This is a test talk about faith. Faith is belief in things hoped for."    }]try:    response = requests.post(        answer_url,        headers={            "Authorization": f"Bearer {SUPABASE_ANON_KEY}",            "Content-Type": "application/json"        },        json={"question": test_question, "context_talks": test_talks}    )    result = response.json()        if 'answer' in result:        print("‚úÖ generate-answer function works!")        print(f"   Answer: {result['answer'][:100]}...")    else:        print(f"‚ùå Error: {result}")except Exception as e:    print(f"‚ùå Test failed: {e}")

### üí° Learning Checkpoint**Why Edge Functions instead of client-side API calls?**üîí **Security**: API keys stay on the server, never exposed to users**Compare:**- ‚ùå Bad: API key in browser ‚Üí anyone can steal it- ‚úÖ Good: API key in Edge Function ‚Üí only Supabase can access itThis is a **production best practice**!

# Part 6: Scrape Conference Data (20 min)Now let's get the actual data! We'll scrape 5 years of conference talks from the Church's website.## Step 6a: Install Dependencies

In [None]:
# @title üì¶ Install Scraping Libraries!pip install -q beautifulsoup4 requests pandas tqdmprint("‚úÖ Libraries installed!")

## Step 6b: Scrape Conference Talks

In [None]:
# @title üåê Scrape Conference Talks (5 years)import requestsfrom bs4 import BeautifulSoupimport pandas as pdimport refrom tqdm.auto import tqdmfrom concurrent.futures import ThreadPoolExecutor, as_completed# How many years to scrapeYEARS_TO_SCRAPE = 5START_YEAR = 2025 - YEARS_TO_SCRAPEEND_YEAR = 2025def setup_session():    """Create session with retries"""    session = requests.Session()    session.headers.update({        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'    })    return sessiondef get_conference_urls(start_year, end_year):    """Generate URLs for conferences"""    base_url = 'https://www.churchofjesuschrist.org/study/general-conference/{year}/{month}?lang=eng'    return [(base_url.format(year=year, month=month), str(year), month)            for year in range(start_year, end_year + 1)            for month in ['04', '10']]def get_talk_urls(conference_url, year, month, session):    """Fetch talk URLs from a conference page"""    try:        response = session.get(conference_url, timeout=10)        response.raise_for_status()    except:        return []        soup = BeautifulSoup(response.text, 'html.parser')    talk_urls = []    seen_urls = set()        # Session slugs to exclude    session_slugs = [        'saturday-morning', 'saturday-afternoon', 'sunday-morning', 'sunday-afternoon',        'priesthood-session', 'women-session', 'womens-session', 'session', 'video'    ]        for link in soup.select('div.talk-list a[href*="/study/general-conference/"]'):        href = link.get('href')        if not href or 'lang=eng' not in href:            continue                canonical_url = 'https://www.churchofjesuschrist.org' + href        if canonical_url in seen_urls:            continue        seen_urls.add(canonical_url)                # Skip session videos        if any(slug in canonical_url.lower() for slug in session_slugs):            continue                talk_urls.append(canonical_url)        return talk_urlsdef scrape_talk(talk_url, session):    """Scrape a single talk"""    try:        response = session.get(talk_url, timeout=10)        response.raise_for_status()    except:        return None        soup = BeautifulSoup(response.text, 'html.parser')        def clean_text(text):        if not text:            return text        return text.strip()        title = clean_text(soup.find("h1").text) if soup.find("h1") else "No Title"    speaker_tag = soup.find("p", {"class": "author-name"})    speaker = clean_text(speaker_tag.text) if speaker_tag else "Unknown"        calling_tag = soup.find("p", {"class": "author-role"})    calling = clean_text(calling_tag.text) if calling_tag else ""        content_div = soup.find("div", {"class": "body-block"})    if not content_div:        return None        content = " ".join(clean_text(p.text) for p in content_div.find_all("p"))        year_match = re.search(r'/(\d{4})/', talk_url)    year = int(year_match.group(1)) if year_match else None    season = "April" if "/04/" in talk_url else "October"        return {        "title": title,        "speaker": speaker,        "calling": calling,        "year": year,        "season": season,        "url": talk_url,        "text": content    }# Main scraping logicprint(f"üì∞ Scraping {YEARS_TO_SCRAPE} years of conference talks ({START_YEAR}-{END_YEAR})...\n")session = setup_session()conference_urls = get_conference_urls(START_YEAR, END_YEAR)# Get all talk URLsprint("Finding talk URLs...")all_talk_urls = []for conf_url, year, month in tqdm(conference_urls):    talk_urls = get_talk_urls(conf_url, year, month, session)    all_talk_urls.extend(talk_urls)print(f"Found {len(all_talk_urls)} talks\n")# Scrape talks in parallelprint("Scraping talk content...")talks_data = []with ThreadPoolExecutor(max_workers=10) as executor:    futures = {executor.submit(scrape_talk, url, session): url for url in all_talk_urls}    for future in tqdm(as_completed(futures), total=len(all_talk_urls)):        talk = future.result()        if talk:            talks_data.append(talk)talks_df = pd.DataFrame(talks_data)print(f"\n‚úÖ Scraped {len(talks_df)} talks successfully!")print(f"   Years: {talks_df['year'].min()} - {talks_df['year'].max()}")print(f"   Total words: {talks_df['text'].str.split().str.len().sum():,}")# Previewprint("\nSample talks:")print(talks_df[['year', 'season', 'title', 'speaker']].head(10))

### üí° Learning CheckpointThe scraper:1. Finds all conference URLs for the year range2. Extracts talk URLs (excluding session videos)3. Scrapes each talk in parallel (10 at a time)4. Cleans and structures the dataThis is real **web scraping** - a valuable data engineering skill!

# Part 7: Generate Embeddings & Import Data (25 min)Now we'll convert the text to embeddings and import everything to Supabase.## Step 7a: Split Talks into Sentences

In [None]:
# @title ‚úÇÔ∏è Split Talks into Sentencesimport uuidimport redef split_into_sentences(text):    """Split text into sentences (simple approach)"""    # Split on period followed by space and capital letter    sentences = re.split(r'\\. (?=[A-Z])', text)    # Clean up    sentences = [s.strip() + '.' if not s.endswith('.') else s.strip() for s in sentences]    return [s for s in sentences if len(s) > 20]  # Filter very short sentences# Create sentence recordssentence_records = []for _, talk in tqdm(talks_df.iterrows(), total=len(talks_df), desc="Splitting into sentences"):    talk_id = str(uuid.uuid4())    sentences = split_into_sentences(talk['text'])        for i, sentence in enumerate(sentences, 1):        sentence_records.append({            'talk_id': talk_id,            'title': talk['title'],            'speaker': talk['speaker'],            'calling': talk['calling'],            'year': talk['year'],            'season': talk['season'],            'url': talk['url'],            'sentence_num': i,            'text': sentence        })sentences_df = pd.DataFrame(sentence_records)print(f"\n‚úÖ Split {len(talks_df)} talks into {len(sentences_df):,} sentences")print(f"   Average sentences per talk: {len(sentences_df) / len(talks_df):.1f}")print(f"   Average sentence length: {sentences_df['text'].str.len().mean():.0f} characters")

## Step 7b: Generate Embeddings

In [None]:
# @title üß† Generate OpenAI Embeddings (this may take 10-15 minutes)import openaiimport timefrom openai import OpenAI# Initialize OpenAI clientclient = OpenAI(api_key=OPENAI_API_KEY)def get_embedding_batch(texts, model="text-embedding-3-small"):    """Get embeddings for a batch of texts"""    try:        response = client.embeddings.create(            model=model,            input=texts        )        return [item.embedding for item in response.data]    except Exception as e:        print(f"Error: {e}")        return None# Process in batches to avoid rate limitsBATCH_SIZE = 100embeddings = []failed_indices = []print(f"Generating embeddings for {len(sentences_df):,} sentences...")print(f"Batch size: {BATCH_SIZE}\n")for i in tqdm(range(0, len(sentences_df), BATCH_SIZE)):    batch_texts = sentences_df['text'].iloc[i:i+BATCH_SIZE].tolist()        batch_embeddings = get_embedding_batch(batch_texts)        if batch_embeddings:        embeddings.extend(batch_embeddings)    else:        failed_indices.extend(range(i, min(i+BATCH_SIZE, len(sentences_df))))        # Add empty embeddings as placeholder        embeddings.extend([None] * len(batch_texts))        # Rate limiting: OpenAI allows ~3000 requests/min    time.sleep(0.1)# Add embeddings to dataframesentences_df['embedding'] = embeddings# Remove failed embeddingssentences_df = sentences_df[sentences_df['embedding'].notna()]print(f"\n‚úÖ Generated {len(sentences_df):,} embeddings")if failed_indices:    print(f"   ‚ö†Ô∏è {len(failed_indices)} failed (removed)")# Estimate costtotal_tokens = sentences_df['text'].str.split().str.len().sum()cost = (total_tokens / 1_000_000) * 0.020  # $0.020 per 1M tokensprint(f"\nüí∞ Estimated cost: ${cost:.2f}")

## Step 7c: Import to Supabase

In [None]:
# @title üíæ Import Data to Supabase# Convert to list of dicts for insertionrecords = sentences_df.to_dict('records')# Convert embeddings to lists (from numpy arrays if needed)for record in records:    if hasattr(record['embedding'], 'tolist'):        record['embedding'] = record['embedding'].tolist()print(f"Importing {len(records):,} sentence embeddings to Supabase...")print("This may take 5-10 minutes...\n")# Insert in batchesBATCH_SIZE = 100success_count = 0error_count = 0for i in tqdm(range(0, len(records), BATCH_SIZE)):    batch = records[i:i+BATCH_SIZE]        try:        result = supabase_admin.table('sentence_embeddings').insert(batch).execute()        success_count += len(batch)    except Exception as e:        print(f"\nError inserting batch {i//BATCH_SIZE + 1}: {e}")        error_count += len(batch)        continue        # Small delay to avoid overwhelming Supabase    time.sleep(0.1)print(f"\n‚úÖ Import complete!")print(f"   Success: {success_count:,} sentences")if error_count > 0:    print(f"   Errors: {error_count:,} sentences")

## Step 7d: Verify Import

In [None]:
# ‚úÖ CHECKPOINT 4: Verify Data Import# Check row countresult = supabase_admin.table('sentence_embeddings').select('id', count='exact').limit(1).execute()row_count = result.count or 0print(f"‚úÖ Database contains {row_count:,} sentence embeddings")# Test vector searchif row_count > 0:    # Get an embedding from our data    test_embedding = embeddings[0]        # Try the match_sentences function    result = supabase_admin.rpc('match_sentences', {        'query_embedding': test_embedding,        'match_threshold': 0.7,        'match_count': 5    }).execute()        if result.data:        print(f"\n‚úÖ Vector search working!")        print(f"   Found {len(result.data)} similar sentences")        print(f"\nTop match:")        print(f"   Title: {result.data[0]['title']}")        print(f"   Speaker: {result.data[0]['speaker']}")        print(f"   Text: {result.data[0]['text'][:100]}...")        print(f"   Similarity: {result.data[0]['similarity']:.3f}")    else:        print("‚ö†Ô∏è No results from vector search (this might be normal)")else:    print("‚ùå No data in database! Check import step above.")

### üí° Learning Checkpoint**What just happened?**1. **Sentence splitting**: ~400 talks ‚Üí ~80,000 sentences2. **Embedding generation**: Each sentence ‚Üí 1,536-dimensional vector3. **Vector database**: Stored in pgvector for fast similarity search**Why sentence-level?**- Research shows sentences preserve semantic meaning- Higher precision for specific queries- Can aggregate by talk for contextThis is the **core of RAG**: converting text to searchable vectors!

# Part 8: Test Your RAG System! (10 min)üéâ **Everything is set up!** Let's test the complete system.## Step 8a: Test from Frontend1. Go to your deployed site: `https://YOUR-USERNAME.github.io/my-conference-rag/`2. Make sure you're logged in3. Ask a question: **"How can I find peace during difficult times?"**4. Watch the magic happen!**What's happening behind the scenes:**```Your Question    ‚ÜìEdge Function: embed-question    ‚Üì (OpenAI embedding)Vector Search in pgvector    ‚Üì (top 20 sentences)Group by talk_id, rank    ‚Üì (top 3 talks)Edge Function: generate-answer    ‚Üì (GPT-4 with context)Final Answer! ‚ú®```## Step 8b: Test from Colab

In [None]:
# @title üß™ Test RAG Pipeline End-to-Enddef test_rag_system(question):    """Test the complete RAG pipeline"""    print(f"Question: {question}\n")        # Step 1: Get embedding for question    print("1Ô∏è‚É£ Getting embedding for question...")    embed_response = requests.post(        f"{SUPABASE_URL}/functions/v1/embed-question",        headers={            "Authorization": f"Bearer {SUPABASE_ANON_KEY}",            "Content-Type": "application/json"        },        json={"question": question}    )    embedding = embed_response.json()['embedding']    print(f"   ‚úÖ Got {len(embedding)}-dimensional embedding\n")        # Step 2: Search for similar sentences    print("2Ô∏è‚É£ Searching for similar sentences...")    search_result = supabase_admin.rpc('match_sentences', {        'query_embedding': embedding,        'match_threshold': 0.6,        'match_count': 20    }).execute()        sentences = search_result.data    print(f"   ‚úÖ Found {len(sentences)} similar sentences\n")        # Step 3: Group by talk and rank    print("3Ô∏è‚É£ Ranking talks by relevance...")    from collections import defaultdict    talk_sentences = defaultdict(list)        for sent in sentences:        talk_sentences[sent['talk_id']].append(sent)        # Sort talks by number of matching sentences    ranked_talks = sorted(        talk_sentences.items(),        key=lambda x: len(x[1]),        reverse=True    )[:3]  # Top 3 talks        print(f"   ‚úÖ Top 3 relevant talks:\n")    context_talks = []    for i, (talk_id, sents) in enumerate(ranked_talks, 1):        # Get full talk text        full_talk_result = supabase_admin.table('sentence_embeddings') \            .select('title, speaker, text') \            .eq('talk_id', talk_id) \            .execute()                talk_sentences_texts = [s['text'] for s in full_talk_result.data]        full_text = ' '.join(talk_sentences_texts)                context_talks.append({            'title': sents[0]['title'],            'speaker': sents[0]['speaker'],            'text': full_text        })                print(f"      {i}. \"{sents[0]['title']}\" by {sents[0]['speaker']}")        print(f"         ({len(sents)} matching sentences)\n")        # Step 4: Generate answer    print("4Ô∏è‚É£ Generating answer with GPT-4...")    answer_response = requests.post(        f"{SUPABASE_URL}/functions/v1/generate-answer",        headers={            "Authorization": f"Bearer {SUPABASE_ANON_KEY}",            "Content-Type": "application/json"        },        json={            "question": question,            "context_talks": context_talks        }    )    answer = answer_response.json()['answer']        print(f"   ‚úÖ Generated answer!\n")    print("="*60)    print("ANSWER:")    print("="*60)    print(answer)    print("="*60)        return answer# Test questionstest_questions = [    "How can I strengthen my faith?",    "What does the church teach about prayer?",    "How can I find peace during trials?"]print("Testing RAG system with sample questions...\n")print("="*60)for q in test_questions:    test_rag_system(q)    print("\n" + "="*60 + "\n")

## ‚úÖ CHECKPOINT 5: Final Verification

In [None]:
# Final system checkprint("üéâ FINAL SYSTEM CHECK\n")print("="*60)checks = {    "Database has data": False,    "Vector search works": False,    "Embed function works": False,    "Answer function works": False}# Check 1: Databasetry:    result = supabase_admin.table('sentence_embeddings').select('id', count='exact').limit(1).execute()    if result.count > 0:        checks["Database has data"] = Trueexcept:    pass# Check 2: Vector searchtry:    result = supabase_admin.rpc('match_sentences', {        'query_embedding': embeddings[0],        'match_count': 5    }).execute()    if result.data:        checks["Vector search works"] = Trueexcept:    pass# Check 3: Embed functiontry:    response = requests.post(        f"{SUPABASE_URL}/functions/v1/embed-question",        headers={"Authorization": f"Bearer {SUPABASE_ANON_KEY}", "Content-Type": "application/json"},        json={"question": "test"}    )    if response.ok:        checks["Embed function works"] = Trueexcept:    pass# Check 4: Answer functiontry:    response = requests.post(        f"{SUPABASE_URL}/functions/v1/generate-answer",        headers={"Authorization": f"Bearer {SUPABASE_ANON_KEY}", "Content-Type": "application/json"},        json={"question": "test", "context_talks": [{"title": "Test", "speaker": "Test", "text": "Test"}]}    )    if response.ok:        checks["Answer function works"] = Trueexcept:    pass# Print resultsfor check, passed in checks.items():    status = "‚úÖ" if passed else "‚ùå"    print(f"{status} {check}")all_passed = all(checks.values())print("\n" + "="*60)if all_passed:    print("üéâ ALL SYSTEMS GO! Your RAG application is ready!")    print("\nNext: Visit your deployed site and try asking questions!")else:    print("‚ö†Ô∏è Some checks failed. Review the steps above.")print("="*60)

# Part 9: Reflection & Next Steps## üéì What You LearnedCongratulations! You just built a production-ready RAG application from scratch.### Technical Skills‚úÖ **Vector Embeddings** - Converted text to 1,536-dimensional vectors  ‚úÖ **Semantic Search** - Used pgvector for similarity search  ‚úÖ **RAG Architecture** - Combined retrieval + generation  ‚úÖ **Edge Functions** - Deployed serverless functions  ‚úÖ **Row Level Security** - Protected data with RLS policies  ‚úÖ **Production Deployment** - Deployed to GitHub Pages  ### Key Concepts**Why RAG instead of fine-tuning?**- ‚úÖ Cheaper (no model training)- ‚úÖ Updatable (just add new data)- ‚úÖ Transparent (shows sources)- ‚úÖ Accurate (uses exact text)**Why sentence-level chunking?**- Research shows sentences preserve semantic meaning- Higher precision for factual queries- Can aggregate by document for context**Why Edge Functions?**- üîí Keeps API keys server-side- üöÄ Serverless (scales automatically)- üí∞ Cost-effective (pay per request)### Architecture You Built```Student Question    ‚ÜìFrontend (GitHub Pages)    ‚Üì (authenticated via Supabase Auth)Edge Function: embed-question    ‚Üì (converts to 1,536-dim vector)Supabase Database (pgvector)    ‚Üì (finds top 20 similar sentences)    ‚Üì (groups by talk, ranks by count)    ‚Üì (returns top 3 talks)Edge Function: generate-answer    ‚Üì (GPT-4 with talk context)Final Answer ‚ú®```## üöÄ Optional ExtensionsWant to take this further? Try these challenges:### 1. Add Question History**Goal**: Track user's past questions and answers**How**:- Add `question_history` table- Store: user_id, question, answer, timestamp- Display in sidebar**Learning**: Database design, user-specific data### 2. Implement Caching**Goal**: Save money by reusing embeddings for common questions**How**:- Hash questions ‚Üí cache key- Store in `cached_embeddings` table- Check cache before calling OpenAI**Learning**: Performance optimization, caching strategies### 3. Add Talk Recommendations**Goal**: \"You might also like these talks...\"**How**:- After showing answer, find similar talks- Use the same embedding, but exclude already shown talks- Display 3 recommendations**Learning**: Recommendation systems### 4. Build Analytics Dashboard**Goal**: See what people are asking about**How**:- Track popular questions- Track popular talks (based on matches)- Create charts with Chart.js**Learning**: Data analytics, visualization### 5. Multi-language Support**Goal**: Support Spanish, Portuguese, etc.**How**:- Scrape talks in other languages- Translate questions before embedding- Return answers in user's language**Learning**: Internationalization, translation APIs### 6. Improved Chunking**Goal**: Compare different chunking strategies**How**:- Try paragraph-level chunks- Try semantic chunks (LangChain)- A/B test which performs better**Learning**: Advanced RAG techniques, experimentation## üìö Additional Resources### RAG & Vector Databases- [Supabase pgvector Guide](https://supabase.com/docs/guides/ai)- [OpenAI Embeddings Guide](https://platform.openai.com/docs/guides/embeddings)- [RAG Best Practices (Weaviate)](https://weaviate.io/blog/rag-evaluation)### Chunking Strategies- [Chunking for RAG (LangChain)](https://python.langchain.com/docs/modules/data_connection/document_transformers/)- [Chunking Research (2024)](https://www.superlinked.com/vectorhub/articles/chunking-vs-semantic-splitting)### Production Deployment- [Supabase Edge Functions Docs](https://supabase.com/docs/guides/functions)- [GitHub Pages Guide](https://pages.github.com/)## üéâ You Did It!You now have:- A working RAG application- Hands-on experience with vector databases- Knowledge of production architecture patterns- A portfolio project to show employers!**What's next?** Share your project, try the extensions, or help a classmate!---**Questions or issues?** Check the troubleshooting guide in the repository README.**Enjoyed this?** Give the repo a ‚≠ê on GitHub!