# World_History_Snowflake
Project to create and populate a DB with both structured and unstructured data


# Instructions

First, set the timeout on this Notebook to 1 (or 2) hours.  The entire sql file takes a while to run and if the notebook times out you will be left figuring out where to resume it.

There are two (2) notebooks that setup, process and create the necessary data.
1.  `1-WORLD_HISTORY_Setup_and_PDF_Ingestion.ipynb` 
    - Creates the database, stage and tables for the unstructured PDF data and processing
    - The first couple of cells setup the database, and then there are a series of cells that connect to an external website to download 32 chapters in PDF format.  These can be completely done in Snowflake, but require AccountAdmin access to enable an external integration and a restart of the Snowflake notebook.  
    The alternative is to use the `pdf_downloader.py` to download the files to a local machine and then manually upload then to the Snowflake @PDF_DOCUMENTS stage either through Snowsight or other methods.
    - The balance of the cells use either SQL or Python to process the files into smaller parts and pages, extract raw text, summarize the content at the page/part/chapter level and collate the data into the world_history_rag table.
    - This notebook also creates a graph knowledge base of the content so the Agent can find and follow references (ie See Page 543) to additional content.
    - A Cortex Search service is created to use vector embeddings to find content within the document.
2. `2-WORLD_HISTORY_SCHOOL_DATA.ipynb` 
    - Creates a second schema, `Schools` with tables and data related to cities, school districts, schools, classes, students, test questions, chapter exams and realistic distribution of grades for exams.
    - Exam questions are retrieved from the PDF documents.
    - Correct and incorrect answers are generated for individual student test results.
    - There are 5 cities and school districts, 15 high schools, 45 classes, 45 teachers, 1350 students, 526 test questions, 710k individual test question responses, 43.2k exam results.
    - A YAML file defining a Cortex Agent is uploaded to a public.config_files stage
3. This setup _does not_ cover creating an Agent.  There is no UI or API to currently create agents.  See below.



# Agent Setup
This is how you can setup an Agent to use all of the services and tools to come up with accurate answers to complicated questions.

## About
- Display Name: World History Agent
- Description: This agent has access to both information about schools, tests, grades, test questions, student responses (structured data) and a World History Textbook (unstructured data).  The content is related in the fact that the exams and questions and responses are based on the content from the World History Textbook.  Anyone can ask questions about content in the textbook, relate that back to student performance, and seamlessly use the agent to go back and forth between the different modalities.

## Instructions
- Response Instructions: Show any percentages as 0.00%.  If the question isn't extremely clear, ask for clarification.  You are an expert professor in World History.  You have the knowledge of 1060 pages of World History and access to student exam performance data.  Your keen observations, suggestions and insights will be highly prized.  Don't be afraid to make suggestions for how tests can be improved or how individual teachers, or schools, can teach the content differently.  
- Sample Questions:
    - Which is the first page that has references to other pages about the enlightment. What pages does it reference and what content is on those pages?
    - I want to compare the military of Classical Athens to that of the late Roman Republic. Your primary method for finding the Roman comparison must be to execute a search for explicit connections starting from the pages discussing the Athenian military during the Persian Wars. First, summarize the Athenian model, then use the connection-finding tool to locate the relevant Roman content and provide the comparison.
    - Analyze the policy of 'War Communism' implemented by the Bolsheviks during the Russian Civil War. First, summarize the policy's immediate historical context. Then, trace its ideological foundation by following any explicit cross-references in that chapter back to the introduction of Marxist theory earlier in the textbook.
    - Compare the citizen-soldier model of Classical Athens during the Persian Wars with the professionalized army of the late Roman Republic. Begin by finding the section on the Persian Wars, summarize the chapter's discussion of the Athenian state and its military, and then use any direct textual cross-references to locate and analyze the author's comparison with the Roman military system.
    - What were the hardest questions about the Roman Empire, which answer was chosen wrong the most, and what pages should students study to get more familiar with the content?
    - How closely does the exam for Emerging Europe and the Byzantine Empire follow the textbook material?

## Tools
- Cortex Analyst: Add the World_History.public.config_files/world_history_semantic_model.yaml.  Let Cortex create the description.
- Cortex Search: Add WORLD_HISTORY_QA.PUBLIC.WORLD_HISTORY_RAG_SEARCH
   - Description: Returns vector based searches on the world history returning either pages, parts, page summaries, part summaries, or chapter summaries.
   - ID Column: PDF_URL
   - Title Column: ENHANCED_CITATION

- Custom Tools
    - Multihop_Search_Results.  Add WORLD_HISTORY.PUBLIC.MULTIHOP_SEARCH_RESULTS as a function.  
        - page_id_param description: This is the page_id param that needs to be passed in the format of CHxx_Pyyyy.  Example: SELECT WORLD_HISTORY_QA.PUBLIC.MULTIHOP_SEARCH_RESULTS_FN('CH23_P0777');
        - description: Use this tool to enrich context for a known page ID. When you have a specific page from a vector search, use this tool to retrieve the page summary, part summary, and chapter summary. This is best for answering questions about the broader theme, context, or significance of information found on a specific page.  This tool returns the connected pages (hops) for references.  It should be used to find if there are any connected edges.  Then move to the find_connected_edges tool to recursively follow those edges in the knowledge graph.
    -Find_Connected_Edges. Add WORLD_HISTORY.PUBLIC.FIND_CONNECTED_PAGES as a function.
        - max_hops description: This is the number of connections from the source page.  If the source page is page 10 and has a reference to page 20, that would be the first hop.  If page 20 has a reference to page 30 that's the 2nd hop.  Default to 2.
        - starting_page_id description: This is the starting page id in the format "CHxx_Pyyyy".  Example "CH23_P0772".  It is a combination chapter and page number that we will get from the prior steps.
        - description: Always use this tool _after_ multihop_search_results.  That tool tells you _if_ there are connected graph edges.  This tool then allows you to recursively follow the relationships of the material.  Use this tool to answer questions about explicit connections, direct links, or tracing a topic's influence across the textbook. It traverses the book's graph of 'see page...' cross-references. Prioritize this tool when a user asks to 'trace the origins of,' 'find the connection to,' 'see what this is linked to,' or analyze how the author explicitly compares two disparate topics.

## Orchestration
Planning instructions
```
Step 1: Question Routing 🚦
The router should prioritize tools in order from most specialized to most general.

Is the user asking to trace a connection or find an explicit link? (e.g., using words like "trace," "connect," "link," "cross-reference," "compare to what the author links").

If yes, prioritize the Multihop_Search_Results + Find_Connected_Pages tool path.  

Is the user asking for the summary, context, or significance of a known topic? (e.g., "Summarize the chapter on the Persian Wars").

If yes, use the Cortex Search + Multihop_Search_Results Path.

Is it a general knowledge question about the text? (e.g., "Tell me about the Roman military").

If yes, use the standard Cortex Search Path, potentially enriched with Multihop_Search_Results.

Is it a question about structured data?

If yes, use the Cortex Analyst Path.

-- ANY TIME the Multihop_Search_Results comes back with connected_pages to get more information.
```

In [None]:
# Cell 1: Setup and Configuration
import io
import os
import tempfile
import json
import requests
import pypdf
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.exceptions import SnowparkSQLException
from urllib.parse import urlparse

# --- Configuration ---
SNOWFLAKE_DATABASE = "WORLD_HISTORY"
SNOWFLAKE_SCHEMA = "public"
SNOWFLAKE_ROLE = "SYSADMIN"
TARGET_WEBSITE_URL = "https://glhssocialstudies.weebly.com/world-history-textbook---pdf-copy.html"

# Define stage names WITHOUT a leading '@'
SOURCE_STAGE_NAME = "pdf_documents"

# Dynamic names based on database
EXTERNAL_ACCESS_INTEGRATION_NAME = f"{SNOWFLAKE_DATABASE}_WEB_ACCESS"
NETWORK_RULE_NAME = f"{SNOWFLAKE_DATABASE}_WEBSITE_ACCESS"

ADAPTIVE_SPLIT_TARGET_PATH = f"{SOURCE_STAGE_NAME}/parts"
SINGLE_PAGE_TARGET_PATH = f"{SOURCE_STAGE_NAME}/pages"

# --- Session Initialization ---
session = get_active_session()
session.sql(f"CREATE DATABASE IF NOT EXISTS {SNOWFLAKE_DATABASE};").collect()
session.use_database(SNOWFLAKE_DATABASE)
session.use_schema(SNOWFLAKE_SCHEMA)

print(f"✅ Setup complete.")
print(f"  - Current Role: {session.get_current_role()}")
print(f"  - Configured Role: {SNOWFLAKE_ROLE}")
print(f"  - Database: {session.get_current_database()}")
print(f"  - Schema: {session.get_current_schema()}")
print(f"  - Target Website: {TARGET_WEBSITE_URL}")
print(f"  - Source Stage: @{SOURCE_STAGE_NAME}")
print(f"  - External Access Integration: {EXTERNAL_ACCESS_INTEGRATION_NAME}")
print(f"  - Network Rule: {NETWORK_RULE_NAME}")


In [None]:
# Create the PDF documents stage if it doesn't exist
session.sql(f"""
    CREATE STAGE IF NOT EXISTS {SOURCE_STAGE_NAME}
        ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
        DIRECTORY = (
           ENABLE = TRUE
           AUTO_REFRESH = TRUE
        )
        COMMENT = 'Stage for PDF documents'
""").collect()
print(f"✅ Stage @{SOURCE_STAGE_NAME} created/verified")

# Manual File Upload - Skip to `file_upload_complete`

## Download through Snowflake - Option 1

Continue to run the cells if you want to download the PDF files through Snowflake.  This requires AccountAdmin access to create the network rule and integration.  

## Download to a local machine and upload - Option 2

To download manually, download (from the sidebar) and run `python3 pdf_downloader.py` (after installing dependencies) and then upload the files to the stage @PDF_DOCUMENTS in the database.  After you complete the upload, continue at cell `file_upload_complete`.

#  🚨 🚨 Add external integration to the notebook  🚨 🚨

The next cell will switch to AccountAdmin and create the proper network rule and access integration.

Follow these instructions:
1. Run the following cell, _and only the next cell_, to create the network access.
2. Open Notebook Settings -> External Access and enable "World_History_Web_Access" (or the downloads will fail.)
3. Run the `Setup` cell again
4. Skip to the `continue_here_for_snowflake_upload` cell and continue running the notebook


In [None]:
# Cell 2: External Access Setup
print("🔐 Setting up external access with proper role management...")

# Switch to ACCOUNTADMIN for setup
session.use_role("ACCOUNTADMIN")
print(f"   Switched to role: {session.get_current_role()}")

# Extract domain from the configured URL
target_domain = urlparse(TARGET_WEBSITE_URL).netloc
print(f"   Target domain: {target_domain}")

# IMPORTANT: Also allow icomets.org where the PDFs are actually hosted
pdf_domain = "icomets.org"
print(f"   PDF domain: {pdf_domain}")

try:
    # Create network rule for BOTH the target website AND the PDF hosting domain
    session.sql(f"""
        CREATE OR REPLACE NETWORK RULE {NETWORK_RULE_NAME}
        MODE = EGRESS
        TYPE = HOST_PORT
        VALUE_LIST = ('{target_domain}:443', '{target_domain}:80', '{pdf_domain}:443', '{pdf_domain}:80')
    """).collect()
    print(f"✅ Network rule created: {NETWORK_RULE_NAME} for {target_domain} + {pdf_domain}")
    
    # Create external access integration using configured variables
    session.sql(f"""
        CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION {EXTERNAL_ACCESS_INTEGRATION_NAME}
        ALLOWED_NETWORK_RULES = ({NETWORK_RULE_NAME})
        ENABLED = TRUE
    """).collect()
    print(f"✅ External access integration created: {EXTERNAL_ACCESS_INTEGRATION_NAME}")

    # Grant access for the role to use the access integration
    session.sql(f"""
        GRANT USAGE ON INTEGRATION {EXTERNAL_ACCESS_INTEGRATION_NAME} to {SNOWFLAKE_ROLE}
    """).collect()
    print(f"✅ Integration {EXTERNAL_ACCESS_INTEGRATION_NAME} access granted to {SNOWFLAKE_ROLE}")
    
    
except Exception as e:
    print(f"❌ Error during setup: {str(e)}")

finally:
    # Switch back to configured role
    try:
        session.use_role(SNOWFLAKE_ROLE)
        print(f"✅ Switched back to role: {session.get_current_role()}")
    except Exception as role_error:
        print(f"⚠️  Warning: Could not switch to {SNOWFLAKE_ROLE}: {role_error}")

print(f"🔧 External access setup completed!")


#  🚨 🚨 Upload your PDF Documents to the Stage using Snowflake  🚨 🚨


Continue with this cell after enabling the External Access Integration and running the `Setup` cell again.

In [None]:
-- Scrapes website to get PDF download links
CREATE OR REPLACE FUNCTION get_pdf_links_from_website(website_url STRING)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
EXTERNAL_ACCESS_INTEGRATIONS = (WORLD_HISTORY_WEB_ACCESS)
PACKAGES = ('requests', 'beautifulsoup4', 'lxml')
HANDLER = 'scrape_pdfs'
AS
$$
import requests
import re
import json
from urllib.parse import urljoin, urlparse
from bs4 import BeautifulSoup

def scrape_pdfs(website_url):
    try:
        # Fetch webpage
        headers = {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        response = requests.get(website_url, headers=headers, timeout=30)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        pdf_links = []
        
        # Method 1: Direct PDF links
        for link in soup.find_all('a', href=True):
            href = link['href']
            if href.lower().endswith('.pdf'):
                full_url = urljoin(website_url, href)
                filename = urlparse(href).path.split('/')[-1]
                text = link.get_text(strip=True)
                
                pdf_links.append({
                    'url': full_url,
                    'text': text,
                    'filename': filename,
                    'method': 'direct_link'
                })
        
        # Method 2: Regex patterns if no direct links found
        if len(pdf_links) == 0:
            content = response.text
            pdf_patterns = [
                r'https?://[^"\s]+\.pdf',
                r'/files/[^"\s]+\.pdf',
                r'uploads/[^"\s]+\.pdf'
            ]
            
            found_urls = set()
            for pattern in pdf_patterns:
                matches = re.findall(pattern, content, re.IGNORECASE)
                for match in matches:
                    url = match.strip('"\'')
                    if url.startswith('/'):
                        url = urljoin(website_url, url)
                    found_urls.add(url)
            
            for i, url in enumerate(found_urls, 1):
                filename = urlparse(url).path.split('/')[-1]
                if not filename or not filename.endswith('.pdf'):
                    filename = f"chapter_{i:02d}.pdf"
                
                pdf_links.append({
                    'url': url,
                    'text': f'Chapter {i}',
                    'filename': filename,
                    'method': 'regex_pattern'
                })
        
        return {
            "success": True,
            "website_url": website_url,
            "pdf_count": len(pdf_links),
            "pdf_links": pdf_links,
            "scraped_at": str(response.headers.get('date', 'unknown'))
        }
        
    except Exception as e:
        return {
            "success": False,
            "error": str(e),
            "website_url": website_url,
            "pdf_count": 0,
            "pdf_links": []
        }
$$;

In [None]:
# Python code to download source files
print(f"📥 Downloading PDFs to @{SOURCE_STAGE_NAME}")

# Clean up existing stage files first
try:
    session.sql(f"REMOVE @{SOURCE_STAGE_NAME}").collect()
    print(f"🧹 Cleaned up existing stage files")
except:
    pass

# Get PDF links
try:
    result = session.sql(f"SELECT get_pdf_links_from_website('{TARGET_WEBSITE_URL}') as result").collect()
    
    if result:
        result_raw = result[0]['RESULT']
        if isinstance(result_raw, str):
            result_data = json.loads(result_raw)
        else:
            result_data = result_raw
        
        if result_data and result_data.get('success', False):
            pdf_links = result_data.get('pdf_links', [])
            total_pdfs = len(pdf_links)
            
            print(f"📊 Found {total_pdfs} PDFs to download")
            
            success_count = 0
            error_count = 0
            
            for i, pdf_info in enumerate(pdf_links, 1):
                pdf_url = pdf_info['url']
                filename = pdf_info['filename']
                
                print(f"\n📄 {i}/{total_pdfs}: {filename}")
                
                # Check available space
                try:
                    temp_dir = tempfile.gettempdir()
                    _, _, free_space = shutil.disk_usage(temp_dir)
                    free_mb = free_space / (1024**2)
                    
                    if free_mb < 150:
                        print(f"   ⚠️  Low space ({free_mb:.1f} MB) - cleaning up...")
                        # Clean up any leftover temp files
                        for temp_file in os.listdir(temp_dir):
                            if temp_file.endswith('.pdf') and 'tmp' in temp_file:
                                try:
                                    os.unlink(os.path.join(temp_dir, temp_file))
                                except:
                                    pass
                        
                        if free_mb < 100:
                            print(f"   ❌ Insufficient space - skipping")
                            error_count += 1
                            continue
                            
                except Exception:
                    pass
                
                temp_path = None
                try:
                    headers = {
                        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
                    }
                    
                    # Streaming download
                    with requests.get(pdf_url, headers=headers, timeout=90, stream=True) as response:
                        response.raise_for_status()
                        
                        # Create temp file with correct name for proper upload
                        temp_dir = tempfile.gettempdir()
                        temp_path = os.path.join(temp_dir, filename)
                        
                        try:
                            total_size = 0
                            with open(temp_path, 'wb') as temp_file:
                                for chunk in response.iter_content(chunk_size=8192):
                                    if chunk:
                                        temp_file.write(chunk)
                                        total_size += len(chunk)
                            
                            print(f"   📦 Downloaded {total_size:,} bytes ({total_size/1024/1024:.1f} MB)")
                            
                            if total_size < 10000:
                                raise Exception(f"File too small: {total_size} bytes")
                            
                            # Upload with proper filename and no compression
                            put_result = session.file.put(
                                local_file_name=temp_path,
                                stage_location=f"@{SOURCE_STAGE_NAME}",
                                auto_compress=False,
                                overwrite=True
                            )
                            
                            if put_result and put_result[0].status == 'UPLOADED':
                                print(f"   ✅ Uploaded {filename}")
                                success_count += 1
                            else:
                                print(f"   ⚠️  Upload failed")
                                error_count += 1
                            
                        finally:
                            if temp_path and os.path.exists(temp_path):
                                os.unlink(temp_path)
                    
                except Exception as e:
                    print(f"   ❌ Error: {str(e)}")
                    error_count += 1
                    if temp_path and os.path.exists(temp_path):
                        try:
                            os.unlink(temp_path)
                        except:
                            pass
                
                # Progress every 5 files
                if i % 5 == 0 or i == total_pdfs:
                    print(f"\n📈 Progress: {i}/{total_pdfs}, ✅{success_count} success, ❌{error_count} errors")
            
            # Final verification  
            print(f"\n📁 Stage verification...")
            stage_files = session.sql(f"LIST @{SOURCE_STAGE_NAME}").collect()
            
            if stage_files:
                print(f"✅ {len(stage_files)} files in @{SOURCE_STAGE_NAME}:")
                total_size = 0
                
                # Sort files by name for better display
                sorted_files = sorted(stage_files, key=lambda x: x['name'])
                
                for file_info in sorted_files:
                    # Get the name field and clean it up
                    full_path = file_info['name']
                    
                    # Handle different path formats from LIST command
                    if full_path.startswith('pdf_documents/'):
                        name = full_path.replace('pdf_documents/', '')
                    elif '/' in full_path:
                        name = full_path.split('/')[-1]
                    else:
                        name = full_path
                    
                    size = file_info['size']
                    total_size += size
                    print(f"   📄 {name} - {size:,} bytes ({size/1024/1024:.1f} MB)")
                
                print(f"\n🎉 SUCCESS! Total: {total_size:,} bytes ({total_size/1024/1024:.1f} MB)")
            else:
                print(f"❌ No files in stage")
                
        else:
            print("❌ Failed to get PDF links")
    else:
        print("❌ No result from discovery")
        
except Exception as e:
    print(f"❌ Download error: {str(e)}")

print(f"\n✅ Download completed!")


In [None]:
-- ensure Stage is refreshed with uploaded files
session.sql(f"ALTER STAGE {SOURCE_STAGE_NAME} REFRESH").collect()

In [None]:
# Create Filename-Chapter Associations Table
# Todo: These aren't used anywhere yet.  They could be added to the world_history_rag table and/or used as topics for the test questions
import re
print("📋 Creating filename-chapter associations table...")

# Create the table with separate chapter number and title
try:
    session.sql(f"""
        CREATE OR REPLACE TABLE FILENAME_CHAPTER_ASSOCIATIONS (
            FILENAME VARCHAR(100),
            CHAPTER_NUMBER INTEGER,
            CHAPTER_TITLE VARCHAR(500),
            ORIGINAL_TEXT VARCHAR(500),
            UPLOAD_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
            PRIMARY KEY (FILENAME)
        )
    """).collect()
    print("✅ Table FILENAME_CHAPTER_ASSOCIATIONS created/verified")
    
    # Get PDF links to extract titles
    result = session.sql(f"SELECT get_pdf_links_from_website('{TARGET_WEBSITE_URL}') as result").collect()
    
    if result:
        result_raw = result[0]['RESULT']
        if isinstance(result_raw, str):
            result_data = json.loads(result_raw)
        else:
            result_data = result_raw
        
        if result_data and result_data.get('success', False):
            pdf_links = result_data.get('pdf_links', [])
            
            print(f"📥 Parsing and inserting {len(pdf_links)} filename-chapter associations...")
            
            # Insert each filename-title mapping with parsed data
            for pdf_info in pdf_links:
                filename = pdf_info['filename']
                original_text = pdf_info.get('text', filename)
                
                # Parse chapter number and title
                # Pattern: "Chapter X: Title (size)" or similar
                chapter_number = None
                chapter_title = original_text
                
                # Try to extract chapter number
                chapter_match = re.search(r'Chapter\s+(\d+)', original_text, re.IGNORECASE)
                if chapter_match:
                    chapter_number = int(chapter_match.group(1))
                    
                    # Extract title after the colon, before any parentheses
                    title_match = re.search(r'Chapter\s+\d+:\s*([^(]+)', original_text, re.IGNORECASE)
                    if title_match:
                        chapter_title = title_match.group(1).strip()
                
                # If no "Chapter X:" pattern, try to extract number from filename
                if chapter_number is None:
                    filename_match = re.search(r'chap(\d+)', filename, re.IGNORECASE)
                    if filename_match:
                        chapter_number = int(filename_match.group(1))
                
                # Clean up chapter title (remove extra spaces, size info)
                chapter_title = re.sub(r'\s*\(\d+[A-Za-z]*\)\s*$', '', chapter_title).strip()
                
                try:
                    session.sql(f"""
                        INSERT INTO FILENAME_CHAPTER_ASSOCIATIONS 
                        (FILENAME, CHAPTER_NUMBER, CHAPTER_TITLE, ORIGINAL_TEXT)
                        VALUES ('{filename}', {chapter_number or 'NULL'}, '{chapter_title.replace("'", "''")}', '{original_text.replace("'", "''")}')
                    """).collect()
                except Exception as e:
                    print(f"   ⚠️  Error inserting {filename}: {str(e)}")
            
            # Verify the data
            associations = session.sql("SELECT * FROM FILENAME_CHAPTER_ASSOCIATIONS ORDER BY CHAPTER_NUMBER").collect()
            
            print(f"\n✅ {len(associations)} associations created:")
            for assoc in associations[:10]:  # Show first 10
                filename = assoc['FILENAME']
                chapter_num = assoc['CHAPTER_NUMBER']
                title = assoc['CHAPTER_TITLE'][:50] + "..." if len(assoc['CHAPTER_TITLE']) > 50 else assoc['CHAPTER_TITLE']
                print(f"   📄 {filename:<15} → Ch.{chapter_num:2d}: {title}")
            
            if len(associations) > 10:
                print(f"   ... and {len(associations) - 10} more associations")
                
            print(f"\n🎯 Query examples:")
            print(f"   • SELECT * FROM FILENAME_CHAPTER_ASSOCIATIONS WHERE CHAPTER_NUMBER = 1")
            print(f"   • SELECT FILENAME, CHAPTER_TITLE FROM FILENAME_CHAPTER_ASSOCIATIONS ORDER BY CHAPTER_NUMBER")
                
        else:
            print("❌ Failed to get PDF links for associations")
    else:
        print("❌ No result from PDF discovery")
        
except Exception as e:
    print(f"❌ Error creating associations: {str(e)}")

print(f"\n✅ Filename-chapter associations completed!")


# Done with file upload!

If you manually uploaded files to the Snowflake stage, continue processing the files starting from the cell below.

In [None]:
#
# Helper Function (Adaptive Splitting) to chunk source files into ~25mb parts
#
def get_page_range_desc(page_labels, start_idx, end_idx):
    start_label = page_labels[start_idx] if page_labels and start_idx < len(page_labels) else start_idx + 1
    end_label = page_labels[end_idx] if page_labels and end_idx < len(page_labels) else end_idx + 1
    clean_desc = f"pages{start_label}to{end_label}".replace('/', '_').replace(' ', '')
    readable_desc = f"pages {start_label}-{end_label}"
    return clean_desc, readable_desc

def split_large_pdf_on_stage(session, file_content_stream, original_filename, target_stage_path, max_size_mb=25):
    original_size_mb = file_content_stream.getbuffer().nbytes / (1024 * 1024)
    print(f"\n📄 Processing {original_filename} ({original_size_mb:.1f} MB)")

    # --- MODIFIED: DO NOT COPY SMALL FILES ---
    if original_size_mb <= 25:
        print("✅ File is under 25MB and in the correct stage. No action needed.")
        return [{'filename': original_filename, 'page_range': 'all', 'size_mb': original_size_mb}]

    # --- Logic for splitting large files remains the same ---
    with tempfile.TemporaryDirectory() as temp_dir:
        reader = pypdf.PdfReader(file_content_stream)
        total_pages, page_labels = len(reader.pages), reader.page_labels
        print(f"Total pages: {total_pages}. Starting adaptive split...")
        
        output_parts, start_page_of_chunk, part_num = [], 0, 1
        
        while start_page_of_chunk < total_pages:
            size_test_writer, end_page_of_chunk = pypdf.PdfWriter(), start_page_of_chunk - 1
            for i in range(start_page_of_chunk, total_pages):
                size_test_writer.add_page(reader.pages[i])
                with io.BytesIO() as buffer:
                    size_test_writer.write(buffer)
                    if buffer.tell() / (1024 * 1024) > max_size_mb and i > start_page_of_chunk: break
                end_page_of_chunk = i
            
            final_writer = pypdf.PdfWriter()
            for i in range(start_page_of_chunk, end_page_of_chunk + 1):
                final_writer.add_page(reader.pages[i])

            base_name = original_filename.rsplit('.', 1)[0]
            clean_desc, readable_desc = get_page_range_desc(page_labels, start_page_of_chunk, end_page_of_chunk)
            output_filename = f"{base_name}_{clean_desc}.pdf"
            
            temp_part_path = os.path.join(temp_dir, output_filename)
            with open(temp_part_path, "wb") as f: final_writer.write(f)

            final_size_mb = os.path.getsize(temp_part_path) / (1024 * 1024)
            session.file.put(
                local_file_name=temp_part_path,
                #stage_location=f"{target_stage_path}/{output_filename}",
                stage_location=f"{target_stage_path}",
                auto_compress=False, overwrite=True
            )
            print(f"  - Part {part_num}: Uploaded {output_filename} ({final_size_mb:.1f} MB, {readable_desc})")
            
            output_parts.append({'filename': output_filename, 'page_range': readable_desc, 'size_mb': final_size_mb})
            part_num += 1
            start_page_of_chunk = end_page_of_chunk + 1
            
    return output_parts

In [None]:
#
# Adaptive Splitting for Large Files Code
#
print("--- Starting Task 1: Adaptive splitting for files > 25MB ---")
all_results = {}
try:
    # pattern selects only the root directory so we don't also do anything recursive by selecting sub-directories
    staged_files = session.sql(f"LS @{SOURCE_STAGE_NAME} PATTERN='[^/]+'").collect()
    files_to_process = [f["name"] for f in staged_files if f["name"].lower().endswith('.pdf') and 'pages' not in f["name"]]

    if not files_to_process:
        print("No matching PDF files found in the stage to process.")
    else:
        print(f"Found {len(files_to_process)} PDF(s) to check...")
        for file_path_on_stage in sorted(files_to_process):
            try:
                stage_file_path = f"@{file_path_on_stage}"
                file_name_only = file_path_on_stage.split('/')[-1]
                
                with session.file.get_stream(stage_file_path) as instream:
                    pdf_bytes_io = io.BytesIO(instream.read())
                    parts = split_large_pdf_on_stage(
                        session=session,
                        file_content_stream=pdf_bytes_io,
                        original_filename=file_name_only,
                        target_stage_path=ADAPTIVE_SPLIT_TARGET_PATH, # Use correct target
                    )
                    if len(parts) > 1:
                        all_results[file_name_only.rsplit('.', 1)[0]] = parts
            except Exception as e:
                print(f"❌ Error processing {file_path_on_stage}: {e}")
except SnowparkSQLException as e:
    print(f"❌ SQL Error: Could not list files in stage '@{SOURCE_STAGE_NAME}'. Please check permissions.")
    print(e)
print("\n--- Task 1 Complete ---")

In [None]:
#
# Single-Page Splitting Helper Function
#
def get_page_label(pdf_reader, page_num):
    """Extracts the page label for a given page number."""
    try:
        return pdf_reader.page_labels[page_num]
    except (IndexError, KeyError):
        return None

def split_pdf_to_single_pages_on_stage(session, file_content_stream, original_filename, target_stage_path):
    """Splits a PDF from a stream into single pages and uploads them to a stage directory."""
    print(f"\n📄 Splitting {original_filename} into single pages...")
    
    with tempfile.TemporaryDirectory() as temp_dir:
        reader = pypdf.PdfReader(file_content_stream)
        total_pages = len(reader.pages)
        print(f"Total pages: {total_pages}")

        base_name = original_filename.rsplit('.', 1)[0]
        page_upload_info = []

        for page_num in range(total_pages):
            pdf_writer = pypdf.PdfWriter()
            pdf_writer.add_page(reader.pages[page_num])
            
            page_label = get_page_label(reader, page_num)
            
            if page_label:
                clean_label = str(page_label).replace('/', '_').replace('\\', '_')
                try:
                    # Attempt to convert and format
                    page_number_str = f"{int(clean_label):04d}"
                    output_filename = f"{base_name}_page{page_number_str}.pdf"
                except ValueError:
                    # Handle cases where clean_label is not a valid number
                    output_filename = f"{base_name}_page{clean_label}.pdf" 
            else:
                output_filename = f"{base_name}_page{page_num+1:04d}_nolabel.pdf"
            
            temp_page_path = os.path.join(temp_dir, output_filename)
            with open(temp_page_path, "wb") as f:
                pdf_writer.write(f)

            session.file.put(
                local_file_name=temp_page_path,
                #stage_location=f"{target_stage_path}/{output_filename}",
                stage_location=f"{target_stage_path}",
                auto_compress=False, overwrite=True
            )
            page_upload_info.append({'filename': output_filename, 'page_label': page_label})
            
            if (page_num + 1) % 20 == 0 or page_num == total_pages - 1:
                print(f"  ...uploaded {page_num+1}/{total_pages} pages")
    
    print(f"✅ Split into {len(page_upload_info)} individual pages in @{target_stage_path}")
    return page_upload_info

In [None]:
#
# Single-Page Splitting Main Code
#
print("\n--- Starting Task 2: Splitting all original files into single pages ---")
single_page_results = {} # To store results for the summary

try:
    # the pattern select only from the root; otherwise we get recursive splitting and sub-directory files
    staged_files = session.sql(f"LS @{SOURCE_STAGE_NAME} PATTERN='[^/]+'").collect()
    # Find original PDFs, excluding any adaptively split parts
    files_to_process = [f["name"] for f in staged_files if f["name"].lower().endswith('.pdf') and '_pages' not in f["name"]]

    if not files_to_process:
        print("No original PDF files found in the stage to process.")
    else:
        print(f"Found {len(files_to_process)} original PDF(s) to split into single pages...")
        for file_path_on_stage in sorted(files_to_process):
            try:
                stage_file_path = f"@{file_path_on_stage}"
                file_name_only = file_path_on_stage.split('/')[-1]
                
                with session.file.get_stream(stage_file_path) as instream:
                    pdf_bytes_io = io.BytesIO(instream.read())
                    # Capture the returned info
                    page_upload_info = split_pdf_to_single_pages_on_stage(
                        session=session,
                        file_content_stream=pdf_bytes_io,
                        original_filename=file_name_only,
                        target_stage_path=SINGLE_PAGE_TARGET_PATH,
                    )
                    single_page_results[file_name_only] = page_upload_info
            except Exception as e:
                print(f"❌ Error splitting {file_path_on_stage} into single pages: {e}")
except SnowparkSQLException as e:
    print(f"❌ SQL Error: Could not list files in stage '@{SOURCE_STAGE_NAME}'. Please check permissions.")
    print(e)
print("\n--- Task 2 Complete ---")

In [None]:
#
# Final Summary of Adaptive Page Split and Single Page Split
#
print("\n" + "="*60)
print("✅ All Tasks Complete")
print("="*60)

# --- Summary for Task 1: Adaptive Splitting ---
print("\n📋 Summary of Adaptive Splitting (Task 1)")
if all_results:
    total_parts = sum(len(parts) for parts in all_results.values())
    print(f"  - Total: {len(all_results)} large PDF(s) were split into {total_parts} parts.")
    print(f"  - Destination: @{ADAPTIVE_SPLIT_TARGET_PATH}/")
else:
    print("  - No files required adaptive splitting.")

# --- Summary for Task 2: Single-Page Splitting ---
print("\n📋 Summary of Single-Page Splitting (Task 2)")
if single_page_results:
    total_pages_created = sum(len(pages) for pages in single_page_results.values())
    print(f"  - Total: {len(single_page_results)} original PDF(s) were split into {total_pages_created} single pages.")
    print(f"  - Destination: @{SINGLE_PAGE_TARGET_PATH}/")
else:
    print("  - Single-page splitting was not run or no files were processed.")

print("\n" + "="*60)

In [None]:
session.sql(f"ALTER STAGE {SOURCE_STAGE_NAME} REFRESH").collect()

# Start Table Creation for Document Analysis

- Document_pages = information for each page
- Document_parts = information on sections/chunks
- Document_analysis = links to related nodes
- Document_edges = graph of related nodes

In [None]:
-- Main table for individual document pages
CREATE OR REPLACE TABLE DOCUMENT_PAGES (
    PAGE_ID STRING PRIMARY KEY,
    CHAPTER_NUMBER INTEGER,
    PAGE_NUMBER INTEGER,                    -- Actual page number from PDF
    PART_IDENTIFIER STRING,                 -- Links pages to their parent parts (e.g., 'pages64to76')
    CHAPTER_TITLE STRING,
    -- UNIT_NUMBER INTEGER,
    PAGE_CONTENT_RAW TEXT,
    PAGE_CONTENT TEXT,                      -- Content for this specific page
    PAGE_SUMMARY TEXT,                      -- AI-generated page summary
    PAGE_KEYWORDS ARRAY,                    -- Key terms on this page
    --CONTENT_VECTOR VECTOR(FLOAT, 768),     -- Page embedding for search
    
    -- Content metadata
    CONTENT_TYPE STRING,                    -- 'text_heavy', 'visual_heavy', 'mixed'
    WORD_COUNT INTEGER,
    CHAR_COUNT INTEGER,
    
    -- URL and citation
    CHAPTER_PDF_URL STRING,                 -- Presigned URL to PDF
    SOURCE_FILENAME STRING,                 -- Original source filename (RELATIVE_PATH)
    ENHANCED_CITATION STRING,               -- "Chapter Title, Chapter X, Page Y"
    
    -- Processing metadata
    PROCESSING_STATUS STRING DEFAULT 'PENDING',
    PROCESSING_TIMESTAMP TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

-- ================================================================================
-- HIERARCHICAL RAG STAGING TABLE - For chapter and part-level content
-- ================================================================================

-- Staging table for chapter and multi-page part PDFs
CREATE OR REPLACE TABLE DOCUMENT_PARTS (
    CHAPTER_NUMBER INTEGER,
    PART_IDENTIFIER STRING,      -- e.g., 'pages64to76', 'full' for complete chapters
    PART_CONTENT_RAW STRING,     -- Raw text extracted from PDF
    SOURCE_FILENAME STRING,      -- Original filename
    PDF_URL STRING,              -- Presigned URL to the source PDF
    
    -- Processing metadata
    WORD_COUNT INTEGER,
    CHAR_COUNT INTEGER,
    ENHANCED_CITATION STRING,
    PROCESSING_STATUS STRING DEFAULT 'PENDING',
    CREATED_TIMESTAMP TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
)
COMMENT = 'Staging table for chapter-level and multi-page part PDFs used in hierarchical RAG pipeline';

-- AI analysis results for cross-reference extraction
CREATE OR REPLACE TABLE DOCUMENT_ANALYSIS (
    PAGE_ID STRING PRIMARY KEY,
    CHAPTER_NUMBER INTEGER,
    PAGE_NUMBER INTEGER,
    PAGE_CONTENT TEXT,
    EXTRACTED_REFERENCES VARIANT,          -- JSON array of cross-references
    AI_MODEL STRING,
    PROCESSING_TIMESTAMP TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

-- Knowledge graph edges between pages/chapters
CREATE OR REPLACE TABLE DOCUMENT_EDGES (
    EDGE_ID STRING DEFAULT CONCAT('EDGE_', UNIFORM(1, 999999999, RANDOM())) PRIMARY KEY,
    
    -- Source page
    SRC_PAGE_ID STRING,
    SRC_CHAPTER_NUMBER INTEGER,
    SRC_PAGE_NUMBER INTEGER,
    
    -- Destination page (may not exist yet)
    DST_PAGE_ID STRING,                     -- NULL if referenced page doesn't exist
    DST_CHAPTER_NUMBER INTEGER,
    DST_PAGE_NUMBER INTEGER,
    
    -- Reference details
    REFERENCE_TYPE STRING,                  -- 'page_reference', 'chapter_reference', 'figure_reference'
    REFERENCE_CONTEXT STRING,              -- "see page 759", "discussed in Chapter 24"
    REFERENCE_EXPLANATION STRING,          -- AI explanation of the connection
    CONFIDENCE_SCORE FLOAT,
    
    PROCESSING_TIMESTAMP TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
);

-- ================================================================================
-- HIERARCHICAL RAG TABLE - Final unified table for multi-granularity RAG
-- ================================================================================

-- Final table for hierarchical multi-hop RAG - stores content at all granularity levels
CREATE OR REPLACE TABLE WORLD_HISTORY_RAG (
    CHAPTER_NUMBER INTEGER,
    PART_IDENTIFIER STRING,     -- e.g., 'pages64to76', 'full', or NULL for chapter-wide summaries
    PAGE_NUMBER INTEGER,        -- The specific page number, or NULL for part/chapter summaries
    CONTENT_TYPE STRING,        -- 'ChapterSummary', 'PartSummary', 'PageSummary', 'RawText'
    TEXT_CONTENT STRING,        -- The actual text or summary
    SOURCE_FILENAME STRING,     -- The original file the content was extracted from
    PDF_URL STRING,             -- A presigned URL to the source PDF
    PAGE_ID STRING,             -- The page ID used in multi-hop search
    ENHANCED_CITATION STRING,   -- "Chapter Title, Chapter X, Page Y"
    -- Processing metadata
    CREATED_TIMESTAMP TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
)    
    COMMENT = 'Hierarchical RAG table storing content at multiple granularities: raw text, page summaries, part summaries, and chapter summaries';


In [None]:
-- Cortex Parse_Document to extract text from individual pages
INSERT INTO DOCUMENT_PAGES (
    PAGE_ID, CHAPTER_NUMBER, PAGE_NUMBER, PART_IDENTIFIER, CHAPTER_TITLE,
     PAGE_CONTENT_RAW, PAGE_CONTENT, CONTENT_TYPE, WORD_COUNT, CHAR_COUNT, CHAPTER_PDF_URL,
    SOURCE_FILENAME, ENHANCED_CITATION, PROCESSING_STATUS
)
WITH page_processing AS (
    SELECT
        -- Extract chapter and page numbers using regex
        ZEROIFNULL(REGEXP_SUBSTR(RELATIVE_PATH, 'chap(\\d+)', 1, 1, 'ie', 1))::INT AS chapter_number,
        REGEXP_SUBSTR(RELATIVE_PATH, 'page[A-Z]*(\\d+)', 1, 1, 'ie', 1)::INT AS page_number,
        
        -- Extract part identifier if this page belongs to a multi-page chunk (usually NULL for individual pages)
        REGEXP_SUBSTR(RELATIVE_PATH, 'pages(\\d+to\\d+)', 1, 1, 'ie', 1) as part_identifier,
        
        -- Generate consistent page ID
        'CH' || LPAD(ZEROIFNULL(REGEXP_SUBSTR(RELATIVE_PATH, 'chap(\\d+)', 1, 1, 'ie', 1))::INT, 2, '0') || 
        '_P' || LPAD(REGEXP_SUBSTR(RELATIVE_PATH, 'page[A-Z]*(\\d+)', 1, 1, 'ie', 1)::INT, 4, '0') as page_id,
        
        -- Parse PDF content
        SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@PDF_DOCUMENTS', RELATIVE_PATH, {'mode': 'LAYOUT'}) as parse_result,
        RELATIVE_PATH,
        
        -- Generate presigned URL
        GET_PRESIGNED_URL('@PDF_DOCUMENTS', RELATIVE_PATH, 604800) as pdf_url
    FROM
        DIRECTORY(@PDF_DOCUMENTS)
    WHERE
        RELATIVE_PATH LIKE 'pages/%.pdf'
        AND page_id
            NOT IN (SELECT PAGE_ID FROM DOCUMENT_PAGES WHERE PAGE_ID IS NOT NULL)  -- Idempotency check
)
SELECT
    -- Core identifiers
    page_id,
    chapter_number,
    page_number,
    part_identifier,
    'Chapter ' || chapter_number as chapter_title,
    
    
    -- Content fields
    COALESCE(parse_result['content']::STRING, '') as page_content_raw,
    COALESCE(parse_result['content']::STRING, '') as page_content,  -- Same as raw for now

    
    -- Content metadata
    CASE 
        WHEN LENGTH(TRIM(COALESCE(parse_result['content']::STRING, ''))) < 100 THEN 'text_light'
        WHEN LENGTH(TRIM(COALESCE(parse_result['content']::STRING, ''))) > 2000 THEN 'text_heavy'
        ELSE 'mixed'
    END as content_type,
    
    -- Counts
    CASE 
        WHEN LENGTH(TRIM(COALESCE(parse_result['content']::STRING, ''))) = 0 THEN 0
        ELSE ARRAY_SIZE(SPLIT(COALESCE(parse_result['content']::STRING, ''), ' '))
    END as word_count,
    LENGTH(COALESCE(parse_result['content']::STRING, '')) as char_count,
    
    -- URLs and citations
    pdf_url as chapter_pdf_url,
    RELATIVE_PATH as source_filename,
    CONCAT(
        'Chapter ',
        chapter_number,
        ', Pages ',
        REPLACE(
        REGEXP_SUBSTR(source_filename, '\\d+to\\d+'),
        'to',
        '-'
        )
     ) AS enhanced_citation,       
    
    -- Processing status
    'PROCESSED' as processing_status

FROM page_processing
WHERE chapter_number IS NOT NULL 
  AND page_number IS NOT NULL
  AND LENGTH(TRIM(COALESCE(parse_result['content']::STRING, ''))) > 50;  -- Quality filter


In [None]:
-- Cortex Parse_Document to extract text from chunks/parts
INSERT INTO DOCUMENT_PARTS (CHAPTER_NUMBER, PART_IDENTIFIER, PART_CONTENT_RAW, SOURCE_FILENAME, PDF_URL, WORD_COUNT, CHAR_COUNT, ENHANCED_CITATION, PROCESSING_STATUS)
WITH parsed_docs AS (
  SELECT
    RELATIVE_PATH,
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@PDF_DOCUMENTS', RELATIVE_PATH, {'mode': 'LAYOUT'})['content']::STRING AS content_raw
  FROM DIRECTORY(@PDF_DOCUMENTS)
  WHERE
    RELATIVE_PATH LIKE 'parts/%.pdf' -- Only files in parts subdirectory
    AND RELATIVE_PATH NOT IN (SELECT SOURCE_FILENAME FROM DOCUMENT_PARTS WHERE SOURCE_FILENAME IS NOT NULL) -- Idempotency check
)
SELECT
    ZEROIFNULL(REGEXP_SUBSTR(pd.RELATIVE_PATH, 'chap(\\d+)', 1, 1, 'ie', 1))::INT as chapter_number,
    REGEXP_SUBSTR(pd.RELATIVE_PATH, '(pages\\d+to\\d+)', 1, 1, 'ie', 1) as part_identifier,
    pd.content_raw,
    pd.RELATIVE_PATH,
    GET_PRESIGNED_URL('@PDF_DOCUMENTS', pd.RELATIVE_PATH, 604800),
    -- Calculate word and character counts from the content_raw field
    ARRAY_SIZE(SPLIT(pd.content_raw, ' ')),
    LENGTH(pd.content_raw),
    'Chapter ' || chapter_number || ', Part ' || part_identifier as enhanced_citation,
    'PROCESSED'
FROM parsed_docs AS pd
WHERE
    ZEROIFNULL(REGEXP_SUBSTR(pd.RELATIVE_PATH, 'chap(\\d+)', 1, 1, 'ie', 1))::INT IS NOT NULL; -- Must have a valid chapter number

In [None]:
-- Insert the raw text from each page into world_history_rag
INSERT INTO WORLD_HISTORY_RAG (CHAPTER_NUMBER, PART_IDENTIFIER, PAGE_NUMBER, CONTENT_TYPE, TEXT_CONTENT, SOURCE_FILENAME, PDF_URL, PAGE_ID, ENHANCED_CITATION)

-- First, select and insert the raw text for each page
SELECT
    p.CHAPTER_NUMBER, 
    p.PART_IDENTIFIER, 
    p.PAGE_NUMBER, 
    'RawText' AS CONTENT_TYPE,
    AI_COMPLETE(
        'llama3.3-70b', 
        'Return the raw text of the page with obvious errors fixed.  This includes spelling errors, hyphenated words, combined words, etc.  If and only a part of the text is not clear you can try to figure out what it means, but if there is not enough context then just return the raw text.  If any historical names, dates, etc seem way off try to correct it but do not make up anything.  Do NOT add any additional text or commentary like "Here is the text with obvious errors fixed".  The text will be used later in a search engine so it should be as close to the original text as possible. :\n' || p.PAGE_CONTENT_RAW
    ) as PAGE_CONTENT_RAW,
    p.SOURCE_FILENAME, 
    p.CHAPTER_PDF_URL,
    'CH' || LPAD(p.CHAPTER_NUMBER, 2, '0') || '_P' || LPAD(p.PAGE_NUMBER, 4, '0') as PAGE_ID,
    p.ENHANCED_CITATION
FROM DOCUMENT_PAGES p
WHERE p.PROCESSING_STATUS = 'PROCESSED'
  AND p.PAGE_CONTENT_RAW IS NOT NULL
  AND LENGTH(p.PAGE_CONTENT_RAW) > 50;  -- Only process pages with substantial content


In [None]:
-- insert AI_COMPLETE page summaries to world_history_rag
INSERT INTO WORLD_HISTORY_RAG (CHAPTER_NUMBER, PART_IDENTIFIER, PAGE_NUMBER, CONTENT_TYPE, TEXT_CONTENT, SOURCE_FILENAME, PDF_URL, PAGE_ID, ENHANCED_CITATION)
SELECT
    p.CHAPTER_NUMBER, 
    p.PART_IDENTIFIER, 
    p.PAGE_NUMBER, 
    'PageSummary' AS CONTENT_TYPE,
    AI_COMPLETE(
        'llama3.3-70b', 
        'Do not add any additional text or commentary like "Here is the summary".  The text will be used later in a search engine so it should be as close to the original text as possible.  Summarize this page content in 1-2 concise sentences:\n' || p.PAGE_CONTENT_RAW
    ),
    p.SOURCE_FILENAME, 
    p.CHAPTER_PDF_URL,
    'CH' || LPAD(p.CHAPTER_NUMBER, 2, '0') || '_P' || LPAD(p.PAGE_NUMBER, 4, '0') as PAGE_ID,
    p.ENHANCED_CITATION
FROM DOCUMENT_PAGES p
WHERE p.PROCESSING_STATUS = 'PROCESSED'
  AND p.PAGE_CONTENT_RAW IS NOT NULL
  AND LENGTH(p.PAGE_CONTENT_RAW) > 50  -- Only process pages with substantial content
  AND CONCAT(p.CHAPTER_NUMBER, '_', p.PAGE_NUMBER, '_PageSummary') NOT IN (
      SELECT CONCAT(CHAPTER_NUMBER, '_', PAGE_NUMBER, '_', CONTENT_TYPE) 
      FROM WORLD_HISTORY_RAG 
      WHERE CONTENT_TYPE = 'PageSummary'
      limit 10 
  ); -- Idempotency check for page summaries

In [None]:
-- Cortex AI_COMPLETE inserts cleaned up text from parts into world_history_rag
INSERT INTO WORLD_HISTORY_RAG (CHAPTER_NUMBER, PART_IDENTIFIER, PAGE_NUMBER, CONTENT_TYPE, TEXT_CONTENT, SOURCE_FILENAME, PDF_URL, PAGE_ID, ENHANCED_CITATION)
SELECT
    p.CHAPTER_NUMBER,
    p.PART_IDENTIFIER,
    NULL AS PAGE_NUMBER,  -- Parts don't have specific page numbers
    'RawText' AS CONTENT_TYPE,
    AI_COMPLETE(
      'llama3.3-70b', 
      'Return the raw text of the page with obvious errors fixed.  This includes spelling errors, hyphenated words, combined words, etc.  If and only a part of the text is not clear you can try to figure out what it means, but if there is not enough context then just return the raw text.  If any historical names, dates, etc seem way off try to correct it but do not make up anything.  Do NOT add any additional text or commentary like "Here is the text with obvious errors fixed".  The text will be used later in a search engine so it should be as close to the original text as possible. :\n' ||  p.PART_CONTENT_RAW
    ) as TEXT_CONTENT,
    p.SOURCE_FILENAME,
    p.PDF_URL,
    'chap' || LPAD(p.CHAPTER_NUMBER, 2, '0') || '_part' || p.PART_IDENTIFIER as PAGE_ID,
    p.ENHANCED_CITATION
FROM DOCUMENT_PARTS p
WHERE p.PART_CONTENT_RAW IS NOT NULL
  -- AND LENGTH(TRIM(p.PART_CONTENT_RAW)) > 100  -- Quality filter
  AND CONCAT(p.CHAPTER_NUMBER, '_', p.PART_IDENTIFIER, '_RawText') NOT IN (
      SELECT CONCAT(CHAPTER_NUMBER, '_', PART_IDENTIFIER, '_', CONTENT_TYPE) 
      FROM WORLD_HISTORY_RAG 
      WHERE CONTENT_TYPE = 'RawText' AND PART_IDENTIFIER IS NOT NULL
  ); -- Idempotency check


In [None]:
-- AI_COMPLETE chapter summaries inserted into world_history_rag
INSERT INTO WORLD_HISTORY_RAG (CHAPTER_NUMBER, PART_IDENTIFIER, PAGE_NUMBER, CONTENT_TYPE, TEXT_CONTENT, SOURCE_FILENAME, PDF_URL, PAGE_ID, ENHANCED_CITATION)
SELECT
    p.CHAPTER_NUMBER,
    p.PART_IDENTIFIER,
    NULL AS PAGE_NUMBER,  -- Parts don't have specific page numbers
    'PartSummary' AS CONTENT_TYPE,
    AI_COMPLETE(
        'llama3.3-70b',
        'Do not add any additional text or commentary like "Here is the summary".  If there is not enough context to summarize the part then just return the raw text.  Summarize the following history textbook section in 2-3 concise paragraphs. Focus on the main themes, key events, important people, and historical significance:\n\n' || 
        p.PART_CONTENT_RAW ||  
        '\n\nProvide a clear, educational summary suitable for a history student.'
    ) AS TEXT_CONTENT,
    p.SOURCE_FILENAME,
    p.PDF_URL,
    'CH' || LPAD(p.CHAPTER_NUMBER, 2, '0') || '_' || p.PART_IDENTIFIER as PAGE_ID,
    p.ENHANCED_CITATION
FROM DOCUMENT_PARTS p
WHERE p.PART_CONTENT_RAW IS NOT NULL
  -- AND LENGTH(TRIM(p.PART_CONTENT_RAW)) > 100  -- Quality filter
  AND CONCAT(p.CHAPTER_NUMBER, '_', p.PART_IDENTIFIER, '_PartSummary') NOT IN (
      SELECT CONCAT(CHAPTER_NUMBER, '_', PART_IDENTIFIER, '_', CONTENT_TYPE) 
      FROM WORLD_HISTORY_RAG 
      WHERE CONTENT_TYPE = 'PartSummary' AND PART_IDENTIFIER IS NOT NULL
  ); -- Idempotency check

In [None]:
-- AI_COMPLETE to aggregate part summaries into chapter summaries and insert into world_history_rag
-- Explore: AI_AGG and how it compares to this method
INSERT INTO WORLD_HISTORY_RAG (CHAPTER_NUMBER, PART_IDENTIFIER, PAGE_NUMBER, CONTENT_TYPE, TEXT_CONTENT, SOURCE_FILENAME, PDF_URL)
WITH aggregated_part_summaries AS (
    SELECT
        CHAPTER_NUMBER,
        ANY_VALUE(PDF_URL) as PDF_URL,
        ANY_VALUE(SOURCE_FILENAME) as SOURCE_FILENAME,
        LISTAGG(TEXT_CONTENT, '\n\n---\n\n') WITHIN GROUP (ORDER BY PART_IDENTIFIER) as full_chapter_text,
        COUNT(*) as part_count
    FROM WORLD_HISTORY_RAG
    WHERE CONTENT_TYPE = 'PartSummary'
      AND CHAPTER_NUMBER NOT IN (
          SELECT DISTINCT CHAPTER_NUMBER 
          FROM WORLD_HISTORY_RAG 
          WHERE CONTENT_TYPE = 'ChapterSummary'
      ) -- Idempotency check for chapter summaries
    GROUP BY CHAPTER_NUMBER
)
SELECT
    a.CHAPTER_NUMBER, 
    NULL AS PART_IDENTIFIER, 
    NULL AS PAGE_NUMBER, 
    'ChapterSummary' AS CONTENT_TYPE,
    AI_COMPLETE(
        'llama3.3-70b', 
        'You are given several high-level summaries from different sections of a history chapter. Combine them into a single, comprehensive 2-3 paragraph summary of the entire chapter.  Do not add any additional text or commentary like "Here is the summary".  The text will be used later in a search engine so it should be as close to the original text as possible. :\n\n' || 
        a.full_chapter_text ||
        '\n\nCreate a cohesive narrative that synthesizes the key themes, events, and historical significance covered throughout this chapter.'
    ),
    a.SOURCE_FILENAME, 
    a.PDF_URL
FROM aggregated_part_summaries a
WHERE a.part_count > 0;  -- Only process chapters that have part summaries

In [None]:
-- Presigned URL's only last 24 hours; this task updates them daily and Cortex Search will automatically update with the latest
-- when it also refreshes
CREATE OR REPLACE TASK DAILY_PDF_URL_REFRESH
    USER_TASK_MANAGED_INITIAL_WAREHOUSE_SIZE = 'XSMALL'
    SCHEDULE = 'USING CRON 0 6 * * * UTC'  -- Daily at 6 AM UTC
    COMMENT = 'Refresh presigned URLs that expire within 24 hours'
AS
    update world_history_rag
    set pdf_url = GET_PRESIGNED_URL(@WORLD_HISTORY.PUBLIC.PDF_DOCUMENTS, source_filename, 604800);

alter task daily_pdf_url_refresh resume;

# Create Cortex Search Service

In [None]:
CREATE OR REPLACE CORTEX SEARCH SERVICE WORLD_HISTORY_RAG_SEARCH
    ON TEXT_CONTENT
    ATTRIBUTES 
        CHAPTER_NUMBER,
        PAGE_NUMBER,
        SOURCE_FILENAME,
        CONTENT_TYPE
    WAREHOUSE = WAREHOUSE_XL_G2
    TARGET_LAG = '1 day' -- to pick up get_presigned_url changes
    EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
    AS (
        SELECT 
            wh.TEXT_CONTENT,
            wh.CHAPTER_NUMBER,
            wh.PAGE_NUMBER,
            wh.PART_IDENTIFIER,
            wh.CONTENT_TYPE,
            wh.SOURCE_FILENAME,
            wh.PDF_URL,
            wh.PAGE_ID,
            wh.ENHANCED_CITATION
        FROM WORLD_HISTORY_RAG wh
        WHERE wh.TEXT_CONTENT IS NOT NULL
          AND LENGTH(TRIM(wh.TEXT_CONTENT)) > 50
    );

In [None]:
-- View to retrieve connections for each page
CREATE OR REPLACE VIEW MULTIHOP_SEARCH_RESULTS AS
WITH search_base AS (
    -- All searchable content with page IDs
    SELECT 
        page_id,
        wh.CHAPTER_NUMBER,
        wh.PAGE_NUMBER,
        wh.CONTENT_TYPE,
        'Chapter ' || wh.CHAPTER_NUMBER || COALESCE(', Page ' || wh.PAGE_NUMBER, '') || ' (' || wh.CONTENT_TYPE || ')' as citation,
        LEFT(wh.TEXT_CONTENT, 500) as content_preview,
        wh.PDF_URL
    FROM WORLD_HISTORY_RAG wh
    WHERE wh.TEXT_CONTENT IS NOT NULL
)
SELECT 
    sb.*,
    COALESCE(connected.related_pages, ARRAY_CONSTRUCT()) as connected_pages,
    COALESCE(connected.connection_count, 0) as connection_count
FROM search_base sb
LEFT JOIN (
    SELECT 
        e.SRC_PAGE_ID,
        ARRAY_AGG(OBJECT_CONSTRUCT(
            'page_id', e.DST_PAGE_ID,
            'citation', 'Chapter ' || e.DST_CHAPTER_NUMBER || ', Page ' || e.DST_PAGE_NUMBER,
            'context', e.REFERENCE_CONTEXT,
            'confidence', e.CONFIDENCE_SCORE
        )) as related_pages,
        COUNT(*) as connection_count
    FROM DOCUMENT_EDGES e
    WHERE e.CONFIDENCE_SCORE > 0.5
    GROUP BY e.SRC_PAGE_ID
) connected ON sb.page_id = connected.SRC_PAGE_ID;

-- Function to be used as a tool by Cortex Agent
--   Cortex Agent needs functions or procedures to be considered a tool; 
--   This wraps the view and returns it as a JSON object which is easily consumed by the Agent
-- only takes a single unique page id
CREATE OR REPLACE FUNCTION MULTIHOP_SEARCH_RESULTS_FN(page_id_param STRING)
RETURNS array
LANGUAGE SQL
AS
$$
    SELECT 
    ARRAY_AGG(OBJECT_CONSTRUCT(*))
        FROM multihop_search_results
        WHERE page_id = page_id_param
$$;

In [None]:
-- Function to traverse multiple graph connections
-- ie - 2 hops will follow "See page 10" -> "See page 20" -> "See page 30"
CREATE OR REPLACE FUNCTION FIND_CONNECTED_PAGES(starting_page_id STRING, max_hops INT)
RETURNS array
LANGUAGE SQL
AS
$$
    WITH RECURSIVE page_traversal AS (
        -- Starting page (hop 0)
        SELECT 
            page_id as dest_page_id,
            chapter_number as dest_chapter_number,
            page_number as dest_page_number,
            enhanced_citation,
            ARRAY_CONSTRUCT('Starting page: ' || enhanced_citation) as connection_path,
            0 as hop_count
        FROM DOCUMENT_PAGES 
        WHERE page_id = starting_page_id

        UNION ALL

        -- Connected pages (hop 1+)
        SELECT 
            COALESCE(e.DST_PAGE_ID, 'MISSING_CH' || e.DST_CHAPTER_NUMBER || '_P' || LPAD(e.DST_PAGE_NUMBER, 4, '0')) as dest_page_id,
            e.DST_CHAPTER_NUMBER as dest_chapter_number,
            e.DST_PAGE_NUMBER as dest_page_number,
            COALESCE(dp.ENHANCED_CITATION, 'Referenced: Chapter ' || e.DST_CHAPTER_NUMBER || ', Page ' || e.DST_PAGE_NUMBER) as enhanced_citation,
            ARRAY_APPEND(pt.connection_path, e.REFERENCE_EXPLANATION || ' (' || e.REFERENCE_CONTEXT || ')') as connection_path,
            pt.hop_count + 1
        FROM page_traversal pt
        JOIN DOCUMENT_EDGES e ON pt.dest_page_id = e.SRC_PAGE_ID
        LEFT JOIN DOCUMENT_PAGES dp ON e.DST_PAGE_ID = dp.PAGE_ID
        WHERE pt.hop_count < max_hops
          AND e.CONFIDENCE_SCORE > 0.5
    )
    SELECT 
        ARRAY_AGG(OBJECT_CONSTRUCT(*))
    FROM page_traversal 
    WHERE hop_count > 0  -- Exclude starting page
    ORDER BY hop_count, dest_chapter_number, dest_page_number
$$;

# End document ingestion and analysis

# Setup School District Data

This section creates a new schema, `SCHOOLS`, for all data related to cities, schools, teachers, tests, test questions, etc.

In [None]:
USE DATABASE WORLD_HISTORY;
CREATE SCHEMA IF NOT EXISTS SCHOOLS;
USE SCHEMA SCHOOLS;

In [None]:
-- =====================================================================================
-- SCHOOL DISTRICTS DATABASE - NORMALIZED DESIGN
-- Create normalized tables for 5 biggest US cities school district data
-- =====================================================================================

-- School district data will be created directly in the public schema (no separate stage needed)

-- =====================================================================================
-- NORMALIZED TABLE STRUCTURES
-- =====================================================================================

-- 1. CITIES TABLE
-- Contains the 5 biggest US cities by population
CREATE OR REPLACE TABLE cities (
    city_id INTEGER,
    city_name VARCHAR(100),
    state VARCHAR(50),
    population INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- 2. SCHOOL DISTRICTS TABLE  
-- Contains the largest school district in each city
CREATE OR REPLACE TABLE school_districts (
    district_id INTEGER,
    district_name VARCHAR(200),
    city_id INTEGER,
    total_students INTEGER,
    total_schools INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- 3. HIGH SCHOOLS TABLE
-- Contains 3 high schools from each district (15 total)
CREATE OR REPLACE TABLE high_schools (
    school_id INTEGER,
    school_name VARCHAR(200),
    district_id INTEGER,
    school_type VARCHAR(50), -- Regular, Magnet, Charter, etc.
    enrollment INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- 4. TEACHERS TABLE
-- Contains 3 world history teachers per school (45 total)
CREATE OR REPLACE TABLE teachers (
    teacher_id INTEGER,
    teacher_name VARCHAR(100),
    school_id INTEGER,
    subject VARCHAR(50),
    years_experience INTEGER,
    education_level VARCHAR(50),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- 5. CLASSES TABLE
-- Contains 3 world history classes per school (45 total)
CREATE OR REPLACE TABLE classes (
    class_id INTEGER,
    class_name VARCHAR(100),
    teacher_id INTEGER,
    school_id INTEGER,
    period INTEGER,
    room_number VARCHAR(20),
    max_capacity INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- 6. STUDENTS TABLE
-- Contains 30 students per class (1,350 total students)
CREATE OR REPLACE TABLE students (
    student_id INTEGER,
    student_name VARCHAR(100),
    class_id INTEGER,
    grade_level INTEGER,
    age INTEGER,
    gender VARCHAR(10),
    enrollment_date DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);


In [None]:

-- =====================================================================================
-- DATA INSERTION
-- =====================================================================================

-- Insert Cities Data
INSERT INTO cities VALUES
(1, 'New York City', 'New York', 8500000, CURRENT_TIMESTAMP()),
(2, 'Los Angeles', 'California', 3900000, CURRENT_TIMESTAMP()),
(3, 'Chicago', 'Illinois', 2700000, CURRENT_TIMESTAMP()),
(4, 'Houston', 'Texas', 2300000, CURRENT_TIMESTAMP()),
(5, 'Phoenix', 'Arizona', 1700000, CURRENT_TIMESTAMP());

-- Insert School Districts Data
INSERT INTO school_districts VALUES
(1, 'New York City Department of Education', 1, 1100000, 1800, CURRENT_TIMESTAMP()),
(2, 'Los Angeles Unified School District', 2, 565000, 1300, CURRENT_TIMESTAMP()),
(3, 'Chicago Public Schools', 3, 330000, 650, CURRENT_TIMESTAMP()),
(4, 'Houston Independent School District', 4, 190000, 280, CURRENT_TIMESTAMP()),
(5, 'Phoenix Union High School District', 5, 28000, 22, CURRENT_TIMESTAMP());

-- Insert High Schools Data (3 per district)
INSERT INTO high_schools VALUES
-- NYC Schools
(1, 'Stuyvesant High School', 1, 'Specialized', 3300, CURRENT_TIMESTAMP()),
(2, 'Bronx High School of Science', 1, 'Specialized', 3000, CURRENT_TIMESTAMP()),
(3, 'Brooklyn Technical High School', 1, 'Specialized', 5400, CURRENT_TIMESTAMP()),

-- LA Schools  
(4, 'Garfield High School', 2, 'Regular', 2800, CURRENT_TIMESTAMP()),
(5, 'Hollywood High School', 2, 'Magnet', 2100, CURRENT_TIMESTAMP()),
(6, 'Lincoln High School', 2, 'Regular', 3200, CURRENT_TIMESTAMP()),

-- Chicago Schools
(7, 'Whitney Young Magnet High School', 3, 'Selective Enrollment', 2200, CURRENT_TIMESTAMP()),
(8, 'Lane Tech College Prep High School', 3, 'Selective Enrollment', 4400, CURRENT_TIMESTAMP()),
(9, 'Jones College Prep High School', 3, 'Selective Enrollment', 1800, CURRENT_TIMESTAMP()),

-- Houston Schools
(10, 'Bellaire High School', 4, 'Regular', 3600, CURRENT_TIMESTAMP()),
(11, 'Lamar High School', 4, 'Regular', 3100, CURRENT_TIMESTAMP()),
(12, 'Carnegie Vanguard High School', 4, 'Magnet', 1200, CURRENT_TIMESTAMP()),

-- Phoenix Schools
(13, 'Central High School', 5, 'Regular', 1800, CURRENT_TIMESTAMP()),
(14, 'Skyline High School', 5, 'Regular', 2400, CURRENT_TIMESTAMP()),
(15, 'Peoria High School', 5, 'Regular', 2900, CURRENT_TIMESTAMP());

-- Insert Teachers Data (3 world history teachers per school)
INSERT INTO teachers VALUES
-- NYC Teachers (Schools 1-3)
(1, 'Sarah Johnson', 1, 'World History', 12, 'Masters', CURRENT_TIMESTAMP()),
(2, 'Michael Chen', 1, 'World History', 8, 'Masters', CURRENT_TIMESTAMP()),
(3, 'Rebecca Martinez', 1, 'World History', 15, 'Masters', CURRENT_TIMESTAMP()),
(4, 'David Thompson', 2, 'World History', 10, 'Masters', CURRENT_TIMESTAMP()),
(5, 'Lisa Wang', 2, 'World History', 6, 'Bachelors', CURRENT_TIMESTAMP()),
(6, 'James Rodriguez', 2, 'World History', 14, 'Masters', CURRENT_TIMESTAMP()),
(7, 'Amanda Foster', 3, 'World History', 9, 'Masters', CURRENT_TIMESTAMP()),
(8, 'Robert Kim', 3, 'World History', 11, 'Masters', CURRENT_TIMESTAMP()),
(9, 'Jennifer Lopez', 3, 'World History', 7, 'Bachelors', CURRENT_TIMESTAMP()),

-- LA Teachers (Schools 4-6)
(10, 'Carlos Gutierrez', 4, 'World History', 13, 'Masters', CURRENT_TIMESTAMP()),
(11, 'Michelle Davis', 4, 'World History', 5, 'Bachelors', CURRENT_TIMESTAMP()),
(12, 'Anthony Wilson', 4, 'World History', 16, 'Doctorate', CURRENT_TIMESTAMP()),
(13, 'Maria Hernandez', 5, 'World History', 8, 'Masters', CURRENT_TIMESTAMP()),
(14, 'Kevin Park', 5, 'World History', 12, 'Masters', CURRENT_TIMESTAMP()),
(15, 'Rachel Green', 5, 'World History', 4, 'Bachelors', CURRENT_TIMESTAMP()),
(16, 'Daniel Lee', 6, 'World History', 10, 'Masters', CURRENT_TIMESTAMP()),
(17, 'Nicole Brown', 6, 'World History', 14, 'Masters', CURRENT_TIMESTAMP()),
(18, 'Steven Garcia', 6, 'World History', 7, 'Bachelors', CURRENT_TIMESTAMP()),

-- Chicago Teachers (Schools 7-9)
(19, 'Emily Anderson', 7, 'World History', 11, 'Masters', CURRENT_TIMESTAMP()),
(20, 'Matthew Taylor', 7, 'World History', 9, 'Masters', CURRENT_TIMESTAMP()),
(21, 'Ashley Miller', 7, 'World History', 6, 'Bachelors', CURRENT_TIMESTAMP()),
(22, 'Brian Jackson', 8, 'World History', 15, 'Doctorate', CURRENT_TIMESTAMP()),
(23, 'Samantha White', 8, 'World History', 8, 'Masters', CURRENT_TIMESTAMP()),
(24, 'Christopher Moore', 8, 'World History', 12, 'Masters', CURRENT_TIMESTAMP()),
(25, 'Jessica Clark', 9, 'World History', 5, 'Bachelors', CURRENT_TIMESTAMP()),
(26, 'Ryan Lewis', 9, 'World History', 13, 'Masters', CURRENT_TIMESTAMP()),
(27, 'Lauren Adams', 9, 'World History', 10, 'Masters', CURRENT_TIMESTAMP()),

-- Houston Teachers (Schools 10-12)
(28, 'Mark Turner', 10, 'World History', 14, 'Masters', CURRENT_TIMESTAMP()),
(29, 'Stephanie Phillips', 10, 'World History', 7, 'Bachelors', CURRENT_TIMESTAMP()),
(30, 'Joseph Campbell', 10, 'World History', 11, 'Masters', CURRENT_TIMESTAMP()),
(31, 'Melissa Parker', 11, 'World History', 9, 'Masters', CURRENT_TIMESTAMP()),
(32, 'William Evans', 11, 'World History', 16, 'Doctorate', CURRENT_TIMESTAMP()),
(33, 'Kimberly Scott', 11, 'World History', 6, 'Bachelors', CURRENT_TIMESTAMP()),
(34, 'Thomas Roberts', 12, 'World History', 12, 'Masters', CURRENT_TIMESTAMP()),
(35, 'Heather Carter', 12, 'World History', 8, 'Masters', CURRENT_TIMESTAMP()),
(36, 'Jason Mitchell', 12, 'World History', 10, 'Masters', CURRENT_TIMESTAMP()),

-- Phoenix Teachers (Schools 13-15)
(37, 'Andrea Perez', 13, 'World History', 13, 'Masters', CURRENT_TIMESTAMP()),
(38, 'Gregory Hall', 13, 'World History', 5, 'Bachelors', CURRENT_TIMESTAMP()),
(39, 'Vanessa Young', 13, 'World History', 15, 'Doctorate', CURRENT_TIMESTAMP()),
(40, 'Scott Hernandez', 14, 'World History', 9, 'Masters', CURRENT_TIMESTAMP()),
(41, 'Brittany King', 14, 'World History', 7, 'Bachelors', CURRENT_TIMESTAMP()),
(42, 'Nathan Wright', 14, 'World History', 14, 'Masters', CURRENT_TIMESTAMP()),
(43, 'Crystal Lopez', 15, 'World History', 11, 'Masters', CURRENT_TIMESTAMP()),
(44, 'Derek Hill', 15, 'World History', 6, 'Bachelors', CURRENT_TIMESTAMP()),
(45, 'Tiffany Green', 15, 'World History', 12, 'Masters', CURRENT_TIMESTAMP());

-- Insert Classes Data (3 world history classes per school, 1 per teacher)
INSERT INTO classes VALUES
-- NYC Classes (Schools 1-3)
(1, 'World History Period 1', 1, 1, 1, 'A101', 30, CURRENT_TIMESTAMP()),
(2, 'World History Period 3', 2, 1, 3, 'A102', 30, CURRENT_TIMESTAMP()),
(3, 'World History Period 5', 3, 1, 5, 'A103', 30, CURRENT_TIMESTAMP()),
(4, 'World History Period 2', 4, 2, 2, 'B201', 30, CURRENT_TIMESTAMP()),
(5, 'World History Period 4', 5, 2, 4, 'B202', 30, CURRENT_TIMESTAMP()),
(6, 'World History Period 6', 6, 2, 6, 'B203', 30, CURRENT_TIMESTAMP()),
(7, 'World History Period 1', 7, 3, 1, 'C301', 30, CURRENT_TIMESTAMP()),
(8, 'World History Period 3', 8, 3, 3, 'C302', 30, CURRENT_TIMESTAMP()),
(9, 'World History Period 5', 9, 3, 5, 'C303', 30, CURRENT_TIMESTAMP()),

-- LA Classes (Schools 4-6)
(10, 'World History Period 2', 10, 4, 2, 'D401', 30, CURRENT_TIMESTAMP()),
(11, 'World History Period 4', 11, 4, 4, 'D402', 30, CURRENT_TIMESTAMP()),
(12, 'World History Period 6', 12, 4, 6, 'D403', 30, CURRENT_TIMESTAMP()),
(13, 'World History Period 1', 13, 5, 1, 'E501', 30, CURRENT_TIMESTAMP()),
(14, 'World History Period 3', 14, 5, 3, 'E502', 30, CURRENT_TIMESTAMP()),
(15, 'World History Period 5', 15, 5, 5, 'E503', 30, CURRENT_TIMESTAMP()),
(16, 'World History Period 2', 16, 6, 2, 'F601', 30, CURRENT_TIMESTAMP()),
(17, 'World History Period 4', 17, 6, 4, 'F602', 30, CURRENT_TIMESTAMP()),
(18, 'World History Period 6', 18, 6, 6, 'F603', 30, CURRENT_TIMESTAMP()),

-- Chicago Classes (Schools 7-9)
(19, 'World History Period 1', 19, 7, 1, 'G701', 30, CURRENT_TIMESTAMP()),
(20, 'World History Period 3', 20, 7, 3, 'G702', 30, CURRENT_TIMESTAMP()),
(21, 'World History Period 5', 21, 7, 5, 'G703', 30, CURRENT_TIMESTAMP()),
(22, 'World History Period 2', 22, 8, 2, 'H801', 30, CURRENT_TIMESTAMP()),
(23, 'World History Period 4', 23, 8, 4, 'H802', 30, CURRENT_TIMESTAMP()),
(24, 'World History Period 6', 24, 8, 6, 'H803', 30, CURRENT_TIMESTAMP()),
(25, 'World History Period 1', 25, 9, 1, 'I901', 30, CURRENT_TIMESTAMP()),
(26, 'World History Period 3', 26, 9, 3, 'I902', 30, CURRENT_TIMESTAMP()),
(27, 'World History Period 5', 27, 9, 5, 'I903', 30, CURRENT_TIMESTAMP()),

-- Houston Classes (Schools 10-12)
(28, 'World History Period 2', 28, 10, 2, 'J1001', 30, CURRENT_TIMESTAMP()),
(29, 'World History Period 4', 29, 10, 4, 'J1002', 30, CURRENT_TIMESTAMP()),
(30, 'World History Period 6', 30, 10, 6, 'J1003', 30, CURRENT_TIMESTAMP()),
(31, 'World History Period 1', 31, 11, 1, 'K1101', 30, CURRENT_TIMESTAMP()),
(32, 'World History Period 3', 32, 11, 3, 'K1102', 30, CURRENT_TIMESTAMP()),
(33, 'World History Period 5', 33, 11, 5, 'K1103', 30, CURRENT_TIMESTAMP()),
(34, 'World History Period 2', 34, 12, 2, 'L1201', 30, CURRENT_TIMESTAMP()),
(35, 'World History Period 4', 35, 12, 4, 'L1202', 30, CURRENT_TIMESTAMP()),
(36, 'World History Period 6', 36, 12, 6, 'L1203', 30, CURRENT_TIMESTAMP()),

-- Phoenix Classes (Schools 13-15)
(37, 'World History Period 1', 37, 13, 1, 'M1301', 30, CURRENT_TIMESTAMP()),
(38, 'World History Period 3', 38, 13, 3, 'M1302', 30, CURRENT_TIMESTAMP()),
(39, 'World History Period 5', 39, 13, 5, 'M1303', 30, CURRENT_TIMESTAMP()),
(40, 'World History Period 2', 40, 14, 2, 'N1401', 30, CURRENT_TIMESTAMP()),
(41, 'World History Period 4', 41, 14, 4, 'N1402', 30, CURRENT_TIMESTAMP()),
(42, 'World History Period 6', 42, 14, 6, 'N1403', 30, CURRENT_TIMESTAMP()),
(43, 'World History Period 1', 43, 15, 1, 'O1501', 30, CURRENT_TIMESTAMP()),
(44, 'World History Period 3', 44, 15, 3, 'O1502', 30, CURRENT_TIMESTAMP()),
(45, 'World History Period 5', 45, 15, 5, 'O1503', 30, CURRENT_TIMESTAMP());


In [None]:

-- =====================================================================================
-- GENERATE STUDENT DATA 
-- Note: This is a partial sample - the full dataset would have 30 students per class
-- =====================================================================================

-- Generate all 1,350 students with systematic unique naming pattern
-- 45 classes × 30 students per class = 1,350 total students

INSERT INTO students (student_id, student_name, class_id, grade_level, age, gender, enrollment_date, created_at)
WITH student_generator AS (
    SELECT 
        c.class_id,
        sd.district_id,
        hs.school_id,
        ROW_NUMBER() OVER (ORDER BY c.class_id) as class_seq,
        sd.district_name,
        hs.school_name
    FROM classes c
    JOIN high_schools hs ON c.school_id = hs.school_id
    JOIN school_districts sd ON hs.district_id = sd.district_id
),
student_numbers AS (
    SELECT 
        SEQ4() as student_seq
    FROM TABLE(GENERATOR(ROWCOUNT => 30))
),
all_students AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY sg.class_id, sn.student_seq) as student_id,
        'Student_' || ROW_NUMBER() OVER (ORDER BY sg.class_id, sn.student_seq) || 
        '_D' || sg.district_id || 
        '_S' || sg.school_id || 
        '_C' || sg.class_id as student_name,
        sg.class_id,
        CASE WHEN UNIFORM(1,4,RANDOM()) = 1 THEN 9
             WHEN UNIFORM(1,4,RANDOM()) = 2 THEN 10  
             WHEN UNIFORM(1,4,RANDOM()) = 3 THEN 11
             ELSE 12 END as grade_level,
        CASE WHEN UNIFORM(1,5,RANDOM()) = 1 THEN 14
             WHEN UNIFORM(1,5,RANDOM()) = 2 THEN 15
             WHEN UNIFORM(1,5,RANDOM()) = 3 THEN 16
             WHEN UNIFORM(1,5,RANDOM()) = 4 THEN 17
             ELSE 18 END as age,
        CASE WHEN UNIFORM(1,2,RANDOM()) = 1 THEN 'Male' ELSE 'Female' END as gender,
        '2024-09-01'::DATE as enrollment_date
    FROM student_generator sg
    CROSS JOIN student_numbers sn
)
SELECT 
    student_id,
    student_name, 
    class_id,
    grade_level,
    age,
    gender,
    enrollment_date,
    CURRENT_TIMESTAMP()
FROM all_students;


In [None]:

-- =====================================================================================
-- CREATE DENORMALIZED VIEW
-- Single view that joins all tables to show complete student hierarchy
-- =====================================================================================

CREATE OR REPLACE VIEW student_school_complete_vw AS
SELECT 
    -- Student Information
    s.student_id,
    s.student_name,
    s.grade_level,
    s.age,
    s.gender,
    s.enrollment_date,
    
    -- Class Information
    c.class_id,
    c.class_name,
    c.period,
    c.room_number,
    c.max_capacity,
    
    -- Teacher Information  
    t.teacher_id,
    t.teacher_name,
    t.subject,
    t.years_experience,
    t.education_level,
    
    -- School Information
    hs.school_id,
    hs.school_name,
    hs.school_type,
    hs.enrollment AS school_enrollment,
    
    -- District Information
    sd.district_id,
    sd.district_name,
    sd.total_students AS district_total_students,
    sd.total_schools AS district_total_schools,
    
    -- City Information
    ct.city_id,
    ct.city_name,
    ct.state,
    ct.population AS city_population
    
FROM students s
JOIN classes c ON s.class_id = c.class_id
JOIN teachers t ON c.teacher_id = t.teacher_id
JOIN high_schools hs ON c.school_id = hs.school_id
JOIN school_districts sd ON hs.district_id = sd.district_id
JOIN cities ct ON sd.city_id = ct.city_id;

In [None]:
-- Create tables specifically for the test schedule, questions, and responses
CREATE TABLE IF NOT EXISTS WORLD_HISTORY_TEST_QUESTIONS (
    QUESTION_ID NUMBER AUTOINCREMENT PRIMARY KEY,
    FILEPATH STRING,
    CHAPTER_NUMBER NUMBER,
    PAGE_NUMBER NUMBER,
    QUESTION_NUMBER NUMBER,
    QUESTION_TEXT STRING,
    OPTION_A STRING,
    OPTION_B STRING,
    OPTION_C STRING,
    OPTION_D STRING,
    CORRECT_ANSWER STRING,
    DIFFICULTY_LEVEL STRING,
    TOPIC STRING,
    CREATED_TIMESTAMP TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

CREATE TABLE IF NOT EXISTS TEST_SCHEDULE (
    SCHEDULE_ID NUMBER AUTOINCREMENT,
    CLASS_ID NUMBER,
    TEACHER_ID NUMBER,
    SCHOOL_ID NUMBER,
    CHAPTER_NUMBER NUMBER,
    TEST_ID NUMBER, -- Will link to CHAPTER_TESTS when tests are generated
    SCHEDULED_DATE DATE,
    SCHEDULED_WEEK NUMBER, -- Week number of academic year (1-40)
    ACADEMIC_YEAR VARCHAR(20) DEFAULT '2024-2025',
    TEST_STATUS VARCHAR(20) DEFAULT 'SCHEDULED',
    CREATED_TIMESTAMP TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP()
);

CREATE TABLE IF NOT EXISTS STUDENT_QUESTION_RESPONSES (
    RESPONSE_ID INTEGER AUTOINCREMENT PRIMARY KEY,
    TEST_RESULT_ID INTEGER,
    QUESTION_ID INTEGER,
    STUDENT_ID INTEGER,
    STUDENT_ANSWER CHAR(1),
    CORRECT_ANSWER CHAR(1),
    IS_CORRECT BOOLEAN,
    RESPONSE_TIME_SECONDS INTEGER,
    CHAPTER_NUMBER INTEGER,
    DIFFICULTY_LEVEL VARCHAR(20),
    TOPIC VARCHAR(200),
    CREATED_AT TIMESTAMP DEFAULT CURRENT_TIMESTAMP()
);

-- Doesn't need to be a Dynamic Table for the demo, but this showcases how it could be used in a real-world example
CREATE DYNAMIC TABLE IF NOT EXISTS STUDENT_CHAPTER_TEST_RESULTS 
   TARGET_LAG = '10 minutes'
   WAREHOUSE = WAREHOUSE_XL_G2
AS
    SELECT
        STUDENT_ID,
        CHAPTER_NUMBER,
        COUNT(*) AS total_questions,
        SUM(IFF(r.IS_CORRECT, 1, 0)) as correct_answers,
        (correct_answers / total_questions) * 100 AS final_score_percent,
        CASE 
        WHEN GREATEST(65, final_score_percent) >= 97 THEN 'A+'
        WHEN GREATEST(65, final_score_percent) >= 93 THEN 'A'
        WHEN GREATEST(65, final_score_percent) >= 90 THEN 'A-'
        WHEN GREATEST(65, final_score_percent) >= 87 THEN 'B+'
        WHEN GREATEST(65, final_score_percent) >= 83 THEN 'B'
        WHEN GREATEST(65, final_score_percent) >= 80 THEN 'B-'
        WHEN GREATEST(65, final_score_percent) >= 77 THEN 'C+'
        WHEN GREATEST(65, final_score_percent) >= 73 THEN 'C'
        WHEN GREATEST(65, final_score_percent) >= 70 THEN 'C-'
        WHEN GREATEST(65, final_score_percent) >= 67 THEN 'D+'
        WHEN GREATEST(65, final_score_percent) >= 65 THEN 'D'
        ELSE 'F'
    END as LETTER_GRADE
    FROM
        WORLD_HISTORY.SCHOOLS.STUDENT_QUESTION_RESPONSES r
    GROUP BY
        1, 2
    ORDER BY
        STUDENT_ID, CHAPTER_NUMBER;

In [None]:
-- Mock up a school calendar for date based analysis
CREATE OR REPLACE TEMPORARY TABLE academic_calendar AS
WITH date_series AS (
    SELECT 
        DATEADD(WEEK, ROW_NUMBER() OVER (ORDER BY NULL) - 1, '2025-08-05'::DATE) as week_start_date,
        ROW_NUMBER() OVER (ORDER BY NULL) as week_number
    FROM TABLE(GENERATOR(ROWCOUNT => 45)) -- Generate 45 weeks to cover full academic year
),
academic_weeks AS (
    SELECT 
        week_start_date,
        week_number,
        MONTH(week_start_date) as month_num,
        CASE 
            WHEN week_start_date BETWEEN '2024-12-23' AND '2025-01-06' THEN 'Winter Break'
            WHEN week_start_date BETWEEN '2025-03-10' AND '2025-03-17' THEN 'Spring Break'
            WHEN week_start_date BETWEEN '2025-11-25' AND '2025-11-29' THEN 'Thanksgiving Break'
            WHEN DAYOFWEEK(week_start_date) IN (1, 7) THEN 'Weekend'
            ELSE 'Regular School Week'
        END as week_type
    FROM date_series
    WHERE week_start_date <= '2026-06-15' -- End of academic year
)
SELECT 
    week_number,
    week_start_date,
    week_type,
    -- Adjust to use school days (Tuesday-Thursday for testing)
    DATEADD(DAY, 2, week_start_date) as suggested_test_date -- Wednesday of each week
FROM academic_weeks
WHERE week_type = 'Regular School Week'
ORDER BY week_number;

-- Show academic calendar
SELECT 
    week_number,
    week_start_date,
    suggested_test_date,
    week_type
FROM academic_calendar
LIMIT 40;


INSERT INTO TEST_SCHEDULE (
    CLASS_ID, TEACHER_ID, SCHOOL_ID, CHAPTER_NUMBER, 
    SCHEDULED_DATE, SCHEDULED_WEEK
)
WITH all_classes AS (
    SELECT 
        c.CLASS_ID,
        c.TEACHER_ID,
        c.SCHOOL_ID
    FROM CLASSES c
),
chapter_sequence AS (
    SELECT 
        ROW_NUMBER() OVER (ORDER BY NULL) as chapter_number
    FROM TABLE(GENERATOR(ROWCOUNT => 32)) -- 32 chapters
),
class_schedules AS (
    SELECT 
        ac.CLASS_ID,
        ac.TEACHER_ID,
        ac.SCHOOL_ID,
        cs.chapter_number,
        -- Randomly assign starting week for each class, then sequential weekly spacing
        LEAST(40, GREATEST(1, 
            (HASH(ac.CLASS_ID, ac.TEACHER_ID) % 8) + 1 + (cs.chapter_number - 1)
        )) as assigned_week
    FROM all_classes ac
    CROSS JOIN chapter_sequence cs
),
scheduled_tests AS (
    SELECT 
        cls.CLASS_ID,
        cls.TEACHER_ID,
        cls.SCHOOL_ID,
        cls.chapter_number,
        cls.assigned_week,
        COALESCE(cal.suggested_test_date, 
                DATEADD(WEEK, cls.assigned_week - 1, '2024-08-07')
        ) as scheduled_date
    FROM class_schedules cls
    LEFT JOIN academic_calendar cal ON cls.assigned_week = cal.week_number
    WHERE cls.assigned_week <= 40 -- Ensure we don't exceed academic year
)
SELECT 
    CLASS_ID,
    TEACHER_ID,
    SCHOOL_ID,
    chapter_number,
    scheduled_date,
    assigned_week
FROM scheduled_tests
WHERE scheduled_date IS NOT NULL;

In [None]:
-- Python based function to do a simple text extraction of review questions and answers
-- Note: Originally tried this with AI_COMPLETE and PARSE_DOCUMENT but neither was sufficient.
-- To do: ~8/21 a new version of PARSE_DOCUMENT will be going GA and should be much better; evaluate that after it's available
CREATE OR REPLACE FUNCTION EXTRACT_QUESTIONS("STAGED_FILE_PATH" VARCHAR)
RETURNS VARCHAR
LANGUAGE PYTHON
RUNTIME_VERSION = '3.12'
PACKAGES = ('snowflake-snowpark-python','pypdf')
HANDLER = 'extract_questions'
AS '
import re
import json
from pypdf import PdfReader
from snowflake.snowpark.files import SnowflakeFile

def extract_questions(staged_file_path: str) -> str:
    """
    Extracts numbered questions from a PDF file stored in a Snowflake stage.

    Args:
        staged_file_path: The path to the PDF file within a Snowflake stage.

    Returns:
        A JSON string representing a list of question objects,
        each with a ''number'' and ''text'' key.
    """
    try:
        with SnowflakeFile.open(staged_file_path, ''rb'') as f:
            reader = PdfReader(f)
            full_text = "".join(page.extract_text() or "" for page in reader.pages)
        
        # Modified regex:
        # (\\d+)      - Capturing group 1: one or more digits (the question number).
        # \\.\\s       - A literal period followed by a space.
        # (.*?)      - Capturing group 2: the question text (non-greedy).
        # (?=...)    - Positive lookahead to stop at the next number or end of string.
        found_questions = re.findall(r"(\\d+)\\.\\s(.*?)(?=\\d+\\.\\s|\\Z)", full_text, re.DOTALL)
        
        # Format the list of tuples [(''1'', ''text...''), (''2'', ''text...'')] into a list of dicts.
        questions_list = [
            {"number": int(number), "text": text.strip()}
            for number, text in found_questions
        ]
        
        # Return the list as a JSON string.
        return json.dumps(questions_list)

    except Exception as e:
        return f"Error processing file: {staged_file_path}. Details: {str(e)}"
';

In [None]:
-- Find each page with "standardized test practice" and construct a url to that page and the following 2 pages
-- Call the `extract_questions` function on each page
-- No insert statement here because the next cell will call the results

WITH test_assessment_start_page AS (
  SELECT
    page_id,
    chapter_number,
    page_number
  FROM
    world_history.public.world_history_rag
  WHERE
    content_type = 'RawText' AND page_number IS NOT NULL AND text_content iLIKE '%STANDARDIZED TEST PRACTICE%' 
    -- and chapter_number like 23
    -- limit 5
), 
ASSESSMENT_PAGES AS (
    SELECT
      f.value :: STRING AS file_path,
      chapter_number,
      page_number
    FROM
      test_assessment_start_page,
      LATERAL FLATTEN(
        input => ARRAY_CONSTRUCT(
          '/pages/chap' || LPAD(chapter_number, 2, '0') || '_page' || LPAD(page_number, 4, '0') || '.pdf',
          '/pages/chap' || LPAD(chapter_number, 2, '0') || '_page' || LPAD(page_number + 1, 4, '0') || '.pdf',
          '/pages/chap' || LPAD(chapter_number, 2, '0') || '_page' || LPAD(page_number + 2, 4, '0') || '.pdf'
        )
      ) f
)

select 
    ap.file_path,
    extract_questions(BUILD_SCOPED_FILE_URL(@world_history.public.pdf_documents, ap.file_path)) as extracted_questions,
    ap.chapter_number,
    ap.page_number
FROM
    assessment_pages ap
GROUP BY
    1, 3, 4;


In [None]:
-- Flatten the questions retreived in the {{extract_test_questions}} cell
SELECT
        eqc.file_path,
        eqc.page_number,
        eqc.chapter_number,
        f.value:number::INT % 100 AS question_number,
        f.value:text::STRING AS question_text
    FROM
        {{extract_test_questions}} eqc,
        LATERAL FLATTEN(input => PARSE_JSON(eqc.extracted_questions)) f
    WHERE
        -- This WHERE clause is now the main filter for finding MCQs.
        -- It's more robust to do this here than in the Python UDF.
        -- Using LIKE without periods is slightly more forgiving.
        question_text LIKE '%A %'
        AND question_text LIKE '%B %'
        AND question_text LIKE '%C %'
        AND question_text LIKE '%D %'

In [None]:
-- AI_COMPLETE to format the questions from {{chunk_test_questions}} in JSON, 
--- select the correct answer, assign it a topic and difficulty level 
SELECT
    ctq.*,
       try_parse_json(AI_COMPLETE(
        'mixtral-8x7b',
        'You are formatting test questions that have been extracted from a World History textbook page. 
        
        TASK: Extract ALL multiple choice test questions from this textbook page, format it properly and supply the right answer.
        
        REQUIREMENTS:
        1. Extract ONLY multiple choice questions (ignore essay, short answer, fill-in-the-blank)
        2. Each question must have exactly 4 answer choices (A, B, C, D)
        3. If the correct answer is provided, include it; if not, research the answer to find it.  Do not leave an answer blank or null.  If you cannot find an answer put "C"
        4. Extract the topic/concept being tested based on other questions in the input.
        5. Return results in this EXACT JSON format:
        
        {
          "extracted_questions": [
            {
              "question_text": "Which type of scientist uses fossils and artifacts to study early humans?",
              "option_a": "Chemists",
              "option_b": "Physicists", 
              "option_c": "Anthropologists",
              "option_d": "Geologists",
              "correct_answer": "C",
              "topic": "Early Human Study",
              "difficulty_level": "Medium"
            }
          ],
          "extraction_notes": "Found X questions on page",
          "page_info": {
            "chapter_number": ' || ctq.chapter_number || ',
            "page_number": ' || ctq.page_number || ',
            "contains_test_questions": true
          }
        }
        
        IMPORTANT RULES:
        - Return ONLY valid JSON, no other text
        - Use "Easy", "Medium", or "Hard" for difficulty_level
        - If no test questions found, return empty array for extracted_questions
        - If questions are incomplete or don\'t have 4 options, skip them
        - If any answers are more than one choice (eg "C, D"), skip them.  
        - Extract the question text exactly as written, EXCEPT extract the specific options to the option_a, option_b, option_c or option_d field. Do NOT include the letter for the option in the value field. Ie if two of the potential anwsners are "A Conscription\nB War communism\n" then the JSON should only include the word that would fill in the blank => {"option_a": "Conscription", "option_b": "War communism"} 
        - If there is a fill in the blank (missing word), replace the empty space with "_____".

        EXAMPLE:
        INPUT TEXT: "is the process of assembling troops and supplies to \nget ready for war. \nA Conscription\nB War communism\nC Armistice\nD Mobilization\n"
        DESIRED OUTPUT:
            {
              "correct_answer": "D",
              "difficulty_level": "Easy",
              "option_a": "Conscription",
              "option_b": "War communism",
              "option_c": "Armistice",
              "option_d": "Mobilization",
              "question_number": 3,
              "question_text": "_____ is the process of assembling troops and supplies to get ready for war.",
              "topic": "War Preparation"
            },
        
        PAGE CONTENT:
        ' || ctq.question_text
    )) AS extraction_result
FROM {{chunk_test_questions}} ctq

In [None]:
-- Finally, parse the json and insert the questions from {{parse_test_questions}} into the world_history_test_questions table

INSERT INTO WORLD_HISTORY_TEST_QUESTIONS(
    FILEPATH,
    CHAPTER_NUMBER,
    PAGE_NUMBER,
    QUESTION_NUMBER,
    QUESTION_TEXT,
    OPTION_A,
    OPTION_B,
    OPTION_C,
    OPTION_D,
    CORRECT_ANSWER,
    DIFFICULTY_LEVEL,
    TOPIC
)

SELECT
 ptq.file_path,
 ptq.chapter_number,
 ptq.page_number,
 ptq.question_number,
 f.value:question_text as question_text,
 f.value:option_a as option_a,
 f.value:option_b as option_b,
 f.value:option_c as option_c,
 f.value:option_d as option_d,
 coalesce(f.value:correct_answer, 'C') as correct_answer,
 f.value:difficulty_level as difficulty_level,
 f.value:topic as topic,
FROM {{parse_test_questions}} ptq,
   LATERAL FLATTEN(input => ptq.extraction_result:extracted_questions) f

In [None]:
-- This single INSERT statement generates and inserts all the mock student responses.
-- It is designed to work with your table where RESPONSE_ID is AUTOINCREMENT
-- and CREATED_AT has a DEFAULT value.

INSERT INTO WORLD_HISTORY.SCHOOLS.STUDENT_QUESTION_RESPONSES (
    TEST_RESULT_ID,
    QUESTION_ID,
    STUDENT_ID,
    STUDENT_ANSWER,
    CORRECT_ANSWER,
    IS_CORRECT,
    RESPONSE_TIME_SECONDS,
    CHAPTER_NUMBER,
    DIFFICULTY_LEVEL,
    TOPIC
)
WITH
-- CTE 1: Get a distinct list of all chapters that have questions.
ALL_CHAPTERS AS (
    SELECT DISTINCT CHAPTER_NUMBER
    FROM WORLD_HISTORY.SCHOOLS.WORLD_HISTORY_TEST_QUESTIONS
),

-- CTE 2: Assign a target score (65-100) for each student for each chapter.
TARGET_SCORES AS (
    SELECT
        s.STUDENT_ID,
        c.CHAPTER_NUMBER,
        -- Generate a score from a normal distribution with mean=82.5, stddev=5
        -- and clamp the result between 65 and 100.
        LEAST(100.0, GREATEST(65.0, NORMAL(82.5, 5, RANDOM()))) AS TARGET_PERCENT_CORRECT
    FROM
        WORLD_HISTORY.SCHOOLS.STUDENTS s
    CROSS JOIN
        ALL_CHAPTERS c
),

-- CTE 3: Get all questions for each chapter and find the total question count.
CHAPTER_QUESTIONS AS (
    SELECT
        CHAPTER_NUMBER,
        QUESTION_ID,
        CORRECT_ANSWER,
        DIFFICULTY_LEVEL,
        TOPIC,
        COUNT(QUESTION_ID) OVER (PARTITION BY CHAPTER_NUMBER) AS TOTAL_QUESTIONS_IN_CHAPTER
    FROM
        WORLD_HISTORY.SCHOOLS.WORLD_HISTORY_TEST_QUESTIONS
),

-- CTE 4: Combine scores and questions, then randomly rank questions for each student's test.
FINAL_RESPONSES AS (
    SELECT
        ts.STUDENT_ID,
        cq.QUESTION_ID,
        cq.CHAPTER_NUMBER,
        cq.CORRECT_ANSWER,
        cq.DIFFICULTY_LEVEL,
        cq.TOPIC,
        -- Assign a random rank to each question within a student's chapter test attempt
        ROW_NUMBER() OVER (PARTITION BY ts.STUDENT_ID, ts.CHAPTER_NUMBER ORDER BY UNIFORM(0,1,RANDOM())) as random_rank,
        -- Calculate the exact number of questions that should be correct based on the target score
        ROUND(cq.TOTAL_QUESTIONS_IN_CHAPTER * ts.TARGET_PERCENT_CORRECT / 100) AS num_to_be_correct
    FROM
        TARGET_SCORES ts
    JOIN
        CHAPTER_QUESTIONS cq ON ts.CHAPTER_NUMBER = cq.CHAPTER_NUMBER
)

-- Final SELECT to format and insert the data
SELECT
    NULL AS TEST_RESULT_ID, -- Populated with NULL as it's a separate entity
    fr.QUESTION_ID,
    fr.STUDENT_ID,
    -- If the answer is correct, student_answer is the correct answer. Otherwise, NULL.
    IFF((fr.random_rank <= fr.num_to_be_correct), fr.CORRECT_ANSWER, NULL) AS STUDENT_ANSWER,
    fr.CORRECT_ANSWER,
    (fr.random_rank <= fr.num_to_be_correct) AS IS_CORRECT,
    UNIFORM(15, 90, RANDOM()) AS RESPONSE_TIME_SECONDS, -- Random response time: 15-90s
    fr.CHAPTER_NUMBER,
    fr.DIFFICULTY_LEVEL,
    fr.TOPIC
FROM
    FINAL_RESPONSES fr;


In [None]:
-- Fun with creating incorrect test question answers
UPDATE
  STUDENT_QUESTION_RESPONSES
SET CORRECT_ANSWER = UPPER(CORRECT_ANSWER);

UPDATE
  STUDENT_QUESTION_RESPONSES
SET
  STUDENT_ANSWER = CASE UPPER(CORRECT_ANSWER)
    WHEN 'A' THEN (
      SELECT VALUE FROM TABLE(FLATTEN(INPUT => ['B', 'C', 'D']::ARRAY)) ORDER BY RANDOM() LIMIT 1
    )
    WHEN 'B' THEN (
      SELECT VALUE FROM TABLE(FLATTEN(INPUT => ['A', 'C', 'D']::ARRAY)) ORDER BY RANDOM() LIMIT 1
    )
    WHEN 'C' THEN (
      SELECT VALUE FROM TABLE(FLATTEN(INPUT => ['A', 'B', 'D']::ARRAY)) ORDER BY RANDOM() LIMIT 1
    )
    WHEN 'D' THEN (
      SELECT VALUE FROM TABLE(FLATTEN(INPUT => ['A', 'B', 'C']::ARRAY)) ORDER BY RANDOM() LIMIT 1
    )
  END
  WHERE
  STUDENT_ANSWER IS NULL or STUDENT_ANSWER = '';

# End school data section

# Upload Semantic Model

In [None]:
CREATE OR REPLACE STAGE WORLD_HISTORY.PUBLIC.CONFIG_FILES
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
    DIRECTORY = (
        ENABLE = TRUE
        AUTO_REFRESH = TRUE
    );

In [None]:
session = get_active_session()

# 1. Define the path to your YAML file
local_file_path = "world_history_analytics.yaml"

# 2. Define the target stage
target_stage = "@WORLD_HISTORY.PUBLIC.CONFIG_FILES"

# 3. Use the put command to upload the file
put_result = session.file.put(local_file_path, target_stage, auto_compress= False, overwrite=True)

# Print the result to confirm the upload
print(f"File uploaded successfully: {put_result[0].target} {put_result[0].status}")

# End Upload Semantic Model

The semantic model for Cortex Analyst will be available in `@WORLD_HISTORY.PUBLIC.CONFIG_FILES`

# ✅ All Done!  (Almost)

Everything except the Snowflake Agent has been created.  The following are the same instructions in the `Instructions` cell.  

# Agent Setup
This is how you can setup an Agent to use all of the services and tools to come up with accurate answers to complicated questions.

## About
- Display Name: World History Agent
- Description: This agent has access to both information about schools, tests, grades, test questions, student responses (structured data) and a World History Textbook (unstructured data).  The content is related in the fact that the exams and questions and responses are based on the content from the World History Textbook.  Anyone can ask questions about content in the textbook, relate that back to student performance, and seamlessly use the agent to go back and forth between the different modalities.

## Instructions
- Response Instructions: Show any percentages as 0.00%.  If the question isn't extremely clear, ask for clarification.  You are an expert professor in World History.  You have the knowledge of 1060 pages of World History and access to student exam performance data.  Your keen observations, suggestions and insights will be highly prized.  Don't be afraid to make suggestions for how tests can be improved or how individual teachers, or schools, can teach the content differently.  
- Sample Questions:
    - Which is the first page that has references to other pages about the enlightment. What pages does it reference and what content is on those pages?
    - I want to compare the military of Classical Athens to that of the late Roman Republic. Your primary method for finding the Roman comparison must be to execute a search for explicit connections starting from the pages discussing the Athenian military during the Persian Wars. First, summarize the Athenian model, then use the connection-finding tool to locate the relevant Roman content and provide the comparison.
    - Analyze the policy of 'War Communism' implemented by the Bolsheviks during the Russian Civil War. First, summarize the policy's immediate historical context. Then, trace its ideological foundation by following any explicit cross-references in that chapter back to the introduction of Marxist theory earlier in the textbook.
    - Compare the citizen-soldier model of Classical Athens during the Persian Wars with the professionalized army of the late Roman Republic. Begin by finding the section on the Persian Wars, summarize the chapter's discussion of the Athenian state and its military, and then use any direct textual cross-references to locate and analyze the author's comparison with the Roman military system.
    - What were the hardest questions about the Roman Empire, which answer was chosen wrong the most, and what pages should students study to get more familiar with the content?
    - How closely does the exam for Emerging Europe and the Byzantine Empire follow the textbook material?

## Tools
- Cortex Analyst: Add the World_History.public.config_files/world_history_semantic_model.yaml.  Let Cortex create the description.
- Cortex Search: Add WORLD_HISTORY_QA.PUBLIC.WORLD_HISTORY_RAG_SEARCH
   - Description: Returns vector based searches on the world history returning either pages, parts, page summaries, part summaries, or chapter summaries.
   - ID Column: PDF_URL
   - Title Column: ENHANCED_CITATION

- Custom Tools
    - Multihop_Search_Results.  Add WORLD_HISTORY.PUBLIC.MULTIHOP_SEARCH_RESULTS as a function.  
        - page_id_param description: This is the page_id param that needs to be passed in the format of CHxx_Pyyyy.  Example: SELECT WORLD_HISTORY_QA.PUBLIC.MULTIHOP_SEARCH_RESULTS_FN('CH23_P0777');
        - description: Use this tool to enrich context for a known page ID. When you have a specific page from a vector search, use this tool to retrieve the page summary, part summary, and chapter summary. This is best for answering questions about the broader theme, context, or significance of information found on a specific page.  This tool returns the connected pages (hops) for references.  It should be used to find if there are any connected edges.  Then move to the find_connected_edges tool to recursively follow those edges in the knowledge graph.
    -Find_Connected_Edges. Add WORLD_HISTORY.PUBLIC.FIND_CONNECTED_PAGES as a function.
        - max_hops description: This is the number of connections from the source page.  If the source page is page 10 and has a reference to page 20, that would be the first hop.  If page 20 has a reference to page 30 that's the 2nd hop.  Default to 2.
        - starting_page_id description: This is the starting page id in the format "CHxx_Pyyyy".  Example "CH23_P0772".  It is a combination chapter and page number that we will get from the prior steps.
        - description: Always use this tool _after_ multihop_search_results.  That tool tells you _if_ there are connected graph edges.  This tool then allows you to recursively follow the relationships of the material.  Use this tool to answer questions about explicit connections, direct links, or tracing a topic's influence across the textbook. It traverses the book's graph of 'see page...' cross-references. Prioritize this tool when a user asks to 'trace the origins of,' 'find the connection to,' 'see what this is linked to,' or analyze how the author explicitly compares two disparate topics.

## Orchestration
Planning instructions
```
Step 1: Question Routing 🚦
The router should prioritize tools in order from most specialized to most general.

Is the user asking to trace a connection or find an explicit link? (e.g., using words like "trace," "connect," "link," "cross-reference," "compare to what the author links").

If yes, prioritize the Multihop_Search_Results + Find_Connected_Pages tool path.  

Is the user asking for the summary, context, or significance of a known topic? (e.g., "Summarize the chapter on the Persian Wars").

If yes, use the Cortex Search + Multihop_Search_Results Path.

Is it a general knowledge question about the text? (e.g., "Tell me about the Roman military").

If yes, use the standard Cortex Search Path, potentially enriched with Multihop_Search_Results.

Is it a question about structured data?

If yes, use the Cortex Analyst Path.

-- ANY TIME the Multihop_Search_Results comes back with connected_pages to get more information.
```