# Notebook 01: Data Ingestion and Stage Check

**Purpose:** Verify Snowflake connection and check that all required data files are staged correctly.

**Steps:**
1. Load environment variables and configuration
2. Create Snowpark session
3. Verify stage files exist
4. Load and inspect training dataset
5. Preview submission template

**Expected Outcome:** Successful connection to Snowflake and confirmation of staged data.

In [None]:
# Import required libraries
import os
import sys
from pathlib import Path

import pandas as pd
from dotenv import load_dotenv

# Add src to path for imports
sys.path.append(str(Path.cwd().parent / "src"))

from wqsa.io.snowflake_session import create_snowpark_session, get_stage_path
from wqsa.utils.config import load_config
from wqsa.utils.logging import setup_logging

In [None]:
# Setup logging
setup_logging(level="INFO")

# Load configuration
config = load_config("../config/project.yaml")
print("✓ Configuration loaded successfully")
print(f"  Targets: {config['targets']}")
print(f"  Feature count: {len(config['features']['landsat']) + len(config['features']['terraclimate'])}")

## Create Snowflake Session

Connect to Snowflake using credentials from `.env` file.

In [None]:
# Create Snowpark session
session = create_snowpark_session()

# Test connection
result = session.sql("SELECT CURRENT_VERSION(), CURRENT_DATABASE(), CURRENT_SCHEMA()").collect()
print("\n✓ Connected to Snowflake successfully")
print(f"  Version: {result[0][0]}")
print(f"  Database: {result[0][1]}")
print(f"  Schema: {result[0][2]}")

## Verify Stage Files

Check that all required data files are present in the Snowflake stage.

In [None]:
# Get stage paths from environment
stage_paths = {
    "Training CSV": get_stage_path("STAGE_TRAIN_CSV"),
    "Submission CSV": get_stage_path("STAGE_SUBMISSION_CSV"),
    "Landsat Dir": get_stage_path("STAGE_LANDSAT_DIR"),
    "TerraClimate Dir": get_stage_path("STAGE_TERRACLIMATE_DIR"),
}

print("\nConfigured stage paths:")
for name, path in stage_paths.items():
    print(f"  {name:20} {path}")

# List files in main stage
main_stage = get_stage_path("SF_STAGE", "@AI_CHALLENGE_STAGE")
print(f"\nListing files in {main_stage}...")
try:
    files = session.sql(f"LIST {main_stage}").collect()
    print(f"\n✓ Found {len(files)} files/directories:")
    for file_info in files[:10]:  # Show first 10
        print(f"  - {file_info['name']}")
except Exception as e:
    print(f"⚠ Error listing stage: {e}")

## Load Training Dataset

Load the training data from the Snowflake stage and inspect its structure.

In [None]:
# Load training data from stage
train_path = get_stage_path("STAGE_TRAIN_CSV")
print(f"Loading training data from: {train_path}")

try:
    # Create file format if needed
    session.sql("""
        CREATE FILE FORMAT IF NOT EXISTS CSV_FORMAT
        TYPE = CSV
        FIELD_OPTIONALLY_ENCLOSED_BY = '"'
        SKIP_HEADER = 1
    """).collect()
    
    # Read from stage (example - adjust based on actual stage structure)
    # In production, you'd use session.read.csv(...) or CREATE TABLE ... FROM stage
    print("\n✓ Training data access configured")
    print("  Note: Actual data loading will occur in notebook 02")
    
except Exception as e:
    print(f"⚠ Note: {e}")
    print("  Ensure data is staged before proceeding to feature engineering")

## Load Submission Template

Check the submission template structure to understand the required output format.

In [None]:
# Load submission template
submission_path = get_stage_path("STAGE_SUBMISSION_CSV")
print(f"\nSubmission template path: {submission_path}")
print("\nExpected format:")
print("  - Columns: ALKALINITY, EC, DRP (in that order)")
print("  - Rows: ~200")
print("  - Format: CSV without index")

## Summary

This notebook verified:
1. ✓ Snowflake connection established
2. ✓ Configuration loaded correctly
3. ✓ Stage paths configured
4. ⚠ Data files accessible (ensure staging is complete)

**Next Step:** Proceed to Notebook 02 for feature engineering with Landsat and TerraClimate data.

In [None]:
# Close session
session.close()
print("\n✓ Session closed successfully")