# Feature Engineering with FAERS+HCLS Integration

**Simple, focused feature engineering using integrated FDA adverse events + healthcare claims data**

## Goals:
1. **Load FAERS+HCLS integrated data** from notebook 3b
2. **Engineer advanced risk features** for ML training
3. **Create target variables** for adverse event prediction
4. **Save ML-ready dataset** for training

**Next Step:** Notebook 5 handles Feature Store + ML Training

## Prerequisites
- Running in Snowflake Notebooks environment
- Previous notebooks completed (01, 02, 03, 03b)
- FAERS+HCLS integrated dataset available


In [None]:
# Initialize Snowflake Session for Feature Engineering
print("Initializing Snowflake session for feature engineering...")

# Import Snowpark session and functions (available in Snowflake Notebooks)
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col, lit, when, count, avg, sum as sum_, max as max_, array_contains, array_to_string

# Get the active Snowflake session
session = get_active_session()

print("SUCCESS: Snowflake session initialized for feature engineering")

# Verify context
current_context = session.sql("""
    SELECT 
        CURRENT_DATABASE() as database,
        CURRENT_SCHEMA() as schema,
        CURRENT_WAREHOUSE() as warehouse
""").collect()[0]

print(f"   Database: {current_context['DATABASE']}")
print(f"   Schema: {current_context['SCHEMA']}")
print(f"   Warehouse: {current_context['WAREHOUSE']}")
print("SUCCESS: Environment ready for feature engineering")


In [None]:
# Load FAERS+HCLS Integrated Data
print("Loading FAERS+HCLS integrated data from notebook 3b...")

try:
    # Load the integrated dataset from notebook 3b
    integrated_df = session.table("ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.FAERS_HCLS_INTEGRATED_DATASET")
    print(f"SUCCESS: Loaded FAERS+HCLS integrated data: {integrated_df.count():,} patients")
except Exception as e:
    print(f"FAILED: Error loading integrated data: {e}")
    print("Please run notebook 3b first to create the integrated dataset")
    raise

# Show data structure
print(f"\nAvailable columns: {len(integrated_df.columns)}")
print(f"   - Key columns: PATIENT_ID, AGE, NUM_CONDITIONS, NUM_MEDICATIONS...")
print(f"   - FAERS features: MAX_MEDICATION_RISK, HIGH_RISK_MEDICATION_COUNT...")

# Display schema to understand column types
print(f"\nSchema details:")
for field in integrated_df.schema.fields:
    print(f"   - {field.name}: {field.datatype}")

# Show a few sample rows to understand the data format
print(f"\nSample data:")
sample_data = integrated_df.limit(3).collect()
for i, row in enumerate(sample_data):
    print(f"   Row {i+1}:")
    # Show first few columns to understand the data
    for j, field in enumerate(integrated_df.schema.fields[:5]):  # Show first 5 columns
        try:
            value = row[j]
            print(f"      {field.name}: {value}")
        except:
            print(f"      {field.name}: <error accessing>")
    print()


In [None]:
# Core Feature Engineering
print("Engineering advanced features for ML training...")

# Start with the integrated data
feature_df = integrated_df

# 0. Create binary gender indicator for Feature Store
feature_df = feature_df.with_column(
    "IS_MALE",
    when(col("GENDER") == lit("M"), lit(1)).otherwise(lit(0))
)

# 1. Enhanced complexity scoring
feature_df = feature_df.with_column(
    "ENHANCED_COMPLEXITY_SCORE",
    (col("AGE") / 100.0 * 10) + 
    (col("NUM_CONDITIONS") * 3) + 
    (col("NUM_MEDICATIONS") * 2) +
    (col("MAX_MEDICATION_RISK") * 5)
)

# 2. FAERS-enhanced risk score
feature_df = feature_df.with_column(
    "FAERS_ENHANCED_RISK",
    col("MAX_MEDICATION_RISK") * 10 + col("HIGH_RISK_MEDICATION_COUNT") * 5
)

# 3. Advanced chronic disease indicators
chronic_diseases = {
    "HAS_CARDIOVASCULAR_DISEASE": ["cardiovascular", "heart", "cardiac", "hypertension"],
    "HAS_DIABETES": ["diabetes", "diabetic", "insulin"],
    "HAS_KIDNEY_DISEASE": ["kidney", "renal", "nephritis"],
    "HAS_LIVER_DISEASE": ["liver", "hepatic", "cirrhosis"]
}

for disease_flag, keywords in chronic_diseases.items():
    # Create condition for each disease based on medication arrays
    disease_condition = lit(False)
    for keyword in keywords:
        # Convert array to string and then search for keyword
        # This handles the ARRAY datatype properly
        medications_as_string = array_to_string(col("CURRENT_MEDICATIONS"), lit(','))
        disease_condition = disease_condition | medications_as_string.contains(lit(keyword))
    
    feature_df = feature_df.with_column(disease_flag, disease_condition.cast("int"))

print("SUCCESS: Enhanced feature engineering complete")
print(f"   Total features: {len(feature_df.columns)}")


In [None]:
# Create Target Variables
print("Creating target variables for ML training...")

# 1. Continuous risk target (0-100 scale)
feature_df = feature_df.with_column(
    "CONTINUOUS_RISK_TARGET",
    # Normalize and combine multiple risk factors
    ((col("AGE") / 100.0 * 20) + 
     (col("NUM_CONDITIONS") * 4) + 
     (col("NUM_MEDICATIONS") * 3) + 
     (col("MAX_MEDICATION_RISK") * 15) +
     (col("HIGH_RISK_MEDICATION_COUNT") * 8))
)

# 2. High adverse event risk target (binary)
feature_df = feature_df.with_column(
    "HIGH_ADVERSE_EVENT_RISK_TARGET",
    when(col("CONTINUOUS_RISK_TARGET") > 70, lit(1)).otherwise(lit(0))
)

print("SUCCESS: Target variables created")
print("   - CONTINUOUS_RISK_TARGET: 0-100 continuous risk score")
print("   - HIGH_ADVERSE_EVENT_RISK_TARGET: Binary high-risk flag")


In [None]:
# Save ML-Ready Dataset
print("Saving ML-ready feature dataset...")

# Save the final feature dataset
feature_df.write.mode("overwrite").save_as_table("ADVERSE_EVENT_MONITORING.DEMO_ANALYTICS.FAERS_HCLS_FEATURES_FINAL")

# Verification
final_count = feature_df.count()
feature_count = len(feature_df.columns)

print(f"SUCCESS: Feature engineering complete!")
print(f"   Patients: {final_count:,}")
print(f"   Features: {feature_count}")
print(f"   Saved as: FAERS_HCLS_FEATURES_FINAL")

print(f"\nReady for Feature Store setup and ML training!")
print(f"   Next: Set up Feature Store in next cell")
print(f"   Then: Run notebook 05_Model_Training.ipynb")


In [None]:
# Feature Store Setup & Registration
print("Setting up Snowflake Feature Store...")

try:
    # Import native Snowflake Feature Store APIs
    from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode
    print("SUCCESS: Snowflake Feature Store APIs imported")
    
    # Create Feature Store with proper creation mode
    fs = FeatureStore(
        session=session,
        database="ADVERSE_EVENT_MONITORING",
        name="ML_FEATURE_STORE",
        default_warehouse="ADVERSE_EVENT_WH",  # Use existing warehouse
        creation_mode=CreationMode.CREATE_IF_NOT_EXIST
    )
    print("SUCCESS: Feature Store created: ADVERSE_EVENT_MONITORING.ML_FEATURE_STORE")
    
    # Create and register Patient Entity
    print("\nCreating Patient Entity...")
    patient_entity = Entity(
        name="PATIENT",
        join_keys=["PATIENT_ID"],  # Required join key
        desc="Healthcare patient entity for adverse event prediction"
    )
    
    # Register the entity
    fs.register_entity(patient_entity)
    print("SUCCESS: Patient entity registered with join key: PATIENT_ID")
    
    # Create Feature Views
    print("\nCreating Feature Views...")
    
    # 1. Demographics Feature View
    demographics_fv = FeatureView(
        name="PATIENT_DEMOGRAPHICS",
        entities=[patient_entity],
        feature_df=feature_df.select(["PATIENT_ID", "AGE", "IS_MALE"]),
        desc="Patient demographic features"
    )
    
    # 2. FAERS Risk Feature View
    faers_fv = FeatureView(
        name="FAERS_RISK_FEATURES", 
        entities=[patient_entity],
        feature_df=feature_df.select([
            "PATIENT_ID", "MAX_MEDICATION_RISK", "HIGH_RISK_MEDICATION_COUNT",
            "HAS_HIGH_RISK_INTERACTION", "CONTINUOUS_RISK_TARGET"
        ]),
        desc="FAERS adverse event risk features"
    )
    
    # 3. Healthcare Utilization Feature View
    healthcare_fv = FeatureView(
        name="HEALTHCARE_UTILIZATION",
        entities=[patient_entity], 
        feature_df=feature_df.select([
            "PATIENT_ID", "NUM_CONDITIONS", "NUM_MEDICATIONS", "NUM_CLAIMS",
            "HAS_CARDIOVASCULAR_DISEASE", "HAS_DIABETES", "HAS_KIDNEY_DISEASE"
        ]),
        desc="Healthcare utilization and chronic disease features"
    )
    
    # Register Feature Views
    print("\nRegistering Feature Views...")
    
    fs.register_feature_view(demographics_fv, version="1.0", block=True)
    print("   SUCCESS: PATIENT_DEMOGRAPHICS registered")
    
    fs.register_feature_view(faers_fv, version="1.0", block=True)
    print("   SUCCESS: FAERS_RISK_FEATURES registered")
    
    fs.register_feature_view(healthcare_fv, version="1.0", block=True)
    print("   SUCCESS: HEALTHCARE_UTILIZATION registered")
    
    # Verify Feature Store setup
    print("\nVerifying Feature Store setup...")
    
    # List entities
    entities = fs.list_entities()
    print(f"   Entities: {entities.count()} registered")
    
    # List feature views
    feature_views = fs.list_feature_views()
    print(f"   Feature Views: {feature_views.count()} registered")
    
    if feature_views.count() > 0:
        print("   Registered Feature Views:")
        try:
            # Convert to list to iterate through feature views
            fv_list = feature_views.collect()
            for fv in fv_list:
                print(f"      • {fv['NAME']}: {fv['DESC']}")
        except Exception as e:
            print(f"      (Could not list feature view details: {e})")
    
    print("\nFeature Store setup complete!")
    print("Location: ADVERSE_EVENT_MONITORING.ML_FEATURE_STORE")
    print("Check Snowflake UI > Data > Features to see your Feature Store objects")
    
except ImportError:
    print("FAILED: Feature Store API not available")
    print("Requires: snowflake-ml-python v1.5.0+ and Enterprise Edition")
    print("Try: pip install snowflake-ml-python --upgrade")
    
except Exception as e:
    print(f"FAILED: Feature Store setup failed: {e}")
    print("Feature data still available in FAERS_HCLS_FEATURES_FINAL table")
    print("Documentation: https://docs.snowflake.com/en/developer-guide/snowflake-ml/feature-store/overview")

print(f"\nComplete! Features ready for ML training in notebook 5")


## Snowflake ML Feature Engineering Complete!

**What we accomplished using Snowflake ML APIs:**
- **StandardScaler**: Normalized numerical features using distributed processing
- **OneHotEncoder**: Encoded categorical variables with proper handling
- **Feature Engineering**: Created complexity scores and risk indicators
- **ML-Ready Output**: Saved preprocessed data for comprehensive ML training

**Snowflake ML Preprocessing Benefits:**
- **Distributed Processing**: Scales automatically across Snowflake compute
- **Native Integration**: Seamless with Snowpark DataFrames
- **Production Ready**: Enterprise-grade feature preprocessing
- **Reusable Pipelines**: Transformers can be saved and reused

**Features Created:**
- **Scaled Features**: Age, conditions, medications, claims (standardized)
- **Categorical Encoding**: Age categories with one-hot encoding
- **Derived Features**: Complexity score, comorbidity flags, polypharmacy indicators
- **Target Variable**: Adverse event indicator

**Next Steps:**
1. **Run Notebook 5**: Comprehensive ML workflow with Feature Store
2. **Model Training**: Unsupervised + supervised learning with distributed training
3. **Model Registry**: Log all models with metadata and lineage
4. **ML Observability**: Set up native monitoring in notebook 7

Ready for the complete Snowflake ML platform workflow!
