# AI-Powered Data Quality Monitor Demo

This notebook demonstrates how to use the AI-powered data quality monitor to check data quality, get LLM-generated insights, and suggest fixes for data quality issues.

## Setup

First, let's set up our environment and import the necessary modules.

In [None]:
import os
import sys
import json
import pandas as pd
from datetime import datetime

# Add the project root to the path
sys.path.append('..')

# Import the necessary modules
from app.validator.run_checks import run_validation, DataQualityValidator
from app.data_ingestion.ingest import ingest
from llm_agent.insight_generator import generate_llm_insights
from llm_agent.fix_suggestor import suggest_fixes
from llm_agent.expectation_generator import generate_expectations_config, analyze_dataset

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key"  # Replace with your actual API key

## 1. Load and Examine Sample Data

Let's load our sample transactions dataset and examine it.

In [None]:
# Path to our sample data
data_path = "../data/transactions.csv"

# Load the data
df = pd.read_csv(data_path)

# Display the first few rows
df.head()

In [None]:
# Check data info
df.info()

We can see that our dataset has some data quality issues:
- Missing values in customer_id and transaction_date
- Non-numeric values in amount ('null', 'abc')
- Negative values in amount

Let's use our data quality monitor to detect these issues.

## 2. Auto-generate a Great Expectations Test Suite

We can use the LLM agent to automatically generate a Great Expectations test suite based on the dataset.

In [None]:
# Analyze the dataset
dataset_info = analyze_dataset(data_path)
print(json.dumps(dataset_info, indent=2, default=str))

In [None]:
# Generate a Great Expectations test suite
output_path = "../expectations/auto_generated_suite.yml"
config = generate_expectations_config(data_path, output_path)
print(config)

## 3. Run Data Quality Validation

Let's run the data quality validation using our existing expectation suite.

In [None]:
# Run validation using the transactions_suite
suite_name = "transactions_suite"
validation_results = run_validation(data_path, suite_name)

# Display validation results summary
print(f"Validation success: {validation_results['success']}")
print(f"Total checks: {validation_results['statistics']['evaluated_expectations']}")
print(f"Passed checks: {validation_results['statistics']['successful_expectations']}")
print(f"Failed checks: {validation_results['statistics']['unsuccessful_expectations']}")
print(f"Success rate: {validation_results['statistics']['success_percent']}%")

In [None]:
# Display details of failed checks
for i, check in enumerate(validation_results.get('failed_checks', [])):
    print(f"\nFailed Check #{i+1}: {check['check_name']}")
    print(f"  Check Type: {check['check_type']}")
    print(f"  Failed Rows: {check['failed_rows']} ({check['failure_percentage']}%)")
    print(f"  Expected Value: {check['expected_value']}")
    print(f"  Actual Value: {check['actual_value']}")

## 4. Generate LLM Insights

Let's use the LLM agent to generate insights for the failed checks.

In [None]:
# Get the path to the validation results file
results_path = f"../data/validation_results/{datetime.now().strftime('%Y-%m-%d')}/transactions/results.json"

# Generate insights
insights = generate_llm_insights(results_path)

# Display insights
for check_name, insight in insights.items():
    print(f"\nInsight for {check_name}:")
    print(f"  Issue Description: {insight.get('issue_description')}")
    print(f"  Impact Level: {insight.get('impact_level')}")
    print(f"  Business Impact: {insight.get('business_impact')}")
    
    print("  Possible Causes:")
    for cause in insight.get('possible_causes', []):
        print(f"    - {cause}")
    
    print("  Recommended Actions:")
    for action in insight.get('recommended_actions', []):
        print(f"    - {action}")

## 5. Generate Fix Suggestions

Now let's use the LLM agent to suggest fixes for the failed checks.

In [None]:
# Generate fix suggestions
fixes = suggest_fixes(results_path)

# Display fix suggestions
for check_name, fix in fixes.items():
    print(f"\nFix Suggestion for {check_name}:")
    print(f"  Approach: {fix.get('fix_approach')}")
    print(f"  Rationale: {fix.get('rationale')}")
    print(f"  Confidence: {fix.get('confidence')}")
    
    print("  Implementation:")
    print(f"```\n{fix.get('implementation')}\n```")
    
    print("  Alternative Approaches:")
    for alt in fix.get('alternative_approaches', []):
        print(f"    - {alt}")

## 6. Implement a Fix and Revalidate

Let's implement one of the suggested fixes and revalidate the dataset.

In [None]:
# Implement a fix for the dataset
def fix_dataset(df):
    # 1. Handle missing values
    df['customer_id'].fillna('UNKNOWN', inplace=True)
    
    # 2. Handle non-numeric values in amount
    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
    
    # 3. Fill missing amounts with the mean
    mean_amount = df['amount'].mean()
    df['amount'].fillna(mean_amount, inplace=True)
    
    # 4. Handle negative amounts - convert to positive and mark as refund
    refunds = df['amount'] < 0
    df.loc[refunds, 'status'] = 'Refunded'
    df.loc[refunds, 'amount'] = df.loc[refunds, 'amount'].abs()
    
    # 5. Convert transaction_date to proper format
    df['transaction_date'] = pd.to_datetime(df['transaction_date'], errors='coerce')
    
    return df

# Apply the fix
df_fixed = fix_dataset(df.copy())

# Save the fixed dataset
fixed_data_path = "../data/transactions_fixed.csv"
df_fixed.to_csv(fixed_data_path, index=False)

# Display the fixed dataset
df_fixed.head()

In [None]:
# Revalidate the fixed dataset
fixed_validation_results = run_validation(fixed_data_path, suite_name)

# Display validation results summary
print(f"Validation success: {fixed_validation_results['success']}")
print(f"Total checks: {fixed_validation_results['statistics']['evaluated_expectations']}")
print(f"Passed checks: {fixed_validation_results['statistics']['successful_expectations']}")
print(f"Failed checks: {fixed_validation_results['statistics']['unsuccessful_expectations']}")
print(f"Success rate: {fixed_validation_results['statistics']['success_percent']}%")

## 7. Conclusion

In this notebook, we've demonstrated the core functionality of the AI-powered data quality monitor:

1. Auto-generating test suites based on data
2. Validating datasets against expectations
3. Generating LLM insights for failed checks
4. Suggesting fixes for data quality issues
5. Implementing fixes and revalidating

This system helps data teams identify and resolve data quality issues more efficiently by combining traditional rule-based validation with AI-powered insights and recommendations.