<a href="https://colab.research.google.com/github/LexusMaximus/Automated-EDA-Narrator-Data-Quality-Scoring-Tool/blob/main/Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DatasetSense: Automated EDA Narrator + Data Quality Scoring Tool

This notebook demonstrates the DatasetSense tool with various quality scoring weight configurations.

## Setup: Clone Repository

In [None]:
# List files in current Colab environment
!ls

# Remove old repo folder (replace with your repo name)
!rm -rf Automated-EDA-Narrator-Data-Quality-Scoring-Tool

In [None]:
!git clone https://github.com/LexusMaximus/Automated-EDA-Narrator-Data-Quality-Scoring-Tool.git

In [None]:
import os
os.chdir("/content/Automated-EDA-Narrator-Data-Quality-Scoring-Tool")
!ls

## Example 1: Default Weights

Run the pipeline with default quality scoring weights:
- Missing: 35%
- Duplicates: 15%
- Outliers: 25%
- Balance: 25%

In [None]:
import sys
import importlib
sys.path.insert(0, '/content/Automated-EDA-Narrator-Data-Quality-Scoring-Tool/src')

# Import the module first, then reload it to ensure latest changes are picked up
import orchestrator
importlib.reload(orchestrator)
from orchestrator import DatasetPipeline

# Initialize pipeline with default weights
pipeline = DatasetPipeline("data/sample.csv")

# Run pipeline and print report
report = pipeline.run()
print(report)

## Example 2: Custom Weights - Prioritize Missing Values

Use custom weights that prioritize missing value detection:
- Missing: 50% (high priority)
- Duplicates: 10%
- Outliers: 20%
- Balance: 20%

In [None]:
# Custom weights focusing on missing values
custom_weights_missing = {
    'missing': 0.50,
    'duplicates': 0.10,
    'outliers': 0.20,
    'balance': 0.20
}

pipeline_custom = DatasetPipeline("data/sample.csv", custom_weights=custom_weights_missing)
report_custom = pipeline_custom.run()
print(report_custom)

## Example 3: Custom Weights - Prioritize Outliers and Duplicates

Use custom weights that prioritize outlier and duplicate detection:
- Missing: 20%
- Duplicates: 30% (high priority)
- Outliers: 40% (high priority)
- Balance: 10%

In [None]:
# Custom weights focusing on outliers and duplicates
custom_weights_outliers = {
    'missing': 0.20,
    'duplicates': 0.30,
    'outliers': 0.40,
    'balance': 0.10
}

pipeline_outliers = DatasetPipeline("data/sample.csv", custom_weights=custom_weights_outliers)
report_outliers = pipeline_outliers.run()
print(report_outliers)

## Example 4: Equal Weights for All Metrics

Treat all quality metrics equally:
- Missing: 25%
- Duplicates: 25%
- Outliers: 25%
- Balance: 25%

In [None]:
# Equal weights for all metrics
equal_weights = {
    'missing': 0.25,
    'duplicates': 0.25,
    'outliers': 0.25,
    'balance': 0.25
}

pipeline_equal = DatasetPipeline("data/sample.csv", custom_weights=equal_weights)
report_equal = pipeline_equal.run()
print(report_equal)

## Example 5: Compare Overall Scores Across Different Weights

Run multiple configurations and compare the overall quality scores.

In [None]:
import pandas as pd

# Define different weight configurations
weight_configs = {
    'Default': {'missing': 0.35, 'duplicates': 0.15, 'outliers': 0.25, 'balance': 0.25},
    'Missing Focus': {'missing': 0.50, 'duplicates': 0.10, 'outliers': 0.20, 'balance': 0.20},
    'Outlier Focus': {'missing': 0.20, 'duplicates': 0.30, 'outliers': 0.40, 'balance': 0.10},
    'Equal Weights': {'missing': 0.25, 'duplicates': 0.25, 'outliers': 0.25, 'balance': 0.25},
    'Balance Focus': {'missing': 0.20, 'duplicates': 0.20, 'outliers': 0.20, 'balance': 0.40}
}

# Run pipeline with each configuration
results = []
for name, weights in weight_configs.items():
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=weights)
    pipeline.run()
    results.append({
        'Configuration': name,
        'Overall Score': round(pipeline.scores['overall'], 2),
        'Missing Weight': weights['missing'],
        'Duplicates Weight': weights['duplicates'],
        'Outliers Weight': weights['outliers'],
        'Balance Weight': weights['balance']
    })

# Display comparison table
comparison_df = pd.DataFrame(results)
print("\n" + "="*80)
print("COMPARISON OF DIFFERENT WEIGHT CONFIGURATIONS")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

## Example 6: Error Handling - Invalid Weights

Demonstrate the validation that prevents invalid weight configurations.

In [None]:
print("Testing error handling for invalid weights...\n")

# Test 1: Weights don't sum to 1.0
print("Test 1: Weights sum > 1.0")
try:
    invalid_weights = {
        'missing': 0.50,
        'duplicates': 0.30,
        'outliers': 0.30,
        'balance': 0.10
    }
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=invalid_weights)
    pipeline.run()
except ValueError as e:
    print(f"✓ Error correctly caught: {e}\n")

# Test 2: Missing required keys
print("Test 2: Missing required keys")
try:
    incomplete_weights = {
        'missing': 0.50,
        'duplicates': 0.50
    }
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=incomplete_weights)
    pipeline.run()
except ValueError as e:
    print(f"✓ Error correctly caught: {e}\n")

# Test 3: Negative weights
print("Test 3: Negative weights")
try:
    negative_weights = {
        'missing': 0.50,
        'duplicates': -0.10,
        'outliers': 0.40,
        'balance': 0.20
    }
    pipeline = DatasetPipeline("data/sample.csv", custom_weights=negative_weights)
    pipeline.run()
except ValueError as e:
    print(f"✓ Error correctly caught: {e}\n")

print("="*80)
print("All error handling tests passed! ✓")
print("="*80)

## Summary

This notebook demonstrated:

1. **Default weights**: Standard balanced approach
2. **Custom weights**: Flexibility to prioritize specific metrics
3. **Multiple configurations**: Comparing different weight strategies
4. **Error handling**: Validation prevents invalid configurations

### Key Takeaways:

- Weights must sum to **1.0** (100%)
- All four metrics must be specified: `missing`, `duplicates`, `outliers`, `balance`
- All weights must be **non-negative**
- Different weights can significantly impact the overall quality score
- Choose weights based on your data quality priorities

### OOP Concepts Demonstrated:

- **Encapsulation**: Weights are validated internally
- **Default parameters**: Optional custom weights with sensible defaults
- **Composition**: Pipeline orchestrates multiple classes
- **Error handling**: Proper validation with meaningful error messages