Title: AdaptiveDataDoctor – Automated Data Quality Agent

This notebook demonstrates the ADK-inspired AdaptiveDataDoctor Agent, which automatically analyzes, cleans, and improves the quality of messy datasets.

It showcases:

Schema inference

Data profiling

Missing-value imputation

Outlier detection

Duplicate resolution

Drift detection

Automated audit report generation

This notebook is the reference implementation for the Kaggle capstone submission.

GitHub Repo:
https://github.com/shishir-katakam/google-x-kaggle

1. Motivation

Real-world datasets often contain:

Missing values

Outliers

Duplicates

Wrong data types

Irregular categories

Corrupted entries

These issues break downstream ML models, dashboards, and enterprise pipelines.

The goal of AdaptiveDataDoctor is to automate data quality improvements using tool-based orchestration, inspired by ADK (Agent Development Kit) concepts.

Raw Dataset
     ↓
SchemaInferTool
     ↓
DataProfilerTool
     ↓
OutlierDetectorTool
     ↓
DataImputerTool
     ↓
DuplicateResolverTool
     ↓
FixGeneratorTool
     ↓
ReportWriter ➜ audit_report.md
               cleaned_output.csv


In [None]:
from src.agent import AdaptiveDataDoctorAgent

agent = AdaptiveDataDoctorAgent()
result = agent.run("data/sample_corrupted.csv")

result

In [None]:
import pandas as pd

pd.read_csv(result["cleaned_path"]).head()


In [None]:
print(open(result["report_path"]).read())


3. Example Input (Corrupted Dataset)

The dataset below intentionally contains:

Missing values

Invalid strings in numeric columns

Duplicate rows

Extreme outlier values

Lower/upper case inconsistencies

This allows the agent to demonstrate cleaning functionality.

In [None]:
import pandas as pd

pd.read_csv("data/sample_corrupted.csv")


4. Results Summary

After running the agent:

Missing numeric values were imputed using median

Missing categorical values were imputed using most_frequent

Numeric outliers were detected

Duplicate rows were removed

Audit report generated with profiles and schema summary

5. Real-World Use Cases

Data engineering workflows

ETL pipeline validation

Automatic cleaning before ML model training

Enterprise data observability tools

Quality checks in daily batch pipelines

6. Future Work

Planned improvements:

Multi-imputation strategy selection

Visualization dashboards for drift

Multi-agent architecture (Supervisor + Tools)

Interactive UI for human-in-the-loop corrections

7. Conclusion

AdaptiveDataDoctor is a lightweight but powerful ADK-style agent that demonstrates real, production-like data cleaning capabilities.