🔬 AI Data Quality Copilot
This is a data engineering project that automatically detects and explains data quality issues I built using Python, Pandas, SQL, and Claude AI.
ai_dataquality_copilot/ ├── streamlit_app.py # UI + app logic (replaces Flask + HTML + CSS) ├── data_quality.py # Core DQ checks (Pandas + SQLite) ├── llm_engine.py # Claude AI integration ├── requirements.txt # Python dependencies ├── .env # Your API key (never share or commit this) ├── .gitignore # Keeps .env out of GitHub └── README.md
step 1: pip install -r requirements.txt
step 2: Create a '.env' file in the project folder:
step 3: ANTHROPIC_API_KEY=sk-ant-your-key-here
No key? The app still works using rule-based analysis — just without AI insights.
step 4: Run the app streamlit run streamlit_app.py
Your browser opens automatically at http://localhost:8501
| Check | Method |
|---|---|
| Null values | Pandas .isnull() |
| Duplicate rows | Pandas .duplicated() |
| Statistical outliers | IQR method |
| Data type mismatches | Pandas to_numeric, to_datetime |
| Inconsistent casing | Pandas string methods |
| Invalid email formats | Python regex |
| Live SQL queries | SQLite in-memory |
| AI fix recommendations | Claude API |
streamlit_app.py ├── calls → data_quality.py → run_full_profile(df) └── calls → llm_engine.py → analyze_with_llm(report)
'data_quality.py' and 'llm_engine.py' are Python functions that take data in and return results. They have no knowledge of Streamlit. Meaning you could swap the UI for anything else and the logic stays the same.
- Export quality report as PDF
- Connect to PostgreSQL or MySQL instead of SQLite
- Track quality scores over time
- Add schema validation with Great Expectations
- Schedule automated checks with Airflow or cron