### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [None]:
# Write your code from here

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as ge

# Load data
df = pd.read_csv("your_data.csv")  # Replace with your CSV path

# 1. Pandas Profiling

# Generate full profile with correlations
profile = ProfileReport(df, title="Data Profiling Report", correlations={"pearson": {"calculate": True}})

# Save to HTML
profile.to_file("full_profile_report.html")

# Profile subset of columns (example: age and income)
subset_profile = ProfileReport(df[['age', 'income']], title="Subset Profile Report")
subset_profile.to_file("subset_profile_report.html")

# 2. Great Expectations

# Create GE dataframe
ge_df = ge.from_pandas(df)

# Create expectation suite (can be saved/loaded to disk)
suite = ge_df.create_expectation_suite("basic_suite", overwrite_existing=True)

# Add multiple expectations
ge_df.expect_column_values_to_not_be_null("age")
ge_df.expect_column_values_to_be_between("age", min_value=0, max_value=120)
ge_df.expect_column_values_to_not_be_null("income")
ge_df.expect_column_values_to_be_between("income", min_value=0)

# Save expectation suite to JSON
suite.save_expectation_suite()

# Validate data against suite
results = ge_df.validate(expectation_suite="basic_suite")

print(results)


  from .autonotebook import tqdm as notebook_tqdm


  from pandas_profiling import ProfileReport


FileNotFoundError: [Errno 2] No such file or directory: 'your_data.csv'

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [None]:
# Write your code from here

In [2]:
import logging
import smtplib
from email.message import EmailMessage

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def send_email_alert(subject, body, to_email):
    msg = EmailMessage()
    msg.set_content(body)
    msg['Subject'] = subject
    msg['From'] = 'your_email@example.com'
    msg['To'] = to_email

    # Setup SMTP server (example Gmail)
    with smtplib.SMTP_SSL('smtp.gmail.com', 465) as smtp:
        smtp.login('your_email@example.com', 'your_app_password')  # Use app password or OAuth2
        smtp.send_message(msg)

# Example: check GE validation results and alert
def alert_on_validation(results):
    if not results['success']:
        logging.error("Data quality validation failed!")
        send_email_alert(
            subject="Data Quality Alert: Validation Failed",
            body=f"Details:\n{results}",
            to_email="alert_recipient@example.com"
        )
    else:
        logging.info("Data quality validation passed.")

# Example usage:
# alert_on_validation(results)

# Grafana integration:
# - Export metrics or logs from your validation process to a time series DB (Prometheus, InfluxDB)
# - Create dashboards/alert rules in Grafana using that data source
# (This part involves infrastructure and is out of scope for pure Python code)


### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [None]:
# Write your code from here

In [3]:
import numpy as np
from sklearn.ensemble import IsolationForest

# Example data (replace with your actual dataset)
data = np.array([[25, 50000], [30, 60000], [35, 75000], [40, None], [45, 100000]], dtype=object)

# Convert to DataFrame and clean
df = pd.DataFrame(data, columns=["age", "income"])
df['income'] = pd.to_numeric(df['income'], errors='coerce')
df['income'].fillna(df['income'].median(), inplace=True)

# Train Isolation Forest for anomaly detection
iso_forest = IsolationForest(contamination=0.2, random_state=42)
iso_forest.fit(df)

# Predict anomalies (-1 anomaly, 1 normal)
df['anomaly'] = iso_forest.predict(df)

# AI-based outlier detection function (simple rule-based)
def simple_ai_outlier_check(row):
    if row['age'] < 0 or row['income'] < 0:
        return True
    if row['income'] > 200000:  # example threshold
        return True
    return False

df['ai_outlier'] = df.apply(simple_ai_outlier_check, axis=1)

# Monitoring function that integrates model predictions
def monitor_data_quality(df):
    anomalies = df[df['anomaly'] == -1]
    ai_outliers = df[df['ai_outlier'] == True]
    if not anomalies.empty or not ai_outliers.empty:
        print(f"ALERT: Found {len(anomalies)} anomalies and {len(ai_outliers)} AI outliers")
        print("Anomalies:")
        print(anomalies)
        print("AI Outliers:")
        print(ai_outliers)
    else:
        print("Data quality looks good!")

# Run monitor
monitor_data_quality(df)


ALERT: Found 1 anomalies and 0 AI outliers
Anomalies:
  age    income  anomaly  ai_outlier
4  45  100000.0       -1       False
AI Outliers:
Empty DataFrame
Columns: [age, income, anomaly, ai_outlier]
Index: []
