### Task 1: Automated Data Profiling

**Steps**:
1. Using Pandas-Profiling
    - Generate a profile report for an existing CSV file.
    - Customize the profile report to include correlations.
    - Profile a specific subset of columns.
2. Using Great Expectations
    - Create a basic expectation suite for your data.
    - Validate data against an expectation suite.
    - Add multiple expectations to a suite.

In [6]:
# Write your code from here

### Task 2: Real-time Monitoring of Data Quality

**Steps**:
1. Setting up Alerts for Quality Drops
    - Use the logging library to set up a basic alert on failed expectations.
    - Implementing alerts using email notifications.
    - Using a dashboard like Grafana for visual alerts.
        - Note: Example assumes integration with a monitoring system
        - Alert setup would involve creating a data source and alert rule in Grafana

In [7]:
# Write your code from here

### Task 3: Using AI for Data Quality Monitoring
**Steps**:
1. Basic AI Models for Monitoring
    - Train a simple anomaly detection model using Isolation Forest.
    - Use a simple custom function based AI logic for outlier detection.
    - Creating a monitoring function that utilizes a pre-trained machine learning model.

In [8]:
pip install pandas pandas-profiling great-expectations scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [9]:
import pandas as pd
from pandas_profiling import ProfileReport
import great_expectations as gx
from sklearn.ensemble import IsolationForest
import logging
import smtplib
from email.mime.text import MIMEText
import numpy as np
import time
from datetime import datetime

# Configure logging for alerts
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# --- Task 1: Automated Data Profiling ---
print("\n--- Task 1: Automated Data Profiling ---")

# Assuming you have a CSV file named 'your_data.csv' in the same directory
try:
    df_profile = pd.read_csv('your_data.csv')
except FileNotFoundError:
    print("Error: 'your_data.csv' not found. Please create a sample CSV file for profiling.")
    df_profile = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': [1.1, 2.2, 3.3]}) # Sample DataFrame

# 1. Using Pandas-Profiling
print("\n1. Using Pandas-Profiling:")
profile = ProfileReport(df_profile, title="Pandas Profiling Report", explorative=True, correlations=True)
profile.to_file("data_profile.html")
print("Pandas Profiling report generated as 'data_profile.html'")

# Profile a specific subset of columns
subset_profile = ProfileReport(df_profile[['col1', 'col3']], title="Subset Profile", explorative=True)
subset_profile.to_file("subset_profile.html")
print("Subset profile generated as 'subset_profile.html' for columns 'col1' and 'col3'")

# 2. Using Great Expectations
print("\n2. Using Great Expectations:")
try:
    context = gx.DataContext.create()
except Exception as e:
    print(f"Error creating DataContext: {e}")
    print("Please ensure you have initialized a Great Expectations project.")
    print("You can do this by running 'great_expectations init' in your terminal.")
    context = None

if context:
    datasource_name = "pandas_source"
    data_connector_name = "default_pandas_in_memory_data_connector"
    data_asset_name = "profile_data"

    batch_request = {
        "datasource_name": datasource_name,
        "data_connector_name": data_connector_name,
        "data_asset_name": data_asset_name,
        "batch_spec_passthrough": {"dataframe": df_profile},
    }

    expectation_suite_name = "data_quality_suite"
    try:
        suite = context.get_expectation_suite(expectation_suite_name)
        print(f"Loaded existing Expectation Suite: {expectation_suite_name}")
    except gx.exceptions.ExpectationSuiteNotFoundError:
        suite = context.create_expectation_suite(expectation_suite_name)
        print(f"Created a new Expectation Suite: {expectation_suite_name}")

    # Add multiple expectations to the suite
    suite.expect_column_values_to_not_be_null("col1")
    suite.expect_column_values_to_be_in_set("col2", ["a", "b", "c", "d"])
    suite.expect_column_values_to_be_between("col3", 1, 4)

    context.save_expectation_suite(suite)

    validator = context.get_validator(
        batch_request=batch_request,
        expectation_suite_name=expectation_suite_name,
    )
    print("\nValidation Results:")
    validation_result = validator.validate()
    print(validation_result)

    if not validation_result["success"]:
        print("Data quality issues found based on Great Expectations!")

# --- Task 2: Real-time Monitoring of Data Quality ---
print("\n--- Task 2: Real-time Monitoring of Data Quality ---")

# Dummy function to simulate data arrival
def fetch_real_time_data():
    # In a real scenario, this would fetch data from a stream or source
    time.sleep(2)
    new_data = {'col1': [4], 'col2': ['e'], 'col3': [5.0]}
    return pd.DataFrame(new_data)

# Email configuration (replace with your actual details)
SMTP_SERVER = 'your_smtp_server.com'
SMTP_PORT = 587
SMTP_USERNAME = 'your_email@example.com'
SMTP_PASSWORD = 'your_email_password'
SENDER_EMAIL = 'your_email@example.com'
RECEIVER_EMAIL = 'recipient@example.com'

def send_email_alert(subject, body):
    try:
        msg = MIMEText(body)
        msg['Subject'] = subject
        msg['From'] = SENDER_EMAIL
        msg['To'] = RECEIVER_EMAIL

        with smtplib.SMTP(SMTP_SERVER, SMTP_PORT) as server:
            server.starttls()
            server.login(SMTP_USERNAME, SMTP_PASSWORD)
            server.sendmail(SENDER_EMAIL, [RECEIVER_EMAIL], msg.as_string())
        logging.info(f"Email alert sent: {subject}")
    except Exception as e:
        logging.error(f"Error sending email: {e}")

def monitor_data_quality():
    print("\nStarting data quality monitoring...")
    for _ in range(3): # Simulate monitoring over a few intervals
        new_df = fetch_real_time_data()
        print(f"\nFetched new data:\n{new_df}")

        if context:
            batch_request = {
                "datasource_name": datasource_name,
                "data_connector_name": data_connector_name,
                "data_asset_name": "realtime_data",
                "batch_spec_passthrough": {"dataframe": new_df},
            }
            validator = context.get_validator(
                batch_request=batch_request,
                expectation_suite_name=expectation_suite_name,
            )
            validation_result = validator.validate()
            print("Real-time Validation Results:")
            print(validation_result)

            if not validation_result["success"]:
                logging.warning("Data quality check failed!")
                alert_subject = "Data Quality Alert - Failed Expectations"
                alert_body = f"Data at {datetime.now()} failed the following expectations:\n{validation_result['results']}"
                send_email_alert(alert_subject, alert_body)
        else:
            logging.warning("Great Expectations context not initialized, skipping real-time validation.")

monitor_data_quality()

# --- Task 3: Using AI for Data Quality Monitoring ---
print("\n--- Task 3: Using AI for Data Quality Monitoring ---")

# Assuming you have some numerical data for anomaly detection
numerical_data = df_profile[['col1', 'col3']].dropna().values

if numerical_data.shape[0] > 0:
    # 1. Train a simple anomaly detection model using Isolation Forest
    print("\n1. Training Isolation Forest for anomaly detection:")
    model = IsolationForest(contamination='auto', random_state=42)
    model.fit(numerical_data)

    # Use a simple custom function based on AI logic for outlier detection (e.g., using the model's decision function)
    def detect_anomalies(data, model, threshold=-0.1): # Adjust threshold as needed
        scores = model.decision_function(data)
        anomalies = data[scores < threshold]
        return anomalies

    # Detect anomalies in the original numerical data
    anomalous_points = detect_anomalies(numerical_data, model)
    print("\nAnomalous points detected by Isolation Forest:")
    print(anomalous_points)

    # Creating a monitoring function that utilizes a pre-trained machine learning model
    def monitor_with_ai(new_numerical_data, model, threshold=-0.1):
        anomalies = detect_anomalies(new_numerical_data, model, threshold)
        if len(anomalies) > 0:
            logging.warning(f"AI-based anomaly detection found {len(anomalies)} potential issues: {anomalies}")
            alert_subject = "AI-Based Data Quality Alert - Anomalies Detected"
            alert_body = f"Potential data anomalies detected at {datetime.now()}:\n{anomalies}"
            send_email_alert(alert_subject, alert_body)
        else:
            logging.info("AI-based monitoring: No anomalies detected.")

    # Simulate monitoring with AI on new numerical data
    new_numerical_data = np.array([[6, 7.0], [1.5, 2.0], [10, 11.0]])
    print("\nMonitoring new numerical data with the AI model:")
    monitor_with_ai(new_numerical_data, model)

else:
    print("Not enough numerical data for anomaly detection.")

PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.11/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.11/u/import-error