### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [None]:
# Write your code from here

In [2]:
import pandas as pd

# Load dataset
df = pd.read_csv("swiggy.csv")  # Columns: Name, Email, Age

# Completeness: % non-null per column
completeness = df.notnull().mean()

# Validity: % of emails containing '@'
valid_email_ratio = df['Email'].dropna().apply(lambda x: '@' in x).mean()

# Uniqueness: count distinct emails and ratio
unique_emails = df['Email'].nunique()
total_emails = df['Email'].count()
uniqueness_ratio = unique_emails / total_emails if total_emails > 0 else 0

print(f"Completeness:\n{completeness}")
print(f"Email Validity: {valid_email_ratio:.2%}")
print(f"Email Uniqueness: {unique_emails} unique emails out of {total_emails} ({uniqueness_ratio:.2%})")


KeyError: 'Email'

### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [None]:
# Write your code from here

In [3]:
# Average completeness across all columns
avg_completeness = completeness.mean()

# Simple average of the three metrics
quality_score = (avg_completeness + valid_email_ratio + uniqueness_ratio) / 3
print(f"Overall Data Quality Score: {quality_score:.2%}")


NameError: name 'valid_email_ratio' is not defined

### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [None]:
# Write your code from here

In [4]:
import great_expectations as ge

# Convert to GE dataframe
ge_df = ge.from_pandas(df)

# Create or load an expectation suite
suite_name = "basic_data_quality_suite"
suite = ge_df.create_expectation_suite(suite_name, overwrite_existing=True)

# Define completeness expectations
ge_df.expect_column_values_to_not_be_null("Name")
ge_df.expect_column_values_to_not_be_null("Email")
ge_df.expect_column_values_to_not_be_null("Age")

# Define validity expectation for Email column to contain '@'
def expect_email_to_contain_at(data):
    return {"success": data.str.contains("@").all()}

# Use built-in expectation for string contains '@' (via regex)
ge_df.expect_column_values_to_match_regex("Email", r".+@.+\..+")

# Save suite
suite.save_expectation_suite()


AttributeError: module 'great_expectations' has no attribute 'from_pandas'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [None]:
# Write your code from here


In [5]:
results = ge_df.validate(expectation_suite=suite_name)

# Print summary
print(f"Validation Success: {results['success']}")
print(results)

# Generate HTML report (using pandas_profiling or manual report)
# For simplicity, here is a basic HTML summary output:
html_report = f"""
<html>
<head><title>Validation Report</title></head>
<body>
<h1>Validation Results</h1>
<p>Success: {results['success']}</p>
<pre>{results}</pre>
</body>
</html>
"""

with open("validation_report.html", "w") as f:
    f.write(html_report)


NameError: name 'ge_df' is not defined

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


In [7]:
def calculate_quality_score(df):
    completeness = df.notnull().mean().mean()
    valid_email = df['Email'].dropna().apply(lambda x: '@' in x).mean()
    unique_emails = df['Email'].nunique()
    total_emails = df['Email'].count()
    uniqueness = unique_emails / total_emails if total_emails > 0 else 0
    return (completeness + valid_email + uniqueness) / 3

def run_quality_check_and_validation(csv_path):
    df = pd.read_csv(csv_path)
    score = calculate_quality_score(df)
    print(f"Data Quality Score: {score:.2%}")

    ge_df = ge.from_pandas(df)
    results = ge_df.validate(expectation_suite="basic_data_quality_suite")

    print(f"Validation passed: {results['success']}")
    return score, results

# Example usage
score, validation_results = run_quality_check_and_validation("swiggy.csv")


KeyError: 'Email'

### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here


In [8]:
def clean_data(df):
    # Example cleaning: fill missing Age with median, remove invalid emails
    df['Age'] = df['Age'].fillna(df['Age'].median())
    df = df[df['Email'].str.contains("@", na=False)]
    df['Name'] = df['Name'].fillna("Unknown")
    return df

def automated_data_quality_pipeline(csv_path, threshold=0.8):
    df = pd.read_csv(csv_path)
    score = calculate_quality_score(df)
    print(f"Initial Data Quality Score: {score:.2%}")

    if score < threshold:
        print(f"Data quality below {threshold:.2%}, cleaning data...")
        df_clean = clean_data(df)
        df_clean.to_csv("cleaned_data.csv", index=False)
        print("Data cleaned and saved as 'cleaned_data.csv'")
    else:
        print("Data quality above threshold, no cleaning needed.")

    ge_df = ge.from_pandas(df)
    validation_results = ge_df.validate(expectation_suite="basic_data_quality_suite")
    print(f"Validation success: {validation_results['success']}")

    return score, validation_results

# Run the pipeline
automated_data_quality_pipeline("your_data.csv", threshold=0.8)


FileNotFoundError: [Errno 2] No such file or directory: 'your_data.csv'