### Task 1: Understanding and Defining Data Quality Metrics
**Description**: Learn how to define basic data quality metrics such as completeness, validity, and uniqueness for a simple dataset.

**Steps**:
1. Dataset: Use a CSV with columns like Name , Email , Age .
2. Metric Definitions:
    - Completeness: Percentage of non-null values.
    - Validity: % of email fields containing @ .
    - Uniqueness: Count distinct entries in the Email column.

In [1]:
# Write your code from here
import pandas as pd
import io

# Sample CSV data as a string
csv_data = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob@example.com,25
Charlie,charlie@example.com,
David,,40
Eve,eve.example.com,35
Alice,alice@example.com,30
"""

# Read the CSV data into a pandas DataFrame
df = pd.read_csv(io.StringIO(csv_data))

# 1. Completeness: Percentage of non-null values for each column
completeness = df.notnull().sum() / len(df) * 100
print("Completeness:\n", completeness)

# 2. Validity (for Email column): Percentage of email fields containing '@'
total_emails = len(df['Email'])
valid_emails = df['Email'].astype(str).str.contains('@').sum()
validity_email = (valid_emails / total_emails) * 100
print("\nEmail Validity (contains '@'):", validity_email, "%")

# 3. Uniqueness (for Email column): Count of distinct entries
unique_emails = df['Email'].nunique()
print("\nEmail Uniqueness (count of distinct emails):", unique_emails)

Completeness:
 Name     100.000000
Email     83.333333
Age       83.333333
dtype: float64

Email Validity (contains '@'): 66.66666666666666 %

Email Uniqueness (count of distinct emails): 4


### Task 2: Calculating Data Quality Score
**Description**: Aggregate multiple metrics to calculate an overall data quality score.

**Steps**:
1. Formula: Simple average of all metrics defined in Task 1.

In [2]:
# Write your code from here
# Completeness values from Task 1
completeness_name = completeness['Name']
completeness_email = completeness['Email']
completeness_age = completeness['Age']

# Email validity from Task 1
# Ensure validity_email is defined from the previous step
if 'validity_email' not in locals():
    total_emails = len(df['Email'])
    valid_emails = df['Email'].astype(str).str.contains('@').sum()
    validity_email = (valid_emails / total_emails) * 100

# Calculate the average data quality score
data_quality_score = (completeness_name + completeness_email + completeness_age + validity_email) / 4

print("Data Quality Score:", data_quality_score, "%")

Data Quality Score: 83.33333333333334 %


### Task 3: Creating Expectations for a CSV
**Description**: Develop basic data quality expectations using Great Expectations.

**Steps**:
1. Expectation Suite
2. Define Expectations for Completeness

In [4]:
pip install great_expectations

Note: you may need to restart the kernel to use updated packages.


In [5]:
# Write your code from here
import pandas as pd
from great_expectations.dataset import PandasDataset

# Our sample CSV data again
csv_data = """Name,Email,Age
Alice,alice@example.com,30
Bob,bob@example.com,25
Charlie,charlie@example.com,
David,,40
Eve,eve.example.com,35
Alice,alice@example.com,30
"""

# Read the CSV data into a pandas DataFrame
df = pd.read_csv(io.StringIO(csv_data))

# Create a Great Expectations dataset from the pandas DataFrame
ge_df = PandasDataset(df)

# Define Expectations for Completeness

# Expect 'Name' column to have no missing values (100% complete)
ge_df.expect_column_values_to_not_be_null(column="Name")

# Expect 'Email' column to have at least 80% completeness
ge_df.expect_column_proportion_of_non_null_values_to_be_at_least(
    column="Email", value=0.8
)

# Expect 'Age' column to have at least 70% completeness
ge_df.expect_column_proportion_of_non_null_values_to_be_at_least(
    column="Age", value=0.7
)

# You can add more expectations here for other quality aspects like validity and uniqueness

# To see the defined expectations (optional)
print(ge_df.list_expectation_suite())

# To validate the data against the defined expectations (optional for this task, but good to know)
validation_results = ge_df.validate()
print("\nValidation Results:\n", validation_results)

ModuleNotFoundError: No module named 'great_expectations.dataset'

### Task 4: Running and Validating Expectations
**Description**: Run the created expectations and generate an output report.

**Steps**:
1. Validate
2. Generate HTML Report

In [6]:
# Write your code from here
# Continue from Task 3

# Define an expectation for Email validity (contains '@')
ge_df.expect_column_values_to_match_regex(
    column="Email", regex=r".+@.+"
)

# Define an expectation for Email uniqueness
ge_df.expect_column_values_to_be_unique(column="Email")

# 1. Validate the data against the defined expectations
validation_results = ge_df.validate()
print("Validation Results:\n", validation_results)

# 2. Generate an HTML Report (basic, without full Great Expectations setup)
# For a more comprehensive HTML report, you'd typically initialize a DataContext
# in Great Expectations. However, for this basic exercise, we can extract
# the validation results and create a simple HTML output.

html_report = """
<!DOCTYPE html>
<html>
<head>
    <title>Data Quality Validation Report</title>
</head>
<body>
    <h1>Data Quality Validation Report</h1>
    <table>
        <tr>
            <th>Expectation</th>
            <th>Success</th>
            <th>Details</th>
        </tr>
"""

for result in validation_results['results']:
    expectation = result['expectation_config']['expectation_type']
    column = result['expectation_config'].get('kwargs', {}).get('column', 'N/A')
    success = result['success']
    details = result.get('result', {}).get('unexpected_percent', 'N/A')
    if details != 'N/A':
        details = f"{details:.2f}% unexpected"

    html_report += f"""
        <tr>
            <td>{expectation} (Column: {column})</td>
            <td>{'Yes' if success else 'No'}</td>
            <td>{details}</td>
        </tr>
    """

html_report += """
    </table>
</body>
</html>
"""

# Save the HTML report to a file
with open("data_quality_report.html", "w") as f:
    f.write(html_report)

print("\nBasic HTML report 'data_quality_report.html' generated.")

# Note: For a more feature-rich and automated HTML report, you would typically
# initialize a Great Expectations DataContext and configure data docs.

NameError: name 'ge_df' is not defined

### Task 5: Automating Data Quality Score Calculation
**Description**: Automate the data quality score via a script that integrates with Great
Expectations.

In [None]:
# Write your code from here


### Task 6: Leveraging Data Quality Metrics for Automated Data Cleaning
**Description**: Implement a system where if data quality metrics fall below a threshold,
automated data cleaning scripts are triggered.

**Steps**:
1. Define Cleaning Logic
2. Integrate with Great Expectations:
    - Use an action within the Great Expectations action list that only triggers if quality score is below a threshold, automating the cleaning.

In [None]:
# Write your code from here
