## Automated Data Quality Monitoring
**Objective**: Use Great Expectations to perform data profiling and write validation rules.

1. Data Profiling with Great Expectations

### Profile a JSON dataset with product sales data to check for null values in the 'ProductID' and 'Price' fields.
- Create an expectation suite and connect it to the data context.
- Use the `expect_column_values_to_not_be_null` expectation to profile these fields.
- Review the summary to identify any unexpected null values.

In [2]:
import great_expectations as gx
import json

# 1. Create an Expectation Suite and connect it to the Data Context
context = gx.DataContext()

expectation_suite_name = "product_sales_null_check_suite"
suite = context.create_expectation_suite(
    expectation_suite_name=expectation_suite_name,
    overwrite_existing=True,
)

# Assuming your JSON file is in the same directory or provide the full path
json_file_path = "product_sales.json"  # Replace with your actual JSON file name

# Load the JSON data using Python's built-in json library
with open(json_file_path, 'r') as f:
    data = json.load(f)

# Great Expectations can work with various backends. For a quick check on in-memory data,
# we can use a Pandas DataFrame.
import pandas as pd
df = pd.DataFrame(data)

# Add a Pandas DataFrame Data Source
datasource_name = "my_pandas_datasource"
datasource = context.sources.add_pandas(name=datasource_name)

# Create a Data Asset from the Pandas DataFrame
data_asset_name = "product_sales_data"
data_asset = datasource.add_dataframe_asset(name=data_asset_name)

batch_request = data_asset.build_batch_request(dataframe=df)

validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=suite,  # Use the suite we just created
)

print(f"Created validator for data asset: {validator.active_batch.data_asset.name}")
print(f"Using Expectation Suite: {validator.expectation_suite.name}")

# 2. Use the expect_column_values_to_not_be_null expectation to profile these fields.
validator.expect_column_values_to_not_be_null(column="ProductID")
validator.expect_column_values_to_not_be_null(column="Price")

# Save the Expectation Suite
validator.save_expectation_suite()

# 3. Review the summary to identify any unexpected null values.
# To see the results, you'll typically run a Checkpoint. Let's configure and run one.

checkpoint_name = "product_sales_null_check_checkpoint"
checkpoint_result = context.run_checkpoint(
    checkpoint_name=checkpoint_name,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ],
)

# Print the validation results
print("\nValidation Results:")
validation_result = checkpoint_result.list_validation_results()[0]
for expectation_result in validation_result["results"]:
    if expectation_result["expectation_config"]["expectation_type"] == "expect_column_values_to_not_be_null" and expectation_result["expectation_config"]["kwargs"]["column"] in ["ProductID", "Price"]:
        print(f"Column '{expectation_result['expectation_config']['kwargs']['column']}': {expectation_result['success']}")
        if not expectation_result["success"]:
            print(f"  - Unexpected null values found: {expectation_result['result'].get('unexpected_count', 'N/A')}")
            if "partial_unexpected_list" in expectation_result["result"]:
                print(f"  - Partial list of unexpected nulls: {expectation_result['result']['partial_unexpected_list']}")

# You can also view a more detailed report in the Data Docs:
print("\nTo view the detailed validation report in Data Docs:")
print(f"- Navigate to your Great Expectations Data Context directory.")
print("- Run the command: `great_expectations docs build`")
print("- Open the generated `index.html` file and find the results for the '{checkpoint_name}' Checkpoint and the '{expectation_suite_name}' Expectation Suite.")

AttributeError: module 'great_expectations' has no attribute 'DataContext'

2. Writing Validation Rules for Data Ingestion

### Define validation rules for an API data source to confirm that 'Status' field contains only predefined statuses ('Active', 'Inactive').

- Apply `expect_column_values_to_be_in_set` to check field values during data ingestion.
- Execute the validation and review any mismatches.

In [None]:
import great_expectations as gx
import pandas as pd

# 1. Create a Data Context (if you don't have one already)
context = gx.DataContext()

# 2. Define the Expectation Suite name
expectation_suite_name = "api_status_validation_suite"

# Create the Expectation Suite if it doesn't exist
try:
    suite = context.get_expectation_suite(expectation_suite_name)
    print(f"Loaded existing Expectation Suite: {suite.name}")
except gx.exceptions.ExpectationSuiteNotFoundError:
    suite = context.create_expectation_suite(
        expectation_suite_name=expectation_suite_name,
        overwrite_existing=True,
    )
    print(f"Created Expectation Suite: {suite.name}")

# 3. Simulate fetching data from an API (replace with your actual API interaction)
api_data = [
    {"UserID": 1, "Name": "Alice", "Status": "Active"},
    {"UserID": 2, "Name": "Bob", "Status": "Inactive"},
    {"UserID": 3, "Name": "Charlie", "Status": "Pending"},
    {"UserID": 4, "Name": "David", "Status": "Active"},
    {"UserID": 5, "Name": "Eve", "Status": "Unknown"},
    {"UserID": 6, "Name": "Frank", "Status": "Inactive"},
]

# Convert the API data to a Pandas DataFrame (a common way to work with data in GE)
df = pd.DataFrame(api_data)

# 4. Add a Pandas DataFrame Data Source and Data Asset
datasource_name = "api_data_source"
datasource = context.sources.add_pandas(name=datasource_name)

data_asset_name = "api_data"
data_asset = datasource.add_dataframe_asset(name=data_asset_name)

batch_request = data_asset.build_batch_request(dataframe=df)

# 5. Get a Validator
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite=suite,
)

print(f"Using Expectation Suite: {validator.expectation_suite.name}")

# 6. Apply the expect_column_values_to_be_in_set expectation
allowed_statuses = ["Active", "Inactive"]
validator.expect_column_values_to_be_in_set(
    column="Status",
    value_set=allowed_statuses,
    mostly=1.0,  # Optional: You can adjust this if a certain percentage of invalid values is acceptable
)

# 7. Save the Expectation Suite
validator.save_expectation_suite()

# 8. Execute the validation and review any mismatches using a Checkpoint
checkpoint_name = "api_data_validation_checkpoint"
checkpoint_result = context.run_checkpoint(
    checkpoint_name=checkpoint_name,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ],
)

# 9. Review the validation results
print("\nValidation Results:")
validation_result = checkpoint_result.list_validation_results()[0]
for expectation_result in validation_result["results"]:
    if expectation_result["expectation_config"]["expectation_type"] == "expect_column_values_to_be_in_set" and expectation_result["expectation_config"]["kwargs"]["column"] == "Status":
        print(f"Expectation for 'Status' column: {expectation_result['success']}")
        if not expectation_result["success"]:
            print(f"  - Unexpected values found: {expectation_result['result'].get('unexpected_count', 'N/A')}")
            if "partial_unexpected_list" in expectation_result["result"]:
                print(f"  - Partial list of unexpected values: {expectation_result['result']['partial_unexpected_list']}")

# 10. Optionally, view the detailed report in Data Docs
print("\nTo view the detailed validation report in Data Docs:")
print(f"- Navigate to your Great Expectations Data Context directory.")
print("- Run the command: `great_expectations docs build`")
print("- Open the generated `index.html` file and find the results for the '{checkpoint_name}' Checkpoint and the '{expectation_suite_name}' Expectation Suite.")