### Task 1: Data Profiling to Understand Data Quality
**Description**: Use basic statistical methods to profile a dataset and identify potential quality issues.

**Steps**:
1. Load the dataset using pandas in Python.
2. Understand the data by checking its basic statistics.
3. Identify null values.
4. Check unique values for categorical columns.
5. Review outliers using box plots.

In [1]:
# write your code from here

### Task 2: Implement Simple Data Validation
**Description**: Write a Python script to validate the data types and constraints of each column in a dataset.

**Steps**:
1. Define constraints for each column.
2. Validate each column based on its constraints.

In [2]:
# write your code from here

### Task 3: Detect Missing Data Patterns
**Description**: Analyze and visualize missing data patterns in a dataset.

**Steps**:
1. Visualize missing data using a heatmap.
2. Identify patterns in missing data.

In [3]:
# write your code from here

### Task 4: Integrate Automated Data Quality Checks
**Description**: Integrate automated data quality checks using the Great Expectations library for a dataset.

**Steps**:
1. Install and initialize Great Expectations.
2. Set up Great Expectations.
3. Add further checks and validate.

In [None]:
# Task 1: Data Profiling to Understand Data Quality
# Description: Use basic statistical methods to profile a dataset and identify potential quality issues.
# Steps:
# 1. Load the dataset using pandas in Python.
# 2. Understand the data by checking its basic statistics.
# 3. Identify null values.
# 4. Check unique values for categorical columns.
# 5. Review outliers using box plots.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load the dataset
try:
    df = pd.read_csv('your_dataset.csv')  # Replace 'your_dataset.csv' with the actual file name
    print("Dataset loaded successfully:")
    print(df.head())
except FileNotFoundError:
    print("Error: The file 'your_dataset.csv' was not found. Please make sure the file is in the correct directory.")
    exit()

# 2. Basic statistics
print("\nBasic statistics of the dataset:")
print(df.describe(include='all'))

# 3. Identify null values
print("\nNull values in each column:")
print(df.isnull().sum())

# 4. Check unique values for categorical columns
print("\nUnique values for categorical columns:")
for column in df.select_dtypes(include='object').columns:
    print(f"\nColumn: {column}")
    print(df[column].unique())
    print(f"Number of unique values: {df[column].nunique()}")

# 5. Review outliers using box plots
print("\nBox plots to visualize outliers:")
for column in df.select_dtypes(include=np.number).columns:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=df[column])
    plt.title(f'Box plot of {column}')
    plt.xlabel(column)
    plt.show()

# Task 2: Implement Simple Data Validation
# Description: Write a Python script to validate the data types and constraints of each column in a dataset.
# Steps:
# 1. Define constraints for each column.
# 2. Validate each column based on its constraints.

import pandas as pd

# Assuming the same DataFrame 'df' from Task 1

# 1. Define constraints for each column
constraints = {
    'column_name_1': {'dtype': 'int64', 'min': 0, 'max': 100},
    'column_name_2': {'dtype': 'object', 'allowed_values': ['A', 'B', 'C']},
    'column_name_3': {'dtype': 'float64', 'min': 1.0, 'max': 5.0},
    # Add constraints for other columns as needed
}

# 2. Validate each column based on its constraints
def validate_data(df, constraints):
    validation_errors = {}
    for column, rules in constraints.items():
        if column not in df.columns:
            validation_errors[column] = "Column not found"
            continue

        errors = []
        if 'dtype' in rules:
            if df[column].dtype != rules['dtype']:
                errors.append(f"Data type mismatch: expected {rules['dtype']}, got {df[column].dtype}")

        if 'min' in rules:
            if df[column].dtype in ['int64', 'float64']:
                if any(df[column] < rules['min']):
                    errors.append(f"Values below minimum: {rules['min']}")
            else:
                errors.append("Cannot check minimum for non-numeric data type")

        if 'max' in rules:
            if df[column].dtype in ['int64', 'float64']:
                if any(df[column] > rules['max']):
                    errors.append(f"Values above maximum: {rules['max']}")
            else:
                errors.append("Cannot check maximum for non-numeric data type")

        if 'allowed_values' in rules:
            if df[column].dtype == 'object':
                invalid_values = df[column][~df[column].isin(rules['allowed_values'])].unique()
                if len(invalid_values) > 0:
                    errors.append(f"Invalid values: {invalid_values}")
            else:
                errors.append("Cannot check allowed values for non-object data type")

        if errors:
            validation_errors[column] = errors

    return validation_errors

validation_results = validate_data(df, constraints)

if validation_results:
    print("\nData Validation Errors:")
    for column, errors in validation_results.items():
        print(f"Column '{column}':")
        for error in errors:
            print(f"- {error}")
else:
    print("\nData validation successful. No errors found.")

# Task 3: Detect Missing Data Patterns
# Description: Analyze and visualize missing data patterns in a dataset.
# Steps:
# 1. Visualize missing data using a heatmap.
# 2. Identify patterns in missing data.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Assuming the same DataFrame 'df' from Task 1

# 1. Visualize missing data using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()

# 2. Identify patterns in missing data
missing_counts = df.isnull().sum().sort_values(ascending=False)
print("\nNumber of missing values per column:")
print(missing_counts)

total_missing = df.isnull().sum().sum()
total_cells = df.size
percent_missing = (total_missing / total_cells) * 100
print(f"\nTotal missing values: {total_missing}")
print(f"Percentage of missing data: {percent_missing:.2f}%")

# Investigate if missing values in one column are related to missing values in another
missing_matrix = df.isnull()
missing_pairs = {}
for col1 in df.columns:
    for col2 in df.columns:
        if col1 != col2:
            both_missing = missing_matrix[col1] & missing_matrix[col2]
            if both_missing.sum() > 0:
                key = tuple(sorted((col1, col2)))
                if key not in missing_pairs:
                    missing_pairs[key] = both_missing.sum()
print("\nColumns with co-occurring missing values:")
print(missing_pairs)

# Task 4: Integrate Automated Data Quality Checks
# Description: Integrate automated data quality checks using the Great Expectations library for a dataset.
# Steps:
# 1. Install and initialize Great Expectations.
# 2. Set up Great Expectations.
# 3. Add further checks and validate.

# Note: This task requires the Great Expectations library to be installed.
# You can install it using: pip install great_expectations

# The following code provides a basic structure. Running it fully requires
# setting up a Great Expectations project, which involves more steps than can
# be included in a simple code snippet.

# 1. Install and initialize Great Expectations (command-line steps usually)
#    - Run: great_expectations init

# 2. Set up Great Expectations (involves configuring data sources and expectations)
#    - This typically involves editing configuration files created by the init command.
#    - You would define a data source pointing to your 'your_dataset.csv' file.
#    - You would then create an Expectation Suite, which is a collection of data quality checks.

# 3. Add further checks and validate (using Python code)
# Assuming you have initialized Great Expectations and have a DataContext
# and an Expectation Suite set up.

# from great_expectations.data_context import DataContext

# # Load your DataContext (replace with your actual project directory)
# context = DataContext("./great_expectations")

# # Get a validator for your data (assuming a Pandas DataFrame datasource named 'pandas_default' and a data asset named 'my_data')
# validator = context.get_validator(
#     datasource_name="pandas_default",
#     data_asset_name="your_dataset", # You might need to configure this in GE
# )

# # Add expectations (data quality checks) to the validator
# validator.expect_column_to_exist("column_name_1")
# validator.expect_column_values_to_be_of_type("column_name_1", "INTEGER")
# validator.expect_column_values_to_be_between("column_name_1", min_value=0, max_value=100)
# validator.expect_column_values_to_not_be_null("column_name_2")
# validator.expect_column_values_to_be_in_set("column_name_2", ["A", "B", "C"])
# # Add more expectations for other columns

# # Validate the data against the expectations
# validation_results = validator.validate()

# # Print the validation results
# print("\nGreat Expectations Validation Results:")
# print(validation_results)

# # You can then use the validation_results to determine if your data meets the defined quality standards.
# if validation_results["success"]:
#     print("Data quality checks passed!")
# else:
#     print("Data quality checks failed. See the validation results for details.")

print("\nFor Task 4, please ensure you have the Great Expectations library installed and initialized.")
print("Refer to the Great Expectations documentation for detailed steps on setting up data sources, expectation suites, and running validations.")
print("A basic initialization can be done via the command line: `great_expectations init`")
print("Then, you would typically edit the `great_expectations.yml` file and use Python code to define and run expectations.")

import numpy as np

Error: The file 'your_dataset.csv' was not found. Please make sure the file is in the correct directory.

Basic statistics of the dataset:


NameError: name 'df' is not defined

: 