### Task 1: Data Profiling to Understand Data Quality
**Description**: Use basic statistical methods to profile a dataset and identify potential quality issues.

**Steps**:
1. Load the dataset using pandas in Python.
2. Understand the data by checking its basic statistics.
3. Identify null values.
4. Check unique values for categorical columns.
5. Review outliers using box plots.

In [None]:
# write your code from here

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Load the dataset
df = pd.read_csv('your_dataset.csv')  # Replace with your actual CSV file path

# Step 2: Basic statistics
print("Basic Statistics:\n", df.describe(include='all'))

# Step 3: Identify null values
print("\nNull Values:\n", df.isnull().sum())

# Step 4: Unique values for categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_columns:
    print(f"\nUnique values in '{col}':\n", df[col].unique())

# Step 5: Review outliers using box plots
numerical_columns = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_columns:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col])
    plt.title(f'Box plot of {col}')
    plt.show()


FileNotFoundError: [Errno 2] No such file or directory: 'your_dataset.csv'

### Task 2: Implement Simple Data Validation
**Description**: Write a Python script to validate the data types and constraints of each column in a dataset.

**Steps**:
1. Define constraints for each column.
2. Validate each column based on its constraints.

In [None]:
# write your code from here

In [2]:
# Define constraints
constraints = {
    'age': {'type': 'int', 'min': 0, 'max': 120},
    'name': {'type': 'str'},
    'email': {'type': 'str'},
    'salary': {'type': 'float', 'min': 0},
}

# Validate function
def validate_data(df, constraints):
    errors = []
    for column, rules in constraints.items():
        if column in df.columns:
            col_data = df[column]
            if rules['type'] == 'int':
                if not pd.api.types.is_integer_dtype(col_data):
                    errors.append(f"Column '{column}' should be integer.")
            elif rules['type'] == 'float':
                if not pd.api.types.is_float_dtype(col_data):
                    errors.append(f"Column '{column}' should be float.")
            elif rules['type'] == 'str':
                if not pd.api.types.is_string_dtype(col_data):
                    errors.append(f"Column '{column}' should be string.")

            if 'min' in rules and (col_data < rules['min']).any():
                errors.append(f"Column '{column}' has values below {rules['min']}.")

            if 'max' in rules and (col_data > rules['max']).any():
                errors.append(f"Column '{column}' has values above {rules['max']}.")
        else:
            errors.append(f"Column '{column}' not found in dataset.")
    return errors

validation_errors = validate_data(df, constraints)
if validation_errors:
    print("Validation Errors:")
    for err in validation_errors:
        print("-", err)
else:
    print("All columns passed validation.")


NameError: name 'df' is not defined

### Task 3: Detect Missing Data Patterns
**Description**: Analyze and visualize missing data patterns in a dataset.

**Steps**:
1. Visualize missing data using a heatmap.
2. Identify patterns in missing data.

In [None]:
# write your code from here

In [3]:
import seaborn as sns
import matplotlib.pyplot as plt

# Step 1: Visualize missing data
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title("Missing Data Heatmap")
plt.show()

# Step 2: Identify missing patterns
missing_pattern = df.isnull().sum()
print("Missing Values per Column:\n", missing_pattern)

# Optional: Visualize missing data patterns with a matrix
import missingno as msno
msno.matrix(df)
plt.show()

msno.heatmap(df)
plt.show()


NameError: name 'df' is not defined

<Figure size 1200x600 with 0 Axes>

### Task 4: Integrate Automated Data Quality Checks
**Description**: Integrate automated data quality checks using the Great Expectations library for a dataset.

**Steps**:
1. Install and initialize Great Expectations.
2. Set up Great Expectations.
3. Add further checks and validate.

In [None]:
# write your code from here

In [4]:
# Step 1: Install and initialize Great Expectations (run in terminal)
pip install great_expectations
great_expectations init


SyntaxError: invalid syntax (175348805.py, line 2)

In [5]:
# Step 2: Load Great Expectations
import great_expectations as ge
from great_expectations.dataset import PandasDataset

# Load your DataFrame
df_ge = ge.from_pandas(df)

# Step 3: Add basic expectations
df_ge.expect_column_values_to_not_be_null('age')
df_ge.expect_column_values_to_be_between('age', 0, 120)
df_ge.expect_column_values_to_match_str_pattern('email', r'^[\w\.-]+@[\w\.-]+\.\w+$')

# Step 4: Run validations
results = df_ge.validate()
print("Validation Results:\n", results)


ModuleNotFoundError: No module named 'great_expectations.dataset'