**Task 1**: Checking Null Values for Completeness

**Description**: Verify if there are any null values in a dataset, which indicate incomplete data.

In [1]:
# Write your code from here

**Task 2**: Checking Data Type Validity

**Description**: Ensure that columns contain data of expected types, e.g., ages are integers.

In [2]:
# Write your code from here

**Task 3**: Verify Uniqueness of Identifiers

**Description**: Check if a dataset has unique identifiers (e.g., emails).

In [3]:
# Write your code from here

Task 4: Validate Email Format Using Regex

Description: Validate if email addresses in a dataset have the correct format.

In [4]:
# Write your code from here

Task 5: Check for Logical Age Validity

Description: Ensure ages are within a reasonable human range (e.g., 0-120).

In [5]:
# Write your code from here

Task 6: Identify and Handle Missing Data

Description: Identify missing values in a dataset and impute them using a simple strategy (e.g., mean).

In [6]:
# Write your code from here

Task 7: Detect Duplicates

Description: Detect duplicate rows in the dataset.

In [7]:
# Write your code from here

Task 8: Validate Correctness of Numerical Values

Description: Ensure numerical columns are within a specified range.

In [8]:
# Write your code from here

Task 9: Custom Completeness Rule Violation Report

Description: Create a report showing which rows violate specific completeness rules, such as mandatory fields being empty.

In [9]:
# Write your code from here

Task 10: Advanced Regex for Data Validity Check

Description: Check for validity with advanced regex patterns, such as validating complex fields with multi-level rules.

In [10]:
# Write your code from here

In [11]:
import pandas as pd
import numpy as np
import re
from datetime import datetime

# Assume your dataset is loaded into a pandas DataFrame called 'df'
# For demonstration, let's create a sample DataFrame:
data = {
    'id': [1, 2, 3, 4, 5, 5],
    'name': ['Alice', 'Bob', 'Charlie', None, 'Eve', 'Eve'],
    'age': [25, '30', 125, 40, 'twenty', 30],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'david@example.com', 'eve@example.com', 'eve@example.com'],
    'transaction_amount': [100.50, 200.75, -50.00, 150.20, 300.00, 300.00],
    'order_date': ['2023-01-15', '2023-02-20', '2023-03-10', '2023/04/05', '2023-05-01', '2023-05-01'],
    'product_code': ['ABC123', 'DEF456', 'GHI789', 'JKL012', 'MNO345', 'PQR678'],
    'country_code': ['US', 'CA', 'GB', 'US', 'FR', None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)
print("-" * 50)

# --- Task 1: Checking Null Values for Completeness ---
print("\n--- Task 1: Checking Null Values for Completeness ---")
null_values = df.isnull().sum()
print("Number of null values per column:")
print(null_values)
total_null = df.isnull().sum().sum()
total_cells = df.size
completeness = (total_cells - total_null) / total_cells * 100
print(f"Overall completeness: {completeness:.2f}%")
print("-" * 50)

# --- Task 2: Checking Data Type Validity ---
print("\n--- Task 2: Checking Data Type Validity ---")
print("Data types of each column:")
print(df.dtypes)

# Example: Ensure 'age' should be integer
df['age_valid_type'] = df['age'].apply(lambda x: isinstance(x, int))
print("\nValidity of 'age' data type (should be int):")
print(df[['age', 'age_valid_type']])
print("-" * 50)

# --- Task 3: Verify Uniqueness of Identifiers ---
print("\n--- Task 3: Verify Uniqueness of Identifiers ---")
# Example: Check uniqueness of 'id' and 'email'
print(f"Number of unique IDs: {df['id'].nunique()}")
print(f"Number of IDs: {len(df)}")
if df['id'].nunique() < len(df):
    print("Warning: 'id' column contains duplicate values.")
else:
    print("'id' column has unique values.")

print(f"Number of unique emails: {df['email'].nunique()}")
print(f"Number of emails: {len(df)}")
if df['email'].nunique() < len(df):
    print("Warning: 'email' column contains duplicate values.")
else:
    print("'email' column has unique values.")
print("-" * 50)

# --- Task 4: Validate Email Format Using Regex ---
print("\n--- Task 4: Validate Email Format Using Regex ---")
def is_valid_email(email):
    if isinstance(email, str):
        pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
        return re.match(pattern, email) is not None
    return False

df['email_valid_format'] = df['email'].apply(is_valid_email)
print("Email format validity:")
print(df[['email', 'email_valid_format']])
print("-" * 50)

# --- Task 5: Check for Logical Age Validity ---
print("\n--- Task 5: Check for Logical Age Validity ---")
def is_logical_age(age):
    try:
        age = int(age)
        return 0 <= age <= 120
    except (ValueError, TypeError):
        return False

df['age_logical'] = df['age'].apply(is_logical_age)
print("Logical age validity (0-120):")
print(df[['age', 'age_logical']])
print("-" * 50)

# --- Task 6: Identify and Handle Missing Data ---
print("\n--- Task 6: Identify and Handle Missing Data ---")
print("Missing values before imputation:")
print(df.isnull().sum())

# Impute missing 'name' with 'Unknown'
df['name'].fillna('Unknown', inplace=True)

# Impute missing 'country_code' with the mode
mode_country = df['country_code'].mode()[0] if not df['country_code'].mode().empty else None
df['country_code'].fillna(mode_country, inplace=True)

print("\nMissing values after simple imputation:")
print(df.isnull().sum())
print("-" * 50)

# --- Task 7: Detect Duplicates ---
print("\n--- Task 7: Detect Duplicates ---")
duplicate_rows = df[df.duplicated()]
print("Duplicate rows:")
print(duplicate_rows)

# Detect duplicates based on specific columns (e.g., 'name' and 'email')
duplicate_specific = df[df.duplicated(subset=['name', 'email'], keep=False)]
print("\nDuplicate rows based on 'name' and 'email':")
print(duplicate_specific)
print("-" * 50)

# --- Task 8: Validate Correctness of Numerical Values ---
print("\n--- Task 8: Validate Correctness of Numerical Values ---")
# Example: Ensure 'transaction_amount' is within a reasonable range (e.g., >= 0)
df['amount_valid_range'] = df['transaction_amount'].apply(lambda x: x >= 0)
print("Validity of 'transaction_amount' (>= 0):")
print(df[['transaction_amount', 'amount_valid_range']])
print("-" * 50)

# --- Task 9: Custom Completeness Rule Violation Report ---
print("\n--- Task 9: Custom Completeness Rule Violation Report ---")
# Example: 'name' and 'email' are mandatory fields
mandatory_fields = ['name', 'email']
completeness_violations = df[df[mandatory_fields].isnull().any(axis=1)]
print(f"Rows violating completeness rules for mandatory fields ({mandatory_fields}):")
print(completeness_violations[mandatory_fields])
print("-" * 50)

# --- Task 10: Advanced Regex for Data Validity Check ---
print("\n--- Task 10: Advanced Regex for Data Validity Check ---")
# Example: Validate 'product_code' format (e.g., 3 uppercase letters followed by 3 digits)
def is_valid_product_code(code):
    if isinstance(code, str):
        pattern = r"^[A-Z]{3}\d{3}$"
        return re.match(pattern, code) is not None
    return False

df['product_code_valid'] = df['product_code'].apply(is_valid_product_code)
print("Product code format validity (3 uppercase letters followed by 3 digits):")
print(df[['product_code', 'product_code_valid']])

# Example: Validate 'order_date' is in YYYY-MM-DD or YYYY/MM/DD format
def is_valid_order_date(date_str):
    if isinstance(date_str, str):
        pattern1 = r"^\d{4}-\d{2}-\d{2}$"
        pattern2 = r"^\d{4}/\d{2}/\d{2}$"
        return re.match(pattern1, date_str) is not None or re.match(pattern2, date_str) is not None
    return False

df['order_date_valid_format'] = df['order_date'].apply(is_valid_order_date)
print("\nOrder date format validity (YYYY-MM-DD or YYYY/MM/DD):")
print(df[['order_date', 'order_date_valid_format']])

Original DataFrame:
   id     name     age                email  transaction_amount  order_date  \
0   1    Alice      25    alice@example.com              100.50  2023-01-15   
1   2      Bob      30      bob@example.com              200.75  2023-02-20   
2   3  Charlie     125  charlie@example.com              -50.00  2023-03-10   
3   4     None      40    david@example.com              150.20  2023/04/05   
4   5      Eve  twenty      eve@example.com              300.00  2023-05-01   
5   5      Eve      30      eve@example.com              300.00  2023-05-01   

  product_code country_code  
0       ABC123           US  
1       DEF456           CA  
2       GHI789           GB  
3       JKL012           US  
4       MNO345           FR  
5       PQR678         None  
--------------------------------------------------

--- Task 1: Checking Null Values for Completeness ---
Number of null values per column:
id                    0
name                  1
age                   0
emai