## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [None]:
# Write your code from here

### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
# Write your code from here


### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here


In [1]:
import pandas as pd
import re
from datetime import datetime

def calculate_completeness_score(df):
    """
    Calculates the completeness score for a Pandas DataFrame.

    The completeness score is the percentage of non-missing values in the dataset.

    Args:
        df (pd.DataFrame): The input DataFrame.

    Returns:
        float: The completeness score (between 0 and 100).
    """
    if df.empty:
        print("Warning: DataFrame is empty. Completeness score will be 0.")
        return 0.0

    # Calculate the total number of values in the DataFrame
    total_values = df.size

    # Calculate the number of missing values (NaN, None, etc.)
    missing_values = df.isnull().sum().sum()

    if total_values == 0:
        return 0.0 # Avoid division by zero if the DataFrame is empty

    # Calculate completeness score
    completeness_score = ((total_values - missing_values) / total_values) * 100
    return completeness_score

def calculate_accuracy_score(main_df, reference_df, key_column, columns_to_check):
    """
    Measures the accuracy of a dataset by comparing it against a reference dataset.

    Accuracy is calculated based on matching values in specified columns for records
    identified by a key column.

    Args:
        main_df (pd.DataFrame): The main dataset to be checked for accuracy.
        reference_df (pd.DataFrame): The reference dataset for comparison.
        key_column (str): The name of the column used to match records between the two DataFrames.
        columns_to_check (list): A list of column names whose values will be compared for accuracy.

    Returns:
        float: The accuracy score (between 0 and 100).
    """
    if main_df.empty or reference_df.empty:
        print("Warning: One or both DataFrames are empty. Accuracy score will be 0.")
        return 0.0

    # Merge the two dataframes on the key column to find matching records
    # Using an inner merge ensures we only compare records present in both.
    merged_df = pd.merge(main_df, reference_df, on=key_column, suffixes=('_main', '_ref'))

    if merged_df.empty:
        print(f"No matching records found on key column '{key_column}'. Accuracy score will be 0.")
        return 0.0

    accurate_records_count = 0
    total_records_checked = len(merged_df)

    # Iterate through each merged record and check accuracy for specified columns
    for index, row in merged_df.iterrows():
        is_accurate_record = True
        for col in columns_to_check:
            main_value = row.get(f"{col}_main")
            ref_value = row.get(f"{col}_ref")

            # Consider None/NaN values as not accurate if the other is not None/NaN
            if pd.isna(main_value) and pd.isna(ref_value):
                # If both are NaN, they are consistent for this check, not inaccurate
                continue
            elif pd.isna(main_value) != pd.isna(ref_value):
                is_accurate_record = False
                break
            elif main_value != ref_value:
                is_accurate_record = False
                break
        if is_accurate_record:
            accurate_records_count += 1

    accuracy_score = (accurate_records_count / total_records_checked) * 100 if total_records_checked > 0 else 0.0
    return accuracy_score

def calculate_consistency_score(df, column, consistency_rule=None):
    """
    Evaluates the consistency within a dataset for a specific column based on a rule.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column (str): The name of the column to check for consistency.
        consistency_rule (callable, optional): A function that takes a value from the
                                               specified column and returns True if it's
                                               consistent, False otherwise.
                                               If None, a default rule for phone number format
                                               (e.g., (XXX) XXX-XXXX or XXX-XXX-XXXX) will be used.

    Returns:
        float: The consistency score (between 0 and 100).
    """
    if df.empty or column not in df.columns:
        print(f"Warning: DataFrame is empty or column '{column}' not found. Consistency score will be 0.")
        return 0.0

    consistent_entries = 0
    total_entries = len(df[column].dropna()) # Only consider non-null entries for consistency check

    if total_entries == 0:
        print(f"Warning: Column '{column}' has no non-missing values to check for consistency. Score will be 0.")
        return 0.0

    # Default rule for phone number consistency (e.g., (XXX) XXX-XXXX or XXX-XXX-XXXX)
    def default_phone_rule(phone_number):
        if not isinstance(phone_number, str):
            return False
        # Regex for common US phone formats: (123) 456-7890 or 123-456-7890
        pattern = re.compile(r"^\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$")
        return bool(pattern.match(phone_number))

    rule_to_apply = consistency_rule if consistency_rule is not None else default_phone_rule

    for value in df[column].dropna():
        if rule_to_apply(value):
            consistent_entries += 1

    consistency_score = (consistent_entries / total_entries) * 100
    return consistency_score

# --- Example Usage ---

# Task 1: Completeness Score
print("--- Task 1: Completeness Score ---")
# Create a sample dataset
data_completeness = {
    'CustomerID': [1, 2, 3, 4, 5, 6],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', None, 'Eve'],
    'Email': ['alice@example.com', 'bob@example.com', None, 'david@example.com', 'frank@example.com', 'eve@example.com'],
    'Phone': ['123-456-7890', '987-654-3210', '555-123-4567', None, '111-222-3333', '444-555-6666'],
    'Address': ['1 Main St', '2 Side Rd', None, '4 Oak Ave', '5 Pine Blvd', None]
}
df_completeness = pd.DataFrame(data_completeness)
print("\nSample Dataset for Completeness:")
print(df_completeness)

completeness = calculate_completeness_score(df_completeness)
print(f"\nCompleteness Score: {completeness:.2f}%")

# Task 2: Accuracy Score
print("\n--- Task 2: Accuracy Score ---")
# Create main dataset
data_main = {
    'OrderID': [101, 102, 103, 104, 105],
    'ProductName': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Price': [1200.00, 25.00, 75.00, 300.00, 50.00],
    'Quantity': [1, 2, 1, 1, 3],
    'CustomerName': ['Alice', 'Bob', 'Charlie', 'David', 'Eve']
}
df_main = pd.DataFrame(data_main)

# Create reference dataset (with some discrepancies)
data_reference = {
    'OrderID': [101, 102, 103, 106, 105], # 106 is a new order, 104 is missing
    'ProductName': ['Laptop', 'Mouse', 'Keybrd', 'Speaker', 'Webcam'], # 'Keybrd' is a typo
    'Price': [1200.00, 25.00, 70.00, 150.00, 50.00], # Price for 103 is different
    'CustomerName': ['Alice', 'Bob', 'Charlie', 'Frank', 'Eve']
}
df_reference = pd.DataFrame(data_reference)

print("\nMain Dataset for Accuracy:")
print(df_main)
print("\nReference Dataset for Accuracy:")
print(df_reference)

# Define key column and columns to check
key_col_accuracy = 'OrderID'
cols_to_check_accuracy = ['ProductName', 'Price', 'CustomerName']

accuracy = calculate_accuracy_score(df_main, df_reference, key_col_accuracy, cols_to_check_accuracy)
print(f"\nAccuracy Score: {accuracy:.2f}%")

# Task 3: Consistency Score
print("\n--- Task 3: Consistency Score ---")
# Create a sample dataset for consistency
data_consistency = {
    'ContactID': [1, 2, 3, 4, 5, 6, 7],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'PhoneNumber': ['(123) 456-7890', '987-654-3210', '555-1234567', '111-222-3333', 'not a phone', '000-000-0000', None],
    'Email': ['a@example.com', 'b@example.com', 'c@example.com', 'd@example.com', 'e@example.com', 'f@example.com', 'g@example.com']
}
df_consistency = pd.DataFrame(data_consistency)

print("\nSample Dataset for Consistency (Phone Numbers):")
print(df_consistency)

# Calculate consistency using the default phone number rule
consistency_phone = calculate_consistency_score(df_consistency, 'PhoneNumber')
print(f"\nConsistency Score for Phone Numbers (Default Rule): {consistency_phone:.2f}%")

# Example of a custom consistency rule (e.g., check if email contains '@')
def email_format_rule(email):
    if not isinstance(email, str):
        return False
    return '@' in email and '.' in email

consistency_email = calculate_consistency_score(df_consistency, 'Email', consistency_rule=email_format_rule)
print(f"Consistency Score for Emails (Custom Rule): {consistency_email:.2f}%")



--- Task 1: Completeness Score ---

Sample Dataset for Completeness:
   CustomerID     Name              Email         Phone      Address
0           1    Alice  alice@example.com  123-456-7890    1 Main St
1           2      Bob    bob@example.com  987-654-3210    2 Side Rd
2           3  Charlie               None  555-123-4567         None
3           4    David  david@example.com          None    4 Oak Ave
4           5     None  frank@example.com  111-222-3333  5 Pine Blvd
5           6      Eve    eve@example.com  444-555-6666         None

Completeness Score: 83.33%

--- Task 2: Accuracy Score ---

Main Dataset for Accuracy:
   OrderID ProductName   Price  Quantity CustomerName
0      101      Laptop  1200.0         1        Alice
1      102       Mouse    25.0         2          Bob
2      103    Keyboard    75.0         1      Charlie
3      104     Monitor   300.0         1        David
4      105      Webcam    50.0         3          Eve

Reference Dataset for Accuracy:
   