## Calculate Data Quality Score
**Introduction**: In this activity, you will calculate data quality scores for datasets using different metrics. You will explore examples where you assess completeness, accuracy, and consistency.

### Task 1: Completeness Score
1. Objective: Determine the percentage of non-missing values in a dataset.
2. Steps:
    - Load a sample dataset using Pandas.
    - Identify the columns with missing values.
    - Calculate the completeness score as the ratio of non-missing values to total values.
    - E.g., a dataset with customer information.

In [1]:
# Write your code from here
import pandas as pd
import io

# Sample dataset with customer information
csv_data = """CustomerID,Name,Email,Phone,Address
1,Alice,alice@example.com,123-456-7890,"123 Main St"
2,Bob,bob@example.com,,456-789-0123,"456 Oak Ave"
3,Charlie,,charlie@example.com,789-012-3456,"789 Pine Ln"
4,David,david@example.com,901-234-5678,
5,Eve,eve@example.com,234-567-8901,"321 Elm Rd"
"""

# Load the dataset into a pandas DataFrame
df = pd.read_csv(io.StringIO(csv_data))

# 1. Identify columns with missing values
print("Columns with missing values:\n", df.isnull().sum())

# 2. Calculate the completeness score for each column
total_rows = len(df)
completeness_score = (df.notnull().sum() / total_rows) * 100

print("\nCompleteness Score for each column (%):\n", completeness_score)

# You can also calculate an overall completeness score as the average of
# the completeness scores of all columns:
overall_completeness = completeness_score.mean()
print("\nOverall Completeness Score (%):", overall_completeness)

ParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6


### Task 2: Accuracy Score

1. Objective: Measure the accuracy of a dataset by comparing it against a reference dataset.
2. Steps:
    - Load the main dataset and a reference dataset.
    - Select key columns for accuracy check.
    - Match values from both datasets and calculate the accuracy percentage.
    - E.g., along existing dataset with sales information.

In [None]:
# Write your code from here

import pandas as pd
import io

# Main dataset with sales information
main_data = """OrderID,ProductID,Quantity,Price,CustomerID
101,A1,2,10.50,1
102,B2,1,25.00,2
103,A1,3,10.50,3
104,C3,1,5.75,4
105,B2,2,24.99,1
"""
main_df = pd.read_csv(io.StringIO(main_data))

# Reference dataset (considered the "ground truth")
reference_data = """OrderID,ProductID,Quantity,Price,CustomerID
101,A1,2,10.50,1
102,B2,1,25.00,2
103,A1,3,10.50,3
104,C3,1,5.75,4
105,B2,2,25.00,1
"""
reference_df = pd.read_csv(io.StringIO(reference_data))

# 1. Select key columns for accuracy check
key_columns = ['OrderID', 'ProductID', 'Quantity', 'Price']

# 2. Merge the two DataFrames on 'OrderID' to compare corresponding records
merged_df = pd.merge(main_df, reference_df, on='OrderID', suffixes=('_main', '_ref'), how='inner')

# 3. Match values from both datasets and calculate accuracy for each key column
accuracy = {}
for col in key_columns:
    accuracy[col] = (merged_df[f'{col}_main'] == merged_df[f'{col}_ref']).sum() / len(merged_df) * 100

print("Accuracy for each key column (%):\n", accuracy)

# 4. Calculate an overall accuracy score (optional: average of key column accuracies)
overall_accuracy = sum(accuracy.values()) / len(accuracy)
print("\nOverall Accuracy Score (%):", overall_accuracy)

### Task 3: Consistency Score

1. Objective: Evaluate the consistency within a dataset for specific columns.
2. Steps:
    - Choose a column expected to have consistent values.
    - Use statistical or rule-based checks to identify inconsistencies.
    - Calculate the consistency score by the ratio of consistent to total entries.
    - E.g., validating phone number formats in a contact list.

In [None]:
# Write your code from here
