### Task 1: Measure Data Accuracy using a Trusted Source

**Description**: You have two datasets of product prices: `company_prices.csv` and
`trusted_prices.csv` . Check if the prices in `company_prices.csv` match the prices in
`trusted_prices.csv` . Assume both files have a "product_id" and "price" column.

In [None]:
# Write your code from here

In [1]:
import pandas as pd

# --- Task 1: Measure Data Accuracy using a Trusted Source ---
print("--- Task 1: Measure Data Accuracy using a Trusted Source ---")

# Load the datasets
try:
    company_df = pd.read_csv('company_prices.csv')
    trusted_df = pd.read_csv('trusted_prices.csv')
except FileNotFoundError:
    print("Error: Make sure 'company_prices.csv' and 'trusted_prices.csv' are in the same directory.")
    company_df = pd.DataFrame() # Create empty DataFrame to avoid errors later
    trusted_df = pd.DataFrame() # Create empty DataFrame to avoid errors later

if not company_df.empty and not trusted_df.empty:
    # Merge the two DataFrames on 'product_id'
    # 'inner' merge will only include products present in both datasets
    merged_df = pd.merge(company_df, trusted_df, on='product_id', suffixes=('_company', '_trusted'))

    # Check if prices match
    merged_df['price_match'] = (merged_df['price_company'] == merged_df['price_trusted'])

    # Calculate accuracy
    if not merged_df.empty:
        total_compared_products = len(merged_df)
        matching_prices_count = merged_df['price_match'].sum()
        accuracy = (matching_prices_count / total_compared_products) * 100
        print(f"Total products compared (present in both files): {total_compared_products}")
        print(f"Number of matching prices: {matching_prices_count}")
        print(f"Data Accuracy (matching prices) against trusted source: {accuracy:.2f}%")

        # Optionally, show products with price discrepancies
        discrepancies = merged_df[merged_df['price_match'] == False]
        if not discrepancies.empty:
            print("\nProducts with price discrepancies:")
            print(discrepancies[['product_id', 'price_company', 'price_trusted']])
        else:
            print("\nNo price discrepancies found between company and trusted data for common products.")
    else:
        print("No common products found between company_prices.csv and trusted_prices.csv to compare.")
else:
    print("Skipping Task 1 due to file loading error.")

--- Task 1: Measure Data Accuracy using a Trusted Source ---
Total products compared (present in both files): 7
Number of matching prices: 4
Data Accuracy (matching prices) against trusted source: 57.14%

Products with price discrepancies:
  product_id  price_company  price_trusted
2       P003           5.75            6.0
3       P004         -12.00           12.0
6       P007          -5.00            5.0


### Task 2: Detect Incorrect Values

**Description**: In `company_prices.csv` , detect any negative price values which are incorrect values for prices.

In [None]:
# Write your code from here

In [2]:
print("\n--- Task 2: Detect Incorrect Values ---")

# Ensure company_df is loaded from Task 1, or load it again if running this section separately
if company_df.empty:
    try:
        company_df = pd.read_csv('company_prices.csv')
    except FileNotFoundError:
        print("Error: 'company_prices.csv' not found. Cannot perform Task 2.")

if not company_df.empty:
    # Detect negative price values
    incorrect_prices_df = company_df[company_df['price'] < 0]

    if not incorrect_prices_df.empty:
        print("Detected incorrect (negative) price values:")
        print(incorrect_prices_df)
    else:
        print("No negative price values found in company_prices.csv. All prices appear valid.")
else:
    print("Skipping Task 2 due to file loading error.")



--- Task 2: Detect Incorrect Values ---
Detected incorrect (negative) price values:
  product_id  price
3       P004  -12.0
6       P007   -5.0


### Task 3: Check Missing Data Rates

**Description**: Calculate the percentage of missing values in `customer_data.csv` .

In [None]:
# Write your code from here

In [4]:
print("\n--- Task 3: Check Missing Data Rates ---")

# Load the customer data
try:
    customer_df = pd.read_csv('company_data.csv')
except FileNotFoundError:
    print("Error: Make sure 'company_data.csv' is in the same directory.")
    customer_df = pd.DataFrame() # Create empty DataFrame to avoid errors later

if not customer_df.empty:
    # Calculate the total number of missing values per column
    missing_values_count = customer_df.isnull().sum()

    # Calculate the percentage of missing values per column
    total_rows = len(customer_df)
    missing_values_percentage = (missing_values_count / total_rows) * 100

    print("Missing values per column:")
    print(missing_values_count)
    print("\nPercentage of missing values per column:")
    print(missing_values_percentage.round(2).astype(str) + '%')

    # Optionally, calculate overall missing data rate
    total_cells = customer_df.size
    total_missing_cells = missing_values_count.sum()
    if total_cells > 0: # Avoid division by zero if DataFrame is empty
        overall_missing_rate = (total_missing_cells / total_cells) * 100
        print(f"\nOverall missing data rate across all columns: {overall_missing_rate:.2f}%")
    else:
        print("Customer data DataFrame is empty, cannot calculate overall missing rate.")
else:
    print("Skipping Task 3 due to file loading error.")


--- Task 3: Check Missing Data Rates ---
Missing values per column:
customer_id    0
name           1
age            3
email          1
city           1
dtype: int64

Percentage of missing values per column:
customer_id     0.0%
name           12.5%
age            37.5%
email          12.5%
city           12.5%
dtype: object

Overall missing data rate across all columns: 15.00%


### Task 4: Handling Partially Available Records

**Description**: In `customer_data.csv` , identify records with missing "email" or "phone number" and decide whether to drop or fill them.

In [None]:
# Write your code from here

In [5]:
print("\n--- Task 4: Handling Partially Available Records ---")

if not customer_df.empty:
    # --- Part A: Identify records with missing 'email' or 'phone number' ---
    # First, let's check if 'phone number' column exists in customer_data.csv
    # Our sample data doesn't have 'phone number', so we'll adjust to only check 'email'
    # If your actual data has 'phone number', you would include it like:
    # missing_contact_records = customer_df[customer_df['email'].isnull() | customer_df['phone number'].isnull()]

    # For our sample customer_data.csv, we will only check 'email'
    missing_email_records = customer_df[customer_df['email'].isnull()]

    print("Records with missing 'email' (or 'phone number' if it existed):")
    if not missing_email_records.empty:
        print(missing_email_records)
    else:
        print("No records found with missing 'email' (or 'phone number' if present).")

    # --- Part B: Decide whether to drop or fill them ---

    # Option 1: Drop records with missing 'email' (or 'phone number')
    print("\n--- Option 1: Dropping records with missing contact info ---")
    # Using 'email' for our sample, replace with ['email', 'phone number'] if both exist
    customer_df_dropped = customer_df.dropna(subset=['email'])

    print(f"Original number of records: {len(customer_df)}")
    print(f"Number of records after dropping missing 'email': {len(customer_df_dropped)}")
    print("DataFrame after dropping:")
    print(customer_df_dropped)


    # Option 2: Fill missing 'email' (or 'phone number') with a placeholder
    print("\n--- Option 2: Filling missing contact info with a placeholder ---")
    customer_df_filled = customer_df.copy() # Create a copy to avoid modifying original df
    # Fill missing 'email' with 'missing@example.com'
    customer_df_filled['email'].fillna('missing@example.com', inplace=True)
    # If you had a 'phone number' column, you might do:
    # customer_df_filled['phone number'].fillna('N/A', inplace=True)


    print("DataFrame after filling missing 'email' with 'missing@example.com':")
    print(customer_df_filled)

    # You would choose either Option 1 or Option 2 based on your data analysis needs
    # and the impact of dropping/filling on your downstream tasks.
else:
    print("Skipping Task 4 due to 'customer_data.csv' loading error.")



--- Task 4: Handling Partially Available Records ---
Records with missing 'email' (or 'phone number' if it existed):
  customer_id   name   age email     city
3        C104  David  45.0   NaN  Houston

--- Option 1: Dropping records with missing contact info ---
Original number of records: 8
Number of records after dropping missing 'email': 7
DataFrame after dropping:
  customer_id     name   age                email         city
0        C101    Alice  30.0    alice@example.com     New York
1        C102      Bob  24.0      bob@example.com  Los Angeles
2        C103  Charlie   NaN  charlie@example.com      Chicago
4        C105      Eve  28.0      eve@example.com          NaN
5        C106    Frank   NaN    frank@example.com        Miami
6        C107    Grace  35.0    grace@example.com      Seattle
7        C108      NaN   NaN   oliver@example.com       Boston

--- Option 2: Filling missing contact info with a placeholder ---
DataFrame after filling missing 'email' with 'missing@exa