### Task 1: Detecting Missing Values during Data Ingestion
**Description**: You have a CSV file with missing values in some columns. Write a Python script to detect and report missing values during the ingestion process.

**Steps**:
1. Load data
2. Check for missing values
3. Report missing values

In [1]:
# Write your code from here
import pandas as pd

def detect_missing_values(file_path):
    """
    Loads data from a CSV file, detects missing values, and reports them.

    Args:
        file_path (str): The path to the CSV file.
    """
    try:
        # Step 1: Load data
        df = pd.read_csv(file_path)
        print(f"Data loaded successfully from: {file_path}\n")

        # Step 2: Check for missing values
        missing_values = df.isnull().sum()

        # Step 3: Report missing values
        if missing_values.any():
            print("Missing values detected in the following columns:")
            print(missing_values[missing_values > 0])
            total_missing = missing_values.sum()
            total_rows = len(df)
            percentage_missing = (total_missing / (total_rows * len(df.columns))) * 100
            print(f"\nTotal number of missing values: {total_missing}")
            print(f"Total number of rows: {total_rows}")
            print(f"Total number of cells: {total_rows * len(df.columns)}")
            print(f"Percentage of missing values in the entire dataset: {percentage_missing:.2f}%")
        else:
            print("No missing values found in the dataset.")

    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
    except Exception as e:
        print(f"An error occurred: {e}")

# Example usage:
file_path = 'your_data.csv'  # Replace 'your_data.csv' with the actual path to your CSV file
detect_missing_values(file_path)

Error: File not found at your_data.csv


### Task 2: Validate Data Types during Extraction
**Description**: You have a JSON file that should have specific data types for each field. Write a script to validate if the data types match the expected schema.

**Steps**:
1. Define expected schema
2. Validate data types

In [2]:
# Write your code from here
import json

def validate_data_types(file_path, expected_schema):
    """
    Loads data from a JSON file and validates if the data types of each field
    match the expected schema.

    Args:
        file_path (str): The path to the JSON file.
        expected_schema (dict): A dictionary defining the expected data types
                                 for each field. For example:
                                 {'name': str, 'age': int, 'is_active': bool, 'salary': float}
    """
    try:
        with open(file_path, 'r') as f:
            data = json.load(f)
        print(f"Data loaded successfully from: {file_path}\n")

        if not isinstance(data, list):
            data = [data]  # Handle single JSON object as a list of one

        validation_errors = []

        for index, record in enumerate(data):
            for field, expected_type in expected_schema.items():
                if field not in record:
                    validation_errors.append(f"Record {index + 1}: Missing field '{field}'")
                    continue

                actual_value = record[field]
                actual_type = type(actual_value)

                if actual_type != expected_type:
                    validation_errors.append(
                        f"Record {index + 1}, Field '{field}': Expected type '{expected_type.__name__}', but got '{actual_type.__name__}'"
                    )

        if validation_errors:
            print("Data type validation failed. Errors found:")
            for error in validation_errors:
                print(f"- {error}")
            return False
        else:
            print("Data type validation successful. All fields match the expected schema.")
            return True

    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        return False
    except json.JSONDecodeError:
        print(f"Error: Could not decode JSON from {file_path}")
        return False
    except Exception as e:
        print(f"An error occurred: {e}")
        return False

# Example Usage:
file_path = 'your_data.json'  # Replace 'your_data.json' with the actual path to your JSON file

# Define the expected schema
expected_schema = {
    'name': str,
    'age': int,
    'is_active': bool,
    'salary': float,
    'city': str  # Example of another field
}

# Create a sample JSON file for testing (optional)
sample_data = [
    {'name': 'Alice', 'age': 30, 'is_active': True, 'salary': 60000.50, 'city': 'Bengaluru'},
    {'name': 'Bob', 'age': '25', 'is_active': 'true', 'salary': 55000, 'city': 'Mumbai'},
    {'name': 'Charlie', 'age': 40, 'salary': 70000.00, 'city': 'Delhi'}
]

with open(file_path, 'w') as f:
    json.dump(sample_data, f, indent=4)

# Validate the data types
validation_result = validate_data_types(file_path, expected_schema)
print(f"\nValidation Result: {validation_result}")

Data loaded successfully from: your_data.json

Data type validation failed. Errors found:
- Record 2, Field 'age': Expected type 'int', but got 'str'
- Record 2, Field 'is_active': Expected type 'bool', but got 'str'
- Record 2, Field 'salary': Expected type 'float', but got 'int'
- Record 3: Missing field 'is_active'

Validation Result: False


### Task 3: Remove Duplicate Records in Data
**Description**: You have a dataset with duplicate entries. Write a Python script to find and remove duplicate records using Pandas.

**Steps**:
1. Find duplicate records
2. Remove duplicates
3. Report results

In [3]:
# Write your code from here
import pandas as pd

def remove_duplicate_records(file_path, columns_to_consider=None):
    """
    Loads data from a CSV file, finds and removes duplicate records using Pandas,
    and reports the results.

    Args:
        file_path (str): The path to the CSV file.
        columns_to_consider (list, optional): A list of column names to consider
                                              when identifying duplicates. If None,
                                              all columns are used. Defaults to None.
    """
    try:
        # Step 1: Load data
        df = pd.read_csv(file_path)
        print(f"Data loaded successfully from: {file_path}\n")

        # Step 2: Find duplicate records
        initial_row_count = len(df)
        if columns_to_consider:
            duplicate_rows = df[df.duplicated(subset=columns_to_consider, keep=False)]
        else:
            duplicate_rows = df[df.duplicated(keep=False)]

        num_duplicates = len(duplicate_rows)
        print(f"Number of duplicate rows found: {num_duplicates}\n")
        if not duplicate_rows.empty:
            print("Example of duplicate rows:")
            print(duplicate_rows.head())
            print("\nNote: 'keep=False' shows all rows that are duplicates.")

        # Step 3: Remove duplicates
        if columns_to_consider:
            df_cleaned = df.drop_duplicates(subset=columns_to_consider, keep='first')
        else:
            df_cleaned = df.drop_duplicates(keep='first')

        final_row_count = len(df_cleaned)
        removed_count = initial_row_count - final_row_count
        print(f"\nNumber of rows after removing duplicates: {final_row_count}")
        print(f"Number of duplicate rows removed: {removed_count}")

        return df_cleaned

    except FileNotFoundError:
        print(f"Error: File not found at {file_path}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Example Usage:
file_path = 'your_data_with_duplicates.csv'  # Replace with the actual path

# Create a sample CSV file with duplicates for testing (optional)
data = {'col1': ['A', 'B', 'C', 'A', 'B', 'D'],
        'col2': [1, 2, 3, 1, 2, 4],
        'col3': [True, False, True, True, False, False]}
df_sample = pd.DataFrame(data)
df_sample.to_csv(file_path, index=False)

# Remove duplicates considering all columns
cleaned_df_all_cols = remove_duplicate_records(file_path)
if cleaned_df_all_cols is not None:
    print("\nCleaned DataFrame (considering all columns):")
    print(cleaned_df_all_cols)

# Remove duplicates considering only specific columns
columns_to_check = ['col1', 'col2']
cleaned_df_subset_cols = remove_duplicate_records(file_path, columns_to_check)
if cleaned_df_subset_cols is not None:
    print(f"\nCleaned DataFrame (considering columns: {columns_to_check}):")
    print(cleaned_df_subset_cols)

Data loaded successfully from: your_data_with_duplicates.csv

Number of duplicate rows found: 4

Example of duplicate rows:
  col1  col2   col3
0    A     1   True
1    B     2  False
3    A     1   True
4    B     2  False

Note: 'keep=False' shows all rows that are duplicates.

Number of rows after removing duplicates: 4
Number of duplicate rows removed: 2

Cleaned DataFrame (considering all columns):
  col1  col2   col3
0    A     1   True
1    B     2  False
2    C     3   True
5    D     4  False
Data loaded successfully from: your_data_with_duplicates.csv

Number of duplicate rows found: 4

Example of duplicate rows:
  col1  col2   col3
0    A     1   True
1    B     2  False
3    A     1   True
4    B     2  False

Note: 'keep=False' shows all rows that are duplicates.

Number of rows after removing duplicates: 4
Number of duplicate rows removed: 2

Cleaned DataFrame (considering columns: ['col1', 'col2']):
  col1  col2   col3
0    A     1   True
1    B     2  False
2    C     3