# Data Ingestion

### The loading stage focuses on ensuring data is correctly ingested with proper data types and meaningful column names.

**Best Practices Followed:**
1. **Defining Data Types:**
   - A `dtype_dict` is created to explicitly specify data types for each column in the dataset. This improves memory efficiency and ensures correct data interpretation.
   - Examples include treating `gender` as a categorical variable and `bmi` as a float for precise numerical analysis.

2. **Using a Column Lookup Table:**
   - A dictionary (`column_lookup`) is used to rename columns to more descriptive and meaningful names. This makes the dataset easier to understand and work with.

3. **Efficient Loading:**
   - The dataset is loaded using `pd.read_csv` with the `dtype_dict`, minimizing post-load type conversions and errors.

4. **Validation:**
   - The data types and the first few rows are printed to verify successful loading and renaming.


In [None]:
import pandas as pd
import re

from utils import data_path, list_data_files

# Prin the list of data files
print(list_data_files())

# Define the path to the raw data file
raw_data_path =  data_path()+ "\\" + 'raw\\sample_set_1 1-1.csv'

In [None]:
# Define the data types for each column in the dataset
dtype_dict = {
    "MEMBER_CODE": "int64",    # De-identified member ID, stored as float to match dataset format
    "Age": "int64",             # Age of the member
    "GENDER": "category",       # Gender is a categorical variable
    "POLICY_NO": "int64",       # Policy number, stored as integer
    "CMS_Score": "int64",       # Charlson comorbidity index score, stored as integer
    "ICD_CODE": "category",     # ICD-10 codes are categorical
    "ICD_desc": "string",       # ICD-10 description as a string
    "City": "string",           # City as a string, handling missing values separately
    "CLAIM_TYPE": "category",   # Claim type is categorical
    "BMI": "float64"            # BMI as a float
}

# Column renaming lookup table
column_lookup = {
    "MEMBER_CODE": "member_code",
    "Age": "age",
    "GENDER": "gender",
    "POLICY_NO": "policy_number",
    "CMS_Score": "cms_score",
    "ICD_CODE": "icd_code",
    "ICD_desc": "icd_description",
    "City": "city",
    "CLAIM_TYPE": "claim_type",
    "BMI": "bmi"
}


# Load the dataset with specified data types
raw_data = pd.read_csv(raw_data_path, dtype=dtype_dict)

# Rename columns using the lookup table
raw_data.rename(columns=column_lookup, inplace=True)

# Verify the data types after loading
print(raw_data.dtypes)

# Display the first few rows to confirm successful loading
print(raw_data.head())

# Save the cleaned data to Parquet format for efficient storage and column type preservation
raw_data.to_parquet("..\\data\\raw\\diabetic_claims.parquet", index=False)


# Data Cleaning
## Handling Missing Data and Duplicates Documentation

1. **Identifying Missing Data:**
   - A summary of missing values is generated to identify columns with missing entries.

2. **Handling Missing Data:**
   - Missing values in the `city` column are replaced with "Unknown" as an example strategy.
      - there are `4700` missing values for `city`  
   - Other strategies can include imputation or dropping rows/columns based on context.

3. **Checking for Duplicates:**
   - Duplicate rows are identified, counted, and removed to ensure data uniqueness and prevent bias. 
      - There are `2` duplicate rows
   - Identified that `claim_type` is a duplicate of another row (I and O values). To test this we check if rows with an I value have a identical row with an O value
      - Number of rows with 'I' value that have identical rows with 'O' value: 34233 out of 34235

4. **Validation:**
   - After handling missing data and duplicates, the data is re-checked to confirm integrity.


In [None]:
# Handle missing values
missing_summary = raw_data.isnull().sum()
print("Missing Values Summary:\n", missing_summary)

# Filling missing city values with 'Unknown'
raw_data["city"] = raw_data["city"].fillna("Unknown")


In [None]:
# Check and document duplicates
duplicate_count = raw_data.duplicated().sum()
display(f"Number of duplicate rows: {duplicate_count}")

# Print duplicate rows if they exist
if duplicate_count > 0:
    display("Duplicate rows:", raw_data[raw_data.duplicated()])

# Remove duplicate rows
raw_data = raw_data.drop_duplicates()

# Verify changes after handling missing values and duplicates
display("Data after handling missing values and duplicates:", raw_data.head())

In [None]:
# Testing if the claim_type column with values 'I' and 'O' are identical

# Separate the DataFrame into two subsets: one with 'I' and one with 'O'
df_I = raw_data[raw_data['claim_type'] == 'I']
df_O = raw_data[raw_data['claim_type'] == 'O']

# Merge the subsets on all columns except 'claim_type' to find identical rows
common_columns = [col for col in raw_data.columns if col != 'claim_type']
merged_df = pd.merge(df_I, df_O, on=common_columns, suffixes=('_I', '_O'))

# Check if rows with 'I' have an identical row with 'O'
identical_rows = merged_df[common_columns]

# Quantify the number of identical rows
num_identical_rows = len(identical_rows)

# Print the result
display("Identical rows with 'I' and 'O' claim_type:")
display(identical_rows)
display(f"Number of rows with 'I' value that have identical rows with 'O' value: {num_identical_rows} out of {len(df_I)}")

# Data Quality Checks Documentation

### The data quality checks phase ensures the dataset is accurate, consistent, and logically sound for analysis. 

**Steps to be Implemented:**

1. **Logical Value Checks:**
   - Validate values within logical ranges for key variables:
     - `age`: Ensure all ages are within a plausible range (e.g., 0-120).
     - `bmi`: Confirm all BMI values fall within a reasonable range (e.g., 10-80).
     - `gender`: Ensure only valid values are present (e.g., "M", "F").

2. **Consistency Check for Member Data:**
   - Verify that variables that should not change across rows for a member (e.g., `member_code`, `gender`) are consistent.
   - We identified that the member_count is does not represent a unique individual, this has been confirmed by conducting a member count grouped by policy_number, member_code, age and gender (show table). It is plausable that the member code represents a famliy unit with multiple individuals when applicable.
   - This has been flagged to create a unique identifier for an individual.
   - In addition to create a feature table indicating the family unit size across policy_number and member_code for further analysis. 

3. **ICD Code and Description Lookup:**
   - Create a unique lookup table of `icd_code` and `icd_description` to identify duplicate or erroneous mappings.

4. **Save Intermediate Quality Report:**
   - Document any issues found during checks and save.

5.  **Addressing the quality issues of the data:**
   - The `non_unique_member_codes.csv` file documents the identified `member_code` issues.
   - To adress this we will create a unique ID by grouping `policy_number`, `member_code`, `gender`, and `age` to differentiate separate individuals.


In [None]:
# 1 - Logical checks
invalid_age_rows = raw_data[~raw_data['age'].between(0, 120)]
assert invalid_age_rows.empty, f"Age values that are not logical: {invalid_age_rows[['age', 'member_code']].to_dict(orient='records')}"

invalid_bmi_rows = raw_data[~raw_data['bmi'].between(10, 80)]
assert invalid_bmi_rows.empty, f"BMI values that are extreme: {invalid_bmi_rows[['bmi', 'member_code']].to_dict(orient='records')}"


In [94]:
# 2- Consistency checks for member_code gender
valid_genders = ["M", "F"]
assert raw_data['gender'].isin(valid_genders).all(), "Invalid gender values found."

# Group by member_code and check for consistent gender
member_gender_consistency = raw_data.groupby(['member_code'])['gender']

# Identify inconsistencies
inconsistent_rows = member_gender_consistency.transform(lambda group: group.nunique() > 1)
inconsistent_gender_members = raw_data[inconsistent_rows][['member_code', 'gender']].drop_duplicates()

print("Inconsistent rows due to gender:")
print(inconsistent_gender_members)

Inconsistent rows due to gender:
            member_code gender
121        720035000000      M
131        720035000000      F
144        760902000000      M
176        760902000000      F
244       1000110000000      F
...                 ...    ...
173196    2502550000000      F
173476    2530880000000      M
173477    2530880000000      F
173577  247070000000000      M
173583  247070000000000      F

[5738 rows x 2 columns]


In [None]:
# 3- checking the frequency of member counts per policy number, member code, gender and age
member_table = raw_data[['policy_number', 'member_code', 'gender', 'age']].drop_duplicates()
grouped_counts = member_table.groupby(['policy_number', 'member_code']).size().reset_index(name='member_count')

# Create a table of the counts of counts
counts_of_counts = grouped_counts['member_count'].value_counts().reset_index()
counts_of_counts.columns = ['member_count', 'frequency']
counts_of_counts

Unnamed: 0,member_count,frequency
0,1,13555
1,2,1678
2,3,271
3,4,132
4,5,55
5,6,14
6,7,6
7,8,4
8,9,1


In [104]:
duplicates = member_table[member_table.duplicated(subset=['member_code'], keep=False)]
duplicates

Unnamed: 0,policy_number,member_code,gender,age
121,26730932,720035000000,M,78
131,26730932,720035000000,F,72
144,26730932,760902000000,M,71
176,26730932,760902000000,F,67
244,202320252,1000110000000,F,58
...,...,...,...,...
173196,30894217,2502550000000,F,56
173476,26730932,2530880000000,M,68
173477,26730932,2530880000000,F,66
173577,26955367,247070000000000,M,73


In [None]:
# Create unique lookup table
icd_lookup = raw_data[['icd_code', 'icd_description']].drop_duplicates()

# Check for duplicates or missing values
duplicate_icd = icd_lookup['icd_code'].duplicated().sum()
print(f"Number of duplicate ICD codes: {duplicate_icd}")
missing_icd_desc = icd_lookup['icd_description'].isnull().sum()
print(f"Number of missing ICD descriptions: {missing_icd_desc}")



# Data Quality Report

In [None]:
import json
import numpy as np

# Convert numpy.int64 to native Python int for JSON serialization
def convert_to_serializable(obj):
    if isinstance(obj, (np.int64, np.float64)):
        return int(obj)
    raise TypeError(f"Type {type(obj)} not serializable")


# Save intermediate quality report
quality_issues = {
    "Duplicate ICD Codes": int(duplicate_icd),
    "Missing ICD Descriptions": int(missing_icd_desc),
    "Invalid/Extreme Age Rows": invalid_age_rows[['age', 'member_code']].to_dict(orient='records') if not invalid_age_rows.empty else [],
    "Invalid/Extreme BMI Rows": invalid_bmi_rows[['bmi', 'member_code']].to_dict(orient='records') if not invalid_bmi_rows.empty else [],
    "Members code with inconsistent Gender": inconsistent_gender_members.index.tolist() if not inconsistent_gender_members.empty else [],
}

with open('..\\reports\\raw_data_quality_report.json', 'w') as f:
    json.dump(quality_issues, f, indent=4, default=convert_to_serializable)

print("Quality report saved as quality_report.json in the reports directory.")

In [116]:
# saving the intermediate data
intermediate_data = raw_data.copy()

# specify the path to save the intermediate data
intermediate_data_path = data_path() + '\\intermediate'+'\\intermediate_data.parquet'

# Save the intermediate data to a parquet file
intermediate_data.to_parquet(intermediate_data_path, index=False)