**Table of contents**<a id='toc0_'></a>    
- [Data Ingestion](#toc1_)    
- [Data Cleaning](#toc2_)    
  - [Handling Missing Data and Duplicates Documentation](#toc2_1_)    
- [Data Quality Checks Documentation](#toc3_)    
    - [The data quality checks phase ensures the dataset is accurate, consistent, and logically sound for analysis.](#toc3_1_1_)    
- [Data Quality Report](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Data Ingestion](#toc0_)

<a id='toc1_1_1_'></a>[The loading stage focuses on ensuring data is correctly ingested with proper data types and meaningful column names.](#toc0_)

**Best Practices Followed:**
1. **Defining Data Types:**
   - A `dtype_dict` is created to explicitly specify data types for each column in the dataset. This improves memory efficiency and ensures correct data interpretation.
   - Examples include treating `gender` as a categorical variable and `bmi` as a float for precise numerical analysis.

2. **Using a Column Lookup Table:**
   - A dictionary (`column_lookup`) is used to rename columns to more descriptive and meaningful names. This makes the dataset easier to understand and work with.

3. **Efficient Loading:**
   - The dataset is loaded using `pd.read_csv` with the `dtype_dict`, minimizing post-load type conversions and errors.

4. **Validation:**
   - The data types and the first few rows are printed to verify successful loading and renaming.


In [None]:
import json
import pandas as pd
import numpy as np

from utils import data_path, list_data_files

# Prin the list of data files
print(list_data_files())

# Define the path to the raw data file
raw_data_path =  data_path()+ "\\" + 'raw\\sample_set_1 1-1.csv'

In [None]:
# Define the data types and column renaming mappings
dtype_dict = {
    "MEMBER_CODE": "int64",
    "Age": "int64",
    "GENDER": "category",
    "POLICY_NO": "int64",
    "CMS_Score": "int64",
    "ICD_CODE": "string",
    "ICD_desc": "string",
    "City": "string",
    "CLAIM_TYPE": "category",
    "BMI": "float64"
}

column_lookup = {
    "MEMBER_CODE": "member_code",
    "Age": "age",
    "GENDER": "gender",
    "POLICY_NO": "policy_number",
    "CMS_Score": "cms_score",
    "ICD_CODE": "icd_code",
    "ICD_desc": "icd_description",
    "City": "city",
    "CLAIM_TYPE": "claim_type",
    "BMI": "bmi"
}

# Load the dataset with specified data types and rename columns in one step
raw_data = pd.read_csv(raw_data_path, dtype=dtype_dict).rename(columns=column_lookup)

# Display data types and first few rows to confirm
print(f"Data types after loading:\n{raw_data.dtypes}\n")
print("First few rows of the data:\n", raw_data.head())

# Save the cleaned data to Parquet format for efficient storage and type preservation
raw_data.to_parquet("..\\data\\raw\\diabetic_claims.parquet", index=False)


In [None]:
display(raw_data.shape)
display(raw_data.info())
display(raw_data.head())
display(raw_data.tail())
display(raw_data.describe())
display(raw_data.isnull().sum())
display(raw_data.nunique())

In [None]:
# Displayt the total null values in the dataset
display(raw_data.isnull().sum())


# <a id='toc2_'></a>[Data Cleaning](#toc0_)
## <a id='toc2_1_'></a>[Handling Missing Data and Duplicates Documentation](#toc0_)

1. **Identifying Missing Data:**
   - A summary of missing values is generated to identify columns with missing entries.

2. **Handling Missing Data:**
   - Missing values in the `city` column are replaced with "Unknown" as an example strategy.
      - there are `4700` missing values for `city`  
   - Other strategies can include imputation or dropping rows/columns based on context.

3. **Checking for Duplicates:**
   - Duplicate rows are identified, counted, and removed to ensure data uniqueness and prevent bias. 
      - There are `2` duplicate rows
   - Identified that `claim_type` is a duplicate of another row (I and O values). To test this we check if rows with an I value have a identical row with an O value
      - Number of rows with 'I' value that have identical rows with 'O' value: 34233 out of 34235

4. **Validation:**
   - After handling missing data and duplicates, the data is re-checked to confirm integrity.


In [None]:
# Handle missing values: summary and filling city missing values with 'Unknown'
missing_summary = raw_data.isnull().sum()
print(f"Missing Values Summary:\n{missing_summary}\n")

# Fill missing city values with 'Unknown'
raw_data['city'].fillna('Unknown', inplace=True)


In [None]:
# Check for and remove duplicates
duplicate_count = raw_data.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Display duplicate rows if they exist, then remove duplicates
if duplicate_count > 0:
    print("Duplicate rows:\n", raw_data[raw_data.duplicated()])
    raw_data.drop_duplicates(inplace=True)

# Verify changes after handling missing values and duplicates
print("Data after handling missing values and duplicates:\n", raw_data.head())


In [None]:
# Separate the DataFrame into two subsets: one with 'I' and one with 'O'
df_I = raw_data[raw_data['claim_type'] == 'I']
df_O = raw_data[raw_data['claim_type'] == 'O']

# Merge the subsets on all columns except 'claim_type' to find identical rows
common_columns = [col for col in raw_data.columns if col != 'claim_type']
merged_df = pd.merge(df_I, df_O, on=common_columns, suffixes=('_I', '_O'))

# Identify and count identical rows
identical_rows = merged_df[common_columns]
num_identical_rows = len(identical_rows)

# Display the results
print(f"Identical rows with 'I' and 'O' claim_type:\n{identical_rows}")
print(f"Number of identical rows: {num_identical_rows} out of {len(df_I)}")


# <a id='toc3_'></a>[Data Quality Checks Documentation](#toc0_)

### <a id='toc3_1_1_'></a>[The data quality checks phase ensures the dataset is accurate, consistent, and logically sound for analysis.](#toc0_)

**Steps to be Implemented:**

1. **Logical Value Checks:**
   - Validate values within logical ranges for key variables:
     - `age`: Ensure all ages are within a plausible range (e.g., 0-120).
     - `bmi`: Confirm all BMI values fall within a reasonable range (e.g., 10-80).
     - `gender`: Ensure only valid values are present (e.g., "M", "F").

2. **Consistency Check for Member Data:**
   - Verify that variables that should not change across rows for a member (e.g., `member_code`, `gender`) are consistent.
   - We identified that the member_count is does not represent a unique individual, this has been confirmed by conducting a member count grouped by policy_number, member_code, age and gender (show table). It is plausable that the member code represents a famliy unit with multiple individuals when applicable.
   - This has been flagged to create a unique identifier for an individual.
   - In addition to create a feature table indicating the family unit size across policy_number and member_code for further analysis. 

3. **ICD Code and Description Lookup:**
   - Create a unique lookup table of `icd_code` and `icd_description` to identify duplicate or erroneous mappings.

4. **Save Intermediate Quality Report:**
   - Document any issues found during checks and save.

5.  **Addressing the quality issues of the data:**
   - The `non_unique_member_codes.csv` file documents the identified `member_code` issues.
   - To adress this we will create a unique ID by grouping `policy_number`, `member_code`, `gender`, and `age` to differentiate separate individuals.


In [None]:
# 1 - Logical checks for invalid age, BMI, and gender

# Check for invalid age
invalid_age_rows = raw_data[~raw_data['age'].between(0, 120)]
if not invalid_age_rows.empty:
    print(f"\nNumber of rows with invalid age values: {invalid_age_rows.shape[0]}")
    print(f"Rows with invalid age values:\n{invalid_age_rows[['age', 'member_code']]}\n")

# Check for invalid BMI
invalid_bmi_rows = raw_data[~raw_data['bmi'].between(10, 80)]
if not invalid_bmi_rows.empty:
    print(f"\nNumber of rows with extreme BMI values: {invalid_bmi_rows.shape[0]}")
    print(f"Rows with extreme BMI values:\n{invalid_bmi_rows[['bmi', 'member_code', 'policy_number']]}\n")
    print(f"Rows with extreme BMI values:\n{invalid_bmi_rows[['bmi', 'member_code', 'policy_number']].drop_duplicates()}\n")
    print(f"Unique Rows with extreme BMI values:\n{invalid_bmi_rows.drop_duplicates()}\n")

# Check for invalid gender
valid_genders = ["M", "F"]
invalid_gender_rows = raw_data[~raw_data['gender'].isin(valid_genders)]
if not invalid_gender_rows.empty:
    print(f"\nNumber of rows with invalid gender values: {invalid_gender_rows.shape[0]}")
    print(f"Rows with invalid gender values:\n{invalid_gender_rows[['gender', 'member_code']]}\n")


In [None]:
# 2- Check for inconsistent gender in member_code
inconsistent_gender_members = raw_data.groupby('member_code').filter(lambda group: group['gender'].nunique() > 1)

# Display inconsistent rows
if not inconsistent_gender_members.empty:
    print("Inconsistent rows due to gender:")
    print(inconsistent_gender_members[['member_code', 'gender']].drop_duplicates())


In [None]:
# 3- Checking the frequency of member counts per policy number, member code, gender, and age
member_table = raw_data[['policy_number', 'member_code', 'gender', 'age']].drop_duplicates()

# Group by policy_number and member_code, then count the members
counts_of_member = member_table.groupby(['policy_number', 'member_code']).size() \
    .value_counts().reset_index(name='frequency') \
    .rename(columns={'index': 'member_count'})

# Display the result
display(counts_of_member)

# Identify and display duplicates based on member_code
duplicates = member_table[member_table.duplicated(subset=['member_code'], keep=False)]
display(duplicates.sort_values(by='member_code'))


In [None]:
# Create a unique ICD lookup table by selecting distinct icd_code and icd_description
icd_lookup = raw_data[['icd_code', 'icd_description']].drop_duplicates()

# Check for duplicate ICD codes
duplicate_icd = icd_lookup['icd_code'].duplicated().sum()
print(f"Checking for duplicates: Number of duplicate ICD codes: {duplicate_icd}")

# Check for missing ICD descriptions
missing_icd_desc = icd_lookup['icd_description'].isnull().sum()
print(f"Checking for missing values: Number of missing ICD descriptions: {missing_icd_desc}")


In [24]:
raw_data['city'].unique()

<StringArray>
[           'RIYADH',                <NA>,            'JEDDAH',
          'ALKHOBAR',            'MAKKAH',              'HAIL',
             'TABUK',            'MADINA',            'DAMMAM',
              'TAIF',             'YANBU',          'AL-AHASA',
            'ONAIZA',              'Abha',          'BUREIDAH',
    'KHAMIS MUSHAIT',              'ARAR',            'SAKAKA',
             'JIZAN',            'JUBAIL',            'KHAFJI',
        'Al Quraiat',          'AL KHARJ',             'HOFUF',
             'RAFHA',           'MAJMAAH',             'SIHAT',
            'RABEGH',             'SAFWA',              'HAQL',
      'HAFR ALBATEN',             'QATIF',        'BUKAIRIYAH',
            'MAHAIL',             'DHEBA',           'AL BAHA',
              'WAJH',            'NAJRAN',           'AL RUSS',
             'ZULFI',            'BADAIE',             'SABYA',
          'SHAROURA',              'AQIQ',         'AL DWADMI',
     'SABT AL-ALAYA',     

# <a id='toc4_'></a>[Data Quality Report](#toc0_)

In [None]:
# Convert numpy types to native Python types for JSON serialization
def convert_to_serializable(obj):
    if isinstance(obj, (np.int64, np.float64)):
        return int(obj)
    raise TypeError(f"Type {type(obj)} not serializable")

# Create the quality report data
quality_issues = {
    "Duplicate ICD Codes": int(duplicate_icd),
    "Missing ICD Descriptions": int(missing_icd_desc),
    "Invalid/Extreme Age Rows": invalid_age_rows[['age', 'member_code']].to_dict(orient='records') if not invalid_age_rows.empty else [],
    "Invalid/Extreme BMI Rows": invalid_bmi_rows[['bmi', 'member_code']].to_dict(orient='records') if not invalid_bmi_rows.empty else [],
    "Members code with inconsistent Gender": inconsistent_gender_members.index.tolist() if not inconsistent_gender_members.empty else [],
}

# Save the quality issues to a JSON file
report_path = '..\\reports\\raw_data_quality_report.json'
with open(report_path, 'w') as f:
    json.dump(quality_issues, f, indent=4, default=convert_to_serializable)

print(f"Quality report saved as {report_path}")


In [None]:
# Create a copy of the raw data for intermediate processing
intermediate_data = raw_data.copy()

# Specify the path to save the intermediate data
intermediate_data_path = f"{data_path()}\\intermediate\\intermediate_data.parquet"

# Save the intermediate data to a Parquet file
intermediate_data.to_parquet(intermediate_data_path, index=False)

print(f"Intermediate data saved to {intermediate_data_path}")
