# Data-Centric AI/ML: Diabetes Dataset Example


## Overview
In this notebook, we explore the concept of **Data-Centric AI/ML**, where the focus is on improving the quality of the dataset to enhance model performance. Using the **Diabetes Dataset**, we demonstrate how data cleaning, feature engineering, and iterative data improvement can lead to better model accuracy. This approach emphasizes the importance of high-quality data over complex model architectures.


## Learning Objectives
By the end of this notebook, you will:
* Understand the principles of **Data-Centric AI/ML**.
* Learn how to clean and preprocess a biomedical dataset effectively.
* Perform **feature engineering** to create meaningful features.
* Identify and correct **noisy labels** in the dataset.
* Evaluate the impact of data-centric improvements on model performance.


## Prerequisites
Before starting, ensure you have the following:
* Basic knowledge of Python and Pandas.
* Familiarity with machine learning concepts (e.g., classification, Random Forests).
* Libraries installed: pandas numpy scikit-learn matplotlib

## Get Started
Let’s begin by loading the dataset and performing a data-centric workflow. The workflow includes:
* **Data Cleaning**: Handling missing values and outliers.
* **Feature Engineering**: Creating new features like BMI categories.
* **Model Training**: Training a baseline Random Forest model.
* **Data-Centric Iteration**: Identifying and correcting noisy labels to improve model performance.

### Install required packages

In [None]:
# Install essential Python libraries for data analysis, machine learning, and visualization
# - pandas: For data manipulation and handling the diabetes dataset
# - numpy: For numerical operations and array management
# - scikit-learn: For machine learning models (e.g., RandomForestClassifier) and metrics (e.g., accuracy_score)
# - matplotlib: For plotting SHAP summary and model performance comparisons
%pip install pandas numpy scikit-learn matplotlib

### Import necessary libraries

In [None]:
# Importing essential libraries
import pandas as pd                # For data manipulation and analysis
import numpy as np                 # For numerical operations and handling arrays

# Importing machine learning libraries
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.ensemble import RandomForestClassifier   # Random Forest algorithm for classification
from sklearn.metrics import accuracy_score            # To evaluate the accuracy of the model

# Importing visualization library
import matplotlib.pyplot as plt    # For plotting graphs and visualizations

### Load the real diabetes dataset

In [None]:
def load_diabetes_data():
    # Define the path to the diabetes dataset (Pima Indians Diabetes Dataset)
    diabetes_data = "../../Data/pima-indians-diabetes.csv"

    # Define the column names for the dataset
    columns = [
        'Pregnancies',               # Number of times pregnant
        'Glucose',                   # Plasma glucose concentration (mg/dL)
        'BloodPressure',             # Diastolic blood pressure (mm Hg)
        'SkinThickness',             # Triceps skinfold thickness (mm)
        'Insulin',                   # 2-Hour serum insulin (mu U/ml)
        'BMI',                       # Body mass index (weight in kg/(height in m)^2)
        'DiabetesPedigreeFunction',  # Diabetes pedigree function (genetic risk)
        'Age',                       # Age in years
        'Outcome'                    # Class variable (0: Non-diabetic, 1: Diabetic)
    ]

    # Load the dataset into a DataFrame
    df = pd.read_csv(
        diabetes_data,   # File path to the CSV data
        header=None,     # No header row in the original file
        names=columns,   # Assign column names defined above
        na_values="?",   # Treat "?" as NaN (missing values)
        sep=','          # CSV file uses commas as the delimiter
    )
    
    # Display the shape of the dataset (rows, columns)
    print("Dataset Shape:", df.shape)
    
    # Show the count of missing values in each column
    print("Initial Missing Values:\n", df.isnull().sum())
    
    return df

### Data-Centric Workflow with Synthetic Noise

In [None]:
# 2. Data-Centric Workflow with Synthetic Noise
def data_centric_workflow():
    # Load the cleaned diabetes dataset
    df = load_diabetes_data()
    
    # Step 1: Data Cleaning
    # Replace zeros with the median value in columns where zero is not a valid measurement
    for col in ['Glucose', 'BloodPressure', 'BMI', 'SkinThickness', 'Insulin']:
        zero_count = (df[col] == 0).sum()  # Count zeros in the column
        print(f"Zeros in {col}: {zero_count}")
        df[col] = df[col].replace(0, df[col].median())  # Replace zeros with the median value
    
    # Remove extreme outliers for 'BMI' and 'BloodPressure' columns
    df = df[(df['BMI'] <= 60) & (df['BloodPressure'] <= 200)]
    print("\nAfter Cleaning Shape:", df.shape)  # Display new shape after cleaning
    
    # Step 2: Feature Engineering
    # Create a new categorical feature based on BMI ranges
    df['BMI_category'] = pd.cut(
        df['BMI'], 
        bins=[0, 18.5, 25, 30, 100],  # Define bins for BMI categories
        labels=['underweight', 'normal', 'overweight', 'obese']  # Category labels
    )
    
    # Step 3: Introduce Synthetic Label Noise
    np.random.seed(42)  # Set seed for reproducibility
    noise_idx = np.random.choice(df.index, size=20, replace=False)  # Select 20 random indices
    df.loc[noise_idx, 'Outcome'] = 1 - df.loc[noise_idx, 'Outcome']  # Flip 0 to 1 and 1 to 0
    print("\nIntroduced synthetic noise to 20 labels.")
    
    # Step 4: Initial Model Training
    # Prepare features and labels
    X = df.drop(['Outcome', 'BMI_category'], axis=1)  # Drop target and new feature for model input
    y = df['Outcome']  # Target variable
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train a Random Forest model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate initial model performance
    initial_pred = model.predict(X_test)
    initial_acc = accuracy_score(y_test, initial_pred)
    print("\nInitial Model Accuracy (with noise):", initial_acc)
    
    # Step 5: Iterative Data Improvement - Detect and Fix Noisy Labels
    # Use model predictions to identify potentially mislabeled data
    full_pred = model.predict(X)
    
    # Define a rule for suspicious labels:
    # If the model predicts 0 but the label is 1, and glucose is in the lowest quartile
    suspicious_idx = df[
        (full_pred == 0) & 
        (df['Outcome'] == 1) & 
        (df['Glucose'] < df['Glucose'].quantile(0.25))
    ].index
    print("Suspicious Labels Found:", len(suspicious_idx))
    
    # Correct suspicious labels
    if len(suspicious_idx) > 0:
        df.loc[suspicious_idx, 'Outcome'] = 0  # Change labels from 1 to 0 based on rule
        print(f"Corrected {len(suspicious_idx)} labels from 1 to 0.")
    else:
        print("No suspicious labels found with current rule.")
    
    # Retrain the model with the improved dataset
    X = df.drop(['Outcome', 'BMI_category'], axis=1)
    y = df['Outcome']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate the improved model performance
    improved_pred = model.predict(X_test)
    improved_acc = accuracy_score(y_test, improved_pred)
    print("Improved Model Accuracy:", improved_acc)
    
    # Visualize the improvement in model accuracy
    plt.bar(['Initial (Noisy)', 'Improved'], [initial_acc, improved_acc])
    plt.ylim(0, 1)  # Set y-axis limits for clarity
    plt.ylabel('Accuracy')  # Label y-axis
    plt.title('Model Performance Before and After Data-Centric Improvement')
    plt.show()
    
    return df, model

### Run the workflow

In [None]:
# Entry point of the script
if __name__ == "__main__":
    # Execute the data-centric machine learning workflow
    # The function returns the cleaned DataFrame and the trained model
    cleaned_df, final_model = data_centric_workflow()

**Suspicious Labels**: The rule will typically find 5-15 labels (varies with split), some of which overlap with the synthetic noise.

**Accuracy Improvement**: You’ll see a small but noticeable increase (e.g., 0.68 to 0.72), demonstrating the value of data refinement.

## Conclusion
In this notebook, we demonstrated how a data-centric approach can significantly improve model performance by focusing on:
* **Iterative Data Improvement**: The code now actively detects and corrects noisy labels, retraining the model to show improved performance.
* **Label Consistency and Noise Reduction**: Synthetic noise is introduced and then mitigated, mimicking real-world data imperfections.
* **Data Quality Over Model Complexity**: The focus remains on fixing the data, not tweaking the model.
* **Domain Knowledge Integration**: The rule uses Glucose (a key diabetes indicator) and model predictions, reflecting biomedical intuition.
* **Quantifying Improvements**: The accuracy increase and plot clearly show the impact of data-centric changes.

## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the tutorial.