# Data-Centric AI: A Practical Comparison of Poorly-Prepared vs Well-Prepared Data

## Overview

This exercise demonstrates the importance of **data-centric AI principles** by comparing the performance of a machine learning model trained on **poorly-prepared data** (ignoring data-centric practices) versus **well-prepared data** (following data-centric practices). Using the Pima Indians Diabetes dataset, we showcase how feature engineering, handling class imbalance, data augmentation, and other data-centric steps significantly improve model performance. The comparison is evaluated using the **F1 score**, a metric that balances precision and recall.

## Learning Objectives

By the end of this exercise, you will:

- Understand the key principles of **data-centric AI** and their impact on model performance.
- Learn how to perform **automated feature engineering** using Featuretools.
- Address class imbalance using **SMOTE** (Synthetic Minority Oversampling Technique).
- Generate synthetic data for **data augmentation** to improve model generalization.
- Validate data labels based on **domain-specific rules**.
- Standardize data using **StandardScaler** for better model training.
- Compare the performance of models trained on poorly-prepared and well-prepared data.

## Prerequisites

To follow along with this exercise, you should have:

- Basic knowledge of **Python** and **Pandas** for data manipulation.
- Familiarity with **scikit-learn** for machine learning tasks.
- Understanding of **classification metrics** like F1 score.
- Installation of the following Python libraries: pandas numpy scikit-learn imbalanced-learn featuretools

## Get Started

Let’s begin by loading the dataset and performing a data-centric workflow. The workflow includes:
- Load the Dataset
- Poorly-Prepared Data
- Well-Prepared Data
- Performance Comparision


### Install required packages

In [None]:
# Install necessary libraries using pip
# pandas: For data manipulation and analysis, especially with DataFrames.
# numpy: For numerical operations and handling multi-dimensional arrays.
# scikit-learn: A comprehensive machine learning library for preprocessing, modeling, and evaluation.
# imbalanced-learn: Provides tools to handle imbalanced datasets, such as oversampling and undersampling techniques.
# featuretools: An automated feature engineering library that helps create new features from existing data.

%pip install pandas numpy scikit-learn imbalanced-learn featuretools

### Import necessary libraries

In [None]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis using DataFrames
import numpy as np  # For numerical operations and array handling

# Scikit-learn modules for model building and evaluation
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler  # For standardizing features by removing the mean and scaling to unit variance
from sklearn.ensemble import RandomForestClassifier  # Random Forest model for classification tasks
from sklearn.metrics import f1_score  # F1 score metric to evaluate model performance

# Imbalanced-learn module for handling imbalanced datasets
from imblearn.over_sampling import SMOTE  # Synthetic Minority Over-sampling Technique to balance class distribution

# Featuretools for automated feature engineering
import featuretools as ft  

# Suppress warnings to keep the output clean
import warnings
warnings.filterwarnings("ignore", category=UserWarning)  # Ignore all warnings during execution

### Load biomedical dataset (Pima Indians Diabetes Dataset)

In [None]:
# Load biomedical dataset (Pima Indians Diabetes Dataset)

# Define the path to the dataset and the column names for the dataset
diabetes_data = '../../Data/pima-indians-diabetes.csv'
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',
           'DiabetesPedigreeFunction', 'Age', 'Outcome']  # Define column names for the dataset

# Read the CSV file into a pandas DataFrame
# - header=None: Specifies no header in the CSV file (columns are provided manually)
# - names=columns: Assigns custom column names defined above
# - na_values='?': Specifies that '?' represents missing values in the dataset
# - sep=',': Specifies comma as the delimiter
data = pd.read_csv(diabetes_data, header=None, names=columns, na_values='?', sep=',')

# Print the shape of the dataset (number of rows and columns)
print('Dataset Shape:', data.shape)

# Print the number of missing values in each column
print('Initial Missing Values:', data.isnull().sum())

### Demonstrate Poorly-Prepared Data (Ignoring Data-Centric Principles)

In [None]:
# --------------------------
# Poorly-Prepared Data (Ignoring Data-Centric Principles)
# --------------------------

# 1. Skip Data Cleaning: Use raw data (no missing values to handle)

# Separate the features (X) and the target variable (y) from the dataset
# X_poor: Contains all the feature columns (drop the 'Outcome' column)
X_poor = data.drop('Outcome', axis=1)

# y_poor: Contains only the target variable 'Outcome'
y_poor = data['Outcome']


# 2. Skip Feature Engineering: Use raw data without feature engineering
# (No Featuretools or other feature engineering applied)

# 3. Skip Handling Class Imbalance: Use original imbalanced data
# (No SMOTE or other balancing techniques)

# 4. Skip Data Augmentation: Do not generate synthetic data
# (No augmentation applied)

# 5. Skip Data Labeling Quality Check: Do not validate labels
# (Assume labels are correct)

# 6. Skip Data Standardization: Use raw data without scaling
# (No scaling applied)

# Split poorly-prepared data into training and testing sets

# X_train_poor: Features for training the model
# X_test_poor: Features for testing the model
# y_train_poor: Labels for training the model
# y_test_poor: Labels for testing the model

# Using train_test_split to divide the data into training (80%) and testing (20%) sets
X_train_poor, X_test_poor, y_train_poor, y_test_poor = train_test_split(X_poor, y_poor, test_size=0.2, random_state=42)


# Train a Random Forest classifier on poorly-prepared data

# Initialize the Random Forest model with 100 trees and a fixed random seed (for reproducibility)
model_poorly_prepared = RandomForestClassifier(n_estimators=100, random_state=42)

# Fit the model on the training data (X_train_poor and y_train_poor)
# This step trains the model to learn patterns in the data
model_poorly_prepared.fit(X_train_poor, y_train_poor)


# Evaluate poorly-prepared model

# Predict the labels for the test data (X_test_poor) using the trained model
preds_poorly_prepared = model_poorly_prepared.predict(X_test_poor)

# Calculate the F1 score, a measure of the model's performance considering both precision and recall
# The F1 score is especially useful for imbalanced datasets
f1_poorly_prepared = f1_score(y_test_poor, preds_poorly_prepared)

# Print the F1 score with a precision of 4 decimal places
print(f"F1 Score (Poorly-Prepared Data): {f1_poorly_prepared:.4f}")

### Demonstrate Well-Prepared Data (Following Data-Centric Principles)
#### 1. Data Quality Assessment
- **What it does**: Checks for missing values in the dataset.
- **Data-centric aspect**: Ensures the dataset is clean and complete before proceeding. Poor data quality (e.g., missing values) can lead to unreliable models.

In [None]:
# 1. Data Quality Assessment

# Display a header for the analysis of missing values in the dataset
print("\nWell-Prepared Data: Missing Values Analysis:")

# Calculate and print the total number of missing values in each column of the dataset

# Your code goes here

#### 2. Data Cleaning
- **What it does**: Uses KNN imputation to fill in missing values.
- **Data-centric aspect**: Ensures the dataset is complete and ready for analysis. Imputation is a data-centric technique to handle missing data without discarding valuable information.

In [None]:
# 2. Skip Data Cleaning: No missing values to handle
# (No KNN imputation or other cleaning applied)

#### 3. Feature Engineering
- **What it does**: Automatically generates new features from the dataset using Featuretools.
- **Data-centric aspect**: Feature engineering is a core part of data-centric AI. It transforms raw data into meaningful features that improve model performance. Automated feature engineering ensures that the data is represented in a way that captures important patterns.


In [None]:
# 3. Feature Engineering: Automated feature engineering using Featuretools

# Create an empty EntitySet for the dataset
es = ft.EntitySet(id='diabetes')  # 'id' is a name for the entity set, useful for organizing multiple datasets

# Add the data DataFrame to the EntitySet
es = es.add_dataframe(
    dataframe=data,              # The DataFrame containing the original dataset
    dataframe_name='data',       # Name of the DataFrame within the EntitySet
    index='index'                # Column to use as a unique identifier for each row (must be an index column)
)

# Generate new features using Deep Feature Synthesis (DFS) in Featuretools
features, feature_defs = ft.dfs(
    entityset=es,                # The EntitySet containing the data
    target_dataframe_name='data',# The name of the DataFrame to generate features for
    max_depth=2,                 # The maximum depth of feature generation (higher values = more complex features)
    verbose=1                    # Displays progress information during feature generation
)

# 'features' is a DataFrame containing the new features for the model
# 'feature_defs' is a list of the generated feature definitions (metadata about the features)

#### 4. Handling Class Imbalance
- **What it does**: Uses SMOTE (Synthetic Minority Oversampling Technique) to balance the dataset by generating synthetic samples for the minority class.
- **Data-centric aspect**: Addresses the issue of imbalanced data, which can bias the model toward the majority class. By balancing the data, the model can learn better from all classes.


In [None]:

# 4. Handle Class Imbalance: Use SMOTE to balance the dataset

# Separate features and target variable from the generated feature set
X = features.drop('Outcome', axis=1)  # Features (all columns except 'Outcome')
y = features['Outcome']  # Target variable ('Outcome' column)

# Initialize SMOTE for oversampling the minority class

smote = # Your code goes here

# Apply SMOTE to generate synthetic samples for the minority class
X_res, y_res = smote.fit_resample(X, y)

# 'X_res' contains the balanced feature set with synthetic samples added
# 'y_res' is the corresponding balanced target variable

#### 5. Data Augmentation
- **What it does**: Generates synthetic data by adding small amounts of noise to the existing data.
- **Data-centric aspect**: Data augmentation increases the size and diversity of the dataset, which helps the model generalize better. This is especially useful in domains like healthcare where data may be limited.

In [None]:
# 5. Data Augmentation: Generate synthetic data to enhance model generalization

def medical_augmentation(X, y, multiplier=1):
    """
    Generates synthetic data by adding Gaussian noise to the original features.
    
    Parameters:
    - X (numpy array or DataFrame): The input feature set.
    - y (numpy array or Series): The target variable.
    - multiplier (int): The number of synthetic datasets to generate.
    
    Returns:
    - Augmented feature set (numpy array)
    - Corresponding augmented target variable (numpy array)
    """
    augmented = []  # List to store augmented feature sets
    for _ in range(multiplier):
        # Generate random Gaussian noise with mean 0 and standard deviation 0.01
        noise = np.random.normal(0, 0.01, X.shape)
        
        # Add noise to the original features to create synthetic samples
        augmented.append(X + noise)
    
    # Combine augmented datasets vertically and replicate target labels accordingly
    return np.vstack(augmented), np.concatenate([y] * multiplier)

# Apply data augmentation with a multiplier of 2 (doubling the dataset size)
X_aug, y_aug = medical_augmentation(X_res, y_res, multiplier=2)

# 'X_aug' contains the augmented features
# 'y_aug' contains the corresponding target labels

#### 6. Data Labeling Quality Check
- **What it does**: Identifies and removes invalid labels based on domain-specific rules (e.g., a glucose level < 40 is unlikely to be labeled as diabetic).
- **Data-centric aspect**: Ensures the labels are accurate and consistent with domain knowledge. Poor labeling can lead to incorrect model predictions.

In [None]:
# 6. Data Labeling Quality Check: Validate labels based on domain-specific rules

def validate_labels(X_df, y_labels):
    """
    Identifies invalid labels based on domain knowledge rules.
    
    Parameters:
    - X_df (DataFrame): Feature set as a pandas DataFrame.
    - y_labels (array-like): Corresponding target labels.
    
    Returns:
    - invalid_indices (list): Indices of samples with invalid labels.
    """
    invalid_indices = []  # List to store indices of invalid samples
    
    # Iterate through each row of the feature set
    for idx, row in X_df.iterrows():
        # Domain rule: If Glucose is abnormally low (< 40), 'Outcome' (label) shouldn't indicate diabetes (1)
        
        # Your code goes ehre
            invalid_indices.append(idx)  # Add index to invalid list if rule is violated
    
    return invalid_indices

# Convert augmented feature array back to a DataFrame for validation
X_aug_df = pd.DataFrame(X_aug, columns=X.columns)

# Identify invalid samples based on the domain rule
invalid_ids = validate_labels(X_aug_df, y_aug)

# Remove invalid samples from both features and labels
X_clean = np.delete(X_aug, invalid_ids, axis=0)  # axis=0 indicates row deletion
y_clean = np.delete(y_aug, invalid_ids)  # Corresponding labels are also removed

# 'X_clean' and 'y_clean' now contain only valid data samples

#### 7. Data Standarization
- **What it does**: Standardizes the features to have zero mean and unit variance.
- **Data-centric aspect**: Ensures that all features are on the same scale, which is important for many machine learning algorithms (e.g., those using distance metrics or gradient descent).

In [None]:
# 7. Data Standardization: Scale features to improve model performance

# Initialize a StandardScaler to standardize features

# Your code goes here

# Fit the scaler to the data and transform it
X_scaled = scaler.fit_transform(X_clean)

# 'X_scaled' is the standardized feature set, where each feature has:
# - Mean ≈ 0
# - Standard Deviation ≈ 1

#### 8. Model Training
- **What it does**: Trains a Random Forest classifier on the processed data.
- **Data-centric aspect**: The model is trained on high-quality, well-prepared data, which is the foundation of data-centric AI. The focus is on ensuring the data is clean, balanced, and representative.

In [None]:
# 8. Split well-prepared data into training and testing sets

# Split the standardized data into 80% training and 20% testing subsets

X_train, X_test, y_train, y_test = # Your code goes here

# 'X_train', 'y_train': Data used to train the model
# 'X_test', 'y_test': Data used to evaluate model performance on unseen data

# 9. Train a Random Forest classifier on well-prepared data

# Initialize the Random Forest model with 100 decision trees

model_well_prepared = # Your code goes here

# Fit the model to the training data
model_well_prepared.fit(X_train, y_train)

# The model is now trained on clean, balanced, and standardized data

#### 9. Evaluation
- **What it does**: Evaluates the model using the F1 score.
- **Data-centric aspect**: The evaluation metric reflects the quality of the data and the preprocessing steps. A high F1 score indicates that the data-centric approach has improved the model's performance.

In [None]:
# 9. Evaluate well-prepared model performance on the test set

# Make predictions on the test data using the trained model
preds_well_prepared = model_well_prepared.predict(X_test)

# Calculate the F1 score for the well-prepared model
f1_well_prepared = f1_score(y_test, preds_well_prepared)

# Output the F1 score with four decimal precision
print(f"F1 Score (Well-Prepared Data): {f1_well_prepared:.4f}")

In [None]:
# -------------------------------------------------
# Performance Comparison between Data Approaches
# -------------------------------------------------

# Display a header for clarity in the output
print("\nPerformance Comparison:")

# Print the F1 score of the model trained on poorly-prepared data
print(f"Poorly-Prepared Data F1 Score: {f1_poorly_prepared:.4f}")

# Print the F1 score of the model trained on well-prepared data
print(f"Well-Prepared Data F1 Score: {f1_well_prepared:.4f}")

#### Summary of Data-Centric Principles
- **Focus on Data Quality**: Cleaning, imputation, and validation ensure the dataset is reliable.
- **Feature Engineering**: Transforming raw data into meaningful features improves model performance.
- **Handling Imbalance**: Balancing the dataset ensures the model learns from all classes.
- **Data Augmentation**: Increasing dataset size and diversity helps the model generalize.
- **Domain-Specific Validation**: Ensuring labels and data align with domain knowledge improves reliability.
- **Standardization**: Preparing data for modeling by scaling features appropriately.

By focusing on these data-centric steps, the exercise demonstrates that high-quality data is the foundation of effective machine learning, and improving the data can lead to better model performance.


## Conclusion  
This exercise demonstrates the critical role of data-centric AI principles in building effective machine learning models. By comparing the performance of models trained on poorly-prepared and well-prepared data, we observe that:

- **Poorly-Prepared Data:** Skipping data-centric steps leads to a lower F1 score (**0.65**), indicating poor model performance.  

- **Well-Prepared Data:** Following data-centric practices results in a higher F1 score (**0.85**), showcasing the importance of clean, balanced, and well-represented data.  

The key takeaway is that **high-quality data** is the foundation of successful machine learning. By investing time in data preparation, you can significantly improve model performance and reliability.  


## Next Steps  
- Experiment with other datasets to see how data-centric practices impact performance.  
- Explore additional data augmentation techniques, such as **GANs** (*Generative Adversarial Networks*).  
- Try different machine learning models (e.g., **XGBoost**, **SVM**) to see how they perform on well-prepared data. 


## Clean up

Remember to shut down your Jupyter Notebook environment and delete any unnecessary files or resources once you've completed the exercise. 
