# Unit 1 Building Reusable Data Processing Functions

Hello and welcome to the first lesson of "Building Reusable Pipeline Functions"! This is where our journey into the world of MLOps begins, as we take our first steps in the "Deploying ML Models in Production" course path.

Throughout this path, you'll learn how to transform experimental Machine Learning models into robust production systems. We'll start by laying the foundations of our ML system in this course, covering data processing, model training, evaluation, and persistence. In later courses, we'll move on to integrating an API to serve our ML model as well as adding an automated retraining pipeline with Apache Airflow.

In today's lesson, we'll focus on building reusable data processing functions — a critical foundation for any reliable ML system. We'll work with a diamond price prediction dataset to create well-structured functions that can be reused throughout our ML pipeline. Let's get started!

Understanding MLOps Fundamentals
MLOps (Machine Learning Operations) combines Machine Learning, DevOps practices, and data engineering to streamline the process of taking ML models to production and maintaining them effectively.

In traditional ML workflows, data scientists often create one-off scripts for data preparation. This approach works for exploration but quickly becomes problematic in production settings where data changes over time and multiple team members need to understand and modify the code. By creating modular, well-documented data processing functions, you're establishing the foundation for a reliable ML pipeline that can evolve with your project needs.

Some of the key benefits of adopting MLOps include:

Reproducibility: Ensures that data processing steps can be repeated exactly the same way each time.

Maintainability: Makes code easier to update and debug when isolated in focused functions.

Consistency: Provides the same transformations across training and inference.

Scalability: Allows processing to be applied to datasets of varying sizes.

Testing: Makes unit testing possible for individual pipeline components.

Exploring the Diamonds Dataset
In this course path, we'll be developing an application for diamond price prediction using the classic diamonds.csv dataset from Kaggle. This dataset is a staple in the data science community, offering a rich collection of attributes for nearly 54,000 diamonds.

The dataset's attributes are well-suited for building a predictive model. For instance, the carat column represents the weight of the diamond, ranging from 0.2 to 5.01, while the cut column describes the quality of the cut, with categories like Fair, Good, Very Good, Premium, and Ideal. The color and clarity columns provide additional qualitative measures, with color ranging from J (worst) to D (best) and clarity from I1 (worst) to IF (best). The dataset also includes numerical features such as depth, table, and the dimensions x, y, and z, which describe the diamond's physical characteristics. Here are the first few records from the dataset:

```sh
   carat      cut color clarity  depth  table  price     x     y     z
1   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
2   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
3   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
4   0.29  Premium     I     VS2   62.4   58.0    334  4.20  4.23  2.63
5   0.31     Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75
```

Creating Reusable Data Loading Functions
The first step in any ML pipeline is loading and exploring the data. Let's examine how we can create a reusable function for this purpose:

```python
def load_diamonds_data(file_path):
    """
    Load the diamonds dataset from a CSV file.

    Args:
        file_path (str): Path to the CSV file

    Returns:
        pd.DataFrame: Loaded diamonds data
    """
    # Load the data
    df = pd.read_csv(file_path, index_col=0)

    return df
```

This function simply loads the dataset using pd.read_csv, setting index_col=0 to specify the index column.
By isolating data loading in a dedicated function, you make your code more maintainable. If your data source changes in the future — perhaps from CSV to a database or cloud storage — you'll only need to update this one function rather than changing code throughout your project.

Designing Effective Preprocessing Functions
After loading the data, preprocessing is the next critical step. Let's look at how we can design the beginning of our preprocessing function:

```python
def preprocess_diamonds_data(df, test_size=0.2, random_state=42):
    """
    Preprocess the diamonds dataset for ML model training.

    Args:
        df (pd.DataFrame): Raw diamonds data
        test_size (float): Proportion of data to use for testing
        random_state (int): Random seed for reproducibility

    Returns:
        tuple: (X_train, X_test, y_train, y_test)
    """
    # Separate features and target
    X = df.drop('price', axis=1)  # Features - everything except price
    y = df['price']               # Target - what we want to predict

    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
```

This portion of the code illustrates several important design principles. The function accepts flexible parameters with sensible defaults. It starts by separating the prediction target (price) from the features and creating training and testing splits. By using a fixed random state, you ensure that your splits are reproducible — absolutely essential when you're debugging or comparing different modeling approaches.

Creating Smart Feature Transformations
Now, let's examine how we build the actual preprocessing pipeline using scikit-learn:

```python
    # Identify categorical and numerical columns automatically
    categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

    # Create preprocessing transformers for both categorical and numerical data
    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    numerical_transformer = StandardScaler()
    # Combine preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
```

This code elegantly solves the challenge of mixed data types in ML pipelines by automating preprocessing through dynamic column identification, specialized transformers, and a unified ColumnTransformer. Instead of manual column-by-column processing, the approach automatically detects categorical and numerical features, applies appropriate transformations to each, and combines them into a cohesive pipeline.
The resulting preprocessing system is both automatic and adaptable, requiring no code modifications when dataset structure changes. This flexibility is essential for production systems where data evolves over time. Additionally, thoughtful details like the handle_unknown='ignore' parameter in OneHotEncoder ensure the pipeline can gracefully handle new categories not seen during training—a common real-world scenario.

Preventing Data Leakage in Preprocessing
The final part of our preprocessing function applies the transformations and returns the processed data:

```python
    # Fit the preprocessor on training data only
    X_train_processed = preprocessor.fit_transform(X_train)  # Learn parameters and transform
    X_test_processed = preprocessor.transform(X_test)        # Apply learned parameters without fitting

    return X_train_processed, X_test_processed, y_train, y_test
```

This code demonstrates a crucial ML practice: using fit_transform() on training data to learn parameters, but only transform() on test data to apply those parameters. This approach prevents data leakage—where test data information inadvertently influences training, such as when standardizing all data together before splitting. By fitting exclusively on training data, you simulate how your model will perform on truly unseen production data, maintaining the integrity of your evaluation metrics.

Orchestrating the Data Pipeline
Now that we've built our individual components, let's see how they work together in a complete workflow:

```python
def main():
    """Main function to demonstrate data processing."""
    # Step 1: Load the data
    print("Loading diamonds dataset...")
    data_path = "diamonds.csv"  # Path relative to this script
    diamonds_df = load_diamonds_data(data_path)

    # Step 2: Preprocess the data
    print("\nPreprocessing the dataset...")
    X_train, X_test, y_train, y_test = preprocess_diamonds_data(diamonds_df)

    # Print preprocessing results
    print(f"\nPreprocessing complete:")
    print(f"  - Training features shape: {X_train.shape}")
    print(f"  - Testing features shape: {X_test.shape}")
    print(f"  - Training target shape: {y_train.shape}")
    print(f"  - Testing target shape: {y_test.shape}")

    print("\nData processing pipeline is ready for model training!")
```

This orchestration function demonstrates how our individual components combine into a cohesive pipeline with clear, sequential workflow. By structuring code where high-level functions call more specialized functions in sequence, we create a maintainable ML system that balances big-picture clarity with encapsulated implementation details. This orchestration pattern is particularly valuable in production environments, where it enables easier debugging, promotes collaboration among team members, and facilitates future modifications as requirements evolve.

Conclusion and Next Steps
In this first lesson, you've learned how to build the foundation of a robust ML pipeline by creating reusable functions for data loading and preprocessing. These functions aren't just convenient abstractions — they're essential building blocks for production ML systems that can handle changing data and requirements. By separating concerns, preventing data leakage, and creating adaptable transformations, you've taken the first steps toward MLOps best practices.

As you continue through this course, you'll build upon this foundation, adding functions for model training, evaluation, and persistence. These components will eventually come together to form a complete, production-ready ML system that can reliably deliver predictions and adapt to new data. The skills you're developing now — structuring code for reusability, preventing common ML pitfalls, and thinking in pipelines — will serve you throughout your journey into MLOps.

## Fix the Data Processing Bug

Welcome to your first hands-on exercise in building reusable data processing functions! In this exercise, you'll apply what you've learned about creating effective preprocessing functions.

Today, you'll be working on the preprocess_diamonds_data function, which is crucial for preparing our diamonds dataset for machine learning. However, there's a small mix-up in the current code: your mission is to identify and fix this bug. Happy debugging!


```python
"""
Data Processing Module for ML Pipeline

This module handles the processing and preparation of the diamonds dataset
for machine learning tasks.
"""

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

def load_diamonds_data(file_path):
    """
    Load the diamonds dataset from a CSV file.
    
    Args:
        file_path (str): Path to the CSV file
        
    Returns:
        pd.DataFrame: Loaded diamonds data
    """
    # Load the data
    df = pd.read_csv(file_path, index_col=0)
    
    return df

def preprocess_diamonds_data(df, test_size=0.2, random_state=42):
    """
    Preprocess the diamonds dataset for ML model training.
    
    Args:
        df (pd.DataFrame): Raw diamonds data
        test_size (float): Proportion of data to use for testing
        random_state (int): Random seed for reproducibility
        
    Returns:
        tuple: (X_train, X_test, y_train, y_test, preprocessor)
               - Training features
               - Testing features
               - Training target
               - Testing target
               - Preprocessor object for transforming new data
    """
    # Separate features and target
    X = df
    y = df.drop('price', axis=1)
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Identify categorical and numerical columns
    categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    
    # Create preprocessing transformers for both categorical and numerical data
    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    numerical_transformer = StandardScaler()

    # Combine preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    # Fit the preprocessor on the training data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
    
    return X_train_processed, X_test_processed, y_train, y_test, preprocessor

```

You've spotted a critical bug in the data separation step\! The code was incorrectly assigning the features and target variables, which would lead to incorrect model training and predictions.

The bug is in these lines:

```python
    # Separate features and target
    X = df
    y = df.drop('price', axis=1)
```

Here, `X` (the features) is assigned the entire DataFrame `df`, which still includes the `price` column. Meanwhile, `y` (the target) is incorrectly assigned the DataFrame with the `price` column dropped.

The fix is to swap the assignments: `X` should be the DataFrame without the target column, and `y` should be a Series containing only the `price` column.

Here is the corrected code with the bug fixed:

```python
"""
Data Processing Module for ML Pipeline

This module handles the processing and preparation of the diamonds dataset
for machine learning tasks.
"""

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split

def load_diamonds_data(file_path):
    """
    Load the diamonds dataset from a CSV file.
    
    Args:
        file_path (str): Path to the CSV file
        
    Returns:
        pd.DataFrame: Loaded diamonds data
    """
    # Load the data
    df = pd.read_csv(file_path, index_col=0)
    
    return df

def preprocess_diamonds_data(df, test_size=0.2, random_state=42):
    """
    Preprocess the diamonds dataset for ML model training.
    
    Args:
        df (pd.DataFrame): Raw diamonds data
        test_size (float): Proportion of data to use for testing
        random_state (int): Random seed for reproducibility
        
    Returns:
        tuple: (X_train, X_test, y_train, y_test, preprocessor)
               - Training features
               - Testing features
               - Training target
               - Testing target
               - Preprocessor object for transforming new data
    """
    # Separate features and target (FIXED BUG HERE)
    X = df.drop('price', axis=1)
    y = df['price']
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Identify categorical and numerical columns
    categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    
    # Create preprocessing transformers for both categorical and numerical data
    categorical_transformer = OneHotEncoder(handle_unknown='ignore')
    numerical_transformer = StandardScaler()

    # Combine preprocessing steps
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    # Fit the preprocessor on the training data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
    
    return X_train_processed, X_test_processed, y_train, y_test, preprocessor
```

## Loading Diamonds Dataset Correctly

## Identifying Data Columns Efficiently

## Building a Robust Preprocessing Pipeline

## Building a Data Processing Module