# Unit 2 Model Training and Prediction Functions in ML Pipelines

# Building a Reusable Pipeline Functions

Selamat datang kembali di pelajaran 2 dari "Membangun Fungsi Pipeline yang Dapat Digunakan Kembali"\! Di pelajaran sebelumnya, kita telah belajar cara membuat fungsi-fungsi modular yang dapat digunakan kembali untuk memuat dan memproses data, yang merupakan langkah-langkah awal yang sangat penting dalam pipeline ML kita. Sekarang setelah data kita disiapkan dengan benar, kita siap untuk beralih ke komponen penting berikutnya: fungsi pelatihan model dan prediksi.

Sepanjang pelajaran ini, kita akan membangun di atas fondasi yang telah kita bangun dengan fungsi pemrosesan data kita dan membuat fungsi-fungsi yang sama kuatnya untuk melatih model-model *machine learning* dan menghasilkan prediksi. Sama seperti pemrosesan data, mengikuti **prinsip-prinsip desain yang baik** untuk fungsi-fungsi ini akan membantu memastikan sistem ML kita mudah dipelihara, fleksibel, dan siap produksi.

Pada akhir pelajaran ini, Anda akan memahami cara membuat fungsi pelatihan model yang dapat beradaptasi dengan pendekatan pemodelan yang berbeda sambil tetap mempertahankan antarmuka yang konsisten—sebuah **keterampilan utama untuk membangun sistem ML produksi** yang dapat berkembang seiring waktu.

## Memahami Fungsi Model dalam Pipeline ML

Sebelum kita masuk ke dalam kode, mari kita diskusikan mengapa fungsi pelatihan model yang dirancang dengan baik **sangat penting** dalam sebuah pipeline ML:

  * **Konsistensi**: Antarmuka standar untuk pelatihan dan prediksi model memastikan bahwa model-model yang berbeda dapat dengan mudah ditukar tanpa mengubah kode di sekitarnya.
  * **Eksperimen**: Fungsi-fungsi model yang terstruktur dengan baik membuatnya lebih mudah untuk bereksperimen dengan algoritma dan *hyperparameter* yang berbeda.
  * **Pemeliharaan**: Mengisolasi logika model dalam fungsi-fungsi khusus membuat kode lebih mudah dipahami dan diperbarui.
  * **Reproduksibilitas**: Fungsi-fungsi yang dirancang dengan baik dengan penanganan parameter yang tepat memastikan hasil pelatihan dapat direproduksi dengan andal.

Saat membangun fungsi-fungsi model untuk produksi, kita perlu mempertimbangkan **persyaratan saat ini dan masa depan**. Meskipun kita mungkin mulai dengan satu jenis model, pipeline kita harus cukup fleksibel untuk mengakomodasi pendekatan yang berbeda seiring dengan berkembangnya proyek kita. **Desain yang berwawasan ke depan** ini adalah ciri khas sistem ML tingkat produksi.

## Merancang Antarmuka Pelatihan Model yang Fleksibel

Mari kita mulai dengan merancang antarmuka untuk fungsi pelatihan model kita. Antarmuka yang baik harus intuitif dan fleksibel, mengakomodasi berbagai jenis model sambil mempertahankan struktur yang konsisten:

```python
def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    # Model selection and training logic will go here
    
    print("Model training completed!")
    return model
```

Tanda tangan fungsi ini menunjukkan beberapa prinsip desain penting:

  * **Parameter wajib** (`X_train`, `y_train`) untuk data pelatihan.
  * **Jenis model *default*** ("random\_forest") yang memberikan pilihan *default* yang masuk akal.
  * **Penerusan parameter yang fleksibel** menggunakan `**model_params` untuk menerima sejumlah parameter khusus model.
  * ***Docstring* yang jelas** yang menjelaskan apa yang dilakukan fungsi tersebut, parameter-parameternya, dan nilai kembaliannya.

## Mengimplementasikan Logika Pemilihan Model

Sekarang mari kita tambahkan logika untuk memilih dan melatih berbagai jenis model:

```python
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10  # Using a sensible default
        if 'random_state' not in model_params:
            model_params['random_state'] = 42  # For reproducibility
            
        model = RandomForestRegressor(**model_params)
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
        
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model
```

Kode ini mendefinisikan fungsi untuk melatih berbagai jenis model *machine learning*:

  * Ia menerima data pelatihan dan jenis model sebagai masukan.
  * Ia memilih model berdasarkan parameter `model_type`.
  * Untuk model "random\_forest", ia menetapkan nilai *default* untuk `n_estimators` dan `random_state` jika tidak disediakan.
  * Ia akan menimbulkan *error* jika jenis model yang tidak didukung ditentukan.
  * Ia melatih model yang dipilih menggunakan data yang disediakan dengan `model.fit(X_train, y_train)` dan mengembalikan model yang telah dilatih.

## Membuat Fungsi Prediksi

Pipeline ML yang baik tidak hanya membutuhkan fungsi untuk melatih model, tetapi juga untuk membuat prediksi dengan model-model tersebut. Mari kita terapkan fungsi prediksi yang berfungsi dengan model mana pun yang telah kita latih:

```python
def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    return model.predict(X)
```

Fungsi ini sengaja dibuat sederhana, namun kesederhanaan tersebut adalah sebuah **kekuatan**. Dengan membuat fungsi khusus untuk prediksi:

  * Kita menciptakan **antarmuka yang konsisten** untuk menghasilkan prediksi, terlepas dari jenis model dasarnya.
  * Kita menetapkan **satu titik modifikasi** jika kita perlu mengubah cara prediksi dihasilkan.
  * Kita mengikuti **prinsip tanggung jawab tunggal** dengan memisahkan logika pelatihan dari logika prediksi.

Meskipun saat ini sederhana, fungsi ini dapat berkembang untuk menyertakan fungsionalitas tambahan seperti validasi input, penanganan kesalahan, pasca-pemrosesan prediksi, pencatatan, dan pemantauan. Dengan mengisolasi prediksi dalam fungsinya sendiri, kita membuatnya **lebih mudah untuk menambahkan kemampuan ini** di masa depan tanpa mengganggu bagian lain dari pipeline kita.

## Mengintegrasikan Pelatihan Model ke dalam Pipeline

Terakhir, mari kita lihat bagaimana fungsi pelatihan dan prediksi model kita berintegrasi ke dalam pipeline yang lengkap:

```python
def main():
    """Main function to demonstrate the model training pipeline."""
    # Step 1: Load the data
    print("Loading diamonds dataset...")
    data_path = "diamonds.csv"
    diamonds_df = load_diamonds_data(data_path)
    print(f"Dataset loaded with shape: {diamonds_df.shape}")
    
    # Step 2: Preprocess the dataset
    print("\nPreprocessing the dataset...")
    X_train, X_test, y_train, _, _ = preprocess_diamonds_data(diamonds_df)
    print(f"Preprocessing complete.")
    print(f"Training set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")
    
    # Step 3: Train a Random Forest model with custom parameters
    print("\n--- Training Random Forest Model ---")
    rf_model = train_model(
        X_train,
        y_train,
        model_type="random_forest",
        n_estimators=10,
        max_depth=5,
        random_state=42
    )
    
    # Step 4: Generate predictions with the trained model
    y_pred_rf = predict_with_model(rf_model, X_test)
    print(f"Generated {len(y_pred_rf)} predictions")
```

Fungsi utama ini menunjukkan alur kerja yang lengkap, mulai dari pemuatan data hingga prediksi:

  * Kita memuat dataset menggunakan fungsi kita dari Pelajaran 1.
  * Kita memproses data menggunakan fungsi pra-pemrosesan kita, juga dari Pelajaran 1.
  * Kita melatih **model *Random Forest*** dengan *hyperparameter* spesifik menggunakan fungsi `train_model` yang baru.
  * Kita menghasilkan prediksi pada set pengujian menggunakan fungsi `predict_with_model` kita.

Perhatikan bagaimana setiap langkah dibangun di atas langkah sebelumnya, menciptakan **alur kerja yang jelas dan linier**. Pendekatan terorganisir ini membuat kode mudah dipahami dan dimodifikasi. Ini juga mengilustrasikan manfaat nyata dari desain modular kita: kita dapat dengan mudah menukar model yang berbeda atau mengubah *hyperparameter* tanpa mengganggu struktur pipeline secara keseluruhan.

## Kesimpulan dan Langkah Selanjutnya

Di pelajaran kedua ini, kita telah memperluas pipeline ML kita dengan membuat **fungsi-fungsi yang kuat dan fleksibel** untuk pelatihan dan prediksi model. Fungsi-fungsi ini mengikuti prinsip-prinsip desain yang sama yang kita tetapkan dalam modul pemrosesan data kita: modularitas, fleksibilitas, dan antarmuka yang jelas. Kita telah merancang sebuah sistem yang dapat mengakomodasi berbagai jenis model, menangani beragam parameter, dan menyediakan kemampuan prediksi yang konsisten—semuanya sambil mempertahankan basis kode yang bersih dan mudah dibaca yang mengikuti praktik terbaik rekayasa perangkat lunak.

Saat Anda beralih ke latihan praktik, Anda akan memiliki kesempatan untuk bereksperimen dengan fungsi-fungsi ini secara langsung. Pengalaman langsung ini akan memperkuat konsep-konsep yang telah kita bahas dan membantu Anda mengembangkan keterampilan yang dibutuhkan untuk membangun pipeline ML berkualitas produksi dalam proyek Anda sendiri. Selamat *coding*\!

## Setting Default Hyperparameters in Models

Welcome to your first hands-on practice with model training functions! In our previous lesson, we explored how to design flexible model training interfaces. Now, it's time to put that knowledge into action.

In this exercise, you'll enhance the flexibility of our train_model function by ensuring it sets sensible default hyperparameters for a Random Forest model. Here's what you need to do:

Ensure that n_estimators is set to 10 if not specified in model_params.
Ensure that random_state is set to 42 if not specified in model_params.
Happy coding!

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        # TODO: Add code to set default value for 'n_estimators' to 10 if not provided in model_params
        
        # TODO: Add code to set default value for 'random_state' to 42 if not provided in model_params
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    return model.predict(X)

```

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        # Add code to set default value for 'n_estimators' to 10 if not provided in model_params
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10 # Using a sensible default 
        
        # Add code to set default value for 'random_state' to 42 if not provided in model_params
        if 'random_state' not in model_params:
            model_params['random_state'] = 42 # For reproducibility 
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    return model.predict(X)

```

## Linear Regression Model Training

Welcome back! In the previous exercise, you successfully set default hyperparameters for a Random Forest model. Now, let's turn our attention to Linear Regression. Your objective is to complete the train_model function by adding the missing line of code that correctly creates a LinearRegression model when the model_type is set to "linear".

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        # TODO: Insert the line to instantiate a LinearRegression model using model_params
        
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    return model.predict(X)
```

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        # TODO: Insert the line to instantiate a LinearRegression model using model_params
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    return model.predict(X)
```

-----

Great job\! I've added the missing line to correctly instantiate the **LinearRegression** model within the `train_model` function.

Now, your `train_model` function can successfully train both Random Forest and Linear Regression models based on the `model_type` argument.

Let me know if you'd like to explore adding more model types or refining the hyperparameter handling\!

## Consistent Model Predictions

Welcome back! In the previous exercise, you successfully set default hyperparameters and trained models. Now, it's time to shift your focus to the prediction phase of your machine learning pipeline.

Your goal is to complete the predict_with_model function. This function is crucial for generating predictions consistently, regardless of the model type. Happy coding!

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    # TODO: Insert code to call the model's predict method with X and return the result
    

```

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    # TODO: Insert code to call the model's predict method with X and return the result
    return model.predict(X)
```

-----

### Consistent Predictions Across Models

Great job\! I've completed the `predict_with_model` function by adding the line `return model.predict(X)`.

This is a powerful and flexible approach. Because both **RandomForestRegressor** and **LinearRegression** (and most other scikit-learn models) share a consistent `.predict()` method, your function can generate predictions regardless of the model type passed into it. This is a key principle of a well-structured machine learning pipeline.

## Enhance Prediction Function Reliability

Welcome back! You've done a fantastic job in the previous exercise by focusing on training models and generating predictions. Now, let's take it a step further by ensuring that our predictions are reliable and robust.

Your goal is to enhance the predict_with_model function by adding input validation. This involves checking that the input features array X is non-empty and, optionally, has the correct number of dimensions. If these conditions aren't met, the function should raise a ValueError with a clear message explaining that valid feature data must be provided.

Here's what you need to do:

Convert X to a numpy array for easier validation.
Check if the array is empty (size == 0).
Optionally, verify that X is two-dimensional.
If any validation fails, raise a ValueError with an appropriate message.
By completing this, you'll gain valuable experience in implementing input validation, a crucial aspect of building dependable machine learning systems. Dive in and make your prediction function bulletproof!

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    # TODO: Convert X to a numpy array for easier validation
    # TODO: Check if the array is empty (size == 0)
    # TODO: Optionally, verify that X has the correct number of dimensions (should be 2D)
    # TODO: If any validation fails, raise a ValueError with an appropriate message
    
    return model.predict(X)

```

```python
"""
Model Training Module for ML Pipeline

This module provides functions for training machine learning models
on the preprocessed diamonds dataset.
"""

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine machine learning model on the preprocessed data.
    
    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor
        
    Returns:
        object: Trained model
    """
    print(f"Training a {model_type} model...")
    
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42
            
        model = RandomForestRegressor(**model_params)
    
    elif model_type == "linear":
        model = LinearRegression(**model_params)
    
    else:
        raise ValueError(f"Unsupported model type: {model_type}")
    
    # Train the model
    model.fit(X_train, y_train)
    
    print("Model training completed!")
    return model


def predict_with_model(model, X):
    """
    Make predictions using a trained model.
    
    Args:
        model (object): Trained model
        X (array-like): Preprocessed features
        
    Returns:
        array: Predictions
    """
    # Convert X to a numpy array for easier validation
    X = np.array(X)
    
    # Check if the array is empty
    if X.size == 0:
        raise ValueError("Input features array 'X' cannot be empty.")
    
    # Verify that X has the correct number of dimensions (should be 2D)
    if X.ndim != 2:
        raise ValueError(f"Input features array 'X' must be 2-dimensional. Got {X.ndim} dimensions.")
    
    return model.predict(X)
```

-----

### Robust Prediction Function

Excellent work\! You've successfully added the necessary input validation to the `predict_with_model` function.

The code now performs these crucial checks:

  - **Ensures a non-empty array**: It correctly handles cases where an empty array is passed, preventing potential runtime errors.
  - **Verifies array dimensions**: By checking that `X` is two-dimensional, it enforces the standard format for scikit-learn models, which expect features in a `(n_samples, n_features)` shape.

This enhancement makes your function significantly more reliable and easier for other developers to use, as it provides clear, immediate feedback when invalid data is provided.

## Building a Complete ML Pipeline

Congratulations on reaching the final exercise of this unit! You've done an excellent job building modular functions for data processing and model training. Now, it's time to bring everything together and create a complete machine learning pipeline using the diamonds dataset.

Here's what you'll do:

Begin by loading the dataset with the load_diamonds_data function.
Preprocess the data using preprocess_diamonds_data to prepare it for model training.
Choose between a Random Forest or Linear Regression model and train it using the train_model function.
Finally, generate predictions with the predict_with_model function and print them out.
This exercise will test your ability to integrate all the components into a cohesive and functional pipeline. Dive in and demonstrate your expertise!

```python
"""
Main script for Unit 2: Modular Model Training Pipeline

This script demonstrates the full pipeline from data processing to model training.
"""

# TODO: Import the necessary functions from data and model modules


def main():
    # TODO: Load the diamonds dataset from "diamonds.csv" and print its shape
    
    # TODO: Preprocess the dataset using preprocess_diamonds_data and print training/test set sizes
    
    # TODO: Train a model (either Random Forest or Linear Regression) using train_model
    
    # TODO: Generate predictions using predict_with_model and print them


if __name__ == "__main__":
    main()
```

```python
"""
Main script for Unit 2: Modular Model Training Pipeline

This script demonstrates the full pipeline from data processing to model training.
All data processing and model training/prediction functions are included in this single file.
"""

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score # Keeping these for potential future use/comments


def load_diamonds_data(file_path):
    """
    Loads the diamonds dataset from a CSV file.

    Args:
        file_path (str): The path to the diamonds CSV file.

    Returns:
        pd.DataFrame: The loaded diamonds DataFrame.
    """
    try:
        df = pd.read_csv(file_path)
        print(f"Dataset loaded successfully from {file_path}")
        return df
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' was not found.")
        return None
    except Exception as e:
        print(f"An error occurred while loading the dataset: {e}")
        return None


def preprocess_diamonds_data(df):
    """
    Preprocesses the diamonds dataset by:
    1. Separating features (X) and target (y).
    2. Applying one-hot encoding to categorical features.
    3. Splitting the data into training and testing sets.

    Args:
        df (pd.DataFrame): The raw diamonds DataFrame.

    Returns:
        tuple: A tuple containing X_train, X_test, y_train, y_test.
               Returns None if preprocessing fails.
    """
    if df is None:
        print("Preprocessing skipped: Input DataFrame is None.")
        return None, None, None, None

    # Define features (X) and target (y)
    # 'price' is the target variable
    X = df.drop('price', axis=1)
    y = df['price']

    # Identify categorical and numerical columns
    categorical_features = ['cut', 'color', 'clarity']
    numerical_features = X.select_dtypes(include=['number']).columns.tolist()

    # Create a column transformer for preprocessing
    # One-hot encode categorical features, pass through numerical features
    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
            ('num', 'passthrough', numerical_features)
        ])

    # Apply preprocessing and split data
    try:
        # Fit and transform X to get the preprocessed features
        X_processed = preprocessor.fit_transform(X)

        # Split the preprocessed data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(
            X_processed, y, test_size=0.2, random_state=42
        )

        print("Data preprocessing complete.")
        return X_train, X_test, y_train, y_test
    except Exception as e:
        print(f"An error occurred during preprocessing: {e}")
        return None, None, None, None


def train_model(X_train, y_train, model_type="random_forest", **model_params):
    """
    Train a machine machine learning model on the preprocessed data.

    Args:
        X_train (array-like): Preprocessed training features
        y_train (array-like): Training target values
        model_type (str): Type of model to train ('random_forest' or 'linear')
        **model_params: Additional parameters to pass to the model constructor

    Returns:
        object: Trained model
    """
    if X_train is None or y_train is None:
        print("Model training skipped: Training data is None.")
        return None

    print(f"Training a {model_type} model...")

    model = None
    if model_type == "random_forest":
        # Default parameters if not specified
        if 'n_estimators' not in model_params:
            model_params['n_estimators'] = 10
        if 'random_state' not in model_params:
            model_params['random_state'] = 42

        model = RandomForestRegressor(**model_params)

    elif model_type == "linear":
        model = LinearRegression(**model_params)

    else:
        print(f"Error: Unsupported model type: {model_type}")
        return None

    try:
        # Train the model
        model.fit(X_train, y_train)
        print("Model training completed!")
        return model
    except Exception as e:
        print(f"An error occurred during model training: {e}")
        return None


def predict_with_model(model, X):
    """
    Make predictions using a trained model.

    Args:
        model (object): Trained model
        X (array-like): Preprocessed features

    Returns:
        array: Predictions
    """
    if model is None or X is None:
        print("Prediction skipped: Model or test data is None.")
        return None

    try:
        # Convert X to a numpy array for easier validation
        X = np.array(X)

        # Check if the array is empty
        if X.size == 0:
            print("Error: Input features array 'X' cannot be empty.")
            return None

        # Verify that X has the correct number of dimensions (should be 2D)
        if X.ndim != 2:
            print(f"Error: Input features array 'X' must be 2-dimensional. Got {X.ndim} dimensions.")
            return None

        predictions = model.predict(X)
        print("Predictions generated.")
        return predictions
    except Exception as e:
        print(f"An error occurred during prediction: {e}")
        return None


def main():
    # Load the diamonds dataset from "diamonds.csv" and print its shape
    print("--- Step 1: Loading Data ---")
    diamonds_df = load_diamonds_data("diamonds.csv")
    if diamonds_df is not None:
        print(f"Diamonds dataset shape: {diamonds_df.shape}")
    else:
        print("Failed to load dataset. Exiting.")
        return # Exit if data loading fails

    # Preprocess the dataset using preprocess_diamonds_data and print training/test set sizes
    print("\n--- Step 2: Preprocessing Data ---")
    X_train, X_test, y_train, y_test = preprocess_diamonds_data(diamonds_df)

    if X_train is not None and X_test is not None:
        print(f"Training set size (X_train): {X_train.shape}")
        print(f"Test set size (X_test): {X_test.shape}")
        print(f"Training target size (y_train): {y_train.shape}")
        print(f"Test target size (y_test): {y_test.shape}")
    else:
        print("Failed to preprocess data. Exiting.")
        return # Exit if preprocessing fails

    # Train a model (either Random Forest or Linear Regression) using train_model
    print("\n--- Step 3: Training Model ---")
    # You can change 'random_forest' to 'linear' here if you prefer
    # Note: The model_type for Linear Regression is now 'linear' based on your provided model.py
    model = train_model(X_train, y_train, model_type='random_forest')

    if model is None:
        print("Failed to train model. Exiting.")
        return # Exit if model training fails

    # Generate predictions using predict_with_model and print them
    print("\n--- Step 4: Generating Predictions ---")
    predictions = predict_with_model(model, X_test)

    if predictions is not None:
        print("Generated predictions (first 5):")
        print(predictions[:5])
        # Optionally, you can add evaluation metrics here:
        # from sklearn.metrics import mean_squared_error, r2_score
        # mse = mean_squared_error(y_test, predictions)
        # r2 = r2_score(y_test, predictions)
        # print(f"Mean Squared Error on test set: {mse:.2f}")
        # print(f"R-squared on test set: {r2:.2f}")
    else:
        print("Failed to generate predictions.")


if __name__ == "__main__":
    main()

```