### Ensuring Feature Consistency Between Training & InferencePipelines:

**Task 1**: Consistent Feature Preparation
- Step 1: Write a function for data preprocessing and imputation shared by both training and inference pipelines.
- Step 2: Demonstrate consistent application on both datasets.

In [1]:
# write your code from here

**Task 2**: Pipeline Integration
- Step 1: Use sklearn pipelines to encapsulate the preprocessing steps.
- Step 2: Configure identical pipelines for both training and building inference models.

In [2]:
# write your code from here

**Task 3**: Saving and Loading Preprocessing Models
- Step 1: Save the transformation model after fitting it to the training data.
- Step 2: Load and apply the saved model during inference.

In [3]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib

# --- Task 1: Consistent Feature Preparation ---
print("--- Task 1: Consistent Feature Preparation ---")

# Create sample training and inference datasets
train_data = {'numerical_col': [1, 2, None, 4, 5],
              'categorical_col': ['A', None, 'B', 'A', 'C']}
train_df = pd.DataFrame(train_data)

inference_data = {'numerical_col': [6, None, 8, 9, 10],
                  'categorical_col': ['B', 'A', None, 'C', 'B']}
inference_df = pd.DataFrame(inference_data)

print("Training Data:")
print(train_df)
print("\nInference Data:")
print(inference_df)

def preprocess_data(df):
    """
    Function for preprocessing data, including imputation.
    """
    numerical_cols = df.select_dtypes(include=['number']).columns
    categorical_cols = df.select_dtypes(include=['object']).columns

    for col in numerical_cols:
        imputer_numerical = SimpleImputer(strategy='mean')
        df[col] = imputer_numerical.fit_transform(df[[col]])

    for col in categorical_cols:
        imputer_categorical = SimpleImputer(strategy='most_frequent')
        df[col] = imputer_categorical.fit_transform(df[[col]])

    return df

# Apply the consistent preprocessing function to both datasets
train_df_processed = preprocess_data(train_df.copy())
inference_df_processed = preprocess_data(inference_df.copy())

print("\nProcessed Training Data:")
print(train_df_processed)
print("\nProcessed Inference Data:")
print(inference_df_processed)

print("\nThis demonstrates consistent application of the same preprocessing steps (mean imputation for numerical, mode for categorical) to both training and inference datasets, ensuring feature consistency.")

print("\n" + "="*50 + "\n")

# --- Task 2: Pipeline Integration ---
print("--- Task 2: Pipeline Integration ---")

# Create sample training and inference datasets (with a target variable for training)
train_data_pipeline = {'numerical_col': [1, 2, None, 4, 5],
                       'categorical_col': ['A', None, 'B', 'A', 'C'],
                       'target': [0, 1, 0, 1, 0]}
train_df_pipeline = pd.DataFrame(train_data_pipeline)

inference_data_pipeline = {'numerical_col': [6, None, 8, 9, 10],
                           'categorical_col': ['B', 'A', None, 'C', 'B']}
inference_df_pipeline = pd.DataFrame(inference_data_pipeline)

X_train = train_df_pipeline.drop('target', axis=1)
y_train = train_df_pipeline['target']

# Identify numerical and categorical columns
numerical_cols_pipeline = X_train.select_dtypes(include=['number']).columns
categorical_cols_pipeline = X_train.select_dtypes(include=['object']).columns

# Create preprocessing pipelines
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', pd.get_dummies, {'handle_unknown': 'ignore'}) # handle_unknown for inference
])

# Combine preprocessing steps using ColumnTransformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols_pipeline),
        ('cat', categorical_transformer, categorical_cols_pipeline)
    ])

# Create the full training pipeline with a model
from sklearn.linear_model import LogisticRegression
training_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                   ('classifier', LogisticRegression())])

# Train the pipeline
training_pipeline.fit(X_train, y_train)

# Configure an identical preprocessing pipeline for inference
inference_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Apply the preprocessing pipeline to the inference data
inference_df_processed_pipeline = inference_pipeline.transform(inference_df_pipeline)

print("\nProcessed Inference Data using Pipeline:")
print(inference_df_processed_pipeline)

print("\nUsing scikit-learn pipelines encapsulates all preprocessing steps, ensuring that the exact same transformations are applied to both training and inference data. The `ColumnTransformer` helps apply different transformations to different columns.")

print("\n" + "="*50 + "\n")

# --- Task 3: Saving and Loading Preprocessing Models ---
print("--- Task 3: Saving and Loading Preprocessing Models ---")

# Create a simpler preprocessing pipeline (just numerical scaling for demonstration)
numerical_data_for_save = pd.DataFrame({'feature1': [1, 2, 3, 4, 5],
                                        'feature2': [5, 4, None, 2, 1]})

numerical_processor_save = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Fit the preprocessing pipeline to the training data
numerical_processor_save.fit(numerical_data_for_save)

# Save the fitted preprocessing model
preprocessing_model_path = 'preprocessing_model.joblib'
joblib.dump(numerical_processor_save, preprocessing_model_path)
print(f"Preprocessing model saved to: {preprocessing_model_path}")

# Load the saved preprocessing model during inference
loaded_preprocessor = joblib.load(preprocessing_model_path)

# Create new inference data
inference_data_for_load = pd.DataFrame({'feature1': [6, 7, 8, None, 10],
                                         'feature2': [1, 3, 2, 4, None]})

# Apply the loaded preprocessing model to the inference data
inference_data_processed_loaded = loaded_preprocessor.transform(inference_data_for_load)

print("\nInference Data before loading and applying preprocessor:")
print(inference_data_for_load)
print("\nInference Data after loading and applying the saved preprocessor:")
print(inference_data_processed_loaded)

print("\nSaving the fitted preprocessing model (e.g., using `joblib`) allows you to reuse the exact same transformation learned from the training data during inference, ensuring consistency without retraining the preprocessor on the inference data.")

--- Task 1: Consistent Feature Preparation ---
Training Data:
   numerical_col categorical_col
0            1.0               A
1            2.0            None
2            NaN               B
3            4.0               A
4            5.0               C

Inference Data:
   numerical_col categorical_col
0            6.0               B
1            NaN               A
2            8.0            None
3            9.0               C
4           10.0               B


ValueError: 2