### Ensuring Feature Consistency Between Training & InferencePipelines:

**Task 1**: Consistent Feature Preparation
- Step 1: Write a function for data preprocessing and imputation shared by both training and inference pipelines.
- Step 2: Demonstrate consistent application on both datasets.

In [None]:
# write your code from here

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer

def preprocess_data(df, imputer=None):
    # Select numeric columns for simplicity
    numeric_cols = df.select_dtypes(include=['float64', 'int']).columns
    
    if imputer is None:
        # Initialize imputer (mean strategy here)
        imputer = SimpleImputer(strategy='mean')
        imputed_data = imputer.fit_transform(df[numeric_cols])
    else:
        # Use the passed imputer to transform new data (inference)
        imputed_data = imputer.transform(df[numeric_cols])
    
    # Replace the original numeric data with imputed data
    df_imputed = df.copy()
    df_imputed[numeric_cols] = imputed_data
    
    return df_imputed, imputer

# Example usage:
train_df = pd.DataFrame({'A': [1, 2, None, 4], 'B': [5, None, 7, 8]})
test_df = pd.DataFrame({'A': [None, 2, 3], 'B': [5, 6, None]})

# Fit on train, transform train
train_preprocessed, fitted_imputer = preprocess_data(train_df)

# Use the same imputer to transform test
test_preprocessed, _ = preprocess_data(test_df, imputer=fitted_imputer)

print("Train after preprocessing:\n", train_preprocessed)
print("Test after preprocessing:\n", test_preprocessed)


Train after preprocessing:
           A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000
Test after preprocessing:
           A         B
0  2.333333  5.000000
1  2.000000  6.000000
2  3.000000  6.666667


**Task 2**: Pipeline Integration
- Step 1: Use sklearn pipelines to encapsulate the preprocessing steps.
- Step 2: Configure identical pipelines for both training and building inference models.

In [None]:
# write your code from here

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Define pipeline with imputer + scaler (example)
preprocessing_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Fit pipeline on training data
preprocessing_pipeline.fit(train_df)

# Transform train and test data with the same pipeline
train_processed = preprocessing_pipeline.transform(train_df)
test_processed = preprocessing_pipeline.transform(test_df)

print("Train pipeline processed:\n", train_processed)
print("Test pipeline processed:\n", test_processed)


Train pipeline processed:
 [[-1.2344268 -1.5430335]
 [-0.3086067  0.       ]
 [ 0.         0.3086067]
 [ 1.5430335  1.2344268]]
Test pipeline processed:
 [[ 0.        -1.5430335]
 [-0.3086067 -0.6172134]
 [ 0.6172134  0.       ]]


**Task 3**: Saving and Loading Preprocessing Models
- Step 1: Save the transformation model after fitting it to the training data.
- Step 2: Load and apply the saved model during inference.

In [None]:
# write your code from here

In [3]:
import joblib

# Save the fitted pipeline to disk
joblib.dump(preprocessing_pipeline, 'preprocessing_pipeline.pkl')

# Later, load it back during inference
loaded_pipeline = joblib.load('preprocessing_pipeline.pkl')

# Use loaded pipeline to transform inference data
inference_df = pd.DataFrame({'A': [2, None, 5], 'B': [None, 1, 8]})
inference_processed = loaded_pipeline.transform(inference_df)

print("Inference data after applying loaded pipeline:\n", inference_processed)


Inference data after applying loaded pipeline:
 [[-0.3086067  0.       ]
 [ 0.        -5.2463139]
 [ 2.4688536  1.2344268]]
