<a href="https://colab.research.google.com/github/yassmin1/MLflow_step_by_step/blob/main/MLflow_3-1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

✅ **Module 3.1: Custom PythonModel for Real-World Use**
Here's a simple yet detailed breakdown of the learning goal:

## 🎯 **Goal Expanded**

### **Packaging Pre/Post Processing with MLflow**

* **What it means:**
  Combining all the steps involved in data processing—preprocessing (transforming raw data before predictions) and postprocessing (transforming model outputs)—into a single, unified model package using MLflow.

---

### ✅ **Why This Matters:**

When deploying machine learning models, it's crucial to replicate the exact steps taken during training at prediction time. This includes:

* **Preprocessing:** Steps like scaling features, encoding categories, and handling missing values.
* **Prediction:** Using the trained model to make predictions.
* **Postprocessing:** Steps like converting numerical predictions into human-readable labels, applying thresholds, or formatting outputs.

Bundling these together ensures your model predictions are consistent, accurate, and reliable across different deployment environments.

---

### 🧰 **Detailed Steps to Achieve This:**

1. **Create a Preprocessing Pipeline**

   * Use tools like Scikit-learn’s pipelines to standardize and automate preprocessing.

   ```python
   from sklearn.pipeline import Pipeline
   from sklearn.preprocessing import StandardScaler
   pipeline = Pipeline([
       ('scaler', StandardScaler()),
       ('model', LogisticRegression())
   ])
   ```

2. **Define Postprocessing Logic**

   * Include any required logic that processes predictions after the model runs. For example, converting class probabilities into labels:

   ```python
   def postprocess(pred_probs):
       return ["positive" if prob > 0.5 else "negative" for prob in pred_probs]
   ```

3. **Combine Using MLflow’s PythonModel**

   * Define a custom MLflow PythonModel that encapsulates both preprocessing and postprocessing:

   ```python
   import mlflow.pyfunc

   class CustomModel(mlflow.pyfunc.PythonModel):
       def load_context(self, context):
           self.pipeline = joblib.load(context.artifacts["pipeline"])
       
       def predict(self, context, model_input):
           preprocessed_input = self.pipeline[:-1].transform(model_input)
           pred_probs = self.pipeline[-1].predict_proba(preprocessed_input)[:, 1]
           return postprocess(pred_probs)
   ```

4. **Log Everything Together**

   * Log this bundled model to MLflow:

   ```python
   mlflow.pyfunc.log_model(
       artifact_path="bundled_model",
       python_model=CustomModel(),
       artifacts={"pipeline": "pipeline.pkl"}
   )
   ```

5. **Deploy as One Unit**

   * Now, you can deploy and serve this complete pipeline as a single REST API or Docker image.

---

### 🚀 **Benefits of Packaging Pre/Post Processing:**

* **Consistency**: Ensures identical transformations during training and serving.
* **Reproducibility**: Makes models easier to replicate across environments.
* **Simplicity**: Reduces complexity during deployment, as you handle only one package instead of multiple separate scripts.


In [1]:
# 📓 Module 3.2: Packaging Pre/Post Processing with MLflow
# Goal: Bundle preprocessing and postprocessing into a single deployable MLflow model

# ✅ Step 1: Install required packages
!pip install -q mlflow scikit-learn pandas joblib

# ✅ Step 2: Import required libraries
import mlflow.pyfunc
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib
import os

# ✅ Step 3: Create preprocessing + model pipeline
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=200))
])

pipeline.fit(X_train, y_train)

# ✅ Step 4: Save pipeline to disk
os.makedirs("artifacts", exist_ok=True)
joblib.dump(pipeline, "artifacts/iris_pipeline.pkl")

# ✅ Step 5: Define PythonModel with postprocessing (convert probs to labels)
class WrappedPipelineModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.pipeline = joblib.load(context.artifacts["pipeline_file"])

    def predict(self, context, model_input):
        probs = self.pipeline.predict_proba(model_input)
        return pd.DataFrame({
            "predicted_label": probs.argmax(axis=1),
            "probability": probs.max(axis=1)
        })

# ✅ Step 6: Log the wrapped model with MLflow
model_path = "iris_wrapped_model"
artifacts = {"pipeline_file": "artifacts/iris_pipeline.pkl"}

with mlflow.start_run():
    mlflow.pyfunc.log_model(
        artifact_path=model_path,
        python_model=WrappedPipelineModel(),
        artifacts=artifacts
    )
    print("✅ Pipeline with preprocessing and postprocessing logged.")

# ✅ Step 7: Load and test the model
loaded = mlflow.pyfunc.load_model(f"runs:/{mlflow.last_active_run().info.run_id}/{model_path}")
test_input = pd.DataFrame(X_test, columns=[f"feature_{i}" for i in range(X_test.shape[1])])
predictions = loaded.predict(test_input)
print("\n🔮 Predictions with label and probability:")
print(predictions.head())



✅ Pipeline with preprocessing and postprocessing logged.

🔮 Predictions with label and probability:
   predicted_label  probability
0                1     0.875979
1                0     0.964411
2                2     0.997118
3                1     0.759399
4                1     0.752136





## 📝 Assessment: Custom PythonModel for Real-World Use

### 📘 Multiple Choice (Answers in **bold**)

**1. What does `load_context()` allow your custom `PythonModel` to do?**   
A. Set model parameters manually   
**B. Load external artifacts like pipelines or tokenizers** ✅   
C. Tune hyperparameters dynamically   
D. Fetch model inputs from MLflow UI   

---

**2. In the custom model example, what happens if a negative value appears in the input?**      
A. An exception is raised      
**B. The model returns -1 for each row** ✅      
C. The model returns NaN      
D. Prediction is skipped for that row      

---

**3. What is the correct method to store files like pickled pipelines in MLflow?**   
A. `mlflow.save_model()`   
B. `mlflow.log_file()   `
**C. `mlflow.pyfunc.log_model(..., artifacts={...})`** ✅   
D. `mlflow.register_artifact()`   

---

**4. Why would you use a custom `pyfunc` model over a standard flavor like `mlflow.sklearn`?**   
A. To reduce log file size   
B. To skip preprocessing   
**C. To wrap custom logic such as input checks, transformation, or ensemble voting** ✅   
D. To avoid using artifacts   
   
---

### ✏️ Short Answer

**5. What is the advantage of using a pipeline + custom logic in a `PythonModel`?**   
*Combines both feature engineering and model prediction logic in one deployable unit. This ensures consistency between training and inference environments.*   

---

**6. How do artifacts make your MLflow model more powerful and reusable?**   
*Artifacts allow models to include reusable components like encoders, scalers, vocabularies, or other external files necessary for prediction.*   

---

### 🧪 Mini Project

**7. Task:**   

* Modify the example so that instead of rejecting negative values, it replaces them with zero   
* Log and test the modified model   
* Use `mlflow.pyfunc.load_model()` and predict on `DataFrame([[10, -10], [0, 100]])`   
* Output the predictions   