# **Part 4: Model Deployment**

### **Objective**
This notebook demonstrates the final step of our project: deploying the trained model to make predictions on new data. It simulates a production environment where raw data comes in, and our saved model pipeline predicts the outcome.

### **Workflow**
1.  **Setup**: Import necessary libraries.
2.  **Load Production Artifacts**: Load the saved `RandomForestClassifier` model, the `StandardScaler`, and the `LabelEncoders`. These are all the components needed to make a prediction.
3.  **Simulate New Data**: Create a sample of new, raw data as if it were just collected. This data will be in the original format, before any preprocessing.
4.  **Create Prediction Pipeline**: Build a function that encapsulates all the preprocessing steps (encoding, scaling) and then calls the model to predict the outcome. This ensures that new data is treated exactly the same way as the training data.
5.  **Make Predictions**: Use the pipeline to predict whether the new users will generate revenue.

---

##  Setup and Imports

In [4]:
import pandas as pd
import joblib
import os
import warnings

warnings.filterwarnings('ignore')


## Load Production Artifacts
We will now load all the components we saved during our preprocessing and training phases. These are the essential of our prediction service.

In [5]:
models_path = '../models/'
model_path = os.path.join(models_path, 'final_revenue_prediction_model.joblib')
scaler_path = os.path.join(models_path, 'scaler.pkl')
encoders_path = os.path.join(models_path, 'label_encoders.pkl')

try:
    model = joblib.load(model_path)
    scaler = joblib.load(scaler_path)
    encoders = joblib.load(encoders_path)
    print("All production artifacts loaded successfully!")
    print(f"Model: {type(model)}")
    print(f"Scaler: {type(scaler)}")
    print(f"Encoders: {type(encoders)}")
except FileNotFoundError as e:
    print(f"Error loading artifacts: {e}. Please ensure the paths are correct.")

All production artifacts loaded successfully!
Model: <class 'sklearn.ensemble._forest.RandomForestClassifier'>
Scaler: <class 'sklearn.preprocessing._data.StandardScaler'>
Encoders: <class 'dict'>


In [7]:
model

0,1,2
,n_estimators,200
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [8]:
scaler

0,1,2
,copy,True
,with_mean,True
,with_std,True


In [9]:
encoders

{'Month': LabelEncoder(),
 'OperatingSystems': LabelEncoder(),
 'Browser': LabelEncoder(),
 'Region': LabelEncoder(),
 'TrafficType': LabelEncoder(),
 'VisitorType': LabelEncoder()}

## Simulate New, Unseen Data

In [25]:
new_data = pd.DataFrame({
    # Numerical Features
    'Administrative': [2, 0, 5],
    'Administrative_Duration': [120.5, 0.0, 310.7],
    'Informational': [1, 0, 4],
    'Informational_Duration': [45.0, 0.0, 220.3],
    'ProductRelated': [25, 3, 60],
    'ProductRelated_Duration': [800.0, 0.0, 2500.0],
    'BounceRates': [0.05, 0.2, 0.01],
    'ExitRates': [0.1, 0.2, 0.03],
    'PageValues': [15.0, 0.0, 40.5],
    'SpecialDay': [0.0, 0.0, 0.8],

    # Categorical Features
    'Month': ['May', 'Jun', 'Dec'],
    'OperatingSystems': [2, 1, 3],
    'Browser': [2, 2, 4],
    'Region': [1, 4, 2],
    'TrafficType': [3, 1, 8],
    'VisitorType': ['Returning_Visitor', 'New_Visitor', 'Returning_Visitor'],
    'Weekend': [False, True, False]
})

print("New raw data to predict:")
display(new_data)

New raw data to predict:


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend
0,2,120.5,1,45.0,25,800.0,0.05,0.1,15.0,0.0,May,2,2,1,3,Returning_Visitor,False
1,0,0.0,0,0.0,3,0.0,0.2,0.2,0.0,0.0,Jun,1,2,4,1,New_Visitor,True
2,5,310.7,4,220.3,60,2500.0,0.01,0.03,40.5,0.8,Dec,3,4,2,8,Returning_Visitor,False


## Create Prediction Pipeline Function

This function is the core of our deployment. It takes raw data, preprocesses it using the loaded artifacts, and returns a human-readable prediction. This makes the prediction process repeatable and error-free.

In [15]:
def predict_revenue(data):
    """
    Preprocesses raw input data and returns a revenue prediction.

    Args:
        data (pd.DataFrame): Raw data with the same columns as the original dataset.

    Returns:
        list: A list of human-readable predictions.
    """
    processed_data = data.copy()

    # --- Step 1: Encode Categorical Variables ---
    for col, encoder in encoders.items():
        processed_data[col] = processed_data[col].apply(lambda x: encoder.transform([x])[0] if x in encoder.classes_ else -1)

    # --- Step 2: Encode Boolean Variable ---
    processed_data['Weekend'] = processed_data['Weekend'].astype(int)

    # --- Step 3: Scale Numerical Features ---
    numerical_features = [
        "Administrative", "Administrative_Duration", "Informational",
        "Informational_Duration", "ProductRelated", "ProductRelated_Duration",
        "BounceRates", "ExitRates", "PageValues", "SpecialDay"
    ]
    processed_data[numerical_features] = scaler.transform(processed_data[numerical_features])

    # --- Step 4: Ensure Column Order ---
    training_columns = [
        "Administrative", "Administrative_Duration", "Informational",
        "Informational_Duration", "ProductRelated", "ProductRelated_Duration",
        "BounceRates", "ExitRates", "PageValues", "SpecialDay", "Month",
        "OperatingSystems", "Browser", "Region", "TrafficType", "VisitorType", "Weekend"
    ]
    processed_data = processed_data[training_columns]

    # --- Step 5: Make Prediction ---
    predictions_numeric = model.predict(processed_data)

    # --- Step 6: Convert to Human-Readable Format ---
    predictions = ['Revenue' if pred == 1 else 'No Revenue' for pred in predictions_numeric]

    return predictions

## Make and Display Predictions

**Predicting on a Sample from the Original Raw Data**

In [19]:
try:
    original_df = pd.read_csv('../data/raw/online_shoppers_intention.csv')
except FileNotFoundError:
    print("Could not find the original raw data file. Please check the path.")
    original_df = None

if original_df is not None:
    sample_data = original_df.sample(3, random_state=42)
    
    sample_X = sample_data.drop('Revenue', axis=1)
    sample_y_true = sample_data['Revenue']

    display(sample_X)

    predictions_on_sample = predict_revenue(sample_X)
    
    validation_results = pd.DataFrame({
        'VisitorType': sample_X['VisitorType'],
        'PageValues': sample_X['PageValues'].round(2),
        'Month': sample_X['Month'],
        'True_Revenue': sample_y_true.map({True: 'Revenue', False: 'No Revenue'}),
        'Predicted_Revenue': predictions_on_sample
    })

    print("\n--- Side-by-Side Comparison: True vs. Predicted ---")
    display(validation_results)

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend
8916,3,142.5,0,0.0,48,1052.255952,0.004348,0.013043,0.0,0.0,Nov,1,8,6,11,Returning_Visitor,False
772,6,437.391304,2,235.55,83,2503.881781,0.002198,0.004916,2.086218,0.0,Mar,2,2,3,2,Returning_Visitor,False
12250,1,41.125,0,0.0,126,4310.004668,0.000688,0.012823,3.451072,0.0,Nov,2,2,2,2,Returning_Visitor,False



--- Side-by-Side Comparison: True vs. Predicted ---


Unnamed: 0,VisitorType,PageValues,Month,True_Revenue,Predicted_Revenue
8916,Returning_Visitor,0.0,Nov,No Revenue,No Revenue
772,Returning_Visitor,2.09,Mar,Revenue,Revenue
12250,Returning_Visitor,3.45,Nov,No Revenue,Revenue


**Get predictions for the simulated new data we created earlier**

In [26]:
predictions = predict_revenue(new_data)

results_df = new_data.copy()
results_df['Predicted_Revenue'] = predictions

print("\n--- Prediction Results on Simulated New Data ---")
display(results_df)


--- Prediction Results on Simulated New Data ---


Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Predicted_Revenue
0,2,120.5,1,45.0,25,800.0,0.05,0.1,15.0,0.0,May,2,2,1,3,Returning_Visitor,False,Revenue
1,0,0.0,0,0.0,3,0.0,0.2,0.2,0.0,0.0,Jun,1,2,4,1,New_Visitor,True,No Revenue
2,5,310.7,4,220.3,60,2500.0,0.01,0.03,40.5,0.8,Dec,3,4,2,8,Returning_Visitor,False,No Revenue


The ``predict_revenue``  function is ready for integration into applications, such as a FastAPI REST endpoint for real-time revenue predictions