In [8]:
import pandas as pd
import numpy as np
import joblib
import warnings
import sys

# Suppress minor warnings for cleaner output
warnings.filterwarnings('ignore')

# --- 1. Define Assets and New Data (Simulation) ---

# CRITICAL: The list of 17 final feature names and order the model expects (Schema)
FINAL_COLUMNS = [
    'CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Point Earned',
    'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Complain', 'Satisfaction Score',
    'Geography_Germany', 'Geography_Spain', 'Gender_Male', 'Card Type_GOLD', 
    'Card Type_PLATINUM', 'Card Type_SILVER'
]
numeric_cols = ['CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'Point Earned']
categorical_cols_ohe = ['Geography', 'Gender', 'Card Type']
# Variable for the ordinal column name (used for .cat.codes)
ordinal_col_name = 'Satisfaction Score'

# Raw data for a new customer (simulated input)
new_customer_raw = pd.DataFrame({
    'CreditScore': [619], 'Age': [42], 'Tenure': [2], 'Balance': [0],
    'EstimatedSalary': [101348.88], 'Point Earned': [464], 'Geography': ['France'],
    'Gender': ['Female'], 'NumOfProducts': [1], 'HasCrCard': [1],
    'IsActiveMember': [1], 'Complain': [1], 'Satisfaction Score': [2], 
    'Card Type': ['DIAMOND']
})


# --- 2. Load Deployment Assets (Model and Scaler) ---

try:
    # Load the saved model object (e.g., best XGBoost)
    best_model = joblib.load('model.pkl') 
    # Load the trained scaler object (for consistent normalization)
    scaler = joblib.load('scaler.pkl')
    # Assign the critical column list
    training_columns = FINAL_COLUMNS
    print("Assets (Model, Scaler, Columns) loaded successfully.")
except FileNotFoundError:
    # Exit if model or scaler files are missing
    print("Error: Ensure model.pkl and scaler.pkl files are in the same directory.")
    sys.exit(1)


# --- 3. Preprocessing New Data (Preventing Schema Drift) ---

new_data = new_customer_raw.copy()

# A. Log1p Transformation (Replicating training step for 'Age' and 'Balance')
new_data['Age'] = np.log1p(new_data['Age'])
new_data['Balance'] = np.log1p(new_data['Balance'])

# B. Ordinal Encoding (Accessing column as Series for .cat.codes)
# Converts ordinal scores (e.g., 1, 2, 3) into numerical codes (0, 1, 2)
new_data[ordinal_col_name] = new_data[ordinal_col_name].astype('category').cat.codes

# C. Scaling Numerical Features
# Apply scaler.transform() using stats learned from the training data.
new_data_scaled = scaler.transform(new_data[numeric_cols])
# Convert the scaled array back to a DataFrame
new_data_scaled_df = pd.DataFrame(new_data_scaled, columns=numeric_cols, index=new_data.index)

# D. Categorical Encoding and Concatenation
# Isolate non-numerical features
data_non_numeric = new_data.drop(columns=numeric_cols) 
# Apply OHE to nominal categorical columns (Geography, Gender, Card Type)
new_data_ohe = pd.get_dummies(data_non_numeric, columns=categorical_cols_ohe, drop_first=True)

# Combine scaled numerical features with encoded categorical features
new_data_processed = pd.concat([new_data_scaled_df, new_data_ohe], axis=1)


# E. Finalization: Reindex (CRITICAL STEP for Schema Alignment)
# Reindexes the data to match the exact column names and order from training_columns.
# fill_value=0 is essential to create missing OHE columns (e.g., 'Geography_Germany') and set them to 0.
X_predict_final = new_data_processed.reindex(columns=training_columns, fill_value=0)

print("New customer data processed and aligned to model schema.")
print(f"Final Data Shape: {X_predict_final.shape}")


# --- 4. Prediction Using Loaded Model ---

try:
    # Predict the class (0 or 1)
    prediction = best_model.predict(X_predict_final)[0]

    # Predict the probability of Churn (Class 1)
    proba = best_model.predict_proba(X_predict_final)[0][1]

    # --- 5. Final Results ---
    status = "CHURN" if prediction == 1 else "TIDAK CHURN (Stay)"

    print("\n--- MODEL PREDICTION RESULTS ---")
    print(f"Model ({type(best_model).__name__}) predicted:")
    print(f"Customer Status: {status}")
    print(f"Churn Probability (Class 1): {proba:.4f}")

except ValueError as e:
    print(f"\nPREDICTION FAILED.")
    print(f"Error: {e}")

Assets (Model, Scaler, Columns) loaded successfully.
New customer data processed and aligned to model schema.
Final Data Shape: (1, 17)

--- MODEL PREDICTION RESULTS ---
Model (LogisticRegression) predicted:
Customer Status: CHURN
Churn Probability (Class 1): 0.9995


This code is a **complete simulation of a deployment process** (*deployment pipeline*), where a saved model (`model.pkl`) is used to make predictions on a single new customer record.

The main goal of this code is to **demonstrate how to handle *Schema Drift*** by ensuring that the new customer data (`new_customer_raw`) goes through **exactly the same preprocessing steps** as the training data.

---

## ðŸš€ Code Explanation: Simulating Prediction in Production

### 1. Load Deployment Assets

* **Purpose:** Import the trained model and preprocessing objects that store statistics from the training data.
* **Process:**

  * `joblib.load()` is used to load **`best_model`** (the decision-making model) and **`scaler`** (which stores the mean and standard deviation from the training data).
  * The final list of columns (`training_columns`) is prepared to align the schema of new data.

### 2. Preprocess New Data 

This step strictly replicates the preprocessing pipeline used during training:

* **Log1p Transformation:** `Age` and `Balance` are transformed using `np.log1p()` to normalize distribution and handle zeros (just like during training).
* **Scaling:** `new_data_scaled = scaler.transform(...)`. Numeric data is scaled using the **same** statistics ($\mu$ and $\sigma$) learned **only from training data** (`scaler.transform`).
* **Ordinal Encoding:** `Satisfaction Score` is converted to ordered numeric codes (`.cat.codes`).
* **One-Hot Encoding (OHE):** Nominal features (`Geography`, `Gender`, `Card Type`) are converted to binary columns (`pd.get_dummies`).

### 3. Finalization: Schema Alignment

```python
X_predict_final = new_data_processed.reindex(columns=training_columns, fill_value=0)
```

* **Purpose:** This addresses **Schema Drift**.
* **`reindex()`** ensures that the final DataFrame (`X_predict_final`) has **exactly the same column names and order** as seen by the model during training (`training_columns`).
* `fill_value=0`: If a particular OHE column (e.g., `Geography_Spain`) is missing in the new data, the column is added with `0`, which is the correct representation.

### 4. Prediction and Output

* **`best_model.predict(X_predict_final)`:** The loaded model makes a class prediction (`0` or `1`).
* **`best_model.predict_proba(...)`:** Computes the modelâ€™s probability (*confidence*) for the prediction.
* **Result:** The code prints the predicted churn status and probability (e.g., 78% chance of churn). This is the **production output** that can be used by business systems.
