## Create and Test Pipeline for the Final XGBoost Model

**Steps**
- Create a pipeline that:
    - Preprocesses new data points (in this case, new credit card transactions)
    - Applies the final XGBoost model from the previous task
    - Outputs the predictions on whether the new point(s) are classified as fraudulent or not
- Once the pipeline is created, run it on a simulated new data point as an example

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import joblib

from xgboost import XGBClassifier

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

In [2]:
#Load the tuned XGBoost model (final model) and best threshold
final_model = joblib.load('../data/final_xgboost_model.pkl')
final_threshold = joblib.load('../data/final_xgboost_threshold.pkl')

#Load raw training data
X_train_raw = joblib.load('../data/X_train_raw.pkl')

In [3]:
final_model

In [4]:
#Create preprocessor compatible with scikit-learn
#This applies log transformation to 'Amount' and scales it
class Preprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.scaler = RobustScaler()

    def fit(self, X, y = None):
        log_amount = np.log1p(X['Amount']).values.reshape(-1,1)
        self.scaler.fit(log_amount)
        return self

    def transform(self, X):
        X_copy = X.copy()

        #Create Log_Amount and scale it
        X_copy['Log_Amount'] = np.log1p(X_copy['Amount'])
        X_copy['Log_Amount_Scaled'] = self.scaler.transform(X_copy[['Log_Amount']])

        #Drop unwanted columns
        X_processed = X_copy.drop(columns = ['Time', 'Amount', 'Log_Amount'], errors = 'ignore')
        return X_processed[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
                            'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
                            'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',
                            'Log_Amount_Scaled']]



In [5]:
#Create pipeline
pipeline = Pipeline([
    ('preprocessing', Preprocessor()),
    ('model', final_model)
    ])

#Fraud prediction function
def predict_fraud(df_new):
    """
    Predicts fraud on new data using the final pipeline and best threshold
    Parameters:
        df_new (DataFrame): Raw transaction data with original features including 'Amount ' and 'Time'
    Returns: 
        DataFrame with prediction scores ('fraud_probability') and fraud label ('predicted_fraud')
    """
    probs = pipeline.predict_proba(df_new)[:,1]
    preds = (probs >= final_threshold).astype(int)

    output = df_new.copy()
    output['fraud_probability'] = probs
    output['predicted_fraud'] = preds
    return output



In [6]:
#Fit the pipeline
pipeline.named_steps['preprocessing'].fit(X_train_raw)

In [7]:
#Save the pipeline
joblib.dump(pipeline, "../data/final_pipeline.pkl")

['../data/final_pipeline.pkl']

In [8]:
#Simulate a new data point based on the original data from Kaggle
df = pd.read_csv('../data/creditcard.csv')
df_x = df.drop(columns=['Class'])

In [9]:
df_x1 = df_x.sample(n = 1, random_state = 42)

In [10]:
df_x1

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
43428,41505.0,-16.526507,8.584972,-18.649853,9.505594,-13.793819,-2.832404,-16.701694,7.517344,-8.507059,...,-1.514923,1.190739,-1.12767,-2.358579,0.673461,-1.4137,-0.462762,-2.018575,-1.042804,364.19


In [11]:
#Apply the pipeline to predict if the new data point is a fraud
result = predict_fraud(df_x1)
print(result[['fraud_probability','predicted_fraud']])

       fraud_probability  predicted_fraud
43428           0.999966                1




The model determined that the new data point was likely a fraud