# Risk Score Calculator

This notebook loads a saved `fraud_detection_pipeline.pkl` (a scikit-learn pipeline including preprocessing + classifier) and calculates a **Risk Score** (0-100) for input transactions. If the pipeline is not found, the notebook offers to train a simple pipeline from a provided dataset.

**Outputs:** `riskscore_results.csv` (and a sample preview).

**How to use:**
- Place your saved pipeline file `fraud_detection_pipeline.pkl` in the same folder as this notebook, **or** provide a CSV with features named like your training data in `new_transactions.csv`.
- Run cells top-to-bottom.

In [1]:

# 1) Imports & helper functions
import os
import pandas as pd
import numpy as np
import joblib
from sklearn.pipeline import Pipeline

def risk_bucket(score):
    if score >= 75:
        return "Very High"
    elif score >= 50:
        return "High"
    elif score >= 25:
        return "Medium"
    else:
        return "Low"


In [6]:

# 2) User parameters - update paths here if needed
PIPELINE_PATH = "fraud_detection_pipeline.pkl"   # path to saved pipeline (preferred)
NEW_TXN_PATH = "processed_fraud_dataset2.csv"            # path to new transactions CSV (raw features matching training columns)
OUTPUT_PATH = "riskscore_results.csv"


In [7]:

# 3) Load pipeline if exists
pipeline = None
if os.path.exists(PIPELINE_PATH):
    pipeline = joblib.load(PIPELINE_PATH)
    print(f"Loaded pipeline from: {PIPELINE_PATH}")
else:
    print(f"Pipeline file not found at: {PIPELINE_PATH}")
    print("If you want to train a pipeline here, provide a training CSV and uncomment the training block below.")


Loaded pipeline from: fraud_detection_pipeline.pkl


## Optional: Train a simple pipeline (only if you don't have a saved pipeline)

If you don't have a saved pipeline, you can train a simple pipeline using a labeled CSV. The CSV must contain the target column `isFraud` and the same feature columns used during original model training. Uncomment and modify the block below to train.

In [None]:

# --- TRAINING BLOCK (OPTIONAL) ---
# To use: set TRAIN_DATA_PATH to a CSV that contains 'isFraud' and all features.
# Then uncomment and run this cell.

TRAIN_DATA_PATH = "processed_fraud_dataset2.csv"  # <-- change if different

# Uncomment the following to train a simple pipeline if you need it.
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import Pipeline
# 
# if os.path.exists(TRAIN_DATA_PATH):
#     df = pd.read_csv(TRAIN_DATA_PATH)
#     X = df.drop(['isFraud','nameOrig','nameDest','isFlaggedFraud'], axis=1, errors='ignore')
#     y = df['isFraud']
#     # Simple handling: detect numeric/categorical
#     numeric_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
#     cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
#     preprocessor = ColumnTransformer([
#         ('num', StandardScaler(), numeric_cols),
#         ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_cols)
#     ], remainder='drop')
# 
#     pipeline = Pipeline([
#         ('prep', preprocessor),
#         ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
#     ])
# 
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
#     pipeline.fit(X_train, y_train)
#     joblib.dump(pipeline, PIPELINE_PATH)
#     print("Trained and saved pipeline to", PIPELINE_PATH)
# else:
#     print("Training CSV not found at", TRAIN_DATA_PATH)


## Load new transactions and compute RiskScore

This cell will:
- load `NEW_TXN_PATH`
- compute `fraud_proba` using the pipeline
- create `risk_score` (0-100) and `risk_bucket`
- save results to `OUTPUT_PATH`

If `NEW_TXN_PATH` is not found, the cell will instead compute RiskScores for the test set if the pipeline was trained in this notebook.

In [8]:

# 4) Compute RiskScore
if pipeline is None:
    print("No pipeline available. Cannot compute risk scores. Please provide a pipeline or train one using the optional training block.")
else:
    if os.path.exists(NEW_TXN_PATH):
        df_new = pd.read_csv(NEW_TXN_PATH)
        print(f"Loaded new transactions: {NEW_TXN_PATH} (rows: {len(df_new)})")
        # Predict probabilities
        proba = pipeline.predict_proba(df_new)
        # find index of fraud class (in case classes_ ordering differs)
        if hasattr(pipeline, 'classes_'):
            try:
                fraud_index = list(pipeline.classes_).index(1)
            except ValueError:
                # fallback to column 1
                fraud_index = 1
        else:
            fraud_index = 1

        fraud_proba = proba[:, fraud_index]
        df_new = df_new.reset_index(drop=True)
        df_new['fraud_proba'] = fraud_proba
        df_new['risk_score'] = (fraud_proba * 100).round(2)
        df_new['risk_bucket'] = df_new['risk_score'].apply(risk_bucket)
        df_new.to_csv(OUTPUT_PATH, index=False)
        print(f"Risk scores saved to {OUTPUT_PATH}")
        display(df_new.head(10))
    else:
        print(f"New transactions file not found at: {NEW_TXN_PATH}")
        # As a helpful fallback, check if pipeline was just trained in this notebook (variable 'X_test' may exist)
        if 'X_test' in globals() and 'y_test' in globals():
            print("Using X_test/X_test from training to compute sample risk scores.")
            df_sample = X_test.copy().reset_index(drop=True)
            proba = pipeline.predict_proba(df_sample)
            fraud_index = 1 if not hasattr(pipeline, 'classes_') else list(pipeline.classes_).index(1)
            fraud_proba = proba[:, fraud_index]
            df_sample['fraud_proba'] = fraud_proba
            df_sample['risk_score'] = (fraud_proba * 100).round(2)
            df_sample['risk_bucket'] = df_sample['risk_score'].apply(risk_bucket)
            df_sample.to_csv(OUTPUT_PATH, index=False)
            print(f"Sample risk scores saved to {OUTPUT_PATH}")
            display(df_sample.head(10))
        else:
            print("No data to score. Provide a new_transactions.csv or train a pipeline.")


Loaded new transactions: processed_fraud_dataset2.csv (rows: 6362620)
Risk scores saved to riskscore_results.csv


Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,balanceDiffOrig,balanceDiffDest,fraud_proba,risk_score,risk_bucket
0,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,9839.64,0.0,2.968551e-11,0.0,Low
1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1864.28,0.0,2.250958e-11,0.0,Low
2,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,181.0,0.0,0.7571538,75.72,Very High
3,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,181.0,-21182.0,0.4394917,43.95,Medium
4,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,11668.14,0.0,2.835938e-11,0.0,Low
5,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0,7817.71,0.0,2.625431e-11,0.0,Low
6,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0,7107.77,0.0,2.818642e-11,0.0,Low
7,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0,7861.64,0.0,2.852525e-11,0.0,Low
8,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0,2671.0,0.0,2.259326e-11,0.0,Low
9,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0,5337.77,-1549.21,1.015537e-10,0.0,Low


## 
- Tune bucket thresholds to match your business needs.
- If probabilities are poorly calibrated, consider using `CalibratedClassifierCV`.
- If your pipeline's preprocessing expects a fixed set of columns, ensure `new_transactions.csv` has the same column names.

In [10]:
# === Step 4: Calibration, threshold tuning, buckets, plots & alerts ===
# Paste this after your Step 4 cell that produced `df_new` or `risk_results`.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, confusion_matrix, classification_report

# ---------------------
# PARAMETERS (edit as needed)
PIPELINE_PATH = "fraud_detection_pipeline.pkl"   # path to trained pipeline (used if you want to re-fit calibrated classifier)
SCORED_PATH = "riskscore_results.csv"            # results produced in Step 4
CALIBRATE = False                                 # set True to calibrate model probabilities (requires pipeline + training data)
CALIBRATION_METHOD = 'isotonic'                   # 'isotonic' or 'sigmoid'
ALERT_THRESHOLD = 50.0                            # default RiskScore threshold (0-100) to mark "alert"
SAVE_ALERTS = True
ALERTS_PATH = "alerts.csv"
# ---------------------

# Load scored data (use existing DataFrame if present)
if 'df_new' in globals():
    scored = df_new.copy()
elif 'results' in globals():
    scored = results.copy()
elif os.path.exists(SCORED_PATH):
    scored = pd.read_csv(SCORED_PATH)
else:
    raise FileNotFoundError("No scored results found. Run Step 4 first or ensure SCORED_PATH is correct.")

print(f"Loaded scored dataset with {len(scored)} rows.")

# Ensure columns exist
if 'fraud_proba' not in scored.columns and 'risk_score' in scored.columns:
    scored['fraud_proba'] = scored['risk_score'] / 100.0

# 1) Optional calibration (requires having the raw pipeline and a validation set)
if CALIBRATE:
    if not os.path.exists(PIPELINE_PATH):
        raise FileNotFoundError("Pipeline file not found for calibration. Set CALIBRATE=False or provide pipeline.")
    # You must supply a validation dataset (X_val, y_val) to calibrate; this notebook assumes you have them.
    # If you trained earlier and still have X_train/X_test/y_train/y_test available in the notebook, use X_test/y_test as validation.
    if 'X_test' not in globals() or 'y_test' not in globals():
        raise RuntimeError("Calibration requested but no X_test/y_test found in the notebook environment. Provide validation data.")
    print("Calibrating classifier using", CALIBRATION_METHOD)
    import joblib
    orig_pipeline = joblib.load(PIPELINE_PATH)
    # Wrap classifier with CalibratedClassifierCV
    from sklearn.base import clone
    clf = orig_pipeline.named_steps['clf'] if hasattr(orig_pipeline, 'named_steps') else orig_pipeline
    calibrator = CalibratedClassifierCV(base_estimator=clone(clf), method=CALIBRATION_METHOD, cv='prefit')
    # Fit calibrator on validation set (note: if using 'prefit', you need clf already fitted; alternative is cv=5)
    # Here we assume clf is already fitted inside orig_pipeline (if not, set cv=5 and pass the unfit base estimator)
    # To keep it simple, we'll refit with cv=5 if prefitted flow is brittle
    calibrator = CalibratedClassifierCV(base_estimator=clone(clf), method=CALIBRATION_METHOD, cv=5)
    # Build pipeline for preprocessor + calibrator if pipeline has preprocessor
    preproc = orig_pipeline.named_steps.get('prep') if hasattr(orig_pipeline, 'named_steps') else None
    if preproc is not None:
        # create an sklearn pipeline: preproc -> calibrator (wrap as final estimator via fit/predict_proba)
        from sklearn.pipeline import Pipeline
        model_for_cal = Pipeline([('prep', preproc), ('clf', calibrator)])
    else:
        model_for_cal = calibrator
    # Fit calibrator on X_train/X_test depending on what's available; here we use X_test as validation
    # Assumes X_train/X_test are feature-only DataFrames (same columns as pipeline expects)
    model_for_cal.fit(X_train if 'X_train' in globals() else X_test, y_train if 'y_train' in globals() else y_test)
    # Replace scored probabilities
    if 'df_new' in globals():
        scored_proba = model_for_cal.predict_proba(scored.drop(columns=['fraud_proba','risk_score','risk_bucket'], errors='ignore'))[:,1]
    else:
        scored_proba = model_for_cal.predict_proba(scored.drop(columns=['fraud_proba','risk_score','risk_bucket'], errors='ignore'))[:,1]
    scored['fraud_proba_calibrated'] = scored_proba
    scored['risk_score_calibrated'] = (scored_proba * 100).round(2)
    print("Calibration complete. Added columns: fraud_proba_calibrated, risk_score_calibrated")

# 2) Compute ROC / AUC if true labels available
if 'y_true' in scored.columns or 'y_test' in globals():
    y_true = scored['y_true'] if 'y_true' in scored.columns else y_test
    proba_col = 'fraud_proba_calibrated' if 'fraud_proba_calibrated' in scored.columns else 'fraud_proba'
    auc = roc_auc_score(y_true, scored[proba_col])
    fpr, tpr, thr = roc_curve(y_true, scored[proba_col])
    print(f"ROC AUC = {auc:.4f}")
    # Plot ROC
    plt.figure(figsize=(6,5))
    plt.plot(fpr, tpr, label=f'AUC = {auc:.4f}')
    plt.plot([0,1],[0,1],'k--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc='lower right')
    plt.show()
    # Compute Youden's J to suggest optimal threshold
    youden = tpr - fpr
    best_idx = np.argmax(youden)
    suggested_thr = thr[best_idx]
    suggested_score = round(suggested_thr * 100, 2)
    print(f"Suggested threshold (Youden's J): prob >= {suggested_thr:.4f} -> RiskScore >= {suggested_score}")
else:
    print("No true labels available: skipping ROC/AUC and threshold-suggestion steps.")

# 3) Determine final risk score column to use
final_proba_col = 'fraud_proba_calibrated' if 'fraud_proba_calibrated' in scored.columns else 'fraud_proba'
scored['risk_score'] = (scored[final_proba_col] * 100).round(2)

# 4) Re-compute/adjust buckets (you can update thresholds here)
def bucket_by_thresholds(score):
    # Customize thresholds here (these are example defaults)
    if score >= 90:
        return "Critical"
    elif score >= 75:
        return "Very High"
    elif score >= 50:
        return "High"
    elif score >= 25:
        return "Medium"
    else:
        return "Low"

scored['risk_bucket'] = scored['risk_score'].apply(bucket_by_thresholds)

# 5) Export alerts for high-risk transactions
alerts = scored[scored['risk_score'] >= ALERT_THRESHOLD].copy()
print(f"Alerts found: {len(alerts)} (threshold >= {ALERT_THRESHOLD})")
if SAVE_ALERTS and len(alerts) > 0:
    alerts.to_csv(ALERTS_PATH, index=False)
    print("Saved alerts to", ALERTS_PATH)

# 6) Save the updated scored file (with buckets and calibrated columns if any)
scored.to_csv(SCORED_PATH, index=False)
print("Updated scored results saved to", SCORED_PATH)

# 7) Quick summary
display_columns = ['risk_score','risk_bucket']
for col in ['fraud_proba', 'fraud_proba_calibrated', 'risk_score_calibrated']:
    if col in scored.columns:
        display_columns.insert(0, col)

print("Sample rows:")
display(scored.head(10))

# 8) (Optional) Calibration plot
if 'y_true' in scored.columns and ('fraud_proba_calibrated' in scored.columns or 'fraud_proba' in scored.columns):
    probcol = 'fraud_proba_calibrated' if 'fraud_proba_calibrated' in scored.columns else 'fraud_proba'
    fraction_of_positives, mean_predicted_value = calibration_curve(y_true, scored[probcol], n_bins=10)
    plt.figure(figsize=(6,5))
    plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Calibration")
    plt.plot([0,1],[0,1],"k--", label="Perfectly calibrated")
    plt.xlabel("Mean predicted probability")
    plt.ylabel("Fraction of positives")
    plt.title("Calibration curve")
    plt.legend()
    plt.show()



Loaded scored dataset with 6362620 rows.
No true labels available: skipping ROC/AUC and threshold-suggestion steps.
Alerts found: 350626 (threshold >= 50.0)
Saved alerts to alerts.csv
Updated scored results saved to riskscore_results.csv
Sample rows:


Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,balanceDiffOrig,balanceDiffDest,fraud_proba,risk_score,risk_bucket
0,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0,9839.64,0.0,2.968551e-11,0.0,Low
1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0,1864.28,0.0,2.250958e-11,0.0,Low
2,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0,181.0,0.0,0.7571538,75.72,Very High
3,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0,181.0,-21182.0,0.4394917,43.95,Medium
4,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0,11668.14,0.0,2.835938e-11,0.0,Low
5,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0,7817.71,0.0,2.625431e-11,0.0,Low
6,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0,7107.77,0.0,2.818642e-11,0.0,Low
7,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0,7861.64,0.0,2.852525e-11,0.0,Low
8,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0,2671.0,0.0,2.259326e-11,0.0,Low
9,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0,5337.77,-1549.21,1.015537e-10,0.0,Low


In [12]:
# === Train and Save Fraud Detection Model ===
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# 1Ô∏è‚É£ Load dataset (update path if needed)
DATA_PATH = "processed_fraud_dataset2.csv"
df = pd.read_csv(DATA_PATH)
print(f"Loaded dataset: {DATA_PATH} with shape {df.shape}")

# 2Ô∏è‚É£ Prepare features and target
# drop ID-like and irrelevant columns
X = df.drop(["isFraud", "nameOrig", "nameDest", "isFlaggedFraud"], axis=1, errors="ignore")
y = df["isFraud"]

# 3Ô∏è‚É£ Identify column types
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(include=["object", "category"]).columns.tolist()
print("Numeric:", numeric_features)
print("Categorical:", categorical_features)

# 4Ô∏è‚É£ Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore", drop="first"), categorical_features),
    ],
    remainder="drop"
)

# 5Ô∏è‚É£ Model pipeline
pipeline = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced", solver="lbfgs"))
])

# 6Ô∏è‚É£ Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

# 7Ô∏è‚É£ Train model
pipeline.fit(X_train, y_train)
print("‚úÖ Model trained successfully!")

# 8Ô∏è‚É£ Evaluate model
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print("\n=== Classification Report ===")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_proba):.4f}")

# 9Ô∏è‚É£ Save model
MODEL_PATH = "fraud_detection_pipeline_risk.pkl"
joblib.dump(pipeline, MODEL_PATH)
print(f"üíæ Pipeline saved to: {MODEL_PATH}")

# 10Ô∏è‚É£ Optional ‚Äì Save test data for risk-score testing
X_test.to_csv("x_test_risk.csv", index=False)
y_test.to_csv("y_test_risk.csv", index=False)
print("Saved x_test.csv and y_test.csv for later use.")


Loaded dataset: processed_fraud_dataset2.csv with shape (6362620, 12)
Numeric: ['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'balanceDiffOrig', 'balanceDiffDest']
Categorical: ['type']
Train size: (4453834, 8), Test size: (1908786, 8)
‚úÖ Model trained successfully!

=== Classification Report ===
              precision    recall  f1-score   support

           0       1.00      0.95      0.97   1906322
           1       0.02      0.94      0.04      2464

    accuracy                           0.95   1908786
   macro avg       0.51      0.94      0.51   1908786
weighted avg       1.00      0.95      0.97   1908786

Confusion Matrix:
 [[1805846  100476]
 [    150    2314]]
ROC-AUC Score: 0.9889
üíæ Pipeline saved to: fraud_detection_pipeline_risk.pkl
Saved x_test.csv and y_test.csv for later use.
