# Library Imports & Environment Setup
In this cell, we import all necessary Python libraries for our analysis. pandas and numpy handle data manipulation and numerical computations, while matplotlib and seaborn support data visualization. We also bring in modules from scikit-learn for splitting the dataset, building a Logistic Regression model, and evaluating its performance using metrics like accuracy, confusion matrix, ROC curve, and AUC. Finally, joblib is imported for saving and loading trained models.

This setup ensures our environment is fully equipped for data processing, modeling, and evaluation tasks.

In [None]:
# Import libraries
import pandas as pd                  # For data manipulation
import numpy as np                   # For numerical operations
import matplotlib.pyplot as plt      # For plotting
import seaborn as sns                # For visualizations

# Data Loading & Initial Exploration
In this section, we load the cleaned fraud detection dataset using pandas. After importing the data, we perform a preliminary examination by displaying its structure (info()), the first few rows (head()), and basic statistical summaries (describe()). This helps us understand the dataset’s shape, types of variables, and general distribution before moving into deeper analysis.

In [None]:

# Load dataset
df = pd.read_csv("../Data/Fraud_Cleaned.csv")  # Ensure the correct file path

# Display basic info
df.info()
df.head()
df.describe()


# Data Exploration 
This section focuses on identifying and addressing potential data quality issues. We begin by checking for missing values and duplicated rows within the dataset. These steps are essential to ensure data integrity and reliability before proceeding to model training and analysis.

In [None]:
# Data Exploration

# Check for missing values
print("Missing Values:\n", df.isnull().sum())

# Check for duplicates
print("\nDuplicate Rows:", df.duplicated().sum())

# Class Distribution
We visualize the distribution of fraud vs. non-fraud cases and highlight the fraud class for clarity. Class percentages are also printed to check for imbalance, which is important for model performance.

In [None]:
# Visualize class distribution with custom color for fraud = 1
ax = sns.countplot(x='fraud', data=df)
ax.patches[1].set_color('#DD8452')  # Set fraud = 1 bar to orange
plt.title('Class Distribution')
plt.show()

# Print class percentages
print("Class distribution (%):")
print(df['fraud'].value_counts(normalize=True) * 100)


The distribution of the target variable fraud is highly imbalanced. The vast majority of transactions are labeled as non-fraudulent (fraud = 0), while only a small fraction are labeled as fraudulent (fraud = 1). Specifically, approximately 95% of the records are non-fraudulent, with only 5% labeled as fraud.

# Feature Distribution by Fraud Status
We examine how each feature is distributed across fraud and non-fraud cases. Continuous features are shown with KDE plots, while categorical ones use stacked histograms. This helps identify patterns and potential predictors of fraud.

In [None]:

features_to_plot = df.columns.drop("fraud")  # exclude target

plt.figure(figsize=(16, 12))
for i, feature in enumerate(features_to_plot, 1):
    plt.subplot(3, 3, i)
    
    unique_vals = df[feature].nunique()
    
    # Use KDE only for continuous features
    if unique_vals > 20:  # arbitrary threshold;
        sns.histplot(data=df, x=feature, hue="fraud", kde=True, bins=30, element="step")
    else:
        sns.histplot(data=df, x=feature, hue="fraud", kde=False, bins=unique_vals, multiple="stack", shrink=0.8)
    
    plt.title(f"{feature} by Fraud Status")

plt.tight_layout()
plt.show()


The feature trustLevel shows that most non-fraudulent transactions occur at higher trust levels (3–6), while fraud is concentrated at the lowest levels (1 and 2), indicating trust is a strong fraud indicator. totalScanTimeInSeconds and grandTotal display wide distributions with slight elevation in fraud at longer scan durations and higher values, though the separation is not visually distinct. lineItemVoids, scansWithoutRegistration, and quantityModification all show more fraud cases at higher values, supporting their relevance as behavioral indicators of manipulation. In contrast, scannedLineItemsPerSecond and valuePerSecond are extremely skewed, with very few high-value outliers — potentially useful after transformation. Finally, lineItemVoidsPerPosition displays a clear concentration of fraud at higher values, confirming its strong predictive signal. 

# Fraud Breakdown by Low-Cardinality Features
This section analyzes categorical features with few unique values to see how fraud rates vary across their categories. It provides a detailed breakdown of fraud and non-fraud counts and percentages for each feature value, helping reveal patterns and potential risk indicators.

In [None]:

# Define the target column
target_col = "fraud"

# Define what qualifies as "low-cardinality"
max_unique = 12
candidate_features = [col for col in df.columns if col != target_col and df[col].nunique() <= max_unique]

# Loop over each low-cardinality feature
for feature in candidate_features:
    summary_rows = []

    for val in sorted(df[feature].unique()):
        total_count = df[df[feature] == val].shape[0]
        count_yes = df[(df[feature] == val) & (df[target_col] == 1)].shape[0]
        count_no = df[(df[feature] == val) & (df[target_col] == 0)].shape[0]
        pct_yes = (count_yes / total_count) * 100 if total_count > 0 else 0
        pct_no = (count_no / total_count) * 100 if total_count > 0 else 0

        summary_rows.append({
            "Value": val,
            "Total Count": total_count,
            "Fraud Count (Yes)": count_yes,
            "Non-Fraud Count (No)": count_no,
            "% Fraud (Yes)": round(pct_yes, 2),
            "% Non-Fraud (No)": round(pct_no, 2),
        })

    summary_df = pd.DataFrame(summary_rows)

    print(f"\n=== Feature: {feature} ===")
    display(summary_df)


## trustLevel
The distribution clearly shows that fraud occurs only at the lowest trust levels (1 and 2), with 22.41% and 6.26% fraud rates respectively. From trust level 3 onward, no fraud is recorded, and all transactions are non-fraudulent. This makes trustLevel one of the strongest features in distinguishing between fraudulent and non-fraudulent behavior — the lower the trust level, the higher the risk.

## lineItemVoids
Fraud percentage increases gradually as the number of voided items increases. For example, fraud rates grow from ~2.6% at 1 void to 6.68% at 11 voids. This steady climb suggests that frequent voiding of items is a clear fraud indicator, and the model should weigh this feature accordingly.

## scansWithoutRegistration
This feature also displays a positive correlation with fraud likelihood. The fraud rate rises from 2.09% at 0 attempts to over 7% at 10 attempts. This supports the idea that customers who repeatedly scan without registering items are exhibiting suspicious behavior, making this another strong behavioral predictor of fraud.

## quantityModification
Fraud rates across different quantity modification values (0–5) are quite stable, hovering between 4.7% and 4.85%. This lack of variation suggests that quantityModification does not strongly differentiate between fraud and non-fraud cases. It may be less valuable as a standalone feature but could still support other interactions in the model.



# Feature Correlation Matrix
We use a heatmap to visualize correlations between features. This helps identify multicollinearity and relationships that may influence model performance.

In [None]:
selected_features = [
    "trustLevel", "totalScanTimeInSeconds", "lineItemVoids", 
    "quantityModification", "grandTotal", "scannedLineItemsPerSecond", 
    "valuePerSecond", "lineItemVoidsPerPosition", "scansWithoutRegistration"
]

X = df[selected_features]

# Visualize feature correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(X.corr(), annot=True, cmap='coolwarm')
plt.title("Feature Correlation Matrix")
plt.show()

The correlation matrix analysis confirmed that most features are weakly correlated, minimizing the risk of multicollinearity in the logistic regression model. The only notable exception is the strong correlation (0.75) between scannedLineItemsPerSecond and valuePerSecond, which aligns with domain logic. While both features are retained for now to preserve model performance, they may be revisited during feature importance or model refinement steps. Overall, the selected features appear well-suited for modeling, requiring no immediate removals or transformations.

# Outlier Detection with Boxplots
Boxplots are used to visually inspect each feature for potential outliers. This helps identify extreme values that may affect model accuracy or require preprocessing.

In [None]:
# Visualize outliers using boxplots
plt.figure(figsize=(15, 8))
sns.boxplot(data=df, orient="h")
plt.title("Feature Boxplot for Outlier Detection")
plt.show()


The boxplot reveals that most features, such as trustLevel, lineItemVoids, quantityModification, and scansWithoutRegistration, are tightly distributed with limited spread and numerous low-value entries. In contrast, totalScanTimeInSeconds and grandTotal show a wider spread and contain clear outliers, with totalScanTimeInSeconds extending well beyond 1500 seconds, indicating unusually long scanning sessions. Features like valuePerSecond and scannedLineItemsPerSecond are heavily skewed and exhibit extreme outliers, suggesting the need for normalization or transformation prior to modeling. The boxplot effectively highlights which features may require special handling to mitigate the influence of outliers on model performance.