# PaisaBazaar Banking Fraud Analysis

### Project Summary  
This project builds a **machine learning pipeline** to analyze financial customer data and predict credit risk.  
The goal is to classify customers into categories of creditworthiness (e.g., Good, Standard, Poor) and highlight potential fraud or default risks.  
The workflow covers data cleaning, preprocessing, exploratory analysis, model training, tuning, and evaluation with business insights.

### GitHub Link  
[🔗 Project Repository](https://github.com/your-username/your-repo)  
_(Replace with your actual repository link)_

### Problem Statement  
Financial institutions face increasing challenges in accurately assessing the creditworthiness of customers.  
Manual processes are **time-consuming, error-prone, and biased**, while fraud risks are rising in the digital lending ecosystem.  

This project aims to develop a **Credit Score Prediction System** that automates classification of customers based on financial attributes,  
reduces default risk, and supports decision-making for loan approvals, credit card issuance, and fraud detection.

### Compact version adapted to the sample structure
This notebook keeps the **outer heading** and key section names from the sample, but compacts the content to the essentials using your **final project code**.


# Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings, io, traceback, gc

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (7,4)

# ==== Low‑RAM switches ====
LOW_RAM = True
N_ROWS = None            # e.g., 150_000 to cap rows or None
PLOT_SAMPLE = 800        # sample size for charts
CV_FOLDS = 3             # lighter cross‑validation
N_JOBS = 1               # set to 1 to avoid RAM spikes
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# XGBoost light config
XGB_PARAMS = dict(
    n_estimators=50,
    max_depth=3,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='mlogloss',
    tree_method='hist',   # lower memory
    verbosity=0,
    nthread=1
)

def safe_show():
    plt.show()
    plt.close()
    gc.collect()

# Load Dataset & Memory Optimization

In [None]:
def reduce_mem_usage(df):
    for col in df.columns:
        col_type = df[col].dtype
        if str(col_type).startswith('float'):
            df[col] = pd.to_numeric(df[col], downcast='float')
        elif str(col_type).startswith('int'):
            df[col] = pd.to_numeric(df[col], downcast='integer')
    return df

try:
    from google.colab import files
    uploaded = files.upload()
    key = 'dataset-2.csv' if 'dataset-2.csv' in uploaded else next(iter(uploaded))
    df = pd.read_csv(io.BytesIO(uploaded[key]), low_memory=True, nrows=N_ROWS)
except Exception:
    df = pd.read_csv('dataset-2.csv', low_memory=True, nrows=N_ROWS)

df = reduce_mem_usage(df)

# Dataset First Look

In [None]:
# 3.1 Head
display(df.head())

# Dataset Rows & Columns count

In [None]:
# 3.2 Shape
print('Shape:', df.shape)

# Dataset Info (Numeric vs Categorical)

In [None]:
# 3.3 Data Types (numeric vs categorical)
num_cols = df.select_dtypes(include=['number']).columns.tolist()
cat_cols = df.select_dtypes(include=['object','category','bool']).columns.tolist()
print('Numeric columns (sample):', num_cols[:10])
print('Categorical columns (sample):', cat_cols[:10])

# Missing Values Overview & Handling

In [None]:
# 4.1 Missing values overview
missing = df.isnull().sum().sort_values(ascending=False)
display(missing[missing>0].to_frame('missing_count'))

In [None]:
# 4.2 Handle missing values (categorical -> mode, numeric -> median) — no inplace
for c in cat_cols:
    if df[c].isna().any():
        df[c] = df[c].fillna(df[c].mode().iloc[0])
for c in num_cols:
    if df[c].isna().any():
        df[c] = pd.to_numeric(df[c], errors='coerce')
        df[c] = df[c].fillna(df[c].median())

print('Missing after fill:')
display(df.isnull().sum().sort_values(ascending=False).head(10))

# Duplicates

In [None]:
# 4.3 Duplicates
before = df.shape[0]
df = df.drop_duplicates().reset_index(drop=True)
after = df.shape[0]
print('Dropped duplicates:', before - after)

# Outlier Handling (Percentile Capping)

In [None]:
# 4.4 Simple outlier capping for numerics (1% / 99%)
for c in num_cols:
    lo, hi = df[c].quantile(0.01), df[c].quantile(0.99)
    df[c] = df[c].clip(lo, hi)
print('Applied percentile capping on numerical columns.')

# Exploratory Visuals (Optional)

In [None]:
try:
    candidates = ['Age','Annual_Income','Monthly_Inhand_Salary','Num_Bank_Accounts','Num_Credit_Card','Interest_Rate','Outstanding_Debt','Monthly_Balance']
    plot_cols = [c for c in candidates if c in df.columns]
    if len(plot_cols) < 5:
        plot_cols = list(df.select_dtypes(include=['number']).columns[:5])

    df_plot = df.sample(min(PLOT_SAMPLE, len(df)), random_state=RANDOM_STATE)

    for col in plot_cols[:5]:
        plt.figure()
        sns.histplot(df_plot[col], bins=30, kde=True)
        plt.title(f'Distribution of {col}')
        plt.xlabel(col); plt.ylabel('Count')
        safe_show()
except Exception as e:
    print('Univariate EDA failed:', e)
    traceback.print_exc()

In [None]:
try:
    if 'Credit_Score' not in df.columns:
        raise ValueError("Target column 'Credit_Score' not found.")

    df_plot = df.sample(min(PLOT_SAMPLE, len(df)), random_state=RANDOM_STATE)

    # Numeric vs target (boxplot)
    num_for_box = None
    for c in ['Annual_Income','Age','Outstanding_Debt','Interest_Rate']:
        if c in df_plot.columns:
            num_for_box = c; break
    if num_for_box is None:
        num_candidates = df_plot.select_dtypes(include=['number']).columns.tolist()
        num_candidates = [c for c in num_candidates if c != 'Credit_Score']
        if len(num_candidates): num_for_box = num_candidates[0]

    if num_for_box is not None:
        plt.figure()
        sns.boxplot(x='Credit_Score', y=num_for_box, data=df_plot)
        plt.title(f'{num_for_box} by Credit_Score')
        safe_show()

    # Categorical vs target (stacked proportions) — choose a meaningful cat, not Name
    cat_candidate = None
    for c in ['Occupation','Gender','Type_of_Loan','Payment_Behaviour']:
        if c in df_plot.columns:
            cat_candidate = c; break
    if cat_candidate is None:
        cats = df_plot.select_dtypes(include=['object','category','bool']).columns.tolist()
        cats = [c for c in cats if c.lower() != 'name']
        if len(cats): cat_candidate = cats[0]

    if cat_candidate is not None:
        ct = pd.crosstab(df_plot[cat_candidate], df_plot['Credit_Score'], normalize='index')
        ct.plot(kind='bar', stacked=True)
        plt.title(f'{cat_candidate} vs Credit_Score (proportions)')
        plt.ylabel('Proportion'); plt.xlabel(cat_candidate)
        safe_show()
except Exception as e:
    print('Bivariate EDA failed:', e)
    traceback.print_exc()

In [None]:
try:
    df_plot = df.sample(min(PLOT_SAMPLE, len(df)), random_state=RANDOM_STATE)
    num_cols_plot = df_plot.select_dtypes(include=['number']).columns.tolist()
    if 'Credit_Score' in num_cols_plot:
        num_cols_plot.remove('Credit_Score')
    corr_cols = num_cols_plot[:6]

    if len(corr_cols) >= 2:
        plt.figure(figsize=(7,5))
        sns.heatmap(df_plot[corr_cols].corr(), annot=True, cmap='coolwarm', center=0)
        plt.title('Correlation Heatmap (selected numeric features)')
        safe_show()

    if 'Credit_Score' in df_plot.columns and len(num_cols_plot) >= 2:
        plt.figure()
        try:
            sns.scatterplot(x=df_plot[num_cols_plot[0]], y=df_plot[num_cols_plot[1]], hue=df_plot['Credit_Score'], alpha=0.6)
        except Exception:
            sns.scatterplot(x=df_plot[num_cols_plot[0]], y=df_plot[num_cols_plot[1]], alpha=0.6)
        plt.title(f'{num_cols_plot[0]} vs {num_cols_plot[1]} by Credit_Score')
        safe_show()
except Exception as e:
    print('Multivariate EDA failed:', e)
    traceback.print_exc()

# Train/Test Split & Preprocessing

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Initialize placeholders so they exist even if try-block fails
preprocess = None
X_train_df = X_test_df = y_train = y_test = None

try:
    if 'Credit_Score' not in df.columns:
        raise ValueError("Target column 'Credit_Score' not found.")

    target = df['Credit_Score'].astype(str)
    classes = sorted(target.unique())
    class_to_idx = {c:i for i,c in enumerate(classes)}
    y = target.map(class_to_idx).values

    X_df = df.drop(columns=['Credit_Score']).copy()
    num_cols = X_df.select_dtypes(include=['number']).columns.tolist()
    cat_cols = X_df.select_dtypes(include=['object','category','bool']).columns.tolist()

    preprocess = ColumnTransformer(
        transformers=[
            ('num', StandardScaler(with_mean=False), num_cols),
            ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=True), cat_cols)
        ],
        remainder='drop',
        sparse_threshold=1.0
    )

    X_train_df, X_test_df, y_train, y_test = train_test_split(
        X_df, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )

    print('Train DF:', X_train_df.shape, '| Test DF:', X_test_df.shape)
    print('Classes mapping:', class_to_idx)

except Exception as e:
    print('Preprocessing failed:', e)
    traceback.print_exc()

# Baseline Models

In [None]:
baseline_df = pd.DataFrame()  # ensure defined

def build_pipe(estimator):
    return Pipeline([('prep', preprocess), ('clf', estimator)])

def eval_model(model, X_tr, y_tr, X_te, y_te, name='Model'):
    model.fit(X_tr, y_tr)
    preds = model.predict(X_te)
    acc = accuracy_score(y_te, preds)
    pr, rc, f1, _ = precision_recall_fscore_support(y_te, preds, average='weighted', zero_division=0)
    print('{} Accuracy: {:.4f}'.format(name, acc))
    print('Precision: {:.4f} | Recall: {:.4f} | F1: {:.4f}'.format(pr, rc, f1))
    print('Confusion Matrix:\n', confusion_matrix(y_te, preds))
    print('Classification Report:\n', classification_report(y_te, preds, zero_division=0))
    return {'Accuracy': acc, 'Precision': pr, 'Recall': rc, 'F1': f1}

try:
    cv = StratifiedKFold(n_splits=CV_FOLDS, shuffle=True, random_state=RANDOM_STATE)

    models = {
        'Logistic Regression': LogisticRegression(max_iter=300, random_state=RANDOM_STATE, n_jobs=N_JOBS),
        'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE),
        'Random Forest': RandomForestClassifier(n_estimators=100, max_depth=6, random_state=RANDOM_STATE, n_jobs=N_JOBS),
        'XGBoost': XGBClassifier(**XGB_PARAMS),
        'KNN': KNeighborsClassifier(n_neighbors=5)
    }

    baseline_results = {}
    for name, est in models.items():
        print('\n###', name, '—', CV_FOLDS, 'fold CV (weighted F1)')
        pipe = build_pipe(est)
        f1_cv = cross_val_score(pipe, X_train_df, y_train, cv=cv, scoring='f1_weighted', n_jobs=N_JOBS)
        print('CV F1 (mean±std): {:.4f} ± {:.4f}'.format(f1_cv.mean(), f1_cv.std()))
        baseline_results[name] = eval_model(pipe, X_train_df, y_train, X_test_df, y_test, name=name)

    baseline_df = pd.DataFrame(baseline_results).T
    display(baseline_df)

except Exception as e:
    print('Baseline modeling failed:', e)
    traceback.print_exc()

# Hyperparameter Tuning (Optional)

In [None]:
tuned_df = pd.DataFrame()  # ensure defined

try:
    tuned_results = {}

    # Random Forest tuning (small grid)
    rf_grid = {
        'clf__n_estimators': [100, 150],
        'clf__max_depth': [6, 8],
        'clf__min_samples_split': [2, 5]
    }
    rf_pipe = build_pipe(RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=N_JOBS))
    rf_gs = GridSearchCV(rf_pipe, rf_grid, scoring='f1_weighted', cv=CV_FOLDS, n_jobs=N_JOBS)
    rf_gs.fit(X_train_df, y_train)
    rf_best = rf_gs.best_estimator_
    print('RF best params:', rf_gs.best_params_)
    tuned_results['Random Forest (Tuned)'] = eval_model(rf_best, X_train_df, y_train, X_test_df, y_test, 'Random Forest (Tuned)')

    # XGBoost tuning (small grid)
    xgb_grid = {
        'clf__n_estimators': [50, 80],
        'clf__max_depth': [3, 4],
        'clf__learning_rate': [0.05, 0.1]
    }
    xgb_pipe = build_pipe(XGBClassifier(**XGB_PARAMS))
    xgb_gs = GridSearchCV(xgb_pipe, xgb_grid, scoring='f1_weighted', cv=CV_FOLDS, n_jobs=N_JOBS)
    xgb_gs.fit(X_train_df, y_train)
    xgb_best = xgb_gs.best_estimator_
    print('XGB best params:', xgb_gs.best_params_)
    tuned_results['XGBoost (Tuned)'] = eval_model(xgb_best, X_train_df, y_train, X_test_df, y_test, 'XGBoost (Tuned)')

    tuned_df = pd.DataFrame(tuned_results).T
    display(tuned_df)

except Exception as e:
    print('Tuning failed:', e)
    traceback.print_exc()

# Model Comparison & Summary

In [None]:
try:
    if 'baseline_df' in globals() and not baseline_df.empty:
        comp = pd.DataFrame({'Model': list(baseline_df.index), 'F1': baseline_df['F1'].values})
        if 'tuned_df' in globals() and not tuned_df.empty:
            for m in tuned_df.index:
                comp = pd.concat([comp, pd.DataFrame({'Model':[m], 'F1':[tuned_df.loc[m,'F1']]})], ignore_index=True)

        plt.figure(figsize=(8,4))
        comp.sort_values('F1', ascending=False, inplace=True)
        plt.bar(comp['Model'], comp['F1'])
        plt.xticks(rotation=30, ha='right')
        plt.ylabel('Weighted F1')
        plt.title('Baseline vs Tuned — F1 Comparison')
        plt.tight_layout()
        safe_show()
    else:
        print('No baseline results available to plot.')
except Exception as e:
    print('Improvement chart failed:', e)
    traceback.print_exc()

# Conclusion & Business Impact

### Conclusion
- We successfully built a **Credit Score Prediction** model pipeline.
- Data cleaning included handling missing values, duplicates, and outliers.
- Multiple baseline models were tested (Logistic Regression, Random Forest, etc.) and evaluated on accuracy, precision, recall, and F1-score.
- Hyperparameter tuning provided additional performance improvements where feasible.

### Business Impact
- **Risk Assessment:** Financial institutions can more accurately classify customers into creditworthiness categories (e.g., Good, Standard, Poor).
- **Decision Support:** Helps in **loan approval**, **credit card issuance**, and **interest rate adjustment** decisions.
- **Operational Efficiency:** Automates manual assessment, saving time and reducing human bias.
- **Customer Management:** Enables proactive measures for customers likely to default, such as restructuring loans or offering financial advice.

This project demonstrates how machine learning can directly contribute to better financial decision-making and reduced default risk.
