# Credit Risk Assessment Dataset Preparation

This notebook prepares the Credit Risk dataset for training a neural network.
We will explore the dataset, suggest modifications, and select evaluation metrics.
No training is performed here.

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

ModuleNotFoundError: No module named 'sklearn'

## Load the Dataset

Assuming the dataset is in a CSV file named 'credit-g.csv'. You can save the provided data to this file.

In [None]:
df = pd.read_csv('credit-g.csv')
df.head()

## Dataset Exploration

In [None]:
print('Shape:', df.shape)
print('\nInfo:')
df.info()
print('\nDescribe:')
df.describe()

In [None]:
# Check for missing values
print('Missing values:\n', df.isnull().sum())

In [None]:
# Class distribution
sns.countplot(x='class', data=df)
plt.title('Class Distribution')
plt.show()
print(df['class'].value_counts(normalize=True))

## Dataset Modifications

1. **Handle Categorical Variables:** Use Label Encoding for binary/ordinal, One-Hot for nominal.
2. **Scale Numerical Features:** Use StandardScaler.
3. **Handle Class Imbalance:** Use SMOTE if imbalanced.
4. **Split Data:** Train/Validation/Test split.

The dataset appears clean, no missing values.

In [None]:
# Identify categorical and numerical columns
categorical_cols = df.select_dtypes(include=['object']).columns.drop('class')
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

print('Categorical:', list(categorical_cols))
print('Numerical:', list(numerical_cols))

In [None]:
# Encode target
le = LabelEncoder()
df['class'] = le.fit_transform(df['class'])  # good=1, bad=0 ? Check mapping
print('Class mapping:', dict(zip(le.classes_, le.transform(le.classes_))))

In [None]:
# One-Hot Encode categorical
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

In [None]:
# Split features and target
X = df_encoded.drop('class', axis=1)
y = df_encoded['class']

In [None]:
# Handle imbalance if needed
if y.value_counts(normalize=True)[0] > 0.3:  # Assuming bad is minority
    smote = SMOTE(random_state=42)
    X_res, y_res = smote.fit_resample(X, y)
else:
    X_res, y_res = X, y

In [None]:
# Scale numerical features
scaler = StandardScaler()
X_res[numerical_cols] = scaler.fit_transform(X_res[numerical_cols])

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=42, stratify=y_res)

## Evaluation Metrics

For credit risk (binary classification, imbalanced):
- **Accuracy:** Overall correctness.
- **Precision:** Important to avoid false positives (bad as good).
- **Recall:** Crucial to catch bad credits (minimize false negatives).
- **F1-Score:** Balance of precision and recall.
- **AUC-ROC:** Handles imbalance well, probability threshold independent.

Primary: Recall and AUC-ROC, as missing bad credits is costly.

In [None]:
# Example how to compute (after model prediction)
# y_pred = model.predict(X_test)
# y_prob = model.predict_proba(X_test)[:,1]
#
# print('Accuracy:', accuracy_score(y_test, y_pred))
# print('Precision:', precision_score(y_test, y_pred))
# print('Recall:', recall_score(y_test, y_pred))
# print('F1:', f1_score(y_test, y_pred))
# print('AUC:', roc_auc_score(y_test, y_prob))