# German Credit Risk Analysis and Predictive Modeling using Machine Learning Techniques

This notebook implements a machine learning pipeline to predict credit risk using the **German Credit Dataset**. The goal is to classify customers as either "good" or "bad" credit risks based on various features such as age, job status, loan amount, and credit history.

---

## Table of Contents:
1. [Introduction](#Introduction)
2. [Loading the Dataset](#Loading-the-Dataset)
3. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis)
4. [Data Preprocessing](#Data-Preprocessing)
5. [Model Training](#Model-Training)
6. [Model Evaluation](#Model-Evaluation)
7. [Making Predictions on New Data](#Making-Predictions-on-New-Data)
8. [Conclusion](#Conclusion)

---

## 1. Introduction <a id="Introduction"></a>

In this notebook, we will:
- Load the **German Credit Dataset**.
- Preprocess the data by encoding categorical variables and scaling numerical features.
- Train three machine learning models: **Logistic Regression**, **Random Forest**, and **Gradient Boosting**.
- Evaluate the models using accuracy and F1 score.
- Make predictions on new customer data.

---

## 2. Loading the Dataset <a id="Loading-the-Dataset"></a>

We will start by loading the dataset from a `.data` file and assigning appropriate column names based on the dataset documentation.

In [3]:
import pandas as pd

# Define column names based on german.doc (documentation)
columns = [
    'Status_of_existing_checking_account', 'Duration_in_month', 'Credit_history',
    'Purpose', 'Credit_amount', 'Savings_account_bonds', 'Present_employment_since',
    'Installment_rate_in_percentage_of_disposable_income', 'Personal_status_and_sex',
    'Other_debtors_guarantors', 'Present_residence_since', 'Property',
    'Age_in_years', 'Other_installment_plans', 'Housing',
    'Number_of_existing_credits_at_this_bank', 'Job',
    'Number_of_people_being_liable_to_provide_maintenance_for',
    'Telephone', 'Foreign_worker', 'Credit_risk'
]

# Load dataset
file_path = "C:/Users/shash/OneDrive/Desktop/Repos/german-credit-risk-analysis/data/german.data"
df = pd.read_csv(file_path, sep=' ', header=None, names=columns)

# Display basic information about the dataset
print(f"Dataset Shape: {df.shape}")
df.head()

Dataset Shape: (1000, 21)


Unnamed: 0,Status_of_existing_checking_account,Duration_in_month,Credit_history,Purpose,Credit_amount,Savings_account_bonds,Present_employment_since,Installment_rate_in_percentage_of_disposable_income,Personal_status_and_sex,Other_debtors_guarantors,...,Property,Age_in_years,Other_installment_plans,Housing,Number_of_existing_credits_at_this_bank,Job,Number_of_people_being_liable_to_provide_maintenance_for,Telephone,Foreign_worker,Credit_risk
0,A11,6,A34,A43,1169,A65,A75,4,A93,A101,...,A121,67,A143,A152,2,A173,1,A192,A201,1
1,A12,48,A32,A43,5951,A61,A73,2,A92,A101,...,A121,22,A143,A152,1,A173,1,A191,A201,2
2,A14,12,A34,A46,2096,A61,A74,2,A93,A101,...,A121,49,A143,A152,1,A172,2,A191,A201,1
3,A11,42,A32,A42,7882,A61,A74,2,A93,A103,...,A122,45,A143,A153,1,A173,2,A191,A201,1
4,A11,24,A33,A40,4870,A61,A73,3,A93,A101,...,A124,53,A143,A153,2,A173,2,A191,A201,2


### Explanation:
- The dataset contains 1000 rows and 21 columns.
- Each row represents a customer applying for credit, with features such as `Credit_amount`, `Duration_in_month`, etc.

## 3. Exploratory Data Analysis (EDA) <a id="Exploratory-Data-Analysis"></a>

Before preprocessing, let's explore the dataset to understand its structure.

In [6]:
# Check for missing values
df.isnull().sum()

# Check data types of each column
df.dtypes

# Check unique values in categorical columns (e.g., Credit_history)
df['Credit_history'].unique()

# Summary statistics for numerical columns
df.describe()

Unnamed: 0,Duration_in_month,Credit_amount,Installment_rate_in_percentage_of_disposable_income,Present_residence_since,Age_in_years,Number_of_existing_credits_at_this_bank,Number_of_people_being_liable_to_provide_maintenance_for,Credit_risk
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


### Explanation:
- There are no missing values in the dataset.
- The dataset contains both categorical and numerical features.

## 4. Data Preprocessing <a id="Data-Preprocessing"></a>

We will now preprocess the data by:
1. Encoding categorical variables using **One-Hot Encoding**.
2. Scaling numerical features using **StandardScaler**.
3. Balancing classes using **SMOTE** (Synthetic Minority Over-sampling Technique).

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from imblearn.over_sampling import SMOTE

# Separate features and target variable ('Credit_risk')
X = df.drop('Credit_risk', axis=1)
y = df['Credit_risk']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Preprocessing pipeline for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing pipeline to training and test data
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

# Apply SMOTE to balance classes in training set
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_preprocessed, y_train)

print(f"Training Data Shape: {X_train_smote.shape}")
print(f"Test Data Shape: {X_test_preprocessed.shape}")

Training Data Shape: (1118, 61)
Test Data Shape: (200, 61)


### Explanation:
- We used **One-Hot Encoding** to convert categorical variables into binary columns.
- We used **StandardScaler** to scale numerical features.
- We applied **SMOTE** to balance the classes in the training set.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Train Logistic Regression model
log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train_smote, y_train_smote)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_smote, y_train_smote)

# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_smote, y_train_smote)

print("Models trained successfully.")

### Explanation:
We trained three different models on the preprocessed training data:
1. Logistic Regression: A linear model for binary classification.
2. Random Forest: An ensemble method that builds multiple decision trees.
3. Gradient Boosting: An ensemble method that builds trees sequentially to minimize errors.

6. Model Evaluation <a id="Model-Evaluation"></a>
We will evaluate each model using accuracy and F1 score.
python

In [None]:
from sklearn.metrics import accuracy_score, f1_score

def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    
    acc = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    
    print(f"Accuracy: {acc:.4f}")
    print(f"F1 Score: {f1:.4f}")

print("Evaluating Logistic Regression Model:")
evaluate_model(log_reg_model, X_test_preprocessed, y_test)

print("\nEvaluating Random Forest Model:")
evaluate_model(rf_model, X_test_preprocessed, y_test)

print("\nEvaluating Gradient Boosting Model:")
evaluate_model(gb_model, X_test_preprocessed, y_test)

## 5. Model Training <a id="Model-Training"></a>

We will now train three machine learning models:
1. Logistic Regression
2. Random Forest Classifier
3. Gradient Boosting Classifier

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Train Logistic Regression model
log_reg_model = LogisticRegression(max_iter=1000)
log_reg_model.fit(X_train_smote, y_train_smote)

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_smote, y_train_smote)

# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train_smote, y_train_smote)

print("Models trained successfully.")

Models trained successfully.


### Explanation:
We trained three different models on the preprocessed training data:
1. Logistic Regression: A linear model for binary classification.
2. Random Forest: An ensemble method that builds multiple decision trees.
3. Gradient Boosting: An ensemble method that builds trees sequentially to minimize errors.

## 6. Model Evaluation <a id="Model-Evaluation"></a>

We will evaluate each model using accuracy and F1 score.

In [9]:
from sklearn.metrics import accuracy_score, f1_score

def evaluate_model(model, X_test, y_test):
    predictions = model.predict(X_test)
    
    acc = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)
    
    print(f"Accuracy: {acc:.4f}")
    print(f"F1 Score: {f1:.4f}")

print("Evaluating Logistic Regression Model:")
evaluate_model(log_reg_model, X_test_preprocessed, y_test)

print("\nEvaluating Random Forest Model:")
evaluate_model(rf_model, X_test_preprocessed, y_test)

print("\nEvaluating Gradient Boosting Model:")
evaluate_model(gb_model, X_test_preprocessed, y_test)

Evaluating Logistic Regression Model:
Accuracy: 0.7450
F1 Score: 0.8061

Evaluating Random Forest Model:
Accuracy: 0.8000
F1 Score: 0.8649

Evaluating Gradient Boosting Model:
Accuracy: 0.7800
F1 Score: 0.8462


### Explanation:
The models are evaluated using two key metrics:
1. **Accuracy**: The percentage of correct predictions out of all predictions made.
2. **F1 Score**: The harmonic mean of precision and recall (useful when dealing with imbalanced datasets).

## 7. Making Predictions on New Data <a id="Making-Predictions-on-New-Data"></a>

We will now simulate predictions on new customer data using all three models.

In [10]:
import numpy as np

def make_prediction(model, new_data):
    prediction = model.predict(new_data)
    return prediction

# Simulate new customer data (replace these values with actual preprocessed values)
new_customer_data = np.array([X_test_preprocessed[0]])  # Using an example from test set

print("\nMaking Predictions on New Customer Data:")

log_reg_prediction = make_prediction(log_reg_model, new_customer_data)
print(f"Logistic Regression Prediction: {log_reg_prediction}")

rf_prediction = make_prediction(rf_model, new_customer_data)
print(f"Random Forest Prediction: {rf_prediction}")

gb_prediction = make_prediction(gb_model, new_customer_data)
print(f"Gradient Boosting Prediction: {gb_prediction}")


Making Predictions on New Customer Data:
Logistic Regression Prediction: [2]
Random Forest Prediction: [2]
Gradient Boosting Prediction: [2]


### Explanation:
We simulated a new customer's data (using a sample from the test set) and passed it through each of the trained models to get predictions.
The output shows whether each model predicts this customer as a good or bad credit risk.

## 8. Conclusion <a id="Conclusion"></a>

In this project:
- We loaded and preprocessed the German Credit Dataset.
- We trained three machine learning models (Logistic Regression, Random Forests, Gradient Boosting).
- We evaluated their performance using accuracy and F1 score.
- We made predictions on new customer data.

The best-performing model was **Random Forest**, which achieved an accuracy of 80% and an F1 score of 0.8649.

Further improvements could be made by tuning hyperparameters or deploying the model for real-time credit risk prediction.

---