# Credit Risk Prediction
## Vince Alihan

### Objective:

The goal of this project is to develop a model that predicts the likelihood of loan default. 
We will begin by implementing traditional methods such as logistic regression, 
and then explore more advanced techniques including XGBoost and neural networks.

### Key Learning Goals:

- Feature engineering and selection from financial data.
- Handling imbalanced datasets (since defaults are often rare).
- Experimenting with different evaluation metrics (e.g., ROC-AUC, precision-recall).

In [25]:
import numpy as np
import pandas as pd

In [26]:
# Path to our dataset
dataset = "C:\\Users\\vince\\OneDrive\\Documents\\Projects\\Credit Risk Prediction\\default of credit card clients.xls"

# Importing our Excel Dataset
df = pd.read_excel(dataset)

# Print out the first few rows of our dataset
print(df.head())

# Check for any missing values
missing_values = df.isnull().sum()
if missing_values.sum() == 0:
    print("No missing values found.")
else: 
    print(missing_values)

# Print out all the datatypes of each column
print(df.info())

  Unnamed: 0         X1   X2         X3        X4   X5     X6     X7     X8  \
0         ID  LIMIT_BAL  SEX  EDUCATION  MARRIAGE  AGE  PAY_0  PAY_2  PAY_3   
1          1      20000    2          2         1   24      2      2     -1   
2          2     120000    2          2         2   26     -1      2      0   
3          3      90000    2          2         2   34      0      0      0   
4          4      50000    2          2         1   37      0      0      0   

      X9  ...        X15        X16        X17       X18       X19       X20  \
0  PAY_4  ...  BILL_AMT4  BILL_AMT5  BILL_AMT6  PAY_AMT1  PAY_AMT2  PAY_AMT3   
1     -1  ...          0          0          0         0       689         0   
2      0  ...       3272       3455       3261         0      1000      1000   
3      0  ...      14331      14948      15549      1518      1500      1000   
4      0  ...      28314      28959      29547      2000      2019      1200   

        X21       X22       X23             

From the original Excel spreadsheet, it appears that all of the values within the rows and columns are marked as 'object' datatypes, we must convert them into 'integer' datatypes in order to utilize machine learning algorithms later on.

In [32]:
# Store the column headers
headers = df.iloc[0]

# Create new dataframe starting from row 1 (skipping headers)
dfcopy = df.iloc[1:].copy()

# Convert all values to integers
dfcopy = dfcopy.apply(pd.to_numeric, errors='coerce').fillna(0).astype(int)

# Restore the column headers
dfcopy.columns = headers
df = dfcopy

# Verify the conversion
print("Data Types after conversion:")
print(df.dtypes)
print("\nFirst few rows of converted data:")
print(df.head())

Data Types after conversion:
0
ID                            int32
LIMIT_BAL                     int32
SEX                           int32
EDUCATION                     int32
MARRIAGE                      int32
AGE                           int32
PAY_0                         int32
PAY_2                         int32
PAY_3                         int32
PAY_4                         int32
PAY_5                         int32
PAY_6                         int32
BILL_AMT1                     int32
BILL_AMT2                     int32
BILL_AMT3                     int32
BILL_AMT4                     int32
BILL_AMT5                     int32
BILL_AMT6                     int32
PAY_AMT1                      int32
PAY_AMT2                      int32
PAY_AMT3                      int32
PAY_AMT4                      int32
PAY_AMT5                      int32
PAY_AMT6                      int32
default payment next month    int32
dtype: object

First few rows of converted data:
0  ID  LIMIT_BAL  SE

C:\Users\vince\OneDrive\Documents\Projects\Credit Risk Prediction


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]
