## <u>2. Datenaufbereitung Bureau

In diesem Dokument werden für die Untersuchung unwichtige Variablen aus dem Bureau-Datensatz gelöscht. Dabei werden zuerst die kategorischen Variablen (nominale & ordinale) betrachtet, um anschließend die metrischen Variablen zu betrachten. Im Gegensatz zu den Application-Daten liegt in dem Bureau-Datensatz eine 1:N-Beziehung vor, da ein Kreditnehmer in der Vergangenheit mehrere Kredite gehabt haben kann. Dies erfordert die Gruppierung der historischen Daten.

*Vorgehensweise kategorische Variablen:*
- Gruppierung der Variablen
- Löschung von Variablen mit mehr als 60% fehlenden Daten
- Löschung von nominalen Variablen mit weniger als 5pP relativer Anteilsdifferenz zwischen Paybacks und Defaults
- Bildung von Korrelationsclustern (Kontingenzkoeffizent bei nominalen Daten)
- Löschung von Variablen ohne kausalen Einfluss auf die Kreditwürdigkeitsbestimmung

*Vorgehensweise metrischer Variablen:*
- Gruppierung der Variablen
- Löschung von Variablen mit mehr als 60% fehlenden Daten
- Bildung von Korrelationsclustern (Pearson-Korrelationskoeffizient)
- Löschung von Variablen ohne kausalen Einfluss auf die Kreditwürdigkeitsbestimmung

## Initialisierung

In [1]:
from pathlib import Path
from scipy import stats

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

np.set_printoptions(suppress=True)

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

from sklearn.linear_model import LogisticRegression

from IPython.display import display, Markdown

In [2]:
path1 = Path(r"A:\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")
path2 = Path(r"C:\Users\rober\Documents\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")

if path1.is_dir():
    DATASET_DIR = path1
else:
    DATASET_DIR = path2

In [34]:
app_test = pd.read_csv(DATASET_DIR / "application_test.csv")
bureau = pd.read_csv(DATASET_DIR / "bureau.csv")
bureau2 = pd.read_csv(DATASET_DIR / "2. Datenaufbereitung" / "bureau.csv", index_col=0)
description = pd.read_csv(DATASET_DIR / "HomeCredit_columns_description.csv", encoding="latin", index_col=0)

In [4]:
des = description.loc[description['Table']=="bureau.csv", "Row":"Special"]

In [4]:
bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [5]:
# Spalten die innerhalb der Aufbereitung nicht verändert werden können
skip = ["TARGET", "SK_ID_CURR", "SK_ID_BUREAU"]

In [7]:
# nominale und metrische Spalten
n_heads = [element for element in bureau.columns if bureau[element].dtype.name == "object"]
m_heads = [element for element in bureau.columns if bureau[element].dtype.name != "object"]

## <u>kategorische Variablen

In [8]:
df = bureau[["SK_ID_BUREAU", "SK_ID_CURR"] + n_heads].copy()

In [11]:
df.head()

Unnamed: 0,SK_ID_BUREAU,SK_ID_CURR,CREDIT_ACTIVE,CREDIT_CURRENCY,CREDIT_TYPE
0,5714462,215354,Closed,currency 1,Consumer credit
1,5714463,215354,Active,currency 1,Credit card
2,5714464,215354,Active,currency 1,Consumer credit
3,5714465,215354,Active,currency 1,Credit card
4,5714466,215354,Active,currency 1,Consumer credit


# Informationsgehalt:
- Anzahl Kredite je Kreditnehmer
- Kreditstatus der Kredite
- Kredittyp

In [12]:
# Anzahl Kredite

cnt = df[["SK_ID_CURR", "SK_ID_BUREAU"]].groupby(by=["SK_ID_CURR"]).count()
cnt.columns = ["CNT_BURAEU"]
cnt.head()

Unnamed: 0_level_0,CNT_BURAEU
SK_ID_CURR,Unnamed: 1_level_1
100001,7
100002,8
100003,4
100004,2
100005,3


Kreditnehmer 100002 hat in seiner Kredithistorie 8 Kredite bei externen Kreditgebern in Anspruch genommen

In [13]:
# Kreditstatus

status = df[["SK_ID_CURR", "CREDIT_ACTIVE"]].groupby(by=["SK_ID_CURR", "CREDIT_ACTIVE"]).size().unstack(fill_value=0)
status.head()

CREDIT_ACTIVE,Active,Bad debt,Closed,Sold
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
100001,3,0,4,0
100002,2,0,6,0
100003,1,0,3,0
100004,0,0,2,0
100005,2,0,1,0


Zum Zeitpunkt der Kreditvergabe im Application-Datensatz hat Kreditnehmer 100002 2 aktive Kredite bei externen Kreditnehmern.

In [14]:
# Kredittyp

typ = df[["SK_ID_CURR", "CREDIT_TYPE"]].groupby(by=["SK_ID_CURR", "CREDIT_TYPE"]).size().unstack(fill_value=0)
typ.head()

CREDIT_TYPE,Another type of loan,Car loan,Cash loan (non-earmarked),Consumer credit,Credit card,Interbank credit,Loan for business development,Loan for purchase of shares (margin lending),Loan for the purchase of equipment,Loan for working capital replenishment,Microloan,Mobile operator loan,Mortgage,Real estate loan,Unknown type of loan
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100001,0,0,0,7,0,0,0,0,0,0,0,0,0,0,0
100002,0,0,0,4,4,0,0,0,0,0,0,0,0,0,0
100003,0,0,0,2,2,0,0,0,0,0,0,0,0,0,0
100004,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0
100005,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0


Kreditnehmer 100002 hat in seiner 4 Konsumentenkredite und 4 Kreditkarten in Anspruch genommen.

In [15]:
result = pd.DataFrame(index=bureau.SK_ID_CURR.unique())
result.index.name = "SK_ID_CURR"

In [16]:
result = pd.merge(result, cnt, how="left", left_index=True, right_index=True)
result = pd.merge(result, status, how="left", left_index=True, right_index=True)
result = pd.merge(result, typ, how="left", left_index=True, right_index=True)

In [17]:
df = result
df.head()

Unnamed: 0_level_0,CNT_BURAEU,Active,Bad debt,Closed,Sold,Another type of loan,Car loan,Cash loan (non-earmarked),Consumer credit,Credit card,Interbank credit,Loan for business development,Loan for purchase of shares (margin lending),Loan for the purchase of equipment,Loan for working capital replenishment,Microloan,Mobile operator loan,Mortgage,Real estate loan,Unknown type of loan
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
215354,11,6,0,5,0,0,1,0,7,3,0,0,0,0,0,0,0,0,0,0
162297,6,3,0,3,0,0,0,0,3,2,0,0,0,0,0,0,0,1,0,0
402440,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
238881,8,3,0,5,0,0,0,0,5,3,0,0,0,0,0,0,0,0,0,0
222183,8,5,0,3,0,0,1,0,4,3,0,0,0,0,0,0,0,0,0,0


In [18]:
df = df.add_prefix("B_")

In [19]:
df.head()

Unnamed: 0_level_0,B_CNT_BURAEU,B_Active,B_Bad debt,B_Closed,B_Sold,B_Another type of loan,B_Car loan,B_Cash loan (non-earmarked),B_Consumer credit,B_Credit card,B_Interbank credit,B_Loan for business development,B_Loan for purchase of shares (margin lending),B_Loan for the purchase of equipment,B_Loan for working capital replenishment,B_Microloan,B_Mobile operator loan,B_Mortgage,B_Real estate loan,B_Unknown type of loan
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
215354,11,6,0,5,0,0,1,0,7,3,0,0,0,0,0,0,0,0,0,0
162297,6,3,0,3,0,0,0,0,3,2,0,0,0,0,0,0,0,1,0,0
402440,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
238881,8,3,0,5,0,0,0,0,5,3,0,0,0,0,0,0,0,0,0,0
222183,8,5,0,3,0,0,1,0,4,3,0,0,0,0,0,0,0,0,0,0


### Speichern der kategorischen Werte

In [20]:
cats = df

## <u>metrische Variablen

In [21]:
df = bureau[m_heads].copy()

In [22]:
df.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,-131,
1,215354,5714463,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,-20,
2,215354,5714464,-203,0,528.0,,,0,464323.5,,,0.0,-16,
3,215354,5714465,-203,0,,,,0,90000.0,,,0.0,-16,
4,215354,5714466,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,-21,


# Informationsgehalt:
(Es werden nur Kredite betrachtet, die maximal ein halbes Jahr in der Vergangenheit liegen.)
- Variablenerstellung: Verschuldungsquote

- Summen: CREDIT_DAY_OVERDUE (überzogene Tage), 
- Durchschnitt: DAYS_CREDIT_ENDDATE (verbleibende Laufzeit), AMT_CREDIT_SUM (Kredithöhe), AMT_CREDIT_SUM_DEBT (Schuldenhöhe), AMT_CREDIT_SUM_OVERDUE (überzogener Betrag), AMT_ANNUITY (Zahlungsbeitrag pro Jahr), DEBT_PER_LIMIT (Verschuldungsquote)

In [23]:
df = df[df["DAYS_CREDIT_ENDDATE"] > -180]

In [24]:
result = pd.DataFrame(index=bureau.SK_ID_CURR.unique())
result.index.name = "SK_ID_CURR"

In [25]:
# Summe überzogene Tage
CREDIT_DAY_OVERDUE = df[["SK_ID_CURR", "CREDIT_DAY_OVERDUE"]].groupby(by=["SK_ID_CURR"]).sum()
result = pd.merge(result, CREDIT_DAY_OVERDUE, how="left", left_index=True, right_index=True)

In [26]:
# Durchschnitte
mean_heads = ["DAYS_CREDIT_ENDDATE", "AMT_CREDIT_SUM", "AMT_CREDIT_SUM_DEBT", "AMT_CREDIT_SUM_OVERDUE", "AMT_ANNUITY"]

for head in mean_heads:
    A = df[["SK_ID_CURR", head]]
    A = A.fillna(0)
    A = A.groupby(by=["SK_ID_CURR"]).mean()
    result = pd.merge(result, A, how="left", left_index=True, right_index=True)

In [27]:
df = result
df.head()

Unnamed: 0_level_0,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_OVERDUE,AMT_ANNUITY
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
215354,0.0,5031.0,617125.05,47410.53,0.0,0.0
162297,0.0,5261.0,7033500.0,0.0,0.0,0.0
402440,0.0,269.0,89910.0,76905.0,0.0,0.0
238881,0.0,821.5,87342.03,4065.75,0.0,0.0
222183,0.0,929.5,977065.5,197513.64,0.0,0.0


### Ergebnis

In [30]:
df = df.add_prefix("B_")

### Speichern der metrischen Werte

Zusammenführung der Kreditanzahl und der nominalen Variablen

In [31]:
cats = pd.merge(cnt, cats, left_index=True, right_index=True)

Zusammenführen der metrischen und kategorischen Variablen

In [32]:
df = pd.merge(cats, df, left_index=True, right_index=True)

In [35]:
df = df[bureau2.columns.values]

In [36]:
df.head()

Unnamed: 0_level_0,CNT_BURAEU,B_Active,B_Closed,B_CREDIT_DAY_OVERDUE,B_DAYS_CREDIT_ENDDATE,B_AMT_CREDIT_SUM,B_AMT_CREDIT_SUM_DEBT,B_AMT_CREDIT_SUM_OVERDUE,B_AMT_ANNUITY
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
100001,7,3,4,0.0,728.0,290936.25,149171.625,0.0,6204.375
100002,8,2,6,0.0,309.0,212745.0,81927.0,0.0,0.0
100003,4,1,3,0.0,1216.0,810000.0,0.0,0.0,0.0
100004,2,0,2,,,,,,
100005,3,2,1,0.0,439.333333,219042.0,189469.5,0.0,1420.5


In [37]:
df.to_csv(DATASET_DIR / "2. Datenaufbereitung" / "bureau_all.csv")