## <u>2. Datenaufbereitung Application</u>

In diesem Dokument werden für die Untersuchung unwichtige Variablen aus dem Application-Datensatz gelöscht. Dabei werden zuerst die kategorischen Variablen (nominale & ordinale) betrachtet, um anschließend die metrischen Variablen zu betrachten.

*Vorgehensweise kategorische Variablen:*
- Löschung von Variablen mit mehr als 60% fehlenden Daten
- Löschung von nominalen Variablen mit weniger als 5pP relativer Anteilsdifferenz zwischen Paybacks und Defaults
- Bildung von Korrelationsclustern (Kontingenzkoeffizent bei nominalen Daten)
- Bildung von Korrelationsclustern (Spearman-Rangkorrelationskoeffizienz bei ordinalen Daten)
- Löschung von Variablen ohne kausalen Einfluss auf die Kreditwürdigkeitsbestimmung
- Transformation der nominalen Daten in Integer

*Vorgehensweise metrischer Variablen:*
- Löschung von Variablen mit mehr als 60% fehlenden Daten
- Bildung von Korrelationsclustern (Pearson-Korrelationskoeffizient)
- Löschung von Variablen ohne kausalen Einfluss auf die Kreditwürdigkeitsbestimmung

## Initialisierung

In [1]:
from pathlib import Path
from scipy import stats

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

np.set_printoptions(suppress=True)

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

from sklearn.linear_model import LogisticRegression

from IPython.display import display, Markdown

In [2]:
path1 = Path(r"A:\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")
path2 = Path(r"C:\Users\rober\Documents\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")

if path1.is_dir():
    DATASET_DIR = path1
else:
    DATASET_DIR = path2

In [3]:
app_train = pd.read_csv(DATASET_DIR / "application_train.csv", index_col="SK_ID_CURR")
description = pd.read_csv(DATASET_DIR / "HomeCredit_columns_description.csv", encoding="latin", index_col=0)

In [4]:
des = description.loc[description['Table']=="application_{train|test}.csv", "Row":"Special"]

In [5]:
app_train.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1
100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,reg oper account,block of flats,0.0149,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,reg oper account,block of flats,0.0714,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Spalten die innerhalb der Aufbereitung nicht verändert werden können
skip = ["TARGET", "SK_ID_CURR"]

In [7]:
# nominale und metrische Spalten
n_heads = ['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
o_heads = ["CNT_CHILDREN", "CNT_FAM_MEMBERS", "HOUR_APPR_PROCESS_START", "OBS_30_CNT_SOCIAL_CIRCLE", "DEF_30_CNT_SOCIAL_CIRCLE", "OBS_60_CNT_SOCIAL_CIRCLE", "DEF_60_CNT_SOCIAL_CIRCLE", "AMT_REQ_CREDIT_BUREAU_HOUR", "AMT_REQ_CREDIT_BUREAU_DAY", "AMT_REQ_CREDIT_BUREAU_WEEK", "AMT_REQ_CREDIT_BUREAU_MON", "AMT_REQ_CREDIT_BUREAU_QRT", "AMT_REQ_CREDIT_BUREAU_YEAR"]
m_heads = [head for head in app_train.columns.values if head not in n_heads and head not in o_heads]

## <u>kategorische Variablen

In [8]:
df = app_train[n_heads + o_heads].copy()

In [9]:
df.head()

Unnamed: 0_level_0,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,CNT_CHILDREN,CNT_FAM_MEMBERS,HOUR_APPR_PROCESS_START,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1
100002,1,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,1,1,0,1,1,0,Laborers,2,2,WEDNESDAY,0,0,0,0,0,0,Business Entity Type 3,reg oper account,block of flats,"Stone, brick",No,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,10,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,Cash loans,F,N,N,Family,State servant,Higher education,Married,House / apartment,1,1,0,1,1,0,Core staff,1,1,MONDAY,0,0,0,0,0,0,School,reg oper account,block of flats,Block,No,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,11,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,Revolving loans,M,Y,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,1,1,1,1,1,0,Laborers,2,2,MONDAY,0,0,0,0,0,0,Government,,,,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,Cash loans,F,N,Y,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,1,1,0,1,0,0,Laborers,2,2,WEDNESDAY,0,0,0,0,0,0,Business Entity Type 3,,,,,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.0,17,2.0,0.0,2.0,0.0,,,,,,
100007,0,Cash loans,M,N,Y,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,1,1,0,1,0,0,Core staff,2,2,THURSDAY,0,0,0,0,1,1,Religion,,,,,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Löschung der Spalten mit weniger als 40% ausgefüllten Daten

In [10]:
result = {
          "header":[],
          "rate":[],
          "des":[]
         }
for key in df.keys():
    if key in skip:
        continue
    rate = df[key].isna().sum() / len(df[key]) * 100
    
    if rate > 60:
        result["header"].append(key)
        result["rate"].append(rate)
        result["des"].append(des[des["Row"] == key]["Description"])

result = pd.DataFrame(result)
result

Unnamed: 0,header,rate,des
0,FONDKAPREMONT_MODE,68.386172,"89 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"


FONDKAPREMONT_MODE wird mit 68.39% fehlenden Daten gelöscht.

In [11]:
n_heads = [head for head in n_heads if head not in result.header.values]
o_heads = [head for head in o_heads if head not in result.header.values]

In [12]:
df = df.drop(result.header.values, axis=1)

### Unterscheidbarkeit von mindestens 5pP einer Kategorie

In [13]:
ID_Payback = app_train[app_train["TARGET"] == 0].index.values
ID_Default = app_train[app_train["TARGET"] == 1].index.values

In [14]:
payback = df.loc[ID_Payback]
default = df.loc[ID_Default]

In [15]:
result = {
    "head" : [],
    "cat" : [],
    "payback" : [],
    "default" : [],
    "diff" : []
}

for head in df[n_heads].columns.values:
    df1 = payback[head].value_counts().rename_axis(head).reset_index(name='payback')
    df2 = default[head].value_counts().rename_axis(head).reset_index(name='default')
    
    df1["payback"] = df1["payback"]/df1["payback"].sum()*100
    df2["default"] = df2["default"]/df2["default"].sum()*100
    
    df_ = df1.merge(df2, how="outer", on=head)
    
    df_["diff"] = (df_["default"]-df_["payback"])
    
    df_ = df_.sort_values("diff", ascending=False)
    
    for diff in df_["diff"]:
        if np.isnan(diff):
            continue
        if diff > 5 or diff < -5:
            row = df_.loc[df_["diff"] == diff]
            cat = row[head][row[head].index[0]]
            
            result["head"].append(head)
            result["cat"].append(cat)
            result["payback"].append(round(row["payback"].values[0],2))
            result["default"].append(round(row["default"].values[0],2))
            result["diff"].append(round(diff,2))

result = pd.DataFrame(result)
result.sort_values("diff", ascending=False)

Unnamed: 0,head,cat,payback,default,diff
2,NAME_INCOME_TYPE,Working,50.78,61.33,10.54
0,CODE_GENDER,M,33.4,42.92,9.53
4,NAME_EDUCATION_TYPE,Secondary / secondary special,70.35,78.65,8.3
11,REG_CITY_NOT_WORK_CITY,1,22.41,30.29,7.88
14,FLAG_DOCUMENT_3,1,70.41,77.79,7.39
6,FLAG_EMP_PHONE,1,81.47,87.95,6.49
9,REGION_RATING_CLIENT,3,15.2,21.62,6.42
10,REGION_RATING_CLIENT_W_CITY,3,13.75,20.15,6.4
8,OCCUPATION_TYPE,Laborers,25.63,31.48,5.85
7,FLAG_EMP_PHONE,0,18.53,12.05,-6.49


Defaults sind:
- Arbeiter
- Männer
- Personen mit Sekundarabschluss
- Arbeitsadresse ist eine andere als die Wohnadresse (Pendler, Reisender, etc.)
- haben Dokument 3 zur Verfügung gestellt

Paybacks sind:
- Frauen
- Akademiker
- Arbeitsadresse und Wohnadresse stimmen überein
- haben Dokument 3 nicht zur Verfügung gestellt
- beziehen Rente als Einkommen
- Arbeiten in Branchen, die im Datensatz nicht hinterlegt sind (sehr spezielle Branchen oder keine da Renter)
- haben als Telefonnummer die Arbeitstelefonnummer hinterlegt

Alle hier nicht aufgeführten nominalen Variablen haben einen zu geringen Informationsgehalt zur Identifikation von Defaults und werden gelöscht.

In [16]:
keep = list(result["head"].unique())
remove = [head for head in n_heads if head not in keep + skip]

In [17]:
n_heads = [head for head in n_heads if head in df.columns.values]

In [18]:
df = df.drop(remove, axis=1)

### Bildung von Korrelationsclustern

#### nominale Daten

In [19]:
def contingency_corr(df, head_1, head_2):
    
    data = pd.crosstab(df[head_1], df[head_2], margins = False)
    
    sum_cols = data.sum(axis=0)
    sum_rows = data.sum(axis=1)
    sum_all = data.sum().sum()
    
    n = data.to_numpy()
    n = n.reshape(1,-1).squeeze()
    
    e = []
    for row in sum_rows:
        for col in sum_cols:
            e.append(row * col / sum_all)
    e = np.array(e)
    
    chi2 = (np.subtract(n, e)**2 / e).sum()
    
    K = np.sqrt(chi2/(sum_all+chi2))
    
    i = min(data.shape)
    i = (i-1)/i
    i = np.sqrt(i)
    
    K_ = K / i
    
    return K_

In [20]:
n_heads = [head for head in df.columns.values if head in n_heads]
result = {}
for head1 in n_heads:
    row = []
    for head2 in n_heads:
        corr = contingency_corr(df, head1, head2)
        row.append(corr)
    result[head1] = row
c = pd.DataFrame(result, index=n_heads) * 100

Kontingenzkoeffizienten der einzelnen Variablen

In [21]:
c

Unnamed: 0,TARGET,CODE_GENDER,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,FLAG_EMP_PHONE,OCCUPATION_TYPE,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,REG_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,FLAG_DOCUMENT_3
TARGET,100.0,7.727091,9.010698,8.12828,6.496004,11.489928,8.316753,8.60769,7.202348,10.203796,6.265363
CODE_GENDER,7.727091,100.0,20.501998,3.253923,21.969022,60.845086,2.344752,2.38931,19.276634,39.184487,12.321069
NAME_INCOME_TYPE,9.010698,20.501998,100.0,22.760297,99.987941,36.23248,23.044339,21.988003,35.379443,79.683528,35.168159
NAME_EDUCATION_TYPE,8.12828,3.253923,22.760297,100.0,19.544416,44.871758,12.334248,12.276476,4.260959,27.875122,8.628719
FLAG_EMP_PHONE,6.496004,21.969022,99.987941,19.544416,100.0,1.336271,5.360049,5.654068,35.127768,99.993393,34.110302
OCCUPATION_TYPE,11.489928,60.845086,36.23248,44.871758,1.336271,100.0,9.864538,9.893453,16.222021,80.159933,19.113187
REGION_RATING_CLIENT,8.316753,2.344752,23.044339,12.334248,5.360049,9.864538,100.0,98.485788,4.608254,16.101935,12.821528
REGION_RATING_CLIENT_W_CITY,8.60769,2.38931,21.988003,12.276476,5.654068,9.893453,98.485788,100.0,5.193054,16.019063,12.638246
REG_CITY_NOT_WORK_CITY,7.202348,19.276634,35.379443,4.260959,35.127768,16.222021,4.608254,5.193054,100.0,38.386727,7.945939
ORGANIZATION_TYPE,10.203796,39.184487,79.683528,27.875122,99.993393,80.159933,16.101935,16.019063,38.386727,100.0,37.462472


In [22]:
families = []
for i, row in c.iterrows():
    r = row[row > 80]
    if len(r) > 1 and set(r.index) not in families:
        families.append(set(r.index))

for A in families:
    for B in families:
        if A == B:
            continue
        if A.issubset(B):
            families.remove(A)
families

[{'FLAG_EMP_PHONE', 'NAME_INCOME_TYPE', 'ORGANIZATION_TYPE'},
 {'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY'},
 {'FLAG_EMP_PHONE', 'OCCUPATION_TYPE', 'ORGANIZATION_TYPE'}]

In [23]:
result = {
          "family":[],
          "head":[],
          "r2":[],
          "na":[],
          "rate":[]
         }

for i, family in enumerate(families):
    headers = list(family)
    
    result["family"].append("")
    result["head"].append("")
    result["r2"].append("")
    result["na"].append("")
    result["rate"].append("")
    
    for head in headers:
        d = df[["TARGET"] + [head]].copy()
        d.loc[:,head], cats = pd.factorize(d.loc[:,head])
        na = d[head].isna().sum() / len(d) * 100
        d = d.dropna()
        y = d[["TARGET"]]
        x = d[[head]]
        
        model = LogisticRegression().fit(x, y.values.ravel())
        r2 = model.score(x,y)
        
        result["family"].append(i)
        result["head"].append(head)
        result["r2"].append(round(r2,5))
        result["na"].append(round(na,2))
        result["rate"].append(r2/na)

result = pd.DataFrame(result)
result

  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)
  result["rate"].append(r2/na)


Unnamed: 0,family,head,r2,na,rate
0,,,,,
1,0.0,NAME_INCOME_TYPE,0.91927,0.0,inf
2,0.0,ORGANIZATION_TYPE,0.91927,0.0,inf
3,0.0,FLAG_EMP_PHONE,0.91927,0.0,inf
4,,,,,
5,1.0,REGION_RATING_CLIENT,0.91927,0.0,inf
6,1.0,REGION_RATING_CLIENT_W_CITY,0.91927,0.0,inf
7,,,,,
8,2.0,OCCUPATION_TYPE,0.91927,0.0,inf
9,2.0,ORGANIZATION_TYPE,0.91927,0.0,inf


Es entstehen 3 Korrelationscluster mit Variablen, die zu mindestens 80% miteinander korrelieren. Es wird eine Variable aus Cluster#1 gelöscht, da beide Variablen das Rating der Region wiederspiegeln, sodass sich die Information doppelt. Die anderen Variablen werden nicht gelöscht, da sie trotz hoher Korrelation Kausal nicht das selbe sind. Ein Arbeiter (eher Default) der seine Privattelefonnummer hinterlegt (eher Payback) wäre vom Modell nicht mehr identifizierbar.

In [24]:
keep = ["NAME_INCOME_TYPE", "FLAG_EMP_PHONE", "ORGANIZATION_TYPE", "OCCUPATION_TYPE", "REGION_RATING_CLIENT"]

In [25]:
remove = [head for head in result["head"].values if head not in [""] + keep]

In [26]:
remove

['REGION_RATING_CLIENT_W_CITY']

In [27]:
df = df.drop(remove, axis=1)

#### ordinale Daten

In [28]:
c = df[o_heads].corr(method='spearman') * 100

In [29]:
families = []
for i, row in c.iterrows():
    r = row[row > 80]
    if len(r) > 1 and set(r.index) not in families:
        families.append(set(r.index))

for A in families:
    for B in families:
        if A == B:
            continue
        if A.issubset(B):
            families.remove(A)
families

[{'CNT_CHILDREN', 'CNT_FAM_MEMBERS'},
 {'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE'},
 {'DEF_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE'}]

In [30]:
result = {
          "family":[],
          "head":[],
          "r2":[],
          "na":[],
          "rate":[]
         }

for i, family in enumerate(families):
    headers = list(family)
    
    result["family"].append("")
    result["head"].append("")
    result["r2"].append("")
    result["na"].append("")
    result["rate"].append("")
    
    for head in headers:
        d = df[["TARGET"] + [head]].copy()
        na = d[head].isna().sum() / len(d) * 100
        d = d.dropna()
        y = d[["TARGET"]]
        x = d[[head]]
        
        model = LogisticRegression().fit(x, y.values.ravel())
        r2 = round(model.score(x,y),5)
        
        result["family"].append(i)
        result["head"].append(head)
        result["r2"].append(round(r2,5))
        result["na"].append(round(na,2))
        result["rate"].append(r2/na)
    
result = pd.DataFrame(result)
result

  result["rate"].append(r2/na)


Unnamed: 0,family,head,r2,na,rate
0,,,,,
1,0.0,CNT_FAM_MEMBERS,0.91927,0.0,1413.43
2,0.0,CNT_CHILDREN,0.91927,0.0,inf
3,,,,,
4,1.0,OBS_60_CNT_SOCIAL_CIRCLE,0.91912,0.33,2.76826
5,1.0,OBS_30_CNT_SOCIAL_CIRCLE,0.91912,0.33,2.76826
6,,,,,
7,2.0,DEF_60_CNT_SOCIAL_CIRCLE,0.91912,0.33,2.76826
8,2.0,DEF_30_CNT_SOCIAL_CIRCLE,0.91912,0.33,2.76826


Die Anzahl der Kinder ist fast immer auch die Anzahl der Familienmitglieder. Da es in der Anzahl der Familienmitglieder fehlende Daten gibt, wird die Anzahl der Kinder als Clusterprepräsentant gewählt.

Die Anzahl der erwarteten Defaults im Sozialumfeld der letzten 30 Tage entspricht meist der Anzahl der letzten 60 Tage.

Die Anzahl der eingetretenen Defaults im Sozialumfeld der letzten 30 Tage entspricht meist der Anzahl der letzten 60 Tage.

In [31]:
remove = ["CNT_FAM_MEMBERS", "OBS_30_CNT_SOCIAL_CIRCLE", "DEF_30_CNT_SOCIAL_CIRCLE"]

In [32]:
o_heads = [head for head in o_heads if head not in remove]

In [33]:
remove

['CNT_FAM_MEMBERS', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE']

In [34]:
df = df.drop(remove, axis=1)

In [35]:
df.head()

Unnamed: 0_level_0,TARGET,CODE_GENDER,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,FLAG_EMP_PHONE,OCCUPATION_TYPE,REGION_RATING_CLIENT,REG_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,FLAG_DOCUMENT_3,CNT_CHILDREN,HOUR_APPR_PROCESS_START,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
100002,1,M,Working,Secondary / secondary special,1,Laborers,2,0,Business Entity Type 3,1,0,10,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,F,State servant,Higher education,1,Core staff,1,0,School,1,0,11,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,M,Working,Secondary / secondary special,1,Laborers,2,0,Government,0,0,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,F,Working,Secondary / secondary special,1,Laborers,2,0,Business Entity Type 3,1,0,17,2.0,0.0,,,,,,
100007,0,M,Working,Secondary / secondary special,1,Core staff,2,1,Religion,0,0,11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Transformation von Kategorien in numerische Werte

In [36]:
# transformiert kategorische variablen in integer
HEADS = [head for head in df.columns.values if df[head].dtype == "object"]

for head in HEADS:
    df[head], cats = pd.factorize(df[head])

### Kausalitätsbetrachtung

In [37]:
result = {
    "head":[],
    "des":[]
}

for head in df.columns.values:
    if head in skip:
        continue
    result["head"].append(head)
    result["des"].append(des[des["Row"] == head]["Description"])
    
result = pd.DataFrame(result)
result

Unnamed: 0,head,des
0,CODE_GENDER,"6 Gender of the client Name: Description, dtype: object"
1,NAME_INCOME_TYPE,"15 Clients income type (businessman, working, maternity leave,) Name: Description, dtype: object"
2,NAME_EDUCATION_TYPE,"16 Level of highest education the client achieved Name: Description, dtype: object"
3,FLAG_EMP_PHONE,"26 Did client provide work phone (1=YES, 0=NO) Name: Description, dtype: object"
4,OCCUPATION_TYPE,"31 What kind of occupation does the client have Name: Description, dtype: object"
5,REGION_RATING_CLIENT,"33 Our rating of the region where client lives (1,2,3) Name: Description, dtype: object"
6,REG_CITY_NOT_WORK_CITY,"41 Flag if client's permanent address does not match work address (1=different, 0=same, at city level) Name: Description, dtype: object"
7,ORGANIZATION_TYPE,"43 Type of organization where client works Name: Description, dtype: object"
8,FLAG_DOCUMENT_3,"100 Did client provide document 3 Name: Description, dtype: object"
9,CNT_CHILDREN,"9 Number of children the client has Name: Description, dtype: object"


Die Kreditwürdigkeit hängt nicht von der Uhrzeit ab

In [38]:
remove = ["HOUR_APPR_PROCESS_START"]

In [39]:
df = df.drop(remove, axis=1)

### Einfügen eines Präfixes zur Wiedererkennbarkeit
A = Application

In [40]:
df = df.add_prefix("A_")

In [41]:
df = df.rename(columns={"A_TARGET": "TARGET"})

### Ergebnis 

In [42]:
df.head()

Unnamed: 0_level_0,TARGET,A_CODE_GENDER,A_NAME_INCOME_TYPE,A_NAME_EDUCATION_TYPE,A_FLAG_EMP_PHONE,A_OCCUPATION_TYPE,A_REGION_RATING_CLIENT,A_REG_CITY_NOT_WORK_CITY,A_ORGANIZATION_TYPE,A_FLAG_DOCUMENT_3,A_CNT_CHILDREN,A_OBS_60_CNT_SOCIAL_CIRCLE,A_DEF_60_CNT_SOCIAL_CIRCLE,A_AMT_REQ_CREDIT_BUREAU_HOUR,A_AMT_REQ_CREDIT_BUREAU_DAY,A_AMT_REQ_CREDIT_BUREAU_WEEK,A_AMT_REQ_CREDIT_BUREAU_MON,A_AMT_REQ_CREDIT_BUREAU_QRT,A_AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
100002,1,0,0,0,1,0,2,0,0,1,0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0
100003,0,1,1,1,1,1,1,0,1,1,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100004,0,0,0,0,1,0,2,0,2,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100006,0,1,0,0,1,0,2,0,0,1,0,2.0,0.0,,,,,,
100007,0,0,0,0,1,1,2,1,3,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [43]:
cats = df

## <u>metrische Variablen

In [44]:
df = app_train[["TARGET"] + m_heads].copy()

In [45]:
df.head()

Unnamed: 0_level_0,TARGET,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,TOTALAREA_MODE,DAYS_LAST_PHONE_CHANGE
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1
100002,1,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,-3648.0,-2120,,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,0.0252,0.0383,0.9722,0.6341,0.0144,0.0,0.069,0.0833,0.125,0.0377,0.022,0.0198,0.0,0.0,0.025,0.0369,0.9722,0.6243,0.0144,0.0,0.069,0.0833,0.125,0.0375,0.0205,0.0193,0.0,0.0,0.0149,-1134.0
100003,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,-1186.0,-291,,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,0.0924,0.0538,0.9851,0.804,0.0497,0.0806,0.0345,0.2917,0.3333,0.0128,0.079,0.0554,0.0,0.0,0.0968,0.0529,0.9851,0.7987,0.0608,0.08,0.0345,0.2917,0.3333,0.0132,0.0787,0.0558,0.0039,0.01,0.0714,-828.0
100004,0,67500.0,135000.0,6750.0,135000.0,0.010032,-19046,-225,-4260.0,-2531,26.0,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-815.0
100006,0,135000.0,312682.5,29686.5,297000.0,0.008019,-19005,-3039,-9833.0,-2437,,,0.650442,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-617.0
100007,0,121500.0,513000.0,21865.5,513000.0,0.028663,-19932,-3038,-4311.0,-3458,,,0.322738,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-1106.0


### Löschung der Spalten mit weniger als 40% ausgefüllten Daten

In [46]:
result = {
          "header":[],
          "rate":[],
          "des":[]
         }
for key in df.keys():
    if key in skip:
        continue
    rate = df[key].isna().sum() / len(df[key]) * 100
    if rate > 60:
        result["header"].append(key)
        result["rate"].append(rate)
        result["des"].append(des[des["Row"] == key]["Description"])

result = pd.DataFrame(result)
result

Unnamed: 0,header,rate,des
0,OWN_CAR_AGE,65.99081,"24 Age of client's car Name: Description, dtype: object"
1,YEARS_BUILD_AVG,66.497784,"50 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
2,COMMONAREA_AVG,69.872297,"51 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
3,FLOORSMIN_AVG,67.84863,"55 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
4,LIVINGAPARTMENTS_AVG,68.354953,"57 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
5,NONLIVINGAPARTMENTS_AVG,69.432963,"59 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
6,YEARS_BUILD_MODE,66.497784,"64 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
7,COMMONAREA_MODE,69.872297,"65 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
8,FLOORSMIN_MODE,67.84863,"69 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"
9,LIVINGAPARTMENTS_MODE,68.354953,"71 Normalized information about building where the client lives, What is average (_AVG suffix), modus (_MODE suffix), median (_MEDI suffix) apartment size, common area, living area, age of building, number of elevators, number of entrances, state of the building, number of floor Name: Description, dtype: object"


In [47]:
df = df.drop(result.header.values, axis=1)

Es werden 16 Variablen mit mehr als 60% fehlenden Daten gelöscht.

### Bildung von Korrelationsclustern

In [48]:
c = df.corr(method='pearson') * 100

In [49]:
c

Unnamed: 0,TARGET,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,LANDAREA_AVG,LIVINGAREA_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,LANDAREA_MODE,LIVINGAREA_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,LANDAREA_MEDI,LIVINGAREA_MEDI,NONLIVINGAREA_MEDI,TOTALAREA_MODE,DAYS_LAST_PHONE_CHANGE
TARGET,100.0,-0.398187,-3.036929,-1.281656,-3.964528,-3.722715,7.823931,-4.493166,4.197486,5.145717,-15.531713,-16.047167,-17.89187,-2.949756,-2.274574,-0.972767,-3.419879,-1.917218,-4.400337,-1.088482,-3.299712,-1.357807,-2.728387,-1.995228,-0.903645,-3.213117,-1.738742,-4.322626,-1.01741,-3.068462,-1.271054,-2.918376,-2.208126,-0.99931,-3.386288,-1.902476,-4.376792,-1.125583,-3.273928,-1.333672,-3.259555,5.521848
AMT_INCOME_TOTAL,-0.398187,100.0,15.687027,19.165743,15.961006,7.47957,2.726087,-6.42234,2.780542,0.850624,2.623237,6.092464,-3.022905,3.450054,1.730277,0.56581,4.505278,0.539441,6.017114,-0.159823,3.997567,7.460417,2.999424,1.28213,0.528382,4.103182,0.202701,5.767546,-0.367382,3.491465,6.177807,3.379773,1.638131,0.563881,4.41597,0.478697,5.968199,-0.189161,3.926142,7.084449,4.198466,-1.858512
AMT_CREDIT,-3.036929,15.687027,100.0,77.0138,98.696831,9.973788,-5.543595,-6.683834,0.962133,-0.657477,16.84289,13.122794,4.351626,6.043923,3.922616,0.624876,8.063502,1.492928,10.329577,0.621797,7.214592,3.788452,5.307205,3.121269,0.480385,7.473989,0.936122,10.041809,0.253212,6.414174,3.239041,5.868221,3.728099,0.576509,7.909438,1.369207,10.277029,0.541497,7.085962,3.582943,7.281803,-7.370111
AMT_ANNUITY,-1.281656,19.165743,77.0138,100.0,77.510927,11.842909,0.944547,-10.433186,3.851425,1.126805,11.939838,12.5804,3.075182,7.621259,4.450711,1.32981,10.143934,1.474534,13.017365,0.861073,8.965913,4.959269,6.640083,3.444344,1.255473,9.31311,0.736217,12.63016,0.362104,7.942566,4.14177,7.398706,4.245173,1.293984,9.955854,1.327609,12.917901,0.771076,8.813429,4.700866,9.041456,-6.374689
AMT_GOODS_PRICE,-3.964528,15.961006,98.696831,77.510927,100.0,10.351971,-5.344231,-6.484199,1.156498,-0.926681,17.550222,13.93668,4.771684,6.491767,4.398154,0.724476,8.373558,1.876357,10.851246,1.329653,7.730695,4.191172,5.752268,3.581102,0.579881,7.797752,1.333479,10.553212,0.937892,6.932426,3.6403,6.318678,4.198048,0.683868,8.226581,1.758526,10.793597,1.255026,7.602761,3.987395,7.75269,-7.631286
REGION_POPULATION_RELATIVE,-3.722715,7.47957,9.973788,11.842909,10.351971,100.0,-2.958228,-0.397979,-5.381964,-0.399337,9.999729,19.892417,-0.600113,20.594209,9.842337,-0.668261,28.068507,3.684005,32.265151,-5.115989,21.349113,7.481561,17.502863,6.577913,-0.683724,25.153927,1.598336,30.399203,-6.055819,18.093152,4.96955,20.138043,9.407531,-0.671069,27.43904,3.362828,31.831898,-5.254729,20.93472,6.606027,20.214519,-4.401321
DAYS_BIRTH,7.823931,2.726087,-5.543595,0.944547,-5.344231,-2.958228,100.0,-61.58642,33.191208,27.269066,-60.060997,-9.199587,-20.54776,0.477888,-0.439976,0.074389,-0.14475,-1.028123,0.163364,0.301341,-0.033354,0.57124,0.49592,-0.43191,0.071388,-0.088877,-0.937417,0.129378,0.379983,-0.002027,0.553791,0.509188,-0.434478,0.078726,-0.129852,-0.987594,0.179054,0.334058,0.009954,0.632317,0.132948,8.293907
DAYS_EMPLOYED,-4.493166,-6.42234,-6.683834,-10.433186,-6.484199,-0.397979,-61.58642,100.0,-21.024178,-27.237804,28.98478,-2.07671,11.343448,-1.635863,-0.138204,0.862909,-0.979156,0.40177,-1.596957,-0.960876,-1.24381,-1.285785,-1.463241,-0.024291,0.863706,-0.880208,0.49429,-1.506369,-0.875768,-1.084709,-1.172571,-1.623898,-0.153533,0.840453,-0.981703,0.413798,-1.605082,-0.962651,-1.263053,-1.29754,-1.51262,2.303239
DAYS_REGISTRATION,4.197486,2.780542,0.962133,3.851425,1.156498,-5.381964,33.191208,-21.024178,100.0,10.189559,-18.109459,-5.991282,-10.754884,1.318569,-2.072946,1.10347,0.074619,-6.309432,4.970077,0.310937,0.741906,4.901173,1.273227,-2.001059,0.878266,0.272931,-6.008719,4.965687,0.335356,0.745654,4.719655,1.346335,-2.150375,1.103856,0.17259,-6.259933,5.007612,0.296825,0.792554,4.993146,1.949523,5.698257
DAYS_ID_PUBLISH,5.145717,0.850624,-0.657477,1.126805,-0.926681,-0.399337,27.269066,-27.237804,10.189559,100.0,-13.23749,-5.095513,-13.159687,-0.732157,-1.403181,-0.341554,-1.055493,-1.537463,-1.023531,-0.749625,-1.15147,0.230794,-0.712937,-1.319181,-0.259041,-1.063364,-1.554783,-1.030816,-0.802611,-1.209608,0.132127,-0.713933,-1.412888,-0.319344,-1.044818,-1.564717,-1.030744,-0.763417,-1.131529,0.24877,-1.115253,8.85759


In [50]:
families = []
for i, row in c.iterrows():
    r = row[row > 80]
    if len(r) > 1 and set(r.index) not in families:
        families.append(set(r.index))

for A in families:
    for B in families:
        if A == B:
            continue
        if A.issubset(B):
            families.remove(A)
families

[{'AMT_CREDIT', 'AMT_GOODS_PRICE'},
 {'APARTMENTS_AVG',
  'APARTMENTS_MEDI',
  'APARTMENTS_MODE',
  'ELEVATORS_AVG',
  'ELEVATORS_MEDI',
  'ELEVATORS_MODE',
  'LIVINGAREA_AVG',
  'LIVINGAREA_MEDI',
  'LIVINGAREA_MODE',
  'TOTALAREA_MODE'},
 {'BASEMENTAREA_AVG', 'BASEMENTAREA_MEDI', 'BASEMENTAREA_MODE'},
 {'YEARS_BEGINEXPLUATATION_AVG',
  'YEARS_BEGINEXPLUATATION_MEDI',
  'YEARS_BEGINEXPLUATATION_MODE'},
 {'ENTRANCES_AVG', 'ENTRANCES_MEDI', 'ENTRANCES_MODE'},
 {'FLOORSMAX_AVG', 'FLOORSMAX_MEDI', 'FLOORSMAX_MODE'},
 {'LANDAREA_AVG', 'LANDAREA_MEDI', 'LANDAREA_MODE'},
 {'NONLIVINGAREA_AVG', 'NONLIVINGAREA_MEDI', 'NONLIVINGAREA_MODE'}]

In [51]:
result = {
          "family":[],
          "head":[],
          "r2":[],
          "na":[],
          "rate":[]
         }

for i, family in enumerate(families):
    headers = list(family)
    
    result["family"].append("")
    result["head"].append("")
    result["r2"].append("")
    result["na"].append("")
    result["rate"].append("")
    
    for head in headers:
        d = df[["TARGET"] + [head]]
        na = d[head].isna().sum() / len(d) * 100
        d = d.dropna()
        y = d[["TARGET"]]
        x = d[[head]]
        
        model = LogisticRegression().fit(x, y.values.ravel())
        r2 = round(model.score(x,y),5)
        
        result["family"].append(i)
        result["head"].append(head)
        result["r2"].append(round(r2,5))
        result["na"].append(round(na,2))
        result["rate"].append(r2/na)
    
result = pd.DataFrame(result)
result

  result["rate"].append(r2/na)


Unnamed: 0,family,head,r2,na,rate
0,,,,,
1,0.0,AMT_CREDIT,0.91927,0.0,inf
2,0.0,AMT_GOODS_PRICE,0.91927,0.09,10.1685
3,,,,,
4,1.0,ELEVATORS_MODE,0.931,53.3,0.0174685
5,1.0,LIVINGAREA_MODE,0.93005,50.19,0.0185294
6,1.0,APARTMENTS_MODE,0.93041,50.75,0.0183333
7,1.0,TOTALAREA_MODE,0.9301,48.27,0.0192693
8,1.0,ELEVATORS_MEDI,0.931,53.3,0.0174685
9,1.0,LIVINGAREA_AVG,0.93005,50.19,0.0185294


Es bilden sich 8 Cluster meist durch die Existenz von Durchschnitt, Median und Modus. In diesen Fällen wird stets der Durchschnitt als Repräsentant gewählt. Cluster#1 wird durch TOTALAREA_MODE am besten Repräsentiert, wie der Score anzeigt. Cluster#0 wird durch AMT_CREDIT Repräsentiert, da AMT_GOODS_PRICE fehlende Daten beinhaltet.

In [52]:
keep = ["AMT_CREDIT", "TOTALAREA_MODE", "BASEMENTAREA_AVG", "YEARS_BEGINEXPLUATATION_AVG", "ENTRANCES_AVG", "FLOORSMAX_AVG", "LANDAREA_AVG", "NONLIVINGAREA_AVG"]

In [53]:
remove = [head for head in list(result["head"]) if head not in keep + [""]]

In [54]:
remove

['AMT_GOODS_PRICE',
 'ELEVATORS_MODE',
 'LIVINGAREA_MODE',
 'APARTMENTS_MODE',
 'ELEVATORS_MEDI',
 'LIVINGAREA_AVG',
 'LIVINGAREA_MEDI',
 'APARTMENTS_AVG',
 'APARTMENTS_MEDI',
 'ELEVATORS_AVG',
 'BASEMENTAREA_MEDI',
 'BASEMENTAREA_MODE',
 'YEARS_BEGINEXPLUATATION_MODE',
 'YEARS_BEGINEXPLUATATION_MEDI',
 'ENTRANCES_MODE',
 'ENTRANCES_MEDI',
 'FLOORSMAX_MODE',
 'FLOORSMAX_MEDI',
 'LANDAREA_MEDI',
 'LANDAREA_MODE',
 'NONLIVINGAREA_MEDI',
 'NONLIVINGAREA_MODE']

In [55]:
df = df.drop(remove, axis=1)

In [56]:
df.head()

Unnamed: 0_level_0,TARGET,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,LANDAREA_AVG,NONLIVINGAREA_AVG,TOTALAREA_MODE,DAYS_LAST_PHONE_CHANGE
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
100002,1,202500.0,406597.5,24700.5,0.018801,-9461,-637,-3648.0,-2120,0.083037,0.262949,0.139376,0.0369,0.9722,0.069,0.0833,0.0369,0.0,0.0149,-1134.0
100003,0,270000.0,1293502.5,35698.5,0.003541,-16765,-1188,-1186.0,-291,0.311267,0.622246,,0.0529,0.9851,0.0345,0.2917,0.013,0.0098,0.0714,-828.0
100004,0,67500.0,135000.0,6750.0,0.010032,-19046,-225,-4260.0,-2531,,0.555912,0.729567,,,,,,,,-815.0
100006,0,135000.0,312682.5,29686.5,0.008019,-19005,-3039,-9833.0,-2437,,0.650442,,,,,,,,,-617.0
100007,0,121500.0,513000.0,21865.5,0.028663,-19932,-3038,-4311.0,-3458,,0.322738,,,,,,,,,-1106.0


### Betrachtung der Kausalität

In [57]:
result = {
    "head":[],
    "des":[]
}

for head in df.columns.values:
    if head in skip:
        continue
    result["head"].append(head)
    result["des"].append(des[des["Row"] == head]["Description"])
    
result = pd.DataFrame(result)
result

Unnamed: 0,head,des
0,AMT_INCOME_TOTAL,"10 Income of the client Name: Description, dtype: object"
1,AMT_CREDIT,"11 Credit amount of the loan Name: Description, dtype: object"
2,AMT_ANNUITY,"12 Loan annuity Name: Description, dtype: object"
3,REGION_POPULATION_RELATIVE,"19 Normalized population of region where client lives (higher number means the client lives in more populated region) Name: Description, dtype: object"
4,DAYS_BIRTH,"20 Client's age in days at the time of application Name: Description, dtype: object"
5,DAYS_EMPLOYED,"21 How many days before the application the person started current employment Name: Description, dtype: object"
6,DAYS_REGISTRATION,"22 How many days before the application did client change his registration Name: Description, dtype: object"
7,DAYS_ID_PUBLISH,"23 How many days before the application did client change the identity document with which he applied for the loan Name: Description, dtype: object"
8,EXT_SOURCE_1,"44 Normalized score from external data source Name: Description, dtype: object"
9,EXT_SOURCE_2,"45 Normalized score from external data source Name: Description, dtype: object"


Die Kreditwürdigkeit hängt nicht Informationen wie Stockwerke, Eingänge oder Telefonänderungen ab.

In [58]:
remove = ["ENTRANCES_AVG", "FLOORSMAX_AVG", "DAYS_LAST_PHONE_CHANGE"]

In [59]:
df = df.drop(remove, axis=1)

In [60]:
df = df.drop(["TARGET"], axis=1)

### Variablenerstellung

Kredithöhe / Einkommen

In [61]:
df["CREDIT/INCOME"] = df["AMT_CREDIT"] / df["AMT_INCOME_TOTAL"]

### Ergebnis

In [62]:
df = df.add_prefix("A_")

In [63]:
df.head()

Unnamed: 0_level_0,A_AMT_INCOME_TOTAL,A_AMT_CREDIT,A_AMT_ANNUITY,A_REGION_POPULATION_RELATIVE,A_DAYS_BIRTH,A_DAYS_EMPLOYED,A_DAYS_REGISTRATION,A_DAYS_ID_PUBLISH,A_EXT_SOURCE_1,A_EXT_SOURCE_2,A_EXT_SOURCE_3,A_BASEMENTAREA_AVG,A_YEARS_BEGINEXPLUATATION_AVG,A_LANDAREA_AVG,A_NONLIVINGAREA_AVG,A_TOTALAREA_MODE,A_CREDIT/INCOME
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
100002,202500.0,406597.5,24700.5,0.018801,-9461,-637,-3648.0,-2120,0.083037,0.262949,0.139376,0.0369,0.9722,0.0369,0.0,0.0149,2.007889
100003,270000.0,1293502.5,35698.5,0.003541,-16765,-1188,-1186.0,-291,0.311267,0.622246,,0.0529,0.9851,0.013,0.0098,0.0714,4.79075
100004,67500.0,135000.0,6750.0,0.010032,-19046,-225,-4260.0,-2531,,0.555912,0.729567,,,,,,2.0
100006,135000.0,312682.5,29686.5,0.008019,-19005,-3039,-9833.0,-2437,,0.650442,,,,,,,2.316167
100007,121500.0,513000.0,21865.5,0.028663,-19932,-3038,-4311.0,-3458,,0.322738,,,,,,,4.222222


### Speichern der Ergebnisse

In [64]:
df = pd.merge(cats, df, left_index=True, right_index=True)

In [65]:
df.to_csv(DATASET_DIR / "2. Datenaufbereitung" / "app_train.csv")