### Datenanalyse

Dieses Dokument dient zur Einarbeitung in den Datensatz. Ziel ist es ein Verständnis der einzelnen Attribute zu erlangen und ein Gefühl für deren Zusammenspiel zu gewinnen. Weiterhin sollen fehlerhafte Daten identifiziert werden.

**Vorbereitung: Import benötigter Bibliotheken & Einlesen der Daten**

In [70]:
from pathlib import Path
from scipy import stats

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

np.set_printoptions(suppress=True)

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

from IPython.display import display

In [71]:
path1 = Path(r"A:\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")
path2 = Path(r"C:\Users\rober\Documents\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")

if path1.is_dir():
    DATASET_DIR = path1
else:
    DATASET_DIR = path2

In [72]:
app_test = pd.read_csv(DATASET_DIR / "application_test.csv")
app_train = pd.read_csv(DATASET_DIR / "application_train.csv")

#bureau = pd.read_csv(DATASET_DIR / "bureau.csv")
#bureau_balance = pd.read_csv(DATASET_DIR / "bureau_balance.csv")
#previous_application = pd.read_csv(DATASET_DIR / "previous_application.csv")
#credit_card_balance = pd.read_csv(DATASET_DIR / "credit_card_balance.csv")
#installments_payments = pd.read_csv(DATASET_DIR / "installments_payments.csv")
#pcb = pd.read_csv(DATASET_DIR / "POS_CASH_balance.csv")

description = pd.read_csv(DATASET_DIR / "HomeCredit_columns_description.csv", encoding="latin", index_col=0)

In [None]:
app_train.head()

In [None]:
app_train["NAME_CONTRACT_TYPE"].value_counts()

**Beschreibungen der Attribute untergliedert nach den gegebenen CSV-Dateien**

In [82]:
description.loc[description['Table']=="application_{train|test}.csv", "Row":"Special"]

Unnamed: 0,Row,Description,Special
1,SK_ID_CURR,ID of loan in our sample,
2,TARGET,"Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)",
5,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
6,CODE_GENDER,Gender of the client,
7,FLAG_OWN_CAR,Flag if the client owns a car,
8,FLAG_OWN_REALTY,Flag if client owns a house or flat,
9,CNT_CHILDREN,Number of children the client has,
10,AMT_INCOME_TOTAL,Income of the client,
11,AMT_CREDIT,Credit amount of the loan,
12,AMT_ANNUITY,Loan annuity,


In [None]:
description.loc[description['Table']=="bureau.csv", "Row":"Special"]

In [None]:
description.loc[description['Table']=="bureau_balance.csv", "Row":"Special"]

In [None]:
description.loc[description['Table']=="POS_CASH_balance.csv", "Row":"Special"]

In [None]:
description.loc[description['Table']=="credit_card_balance.csv", "Row":"Special"]

In [None]:
description.loc[description['Table']=="previous_application.csv", "Row":"Special"]

In [None]:
description.loc[description['Table']=="installments_payments.csv", "Row":"Special"]

**Anpassung der numerischen Kategorien**

In [73]:
app_train["TARGET"].replace(
    {
        0: "Payback",
        1: "Default"
    }, inplace = True
)

## Untersuchung Application-Train & Application-Test

**Erstellung von Subklassen**

* payback = Kreditnehmer die ihren Kredit zurückzahlten
* default = Kreditnehmer die ihren Kredit nicht zurückzahlten
* n = nominale Daten
* m = metrische Daten
* md = diskrete metrische Daten
* mdp = diskrete metrische Daten der Kreditnehmer die ihren Kredit zurückzahlten
* mdd = diskrete metrische Daten der Kreditnehmer die ihren Kredit nicht zurückzahlten
* ms = stetige metrische Daten
* msp = stetige metrische Daten der Kreditnehmer die ihren Kredit zurückzahlten
* msd = stetige metrische Daten der Kreditnehmer die ihren Kredit nicht zurückzahlten


In [74]:
n_heads = ['TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21']
m_heads = ['CNT_CHILDREN','AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'HOUR_APPR_PROCESS_START', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
md_heads = ['CNT_CHILDREN', "CNT_FAM_MEMBERS","HOUR_APPR_PROCESS_START", "OBS_30_CNT_SOCIAL_CIRCLE","DEF_30_CNT_SOCIAL_CIRCLE", "OBS_60_CNT_SOCIAL_CIRCLE", "DEF_60_CNT_SOCIAL_CIRCLE","AMT_REQ_CREDIT_BUREAU_HOUR","AMT_REQ_CREDIT_BUREAU_DAY","AMT_REQ_CREDIT_BUREAU_WEEK","AMT_REQ_CREDIT_BUREAU_MON","AMT_REQ_CREDIT_BUREAU_QRT","AMT_REQ_CREDIT_BUREAU_YEAR"]
ms_heads = [head for head in m_heads if head not in md_heads]

In [75]:
payback = app_train[app_train["TARGET"] == "Payback"]
default = app_train[app_train["TARGET"] == "Default"]
m = app_train[m_heads]
n = app_train[n_heads]

md = m[md_heads]
mdp = md[app_train["TARGET"] == "Payback"]
mdd = md[app_train["TARGET"] == "Default"]

ms = m[ms_heads]
msp = ms[app_train["TARGET"] == "Payback"]
msd = ms[app_train["TARGET"] == "Default"]

In [None]:
app_train.head()

**Hilfsfunktion zum zeichnen eines Kreisdiagramms**

In [None]:
# Function to draw a Piechart
def draw_piechart(arguments):
    
    fig, ax = plt.subplots(1,len(arguments))

    try:
    # Handle multiple plots
        for argument, a in zip(arguments,ax):
            labels = argument[0]
            sizes = argument[1]
            title = argument[2]

            a.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, normalize=False, labeldistance=1.05)
            a.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
            a.set_title(title)
            
    # Handle single plot           
    except TypeError:
        for argument in arguments:
            labels = argument[0]
            sizes = argument[1]
            title = argument[2]
        
            ax.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90, normalize=False)
            ax.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
            ax.set_title(title)
    plt.show()

## Untersuchung der Kreditnehmer - Payback vs. Default

**Verhältnis Payback zu Default im Datensatz**

In [None]:
pb = len(payback.index)
df = len(default.index)
N = pb + df

labels = "Payback", "Default"
sizes = [pb/N,df/N]
title = "Payback vs Default"

arguments = [(labels, sizes, title)]

draw_piechart(arguments)

**Verhältnis Payback zu Default nach Geschlecht**

In [None]:
F,M,other = payback["CODE_GENDER"].value_counts()
N = F+M+other

labels1 = "Female", "Male"
sizes1 = [F/N,M/N]
title1 = "Gender Payback"

F,M = default["CODE_GENDER"].value_counts()
X = default["CODE_GENDER"].value_counts()

N = F+M

labels2 = "Female", "Male"
sizes2 = [F/N,M/N]
title2 = "Gender Default"

arguments = [(labels1, sizes1, title1),(labels2, sizes2, title2)]

draw_piechart(arguments)

**Verhältnis Payback zu Default nach Bildung**

In [None]:
count = payback["NAME_EDUCATION_TYPE"].value_counts()

low_sec = count["Lower secondary"]
sec = count["Secondary / secondary special"]
inc_high = count["Incomplete higher"]
high = count["Higher education"]
acad = count["Academic degree"]

N = len(payback["NAME_EDUCATION_TYPE"])

labels1 = "Secondary ", "Higher education", "Incomplete higher", "Lower secondary", "Academic degree"
sizes1 = [sec/N, high/N, inc_high/N, low_sec/N, acad/N]
title1 = "Education Payback"

count = default["NAME_EDUCATION_TYPE"].value_counts()

low_sec = count["Lower secondary"]
sec = count["Secondary / secondary special"]
inc_high = count["Incomplete higher"]
high = count["Higher education"]
acad = count["Academic degree"]

N = len(default["NAME_EDUCATION_TYPE"])

labels2 = "Secondary ", "Higher education", "Incomplete higher", "Lower secondary", "Academic degree"
sizes2 = [sec/N, high/N, inc_high/N, low_sec/N, acad/N]
title2 = "Education Default"


arguments = [(labels1, sizes1, title1),(labels2, sizes2, title2)]

draw_piechart(arguments)

**Verhältnis Payback zu Default - komplette Liste der kategorischen Variablen**

In [None]:
for head in n.columns.values:
    
    df1 = payback[head].value_counts().rename_axis(head).reset_index(name='payback').head()
    df2 = default[head].value_counts().rename_axis(head).reset_index(name='default').head()
    
    df1["payback"] = df1["payback"]/df1["payback"].sum()*100
    df2["default"] = df2["default"]/df2["default"].sum()*100
    
    df = df1.merge(df2, how="outer", on=head)
    
    df["change"] = (df["default"]-df["payback"])
    
    df = df.sort_values("change", ascending=False)
    
    display(df)

**Verhältnis Payback zu Default - Top-10 der unterschiedlichsten Verhältnisse**

In [None]:
top10 = []

for head in n.columns.values:
    
    df1 = payback[head].value_counts().rename_axis(head).reset_index(name='payback').head()
    df2 = default[head].value_counts().rename_axis(head).reset_index(name='default').head()
    
    df1["payback"] = df1["payback"]/df1["payback"].sum()*100
    df2["default"] = df2["default"]/df2["default"].sum()*100
    
    df = df1.merge(df2, how="outer", on=head)
    
    df["change"] = (df["default"]-df["payback"])
    
    df = df.sort_values("change", ascending=False)
    
    for element in df["change"]:
        if np.isnan(element):
            continue
        if len(top10) < 10:
            t = (head, df[df["change"]==element][head].values[0], element)    
            top10.append(t)
        else:
            if element > top10[-1][-1]:
                top10.pop(-1)
                t = (head, df[df["change"]==element][head].values[0], element)
                top10.append(t)
            
        top10 = sorted(top10, key=lambda value: value[2], reverse=True)

df = pd.DataFrame(top10)
display(df)

Eine zahlungsunfähige Person ist typischerweise:

    - arbeitend
    - männlich
    - Bildungsniveau Sekundarstufe
    - hat eine weitere Adresse für die Arbeit hinterlegt
    - ...

**Pearson Correlation**

**Korrelaion aller Features des Trainingsdatensatzes**

In [None]:
plt.figure(figsize=(120,100))
sns.heatmap(m.corr(method="pearson"), annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
x = m.corr(method="pearson")
t = x[x > 0.85]["APARTMENTS_AVG"].dropna()
list(t.index)

In [46]:
x = m.corr(method="pearson")
for head in m_heads:
    t = x[x > 0.85][head].dropna()
    corrs = list(t.index)
    corrs.remove(head)
    if len(corrs) > 0:
        print("{} : {}".format(head, corrs))

CNT_CHILDREN : ['CNT_FAM_MEMBERS']
AMT_CREDIT : ['AMT_GOODS_PRICE']
AMT_GOODS_PRICE : ['AMT_CREDIT']
CNT_FAM_MEMBERS : ['CNT_CHILDREN']
APARTMENTS_AVG : ['LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'APARTMENTS_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'APARTMENTS_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'TOTALAREA_MODE']
BASEMENTAREA_AVG : ['BASEMENTAREA_MODE', 'BASEMENTAREA_MEDI']
YEARS_BEGINEXPLUATATION_AVG : ['YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BEGINEXPLUATATION_MEDI']
YEARS_BUILD_AVG : ['YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI']
COMMONAREA_AVG : ['COMMONAREA_MODE', 'COMMONAREA_MEDI']
ELEVATORS_AVG : ['LIVINGAREA_AVG', 'ELEVATORS_MODE', 'ELEVATORS_MEDI', 'LIVINGAREA_MEDI']
ENTRANCES_AVG : ['ENTRANCES_MODE', 'ENTRANCES_MEDI']
FLOORSMAX_AVG : ['FLOORSMAX_MODE', 'FLOORSMAX_MEDI']
FLOORSMIN_AVG : ['FLOORSMIN_MODE', 'FLOORSMIN_MEDI']
LANDAREA_AVG : ['LANDAREA_MODE', 'LANDAREA_MEDI']
LIVINGAPARTMENTS_AVG : ['APARTMENTS_AVG', 'LIVINGAREA_AVG', 'APARTMENTS_MODE', 'LIVINGAPA

In [55]:
mode = [head for head in m_heads if "MODE" in head]
medi = [head for head in m_heads if "MEDI" in head]
t = app_train.drop(mode + medi, axis=1)
t.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,Default,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.018801,-9461,-637,-3648.0,-2120,,1,1,0,1,1,0,Laborers,1.0,2,2,WEDNESDAY,10,0,0,0,0,0,0,Business Entity Type 3,0.083037,0.262949,0.139376,0.0247,0.0369,0.9722,0.6192,0.0143,0.0,0.069,0.0833,0.125,0.0369,0.0202,0.019,0.0,0.0,reg oper account,block of flats,"Stone, brick",No,2.0,2.0,2.0,2.0,-1134.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,Payback,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,Family,State servant,Higher education,Married,House / apartment,0.003541,-16765,-1188,-1186.0,-291,,1,1,0,1,1,0,Core staff,2.0,1,1,MONDAY,11,0,0,0,0,0,0,School,0.311267,0.622246,,0.0959,0.0529,0.9851,0.796,0.0605,0.08,0.0345,0.2917,0.3333,0.013,0.0773,0.0549,0.0039,0.0098,reg oper account,block of flats,Block,No,1.0,0.0,1.0,0.0,-828.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,Payback,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.010032,-19046,-225,-4260.0,-2531,26.0,1,1,1,1,1,0,Laborers,1.0,2,2,MONDAY,9,0,0,0,0,0,0,Government,,0.555912,0.729567,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-815.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,Payback,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,Unaccompanied,Working,Secondary / secondary special,Civil marriage,House / apartment,0.008019,-19005,-3039,-9833.0,-2437,,1,1,0,1,0,0,Laborers,2.0,2,2,WEDNESDAY,17,0,0,0,0,0,0,Business Entity Type 3,,0.650442,,,,,,,,,,,,,,,,,,,,2.0,0.0,2.0,0.0,-617.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
4,100007,Payback,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,Unaccompanied,Working,Secondary / secondary special,Single / not married,House / apartment,0.028663,-19932,-3038,-4311.0,-3458,,1,1,0,1,0,0,Core staff,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Religion,,0.322738,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1106.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
plt.figure(figsize=(120,100))
sns.heatmap(n.corr(method="spearman"), annot=True, cmap=plt.cm.Reds)
plt.show()

In [79]:
n.head()

NameError: name 'nd' is not defined

In [81]:
n.corr(method="spearman").head()
n.corr().index

Index(['FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'REGION_RATING_CLIENT',
       'REGION_RATING_CLIENT_W_CITY', 'REG_REGION_NOT_LIVE_REGION',
       'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
       'LIVE_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3',
       'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6',
       'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9',
       'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12',
       'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15',
       'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18',
       'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'],
      dtype='object')

In [80]:
x = n.corr(method="spearman")
for head in list(x.index):
    t = x[x > 0.85][head].dropna()
    corrs = list(t.index)
    corrs.remove(head)
    if len(corrs) > 0:
        print("{} : {}".format(head, corrs))

REGION_RATING_CLIENT : ['REGION_RATING_CLIENT_W_CITY']
REGION_RATING_CLIENT_W_CITY : ['REGION_RATING_CLIENT']
REG_REGION_NOT_WORK_REGION : ['LIVE_REGION_NOT_WORK_REGION']
LIVE_REGION_NOT_WORK_REGION : ['REG_REGION_NOT_WORK_REGION']


**Korrelation der Features der zurückzahlenden Personen**

In [None]:
plt.figure(figsize=(120,100))
sns.heatmap(payback.corr(), annot=True, cmap=plt.cm.Reds)
plt.show()

**Korrelation der Features der ausfallenden Personen**

In [None]:
plt.figure(figsize=(120,100))
sns.heatmap(default.corr(), annot=True, cmap=plt.cm.Reds)
plt.show()

**Numerische Werte**

In [None]:
for head in ms.columns.values:
    plt.hist(ms[head])
    plt.hist(msp[head])
    plt.hist(msd[head])
    plt.legend(labels=["All", "Payback","Default"])
    plt.title(head)
    plt.show()

In [None]:
for head in md.columns.values:
    bins = md[head].value_counts()
    bins1 = mdp[head].value_counts()
    bins2 = mdd[head].value_counts()

    fig, ax = plt.subplots()
    ax.bar(bins.keys(), bins.values, label='All')
    ax.bar(bins1.keys(), bins1.values, label='Payback')
    ax.bar(bins2.keys(), bins2.values, label='Default')
    
    plt.xticks(np.arange(len(bins.keys())), bins.keys())
    
    plt.title(head)
    plt.legend()
    plt.show()