## <u>2.Datenaufbereitung Point of Sales

In diesem Dokument werden für die Untersuchung unwichtige Variablen aus dem POS-Datensatz gelöscht und wichtige gruppiert. Im Gegensatz zu den Application-Daten liegt in dem POS-Datensatz eine N:M-Beziehung vor, da ein Kreditnehmer in der Vergangenheit mehrere Kredite gehabt haben kann und diese monatliche Kreditdaten beinhalten. Dies erfordert die Gruppierung der historischen Daten. Für die Bestimmung der Kreditwürdigkeit werden nur Daten historische Kontodaten benutzt, die maximal ein halbes Jahr in der Vergangenheit liegen. 

*Vorgehensweise kategorische Variablen:*
- Gruppierung der Variablen

## Initialisierung

In [1]:
from pathlib import Path
from scipy import stats

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

np.set_printoptions(suppress=True)

pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

from sklearn.linear_model import LogisticRegression

from IPython.display import display, Markdown

In [2]:
path1 = Path(r"A:\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")
path2 = Path(r"C:\Users\rober\Documents\Workspace\Python\Masterarbeit\Kaggle Home Credit Datensatz")

if path1.is_dir():
    DATASET_DIR = path1
else:
    DATASET_DIR = path2

In [3]:
pos = pd.read_csv(DATASET_DIR / "POS_CASH_balance.csv")
description = pd.read_csv(DATASET_DIR / "HomeCredit_columns_description.csv", encoding="latin", index_col=0)

In [4]:
des = description.loc[description['Table']=="POS_CASH_balance.csv", "Row":"Special"]

In [5]:
des

Unnamed: 0,Row,Description,Special
145,SK_ID_PREV,"ID of previous credit in Home Credit related to loan in our sample. (One loan in our sample can have 0,1,2 or more previous loans in Home Credit)",
146,SK_ID_CURR,ID of loan in our sample,
147,MONTHS_BALANCE,"Month of balance relative to application date (-1 means the information to the freshest monthly snapshot, 0 means the information at application - often it will be the same as -1 as many banks are not updating the information to Credit Bureau regularly )",time only relative to the application
148,CNT_INSTALMENT,Term of previous credit (can change over time),
149,CNT_INSTALMENT_FUTURE,Installments left to pay on the previous credit,
150,NAME_CONTRACT_STATUS,Contract status during the month,
151,SK_DPD,DPD (days past due) during the month of previous credit,
152,SK_DPD_DEF,DPD during the month with tolerance (debts with low loan amounts are ignored) of the previous credit,


# Informationsgehalt:
* Anzahl der restlichen monatlichen Zahlungen die paralell zum aktuellen Kredit noch abbezahlt werden müssen
* Anzahl an Monaten in denen ein Zahlungsrückstand vorlag

In [6]:
pos[(pos["SK_ID_PREV"] == 1682318) & (pos["MONTHS_BALANCE"] >= -12)].sort_values("MONTHS_BALANCE")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
8829991,1682318,161674,-10,10.0,10.0,Active,0,0
4037027,1682318,161674,-9,10.0,9.0,Active,0,0
88351,1682318,161674,-8,10.0,8.0,Active,0,0
9240316,1682318,161674,-7,10.0,7.0,Active,0,0
9021153,1682318,161674,-6,10.0,6.0,Active,0,0
8549983,1682318,161674,-5,10.0,5.0,Active,0,0
9232915,1682318,161674,-4,10.0,4.0,Active,0,0
8419944,1682318,161674,-3,10.0,3.0,Active,0,0
6518430,1682318,161674,-2,8.0,0.0,Completed,0,0


In [7]:
pos[(pos["SK_ID_PREV"] == 1487161) & (pos["MONTHS_BALANCE"] >= -12)].sort_values("MONTHS_BALANCE")

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
1789040,1487161,169489,-12,24.0,15.0,Active,0,0
308815,1487161,169489,-11,24.0,14.0,Active,0,0
3369171,1487161,169489,-10,24.0,13.0,Active,0,0
1741425,1487161,169489,-9,24.0,12.0,Active,0,0
1945738,1487161,169489,-8,24.0,11.0,Active,0,0
4187515,1487161,169489,-7,24.0,10.0,Active,0,0
4461021,1487161,169489,-6,24.0,9.0,Active,0,0
5360357,1487161,169489,-5,24.0,8.0,Active,0,0
1114287,1487161,169489,-4,24.0,7.0,Active,0,0
1702339,1487161,169489,-3,24.0,6.0,Active,0,0


In [8]:
df = pos.copy()

In [9]:
result = pd.DataFrame(index=df.SK_ID_PREV.unique())
result.index.name = "SK_ID_PREV"

In [10]:
X = df[df["MONTHS_BALANCE"] == -1] # Daten über den letzten Zahlungsmonat
X = X[["SK_ID_PREV", "CNT_INSTALMENT_FUTURE"]]
X = X.set_index("SK_ID_PREV")
X.columns = ["CNT_PAYMENTS_LEFT"]
X.head()

Unnamed: 0_level_0,CNT_PAYMENTS_LEFT
SK_ID_PREV,Unnamed: 1_level_1
2373788,12.0
1328586,10.0
1059231,21.0
1627415,6.0
2070143,5.0


In [11]:
result = pd.merge(result, X, how="left", left_index=True, right_index=True)

Zum Zeitpunkt der Kreditaufnahme im Application-Datensatz muss Kreditnehmer 2373788 noch 12 Monatszahlungen für seine POS-Kredite leisten.

In [12]:
# Anzahl verschuldeter Monate in der Vergangenheit

In [13]:
X = df[df["MONTHS_BALANCE"] >= -6][["SK_ID_PREV", "SK_DPD_DEF"]]
X = X[["SK_ID_PREV", "SK_DPD_DEF"]]
X = X.groupby(by =["SK_ID_PREV"]).sum()
X.columns = ["CNT_DPD"]
X.head()

Unnamed: 0_level_0,CNT_DPD
SK_ID_PREV,Unnamed: 1_level_1
1000003,0
1000007,0
1000011,0
1000017,0
1000025,0


In [14]:
result = pd.merge(result, X, how="left", left_index=True, right_index=True)

Zum Zeitpunkt der Kreditaufnahme im Application-Datensatz hat Kreditnehmer 1000003 180 Tage zuvor keine Tage mit Zahlungsverzug gehabt.

In [15]:
result = result.fillna(0)

### Ergebnis

In [16]:
df = result

In [17]:
df = df.add_prefix("POS_")

In [18]:
df.head()

Unnamed: 0_level_0,POS_CNT_PAYMENTS_LEFT,POS_CNT_DPD
SK_ID_PREV,Unnamed: 1_level_1,Unnamed: 2_level_1
1803195,0.0,0.0
1715348,0.0,0.0
1784872,0.0,0.0
1903291,0.0,0.0
2341044,1.0,0.0


### Speichern der Werte

In [19]:
df.to_csv(DATASET_DIR / "2. Datenaufbereitung" / "pos.csv")

In [20]:
df.loc[1487161]

POS_CNT_PAYMENTS_LEFT    4.0
POS_CNT_DPD              0.0
Name: 1487161, dtype: float64