*This notebook is the second part of a series of notebooks that comprise a larger project. This is the full list of notebooks:*
1. [Understanding the problem](https://www.kaggle.com/timvh2/understanding-the-problem/edit)
2. Data cleaning
3. [Exploratory data analysis](https://www.kaggle.com/timvh2/data-exploratie-versie-2/edit)
4. [Building a model](https://www.kaggle.com/timvh2/building-a-model/edit)
5. [Model evaluation and interpretation](https://www.kaggle.com/timvh2/model-evaluation-and-interpretation/edit)

The data and Home Credit Default competition can be found [here](https://www.kaggle.com/c/home-credit-default-risk).

**Note**: we should remove the /edit from the links when we publicize this. Also, if we change the title of the notebook the link will change!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Path of the file to read
application_train_file_path = '../input/home-credit-default-risk/application_train.csv'
application_test_file_path = '../input/home-credit-default-risk/application_test.csv'

bureau_file_path = '../input/home-credit-default-risk/bureau.csv'
bureau_balance_file_path = '../input/home-credit-default-risk/bureau_balance.csv'

previous_application_file_path = '../input/home-credit-default-risk/previous_application.csv'
POS_CASH_balance_file_path = '../input/home-credit-default-risk/POS_CASH_balance.csv'
installments_payments_file_path = '../input/home-credit-default-risk/installments_payments.csv'
credit_card_balance_file_path = '/kaggle/input/home-credit-default-risk/credit_card_balance.csv'

# Read the file
application_train_data = pd.read_csv(application_train_file_path, index_col = "SK_ID_CURR")
application_test_data = pd.read_csv(application_test_file_path, index_col = "SK_ID_CURR")

# bureau_data = pd.read_csv(bureau_file_path)
# bureau_balance_data = pd.read_csv(bureau_balance_file_path)

# previous_application_data = pd.read_csv(previous_application_file_path)
# POS_CASH_balance_data = pd.read_csv(POS_CASH_balance_file_path)
# installments_payments_data = pd.read_csv(installments_payments_file_path)
# credit_card_balance_data = pd.read_csv(credit_card_balance_file_path)

# Copy the permutable dataframes

X_train_full = application_train_data.copy()
X_test = application_test_data.copy()

# Define target
y = X_train_full.TARGET

# Drop target from training set
X_train_full = X_train_full.drop(['TARGET'], axis = 1)

In [None]:
# X_train_full

# 2. Data cleaning

In the previous chapter we have examined the problem and seen how the data are structured. We now wish to take a first look at the data, and prepare them for exploratory data analysis. To this end, we will take the following steps:

1. We examine the distribution of all variables. We check this distribution is consistent with our expectations of the variable, and deal with any anomalies.
2. We examine the missing values in the data, and either impute or drop them.
3. We examine the categorical variables, and apply either one-hot encoding or label encoding to them.

These steps will be put together into a pipeline with which we will preprocess the data.

# 2.1 Distributions

In [None]:
# X_train_full.loc[114967, ["AMT_INCOME_TOTAL"]] = None

# outliers = X_train_full.loc[X_train_full.AMT_INCOME_TOTAL > 50000000]
# # X_train_full.AMT_INCOME_TOTAL.describe()

# print(outliers)

# # X_train_full.loc[]

# X_train_full.iloc[12840].AMT_INCOME_TOTAL


# 2.2 Missing values



First, let's take inventory of the missing values. We will start with the application_train_data dataset.

In [None]:
# find the columns that contain missing values
cols_missing_vals = [col for col in application_train_data.columns
                     if application_train_data[col].isnull().any()]

n_missing_vals = application_train_data.isnull().sum()[cols_missing_vals]

n_missing_vals_percent = (n_missing_vals / len(application_train_data) * 100).sort_values(ascending = False)

# # Percentages of missing values w.r.t. the whole column
# with pd.option_context('display.max_rows', None):
#   display(n_missing_vals_percent)

# pd.DataFrame([cols_missing_vals, n_missing_vals]), index = cols_missing_vals, columns = [""])

In [None]:
# # Matrix and Dendrogram of relations between columns in term of missing values
# import missingno as mn

# mn.matrix(X_train_full.sample(500), figsize = (10,6))

# mn.dendrogram(X_train_full, figsize = (20,50))

We see that there are 57 variables that have a large amount of missing values, and 10 more with small amounts of missing values. Those with large amounts of missing data are almost all variables describing the building the client lives in. Though we do not know how these data were collected or often even exactly what they describe, it is probably the case that these data are not missing completely at random. But as we suspect the properties of the building one lives in are usually related to for instance income and number of children (which are known), we will use imputation to fill in missing values.

For now we will use a method of imputation known as *k-nearest neighbor(kNN)*.

In [None]:
# from sklearn.impute import KNNImputer

# imputer = KNNImputer(n_neighbors = 5)

In [None]:
# Distinguish categorical and numerical columns
cat_cols = [col for col in X_train_full.columns if X_train_full[col].dtype == "object"]
num_cols = [col for col in X_train_full.columns if X_train_full[col].dtype in ["int64", "float64"]]

cat_cols_with_two_vals = [col for col in cat_cols if len(X_train_full[col].unique()) == 2]

cat_cols_with_more_vals = [col for col in cat_cols if len(X_train_full[col].unique()) > 2]

print(cat_cols_with_more_vals)

In [None]:
from sklearn.base import TransformerMixin , BaseEstimator

class KillVampires(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature_name):
        self.feature_name = feature_name
    
    def fit(self, X, y = None):
        return self
    
    def transform(self, X, y = None):
        X_ = X.copy() # We create a copy to prevent meddling with the original dataset
        X_[self.feature_name] = X_[self.feature_name].replace(to_replace = 365243, value = np.nan)
        return X_

In [None]:
def rating(rij):
    if rij > 10**7:
        return np.nan 
    else:
        return rij

In [None]:
class KillLiars(BaseEstimator, TransformerMixin):
    
    def __init__(self, feature_name):
#         print('\n>>>>>>>init() called.\n')
        self.feature_name = feature_name
    
    def fit(self, X, y = None):
#         print('\n>>>>>>>fit() called.\n')
        return self
    
    def transform(self, X, y = None):
#         print('\n>>>>>>>transform() called.\n')
        X_ = X.copy() # We create a copy to prevent meddling with the original dataset
        X_[self.feature_name] = X_[self.feature_name].apply(lambda x: rating(x))
        return X_

In [None]:
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

num_transformer = Pipeline(steps = [
    ("killVamp", KillVampires('DAYS_EMPLOYED')),
    ("killLiar", KillLiars('AMT_INCOME_TOTAL')),
    ("scaler", StandardScaler()),
    ("imputer", SimpleImputer(strategy = "mean"))
])

# preprocessing for categorical data with two values
cat_transformer_two_vals = Pipeline(steps = [
    ("imputer", SimpleImputer(strategy = "constant", fill_value = "missing")),
    ("label_encoder", OrdinalEncoder())
])


# preprocessing for categorical data with more than two values
cat_transformer_more_vals = Pipeline(steps = [
    ("imputer", SimpleImputer(strategy = "constant", fill_value = "missing")), 
    ("onehot_encoder", OneHotEncoder(handle_unknown = "ignore"))
])

# bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers = [
        ('num', num_transformer, num_cols),
        ('cat_with_two_vals', cat_transformer_two_vals, cat_cols_with_two_vals),
        ('cat_with_more_than_two_vals', cat_transformer_more_vals, cat_cols_with_more_vals)
])

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score

model = LogisticRegressionCV(class_weight = "balanced", max_iter = 1000, cv = 5, scoring = "roc_auc")

pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                              ('model', model)
                             ])

In [None]:
from sklearn.model_selection import train_test_split

X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(
    X_train_full, y, test_size = 0.2, train_size = 0.8, random_state = 0)

In [None]:
pipeline.fit(X_train_split, y_train_split)

In [None]:
params = model.coef_

print(params)

type(params)

onehot_features = pipeline.named_steps["preprocessor"].transformers_[2][1].named_steps["onehot_encoder"].get_feature_names(cat_cols_with_more_vals)

one_hot_col_names = X_train_split.copy().columns.tolist()
for col in cat_cols_with_more_vals:
    one_hot_col_names.remove(col)
one_hot_col_names.extend(onehot_features)

print(len(num_cols))

print(one_hot_col_names)
len(one_hot_col_names)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

keys = one_hot_col_names
vals = params.tolist()

print(keys)
print(vals[0])

print(len(keys))
print(len(vals[0]))

plt.figure(figsize=(20,6))

sns.barplot(x = keys, y = vals[0])

In [None]:
coefficients = dict(zip(keys, vals[0]))
with pd.option_context("display.max_rows", None,"display.min_rows", None):
    print(pd.Series(coefficients).sort_values())

In [None]:
from sklearn.metrics import roc_auc_score

preds = pipeline.predict_proba(X_test_split)[: ,1]
print(roc_auc_score(y_test_split, preds, average = None))

In [None]:
from sklearn.metrics import roc_auc_score
import time

tic = time.time()
# preprocess and fit model
pipeline.fit(X_train_full, y)
toc = time.time()
print(toc-tic)

tic = time.time()
# make predictions for validation set
preds = pipeline.predict_proba(X_test)[:,1]
toc = time.time()
print(toc-tic)

# # evaluate model
# score = roc_auc_score(y, preds, average = None)
# print('ROC AUC score:', score)

In [None]:
preds

In [None]:
test_preds = preds.copy()
submission = pd.concat([pd.Series(test_preds, index = X_test.index)], axis = 1)
submission.columns = ['TARGET']
submission.index.names = ['SK_ID_CURR']
submission.to_csv('submission.csv', index=True)

In [None]:
submission

In [None]:
# test_case = X_train_full.iloc[0:-1]
# test_y = y.iloc[0:-1]

# test_case = X_train_full
# test_y = y

# # print(test_y)
# import time
# tic = time.time()
# result = pd.DataFrame(preprocessor.fit_transform(test_case,test_y))
# toc = time.time()
# print(toc-tic)

# preprocessor.fit_transform(test_case, test_y)


# # ----------------------------------------------------------------------------------------------------------------
# import time

# time_elapsed = {}

# for t in range(20000, 40000, 10000):
#     begin_time = time.time()
#     test_case = X_train_full.iloc[0 : t]
#     transformed_test_case = pd.DataFrame(preprocessor.fit_transform(test_case), index = test_case.index)
#     end_time = time.time()
#     time_elapsed[t] = end_time - begin_time
    
# import seaborn as sns
# import matplotlib.pyplot as plt

# keys = list(time_elapsed.keys())

# values = list(time_elapsed.values())

# plt.figure(figsize = (10, 10))
# plt.title("{} HOURS for N=5 Nearest Neighbours".format(round(37*values[-1]/60,2)))
# bar = sns.barplot(x = keys, y = values)
# for p in range(len(values)):
#     bar.text(p, values[p], str(round(values[p], 2)),
#             horizontalalignment = 'center')
# # ----------------------------------------------------------------------------------------------------------------

# # test = pd.DataFrame(preprocessor.fit_transform(test_case, test_y)
# # pipeline.fit(X_train_full, y)

In [None]:
# test_preds = pipeline.predict_proba(X_test)[:,1]
# submission = pd.concat([pd.Series(test_preds, index = X_test['SK_ID_CURR'])], axis = 1)
# submission.columns = ['TARGET']
# submission.index.names = ['SK_ID_CURR']
# submission.to_csv('submission.csv', index=True)