# Introduction
![](https://zeynepstefan.com/wp-content/uploads/2018/04/creditrisk.jpg)

## Aim:

Making risk prediction based on given features.

## Context:

The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

## Content:
1. [Data Understanding](#1)
    * [Load and Check Data](#2)
    * [Variable Description](#3)
    * [Data Visualization](#4)
1. [Data Preprocessing](#5)                          
1. [Modeling](#6)
    * [KNN Model](#7)
    * [SVC Model](#8)
    * [XGBoost Model](#9)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import pandas_profiling 
import seaborn as sns # visualization
import matplotlib.pyplot as plt # visualization
from sklearn.preprocessing import LabelEncoder # label encoding
from sklearn.model_selection import train_test_split # train, test split
from sklearn.preprocessing import StandardScaler # normalization
from sklearn.neighbors import KNeighborsClassifier # KNN model
from sklearn.svm import SVC # SVC model
from xgboost import XGBClassifier # XGBoost model
from sklearn.model_selection import GridSearchCV, cross_val_score # Gridsearch 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve# results

import warnings # ignore warning
warnings.filterwarnings("ignore")

%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="1"></a>
# 1) Data Understanding

<a id="2"></a>
## Load and Check Data

In [None]:
# reading and copying data
data = pd.read_csv("/kaggle/input/german-credit-data-with-risk/german_credit_data.csv", index_col = "Unnamed: 0")
df = data.copy()

* overview of data

In [None]:
# overview of data
df.head()

<a id="3"></a>
## Variable Description

Meaning of the Values:

1. Age: Age of the person applying for the credit.
1. Sex: Gender of the person applying for the credit.
1. Job: 0,1,2,3 The values specified for the job in the form of 0,1,2,3.
1. Housing: own, rent or free.
1. Saving accounts: the amount of money in the person's bank account.
1. Checking account: cheque account.
1. Credit amount: Credit amount.
1. Duration: Time given for credit payment.
1. Purpose: Goal of credit application.
1. Risk: Credit application positive or negative.

In [None]:
df.info()

Summary of the Columns and Rows:

* int64(4): Age, Job, Credit amount, Duration
* object(6): Sex, Housing, Saving accounts, Checking account, Purpose, Risk
* row number: 1000
* column number: 10

In [None]:
df.describe().T

Summary of Statistics of Numerical Values:

* Age: max 75.0, min 19.0, mean 35.546
* Job: max 3.0, min 0.0, mean 1.904
* Credit amount: max 18424.0, min 250.0, mean 3271.258
* Duration: max 72.0, min 4.0, mean 20.903

In [None]:
columns = ["Age","Sex","Job","Housing","Saving accounts","Checking account","Credit amount","Duration","Purpose","Risk"]

def unique_value(data_set, column_name):
    return data_set[column_name].nunique()

print("Number of the Unique Values:\n",unique_value(df, columns))    

Number of the Unique Values:

* Age(53):
* Sex(2): (male, female)
* Job(4): (0, 1, 2, 3)
* Housing(3): (own, free, rent)
* Saving accounts(4): (little, moderate, quite rich, rich)
* Checking account(3): (little, moderate, rich)
* Credit amount(921):
* Duration(33):
* Purpose(8): (radio/TV, education, furniture/equipment, car, business, domestic appliances, repairs, vacation/others)
* Risk(2): (bad, good)

In [None]:
# Missing Value Table
def missing_value_table(df):
    missing_value = df.isna().sum().sort_values(ascending=False)
    missing_value_percent = 100 * df.isna().sum()//len(df)
    missing_value_table = pd.concat([missing_value, missing_value_percent], axis=1)
    missing_value_table_return = missing_value_table.rename(columns = {0 : 'Missing Values', 1 : '% Value'})
    cm = sns.light_palette("lightgreen", as_cmap=True)
    missing_value_table_return = missing_value_table_return.style.background_gradient(cmap=cm)
    return missing_value_table_return
  
missing_value_table(df)

Number of Missing Values:

* Checking account: number of missing value 394, percent of missing value 39
* Saving accounts: number of missing value 183, percent of missing value18

In [None]:
date_int = ["Purpose", 'Sex']
cm = sns.light_palette("lightgreen", as_cmap=True)
pd.crosstab(df[date_int[0]], df[date_int[1]]).style.background_gradient(cmap = cm)

Sex and Purpose:

* Women(94) and Men(243) applied for credit mostly for cars.
* Women(3) applied for credit least for vacation/others.
* Men(6) applied for credit least for domestic appliances.

<a id="4"></a>
## Data Visualization

In [None]:
fig, ax = plt.subplots(1,2,figsize=(15,5))

sns.countplot(df['Sex'], ax=ax[0]).set_title('Male - Female Ratio');
sns.countplot(df.Risk, ax=ax[1]).set_title('Good - Bad Risk Ratio');

* Looking at the graphics, the rate of males in this data set is higher than the rate of females.
* Looking at the graphics, the rate of good risk in this data set is higher than the rate of bad risk.

In [None]:
fig, ax = plt.subplots(2,1,figsize=(15,5))
plt.tight_layout(2)
sns.lineplot(data=df, x='Age', y='Credit amount', hue='Sex', lw=2, ax=ax[0]).set_title("Credit Amount Graph Depending on Age and Duration by Sex", fontsize=15);
sns.lineplot(data=df, x='Duration', y='Credit amount', hue='Sex', lw=2, ax=ax[1]);

* Looking at the chart, the highest credit amount was reached at the age of 60 and around.
* Looking at the graph, the highest loan amounts between 50-60 duration have been reached.

In [None]:
sns.countplot(x="Housing", hue="Risk", data=df).set_title("Housing and Frequency Graph by Risk", fontsize=15);
plt.show()

* The risk rates are higher in the own, free and rent categories then bad risk.
* Owners of their own homes are the people who apply for a loan the most.

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(10,6))
sns.countplot(x="Saving accounts", hue="Risk", data=df, ax=ax1);
sns.countplot(x="Checking account", hue="Risk", data=df, ax=ax2);
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
ax2.set_xticklabels(ax2.get_xticklabels(), rotation=45)
fig.show()

* At first scan, it is thought that the rich people will get more credit, but it does not seem to be clearly on the graphics.

In [None]:
fig, ax = plt.subplots(1,3,figsize=(20,5))
plt.suptitle('Box Plots of Age, Duration and Credit amount.',fontsize = 15)
sns.boxplot(df['Credit amount'], ax=ax[0]);
sns.boxplot(df['Duration'], ax=ax[1]);
sns.boxplot(df['Age'], ax=ax[2]);
plt.show()

In [None]:
cor = df.corr()
sns.heatmap(cor, annot=True).set_title("Correlation Graph of Data Set",fontsize=15);
plt.show()

* There is a correlation between Credit amount and Duration(0.62).

<a id="5"></a>
# 2) Data Preprocessing

In [None]:
# Label Encoding
columns_label = ["Sex","Risk"]
labelencoder = LabelEncoder()
for i in columns_label:
    df[i] = labelencoder.fit_transform(df[i])

* Label encoding

In [None]:
Cat_Age = []
for i in df["Age"]:
    if i<25:
        Cat_Age.append("0-25")
    elif (i>=25) and (i<30):
        Cat_Age.append("25-30")
    elif (i>=30) and (i<35):
        Cat_Age.append("30-35")
    elif (i>=35) and (i<40):
        Cat_Age.append("35-40")
    elif (i>=40) and (i<50):
        Cat_Age.append("40-50")
    elif (i>=50) and (i<76):
        Cat_Age.append("50-75")
        
df["Cat Age"] = Cat_Age        

* converting age to category.

In [None]:
# Get Dummies
columns_dummy = ['Housing','Saving accounts','Checking account',"Purpose","Cat Age"]
for i in columns_dummy:
    df = pd.concat([df, pd.get_dummies(df[i])], axis=1)

* Used get dummies method for some columns.

In [None]:
df.drop(['Housing','Saving accounts','Checking account',"Purpose","Age","Cat Age"], axis = 1, inplace=True)

* drop unnecessary columns.

!!! missing values were not filled or droped because when filled or droped, the accuracy value was decreasing. 

<a id="6"></a>
# 3) Modeling

In [None]:
y = df.Risk
X = df.drop("Risk", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

* Separated data as train and test.

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

* Applied standart scaling.

<a id="7"></a>
## KNN(K-Nearest Neighbors) Model

In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 3)
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test) 
print('With KNN (K=3) accuracy is: ',knn_model.score(X_test,y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

* Accuracy is 0.69 (K = 3) 

* Checking max. accuracy with graph.

In [None]:
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# Loop over different values of k
for i, k in enumerate(neig):
    # k from 1 to 25(exclude)
    knn_model = KNeighborsClassifier(n_neighbors=k)
    # Fit with knn
    knn_model.fit(X_train,y_train)
    #train accuracy
    train_accuracy.append(knn_model.score(X_train, y_train))
    # test accuracy
    test_accuracy.append(knn_model.score(X_test, y_test))

# Plot
plt.figure(figsize=[12,6])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

In [None]:
knn_model = KNeighborsClassifier(n_neighbors = 23)
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test) 
print('With KNN (K=23) accuracy is: ',knn_model.score(X_test,y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
#Predicting proba
y_pred_prob = knn_model.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

* Max accuracy is 0.725(K=23)

In [None]:
#Predicting proba
y_pred_prob = model.predict_proba(X_test)[:,1]

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

<a id="8"></a>
## SVC(Support Vector Classification) Model

In [None]:
svc_model = SVC(kernel = "rbf").fit(X_train, y_train)
y_pred = svc_model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

* Before Tuning accuracy score is 0.745.  

In [None]:
svc_params ={"C": [0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 50, 100]
             ,"gamma": [0.0001, 0.001, 0.01, 0.1, 1, 5, 10, 50, 100]}
svc = SVC()
svc_cv_model = GridSearchCV(svc, svc_params, cv = 10, n_jobs = -1, verbose = 2)
svc_cv_model.fit(X_train, y_train)

In [None]:
print("Best Parameters: "+ str(svc_cv_model.best_params_))

In [None]:
svc_tuned = SVC(C = 10, gamma = 0.01).fit(X_train, y_train)
y_pred = svc_model.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

* After Tuning accuracy score same with before tuning so, 0.745.

<a id="9"></a>
## XGBoost Model

In [None]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print(accuracy_score(y_pred_xgb, y_test))

* Before tuning accuracy score is 0.735.

In [None]:
xgb_params = {"n_estimators": [100, 500, 1000, 2000],
             "subsample": [0.6, 0.8, 1.0],
             "max_depth": [3, 4, 5, 6],
             "learning_rate": [0.1, 0.01, 0.02, 0.05],
             "min_samples_split": [2,5,10]}
xgb = XGBClassifier()
xgb_cv_model = GridSearchCV(xgb, xgb_params, cv = 10, n_jobs = -1, verbose = 2)
xgb_cv_model.fit(X_train, y_train)

In [None]:
print("Best Parameters: "+ str(xgb_cv_model.best_params_))

In [None]:
xgb = XGBClassifier(learning_rate = 0.05, max_depth = 5, min_samples_split=2,n_estimators=100,subsample=0.8 )
xgb_tuned = xgb.fit(X_train,y_train)
y_pred = xgb_tuned.predict(X_test)
print("Accuracy Score:", accuracy_score(y_test, y_pred))

* After tuning accuracy score is 0.79.