**Outline**  
1. Understandig the problem
2. Getting the Data
3. Data Exploration
4. Data Preprocessing
5. Shortlisting Promising Models
6. Fine Tuning
7. Ensemble Learning
8. Conclusion
9. References

### 1. Understand and frame the problem
* a.Where to be used?  
To predict whether a credit applicant will be rated good or bad
* b. Supervised/Unsupervised/RL?  
It is a supervised problem since we are given the dependent variable which is Risk(good or bad)
* c. Classification / Regression?  
It is a classification problem. We will predict whether an applicant belong to good or bad class.

### 2. Getting the Data  
Data is provided by Prof. Hofmann. It is also available in UCI datasets.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv("/kaggle/input/german-credit-data-with-risk/german_credit_data.csv",index_col=0)

### 3. Explore the Data
**3.1 General Info**

In [None]:
# Making a copy of the original data to analyze
df = data.copy()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

**Observations**    
* 1000 data entries (X,y)
* 10 columns
* Categorical: *Sex, Housing, Saving Accounts, Checking Account, Purpose, Risk*
* Numerical  : *Age, Job, Credit Amount, Duration*
* Target Value: *Risk*
* Columns with Null Values: *Saving Accounts, Checking Account*
* Applicants have at most 3 jobs
* Duration is between 4 to 72 months


In [None]:
df_num = df[['Age','Job','Credit amount','Duration']]
df_cat = df[['Sex','Housing','Saving accounts','Checking account','Purpose']]

**3.2 The distribution of Risk among Numerical Variables**

In [None]:
pd.pivot_table(df, index = 'Risk', values = df_num)

**Observations**  
* The average **age** of ***good*** applicants is **higher** than ***bad*** applicants
* The average **credit amount** of ***good*** applicants is **less** than ***bad*** applicants
* The average **duration** of ***good*** applicants is **less** than ***bad*** applicants
* The average **number of jobs** is nearly the **same** for good and bad applicants

**3.2 The distribution of applicants with good credit rate among categorical variables**

In [None]:
for i in df_cat:
    classLabel = df.loc[df.Risk == 'good'][i].value_counts(normalize=True).index
    plt.pie(df.loc[df.Risk == 'good'][i].value_counts(normalize=True),
            labels = classLabel, startangle=90, autopct='%.1f%%')
    plt.title(i)
    plt.show()

**Observations**  
Among the applicants with **good** risk  
* In terms of Sex             :   
71% male, 29% female
* In terms of Housing         :   
75% own , 16% rent  , 9% free
* In terms of Saving Accounts :   
70% little
* In terms of Checking Account:   
47% moderate
* In terms of Purpose         :   
33% car , 31% radio,TV

In [None]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(10,10))
ser = df.loc[df.Sex == 'female']["Risk"].value_counts(normalize = True)
ax1.pie(ser,labels = ser.index, startangle=90, autopct='%.1f%%')
ax1.set_title('Female')
ser2 = df.loc[df.Sex == 'male']["Risk"].value_counts(normalize = True)
ax2.pie(ser2,labels = ser2.index, startangle=90, autopct='%.1f%%')
ax2.set_title('Male')
fig.show()

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,15))
ser = df.loc[df.Housing == 'own']["Risk"].value_counts(normalize = True)
ser2 = df.loc[df.Housing == 'rent']["Risk"].value_counts(normalize = True)
ser3 = df.loc[df.Housing == 'free']["Risk"].value_counts(normalize = True)

ax1.pie(ser,labels = ser.index, startangle=90, autopct='%.1f%%')
ax1.set_title('Housing Status:Own')
ax2.pie(ser2,labels = ser2.index, startangle=90, autopct='%.1f%%')
ax2.set_title('Housing Status:rent')
ax3.pie(ser3,labels = ser3.index, startangle=90, autopct='%.1f%%')
ax3.set_title('Housing Status:free')
fig.show()

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(15,15))
ser = df.loc[df["Checking account"] == 'little']["Risk"].value_counts(normalize = True)
ser2 = df.loc[df["Checking account"] == 'moderate']["Risk"].value_counts(normalize = True)
ser3 = df.loc[df["Checking account"] == 'rich']["Risk"].value_counts(normalize = True)

ax1.pie(ser,labels = ser.index, startangle=90, autopct='%.1f%%')
ax1.set_title('Checking account:little')
ax2.pie(ser2,labels = ser2.index, startangle=90, autopct='%.1f%%')
ax2.set_title('Checking account:moderate')
ax3.pie(ser3,labels = ser3.index, startangle=90, autopct='%.1f%%')
ax3.set_title('Checking account:rich')
fig.show()

In [None]:
fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4,figsize=(15,15))
ser = df.loc[df["Saving accounts"] == 'little']["Risk"].value_counts(normalize = True)
ser2 = df.loc[df["Saving accounts"] == 'moderate']["Risk"].value_counts(normalize = True)
ser3 = df.loc[df["Saving accounts"] == 'rich']["Risk"].value_counts(normalize = True)
ser4 = df.loc[df["Saving accounts"] == 'quite rich']["Risk"].value_counts(normalize = True)

ax1.pie(ser,labels = ser.index, startangle=90, autopct='%.1f%%')
ax1.set_title('Checking account:little')
ax2.pie(ser2,labels = ser2.index, startangle=90, autopct='%.1f%%')
ax2.set_title('Checking account:moderate')
ax3.pie(ser3,labels = ser3.index, startangle=90, autopct='%.1f%%')
ax3.set_title('Checking account:rich')
ax4.pie(ser4,labels = ser4.index, startangle=90, autopct='%.1f%%')
ax4.set_title('Checking account:quite rich')
fig.show()

In [None]:
for i in df.Purpose.unique():
        ser = df.loc[df["Purpose"] == i]["Risk"].value_counts(normalize = True)
        print('applicants with Purpose: ',i)
        print("%.2f" % (ser[1]*100),'% bad')
        print("%.2f" % (ser[0]*100),'% good')

**Observations**
* The percentage of male applicants who rated as good is higher than female ones
* Applicants with their own house rated as good 14% more than applicants on rent or free
* In terms of checking account rich has 17% more chance than moderate and moderate has 10% more chance than little
* The chance of getting good applicant in terms of saving accounts:
little = 64%
moderate = 67%
rich = 87%
quite rich = 82%
* The chance of getting good credit risk in terms of purpose is as follows:  
radio/TV > car > furniture/equipment > domestic appliances > business > repairs > education > vacation
    


**3.3 Histogram of attributes**

In [None]:
good = df.loc[df['Risk'] == 'good']
bad = df.loc[df['Risk'] == 'bad']
for i in df_num:
    good[i].hist(alpha = 0.5,label='good')
    bad[i].hist(alpha = 0.5,label='bad')
    plt.title(i)
    plt.legend(['good','bad'])
    plt.show()

**Observations**
* The chance of good credit risk is;
    * higher for younger applicants
    * nearly the same for people with different number of jobs
    * decreases as the credit amount increases
    * decreases as the credit duration increases

In [None]:
for i in df_cat:
    if(i != 'Purpose'):
        print(pd.pivot_table(df, index = 'Risk',values= 'Purpose', columns = i,aggfunc = 'count'))

**3.4 Correlations**

In [None]:
print(df_num.corr())
sns.heatmap(df_num.corr())

**Observation**  
* Credit amount and duration is highly corelated

### 4. Prepare the Data  
**4.1 Missing Values**
* As we observed before we have null values in just 2 categories which are Saving Account and Checking account. This values may indicate that these person doesnt have saving or checking account or they may be just a missing value.
* Therefore i wanted to treat the missing values as another categorical value.   

In [None]:
data.isnull().sum().sort_values(ascending=False)

In [None]:
data['Checking account'].fillna("No Info", inplace = True) 
data['Saving accounts'].fillna("No Info", inplace = True) 

Splitting train and test data
I wanted to keep the ratio of good and bad risk applicant same in both train and test data
To do so i decided to use stratified shuffle split.

**4.2 Split data into training and test set**  
* I wanted to keep the ratio of good and bad applicants same for both sets
* Therefore i made stratified train test split 

In [None]:
# The ratio of good and bad applicants
data['Risk'].value_counts()

In [None]:
# set dependent and independent values
y = data['Risk']
X = data.drop('Risk',axis = 1)

In [None]:
# Split train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
y_train.value_counts()

* The ratio of good and bad applicants is the same for both sets

**4.3 Feature Scaling and Pipeline Formation**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_attribs = ['Age','Job','Credit amount','Duration']
cat_attribs = ['Sex','Housing','Saving accounts','Checking account','Purpose']

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler())
    ])
cat_pipeline = Pipeline([
        ("encoding", OneHotEncoder())
    ])
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs)
    ])

In [None]:
X_train = full_pipeline.fit_transform(X_train)
X_test = full_pipeline.transform(X_test)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

### 5. Shortlisting Promising Models

| Models | Accuracy Score |
|:-|:-|
| Logistic Regression | 73.5% |
| Naive Bayes | 69.5% |
| k-nn | 69.5% |
| SVM | 72.5% |
| Kernel SVM | 74.5% |
| Decision Tree | 71% |
| Random Forest | 71.5% |
| XGBoost | 71.5% |

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

**5.1 Logistic Regression**

In [None]:
lr = LogisticRegression(random_state = 42)
y_pred_lr = cross_val_predict(lr,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_lr))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_lr))

**5.2 Naive Bayes**

In [None]:
nb = GaussianNB()
y_pred_nb = cross_val_predict(nb,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_nb))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_nb))

**5.3 K-Nearest Neighbors**

In [None]:
knn = KNeighborsClassifier(n_neighbors = 5)
y_pred_knn = cross_val_predict(knn,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_knn))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_knn))

**5.4 Support Vector Machine**

In [None]:
svc = SVC(probability=True,kernel = 'linear', random_state = 42)
y_pred_svc = cross_val_predict(svc,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_svc))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_svc))

**5.5 Kernel SVM (rbf)**

In [None]:
svcKernel = SVC(probability=True,kernel = 'rbf', random_state = 42)
y_pred_svcKernel = cross_val_predict(svcKernel,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_svcKernel))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_svcKernel))

**5.6 Decision Tree Classification**

In [None]:
dt = DecisionTreeClassifier(max_depth = 5,random_state = 42)
y_pred_dt = cross_val_predict(dt,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_dt))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_dt))

**5.7 Random Forest**

In [None]:
rf = RandomForestClassifier(n_estimators = 10, random_state = 42)
y_pred_rf = cross_val_predict(rf,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_rf))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_rf))

**5.8 XGBoost**

In [None]:
xgb = XGBClassifier(random_state =42)
y_pred_xgb = cross_val_predict(xgb,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_xgb))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_xgb))

**5.9 Analyzing the type of errors for each model**

| Models | Accuracy Score | Type 1 Error | Type 2 Error |
|-|-|-|-|
| Logistic Regression | 73.1% | 152 | 63 |
| Naive Bayes | 67.7% | 103 | 155 |
| k-nn | 70.2% | 159 | 79 |
| SVM | 73.3% | 166 | 47 |
| Kernel SVM | 73.7% | 171 | 39 |
| Decision Tree | 71.3% | 164 | 65 |
| Random Forest | 70.3% | 126 | 111 |
| XGBoost | 72.3% | 135 | 86 |

* Kernel SVM has lowest Type 2 Error(False Negatives)
* Naive Bayes has the lowest Type 1 Error (False Positives)

**6.Fine Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [4, 5, 7] }]

grid_search = GridSearchCV(knn, param_grid, cv=5, verbose=3)
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

In [None]:
knn = KNeighborsClassifier(n_neighbors = 7, weights = 'uniform')
y_pred_knn = cross_val_predict(knn,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred_knn))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred_knn))

* knn's accuracy rate increased and Type 2 errors decreased.

**7.Ensemble Learning**

* Voting Classifier, soft voting
* First all models included

In [None]:
voting_clf = VotingClassifier(estimators = [('lr',lr),
                                            ('nb',nb),
                                            ('knn',knn),
                                            ('svc',svc),
                                            ('svcKernel',svcKernel),
                                            ('rf',rf),
                                            ('dt',dt),
                                            ('xgb',xgb)], 
                              voting = 'soft') 
y_pred = cross_val_predict(voting_clf,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred))

* I decided to use the 5 models with highest accuracy rate

In [None]:
voting_clf2 = VotingClassifier(estimators = [('lr',lr),
                                            ('svc',svc),
                                            ('svcKernel',svcKernel),
                                            ('dt',dt),
                                            ('xgb',xgb)], 
                              voting = 'soft') 
y_pred = cross_val_predict(voting_clf2,X_train,y_train,cv = 5)
print('Accuracy score:',accuracy_score(y_train, y_pred))
print('Confusion Matrix')
print(confusion_matrix(y_train, y_pred))

**Checking The accuracy score with Test Set**

In [None]:
voting_clf2 = VotingClassifier(estimators = [('lr',lr),
                                            ('svc',svc),
                                            ('svcKernel',svcKernel),
                                            ('dt',dt),
                                            ('xgb',xgb)], 
                              voting = 'soft') 
voting_clf2.fit(X_train,y_train)
y_pred = voting_clf2.predict(X_test)
print('Accuracy score:',accuracy_score(y_test, y_pred))
print('Confusion Matrix')
print(confusion_matrix(y_test, y_pred))

**8.Conclusion**  
77% accuracy rate achieved with soft voting using Logistic Regression, SVC, kernel SVC, Decision Tree and XGB models.

**9.References**  
* [German Credit Analysis || A Risk Perspective](https://www.kaggle.com/janiobachmann/german-credit-analysis-a-risk-perspective)
