# Overview

This is the exploratory data analysis of the [German Credit Database](https://www.kaggle.com/uciml/german-credit).

In this dataset, each entry represents a person who takes a credit by a bank. 

This dataset is a subset of the full dataset by Prof. Hofmann. Original dataset: [UCI](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29)

# Data Collection

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv("../input/german-credit/german_credit_data.csv")
data.head()

# Data cleaning

In [None]:
data.drop("Unnamed: 0", inplace=True, axis=1)
data.head()

In [None]:
job_dictionary = {0:"unskilled and non-resident", 1:"unskilled and resident", 2:"skilled", 3:"higly skilled"}
data = data.replace({"Job":job_dictionary})
data.head()

# Data Information

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.isnull().sum() / data.shape[0]

Checking and Saving accounts has significant amount of missing data. This may be due to the fact that many people didn't have an account when applying to take credit.

# Data Exploration

## Numeric Data

In [None]:
data.Age.hist()

Maximum number of applicants were between the ages of 25-30

In [None]:
data["Credit amount"].hist()

The credit amount is exponentially decreasing

In [None]:
data.Duration.hist()

Maximum number of credits were for a duration of 1-2 years

In [None]:
corr = data[["Age","Credit amount", "Duration"]].corr()
corr

In [None]:
cmap = sns.diverging_palette(250, 0, as_cmap=True)
sns.heatmap(corr, cmap=cmap, square=True, linewidths=.5, vmax=1, vmin=-.2)

It can be observed that duration and Credit amount are moderately correlated and age has a small very little correaltion to both.

In [None]:
sns.regplot(x=data["Credit amount"], y=data["Duration"],order=3, line_kws={"color":"orange"})

We can see that it is an upward trend between Credit amount and Duration and then slows down as it reaches the 40 month mark. Thus we can say that people didn't prefer to take big credits for large amount of time.

## Categorical

In [None]:
df_cat = data[['Sex', 'Job', 'Housing', 'Saving accounts', 'Checking account','Purpose']]

In [None]:
for i in df_cat.columns:
    cat_num = df_cat[i].value_counts()
    title = cat_num.name
    cat_num.name = "# of applicants"
    chart = sns.barplot(x=cat_num.index, y=cat_num)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=90)
    
    chart.y ="# of applicants"
    plt.title(title)
    plt.show()

* Males had more applications for credit
* Skilled Workers had most amount of applications for credit
* People who owned houses applied more for credit
* Poor people applied for more credit
* Maximum applications were for cars, TV and furniture

In [None]:
data.groupby("Sex").mean()["Credit amount"].T.plot(kind="bar")

In [None]:
data.groupby("Sex").mean()[["Age", "Duration"]].T.plot(kind="bar")

We can see that Males took more credit, for more duration and at a greater age.

In [None]:
data.groupby("Job").mean()["Credit amount"].T.plot(kind="bar")

We can see that highly skilled workers took greater credit amount than others even though they applied for less no. of applications.

In [None]:
data.groupby("Job").mean()[["Age", "Duration"]].T.plot(kind="bar")

We see that highly skilled and skilled workers took longer duration credit from banks. In age we don't see much of a difference as all job classes were between 30-40.

From the previous plot and this plot we understand that the banks gave higher credit amount for more duration to highly skilled workers.
But skilled workers who applied for more no. of applications got lower credit amount.

In [None]:
data.groupby("Housing").mean()["Credit amount"].T.plot(kind="bar")

People with free housing applied for larger amount of credit. This is understandable as someone who can't afford his own house must take credit from the bank to survive.

In [None]:
data.groupby("Housing").mean()[["Age", "Duration"]].T.plot(kind="bar")

People who live in free housing were of higher age.

In [None]:
data.groupby("Saving accounts").mean()["Credit amount"].T.plot(kind="bar")

In [None]:
data.groupby("Saving accounts").mean()[["Age", "Duration"]].T.plot(kind="bar")

In [None]:
data.groupby("Checking account").mean()["Credit amount"].T.plot(kind="bar")

In [None]:
data.groupby("Checking account").mean()[["Age", "Duration"]].T.plot(kind="bar")

We can see that the middle class took higher amount of credit and at a higher frequency.

In [None]:
data.groupby("Purpose",sort=True).mean()["Credit amount"].T.sort_values().plot(kind="barh")

Highest amount of credit was taken for business

# Some personal insights

We saw that people having free housing took the highest amount of loans. Let's see what did they were applying for.

In [None]:
data.groupby(["Housing","Purpose"], sort=True).count().loc["free"]["Credit amount"].T.sort_values().plot(kind="barh")

The biggest applications for credit were for car. People who didn't have house where applying to get cars.

In [None]:
data.groupby(["Housing","Purpose"], sort=True).count().loc["own"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Housing","Purpose"], sort=True).count().loc["rent"]["Credit amount"].T.sort_values().plot(kind="barh")

Car, radio/TV and furniture were the categories for which highest applications were submitted.

In [None]:
data.groupby(["Saving accounts","Purpose"], sort=True).mean().loc["little"]["Credit amount"].T.sort_values().plot(kind="barh")

The poor spent more on business, car and repairs, which are necessities

In [None]:
data.groupby(["Saving accounts","Purpose"], sort=True).mean().loc["moderate"]["Credit amount"].T.sort_values().plot(kind="barh")

The middle class spent the most on education.

In [None]:
data.groupby(["Saving accounts","Purpose"], sort=True).mean().loc["rich"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Saving accounts","Purpose"], sort=True).mean().loc["quite rich"]["Credit amount"].T.sort_values().plot(kind="barh")

The rich and the quite rich took credit for luxury items like car, furniture and TV

In [None]:
data.groupby(["Sex","Purpose"], sort=True).mean().loc["male"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Sex","Purpose"], sort=True).mean().loc["female"]["Credit amount"].T.sort_values().plot(kind="barh")

Females took credit for non essential items and vacations than males.

In [None]:
data.groupby(["Job","Purpose"], sort=True).mean().loc["skilled"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Job","Purpose"], sort=True).mean().loc["higly skilled"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Job","Purpose"], sort=True).mean().loc["unskilled and resident"]["Credit amount"].T.sort_values().plot(kind="barh")

In [None]:
data.groupby(["Job","Purpose"], sort=True).mean().loc["unskilled and non-resident"]["Credit amount"].T.sort_values().plot(kind="barh")

# Conclusion

Some interesting points that were noticed:
* People didn't prefer to take big credits for large amount of time
* We can see that highly skilled workers took greater credit amount than others even though they applied for less no. of applications
* The biggest applications for credit were for car. People who didn't have house where applying to get cars.