### Exploring Credit Risks

This activity is another open exploration of a dataset using both cleaning methods and visualizations.  The data describes customers as good or bad credit risks based on a small set of features specified below.  Your task is to create a Jupyter notebook with an exploration of the data using both your `pandas` cleaning and analysis skills and your visualization skills using `matplotlib`, `seaborn`, and `plotly`.  Your final notebook should be formatted with appropriate headers and markdown cells with written explanations for the code that follows. 

Post your notebook file in Canvas, as well as a brief (3-4 sentence) description of what you found through your analysis. Respond to your peers with reflections on thier analysis. 

-----

1. [The Science Question](#top)
2. [Cleaning and Augmenting Data](#2)
3. [Initial Conclusions](#3)
4. [Individuals who miss payments](#4)
5. [Conclusion](#c)


#### <span id='top'>The Science Question</span>

I decided to begin with a general science question to direct the investigation of this data. The goal of any bank is to minimize credit risk and thereby loss. As such, my question is simply: what makes a bad lender, bad? 

In [44]:
import pandas as pd
import seaborn as sns
import plotly.express as px
import numpy as np

In [50]:
df = pd.read_csv('data/dataset_31_credit-g.csv')
df.head(5)

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
3,'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,...,'life insurance',45,none,'for free',1,skilled,2,none,yes,good
4,'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,...,'no known property',53,none,'for free',2,skilled,2,none,yes,bad


#### <span id="2"> Data Description: Cleaning and Augmenting Data </span>

```
1. Status of existing checking account, in Deutsche Mark.
2. Duration in months
3. Credit history (credits taken, paid back duly, delays, critical accounts)
4. Purpose of the credit (car, television,...)
5. Credit amount
6. Status of savings account/bonds, in Deutsche Mark.
7. Present employment, in number of years.
8. Installment rate in percentage of disposable income
9. Personal status (married, single,...) and sex
10. Other debtors / guarantors
11. Present residence since X years
12. Property (e.g. real estate)
13. Age in years
14. Other installment plans (banks, stores)
15. Housing (rent, own,...)
16. Number of existing credits at this bank
17. Job
18. Number of people being liable to provide maintenance for
19. Telephone (yes,no)
20. Foreign worker (yes,no)
```

Additional data is generated using the columns from above.
Using checking_status and saving_status:
```
21. checking_status_ct: corresponding to 'no checking':0, '<0':1, etc.
22. savings_status_ct: similarly as above
```
And with on/off params for credit_history, housing and job:
```
24. existing paid: 1 for yes, 0 for no
25. critical/other existing credit
26. delayed previously
27. own
28. rent
29. for free
30. skilled
31. unskilled resident
32. high qualif/self emp/mgmt
33. unemp/unskilled non res
```

and lastly, one for the classification of whether the investment was bad:
```
34. is_good: 1 for is good, 0 for is bad
```

In [52]:
### This cell adds a number of columns which are explained above.


df['checking_status_ct'] = df[['checking_status']].replace(
    {"checking_status":{"\'no checking\'":0, "\'<0\'":1, "\'0<=X<200\'":2, "\'>=200\'":3}})
    

df['savings_status_ct'] = df[['savings_status']].replace(
    {"savings_status":{"\'no known savings\'":0, "\'<100\'":1, "\'100<=X<500\'":2, "\'500<=X<1000\'":3, "\'>=1000\'":4}})


def yesOrNo(x):
    if x[param] == arg:
        return 1
    else:
        return 0
    

param = 'credit_history'
arg = '\'existing paid\''
df[arg] = df.apply(yesOrNo, axis=1)

arg = "\'critical/other existing credit\'"
df[arg] = df.apply(yesOrNo, axis=1)

arg = "\'delayed previously\'"
df[arg] = df.apply(yesOrNo, axis=1)

###
param = 'housing'
arg = 'own'
df[arg] = df.apply(yesOrNo, axis=1)

arg = 'rent'
df[arg] = df.apply(yesOrNo, axis=1)

arg = '\'for free\''
df[arg] = df.apply(yesOrNo, axis=1)

param = 'job'
arg = 'skilled'
df[arg] = df.apply(yesOrNo, axis=1)

arg = "\'unskilled resident\'"
df[arg] = df.apply(yesOrNo, axis=1)

arg = "\'high qualif/self emp/mgmt\'"
df[arg] = df.apply(yesOrNo, axis=1)

arg = "\'unemp/unskilled non res\'"
df[arg] = df.apply(yesOrNo, axis=1)

param = 'class'
arg='good'
df['is_'+arg] = df.apply(yesOrNo, axis=1)


In [24]:
df.corr()

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents,checking_status_ct,savings_status_ct,'existing paid','critical/other existing credit','delayed previously',own,rent,'for free',skilled,'unskilled resident','high qualif/self emp/mgmt','unemp/unskilled non res',is_good
duration,1.0,0.624984,0.074749,0.034067,-0.036136,-0.011284,-0.023834,0.03505,-0.064526,-0.069751,-0.075575,0.136927,-0.075169,-0.064417,0.189117,0.05501,-0.181203,0.147515,-0.044043,-0.214927
credit_amount,0.624984,1.0,-0.271316,0.028926,0.032716,0.020795,0.017142,0.024561,-0.107538,-0.086682,-0.041807,0.113552,-0.117497,-0.024611,0.201643,-0.092636,-0.161757,0.319715,-0.027969,-0.154739
installment_commitment,0.074749,-0.271316,1.0,0.049302,0.058266,0.021669,-0.071207,-0.057942,-0.000805,-0.020947,0.041089,-0.014597,0.049922,-0.091373,0.040098,0.042623,-0.057237,0.042805,-0.087834,-0.072404
residence_since,0.034067,0.028926,0.049302,1.0,0.266419,0.089625,0.042643,-0.059555,-0.011772,-0.081458,0.08846,-0.020351,-0.297547,0.167285,0.227044,-0.000657,0.009065,0.004952,-0.034545,-0.002967
age,-0.036136,0.032716,0.058266,0.266419,1.0,0.149254,0.118201,-0.049058,-0.017997,-0.155848,0.163681,0.016129,0.006553,-0.21262,0.253058,-0.148283,0.043712,0.127605,0.059954,0.091127
existing_credits,-0.011284,0.020795,0.021669,0.089625,0.149254,1.0,0.109667,-0.093081,-0.004176,-0.540354,0.501364,0.141742,0.041386,-0.05807,0.011406,-0.001471,-0.010392,-0.010906,0.059582,0.045732
num_dependents,-0.023834,0.017142,-0.071207,0.042643,0.118201,0.109667,1.0,-0.040889,-0.021302,-0.078339,0.021765,0.042526,-0.027579,-0.063033,0.118047,-0.106737,0.145066,-0.015096,-0.007723,0.003015
checking_status_ct,0.03505,0.024561,-0.057942,-0.059555,-0.049058,-0.093081,-0.040889,1.0,-0.005614,0.068012,-0.143082,0.010746,-0.043246,0.015874,0.043423,-0.08148,0.03868,0.02606,0.099622,-0.197788
savings_status_ct,-0.064526,-0.107538,-0.000805,-0.011772,-0.017997,-0.004176,-0.021302,-0.005614,1.0,-0.018038,0.005297,0.004675,0.014937,0.024267,-0.051742,0.039222,0.010348,-0.064459,-0.00127,0.033871
'existing paid',-0.069751,-0.086682,-0.020947,-0.081458,-0.155848,-0.540354,-0.078339,0.068012,-0.018038,1.0,-0.683617,-0.329862,-0.043805,0.084304,-0.040281,0.017015,0.010018,-0.025052,-0.022675,-0.043722


#### <span id="3">Face-value conclusions</span>

- "Good" classification is correlated:
    - negatively with credit amount and duration of the loan taken
    - positively with critical/other existing credit
    - positively with owning a home
    
Oddly enough, the checkings/savings field does not correlate with the credit worthiness of the individual.

In [5]:
df[['checking_status_ct', 'savings_status_ct', 'is_good']].query('savings_status_ct>0 and checking_status_ct>0').corr()

Unnamed: 0,checking_status_ct,savings_status_ct,is_good
checking_status_ct,1.0,0.159763,0.140457
savings_status_ct,0.159763,1.0,0.12793
is_good,0.140457,0.12793,1.0


If we enforce that the borrower has a checking/savings account, we find that the "good" classification actually does correlate with their balance in the checking or savings account.

Let's also look at the way age plays a role.

In [23]:
px.violin(df, x='class', y='age', color='class', title='Violin KDE by age for good and bad lenders', box=True)


Specific bands of ages predict for riskier credit -- roughly individuals in their early 20s. 

This is all intriguing because it starts to form a picture of what kinds of individuals are 'bad risks':
- Taking on high credit amount and duration
- Young age
- No home ownership/property ownership
- Has a low amount in their checking/savings accounts
- Does not possess a checking account

It's worth looking into these covariates to understand how these variables affect one another. 

#### <span id='4'>Individuals who miss payments</span>

The most interesting one, right away, is that high credit amount and duration correlate with the likelihood that someone has "previously delayed" their credit payments. Looking at the effect of that behavior on the classification, however, does not yield a trend.

This is unusual, because you would expect someone who misses payments to be riskier altogether. 

People who have delayed payments previously are also more likely to have other credit lines and are much less likely to have already paid off their credit totals.

In [125]:
val = df[['credit_history']].value_counts().index[2][0]

px.histogram(df.query('credit_history == @val'), x='credit_amount', color='existing_credits', 
             title='Subset of individuals WITH previously delayed payments', nbins=10 )

In [114]:
px.histogram(df.query('credit_history != @val and credit_amount<16_000'), x='credit_amount', color='existing_credits',
             title = 'Individuals WITHOUT previously delayed payments', nbins=10 )

I subset the data for only those who had previously delayed payments and split up the histogram of their credit amount by how many other existing credit lines they already have open.

- Individuals with delayed payments are more likely to have two other existing credit lines.

This is interesting, as this is another indicator that the individual would be more likely to miss a payment.

That said, existing credit lines correlate with age (a predictor for "good" classification) and are essentially neutral with whether an individual is classified as "good". So the break-down factor in this chain of couplings is that existing credit lines are invariant to whether the individual is 'good' or not.

As we can see in the second plot, a large number of individuals without previous delayed payments ALSO do have multiple lines of credit when making taking out small loans. The number of individuals in this subset with 'good' classification actually dwarfs those with the 'bad'. Though the goal of the science question is to minimize loss, we stumble onto the actual motivation of a bank: to make money. 

It's actually more lucrative to take bets on less-reliable individuals in this category, since more of them succeed than not. The expected value is still positive, even if the credit risk is higher.

In [126]:
val = df[['credit_history']].value_counts().index[2][0]

px.histogram(df.query('credit_history == @val'), x='credit_amount', color='is_good', 
             title='Subset of individuals WITH previously delayed payments', nbins=10 )

#### <span id='c'>Conclusion</span>

I sought to form a profile of what an individual with 'bad' credit looks like.

An individual which is a 'bad' credit risk is:
- Taking on high credit amount and duration
- Young age
- No home ownership/property ownership
- Has a low amount in their checking/savings accounts
- Does not possess a checking account

Additional existing credit lines do not have a direct impact on the classification of the individual, however they are still more likely to delay payments previously. This impact is softened because individuals with multiple credit lines generally take smaller loans and may delay payments, but are counterbalanced by other peers who ultimately do pay off their loans. My interpretation of this is that a delayed payment is not necessarily the sign of a 'bad' individual.