## PROJECT 1. Data Preprocessing


## Analyzing borrowers’ risk of defaulting

The project is to prepare a report for a bank’s loan division. We’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. 

The bank already has some data on customers’ credit worthiness.

The report will be considered when building a **credit scoring** of a potential customer. A **credit scoring** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [None]:
import pandas as pd

df = pd.read_csv("/datasets/credit_scoring_eng.csv")

#print(df.info())

print(df.head())

### Conclusion

At first we have imported data from the .csv-file in our system to work with python. 

As we can see in the _head()_ and _info()_ methods output, we have some issues with the data: 

- the days of employment looks strange (it should be integers, not floats; it should be always positiv numbers);
- the data in "children", "debt", "family_status_id", "education_id", "dob_years" columns can be saved as int16 or int8, not as int64 to save memory; 
- we have to rid of duplucates (e.g. _Secondary Education/secondary Education_)
- some values can be missing (we have to check it later)

### Step 2. Data preprocessing

### Processing missing values

We are using a small loop to print out the output the number of _NaN/None_ values in each column of our dataframe. We have problems with 2100+ values in "days_employed" and "total_income". 

In real life we could ask a person how gave us the dataset and together find out, why the errors are occured and find a solution. 

As far as data in the column "days_employed" are not critical for our educational project (we have no questions about a connection between how long the customer has been working and repaying a loan on time), we are going to drop this data. 

We also going to fill out missing data in the "total_income" column with median values for each income type. 

In [57]:
#print(df["income_type"].unique())
 
df['total_income'].fillna(df.groupby(['education', 'income_type'])['total_income'].transform('median'), inplace=True)

print((df['total_income'] == 0).sum())

income_type
business                       27577.2720
civil servant                  24071.6695
employee                       22815.1035
entrepreneur                   79866.1030
paternity / maternity leave     8612.6610
retiree                        18962.3180
student                        15712.2600
unemployed                     21014.3605
Name: total_income, dtype: float64
0


In [56]:
columns = df.columns

"""for column in columns:
    print(df[column].isnull().sum())
for column in columns:
    print(df[column].value_counts())"""

df = df.dropna()


print(df.info())



<class 'pandas.core.frame.DataFrame'>
Int64Index: 19351 entries, 0 to 21524
Data columns (total 12 columns):
children            19351 non-null int64
days_employed       19351 non-null float64
dob_years           19351 non-null int64
education           19351 non-null object
education_id        19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 1.9+ MB
None



We are also using a small loop to print out the output of _value_counts()_ method to each column in our dataframe. 

So we have found a few obvious problems: 

_children_: 
- 67 persons have 20 kids each. Not absolutly impossible, but fishy. We have to dig in a bit more. 
- 44 persons have minus one kid each. We have to dig in a bit more to check, if they have one kid each or if necessary drop the data.

_days_employed_: 

- we have to parse non-negative floats to intergers
- we have to understand, if we could use modulo of negative values or to precess them in other way or drop it

_education_:

- get rid of the duplicates

_purpose_:

- get rid of the duplicates


In [5]:

has_20_kids = df[df["children"] == 20]

#print(has_20_kids.head(50))


We did not found any pattern why this people has allegedly 20 children each. 
It seems, that this field was filled up manually and we have to deal with a banal 
tipo (20 instead of 2). In real life we could ask a person how gave us the dataset 
and together find out, why the errors are occured and find a solution.   
But in this educational case we can just to drop this data as unrealistic or - 
assuming that we are talking about tipo - just substitute "20" by "2". (_Deductive imputation_)

In [6]:
df['children'] = df['children'].replace(20, 2)

#print(has_20_kids.head(10))

#print(df['children'].value_counts())


In [7]:
has_negativ_kids = df[df["children"] == -1]

#print(has_negativ_kids.head(10))

Same as above: we did not found any pattern why 44 people has allegedly negative amount of children. 
In real life we could ask a person how gave us the dataset and together find out, 
why the errors are occured and find a solution.   
But in this educational project we are going just to drop this data as unreliable or - 
assuming that we are talking about tipo - just substitute "-1" by "1". (_Deductive imputation_)

In [8]:
df['children'] = df['children'].replace(-1, 1)

#print(df[df['children'] == -1])

Tha data in the column "days_employed" are corrupted. 15809 values (more than 75%) are negativ. In real life we could ask a person how gave us the dataset and together find out, why the errors are occured and find a solution.
As far as this data are not critical for our educational project (== we have not questions about a connection between how long the customer has been working and repaying a loan on time), we are going to convert negativ values to positive using its absolute value and after that to parse all floats in this column to integers. (Just to show, that we can do it; we are not using this date further.)

In [9]:
negativ_days = df[df["days_employed"] < 0]

#print(df["days_employed"].head(50))

df["days_employed"] = df["days_employed"].abs()

#print(df["days_employed"].head())

We also have check a bit the credibility of data in column "days_employed" and have found one more reason not to trust and not to use this information: at least one of emploees allegedly worked more than 89 years and 1409 persons worked more than 50 years(18250 days), what hardly can be true. 

In [10]:
#print(max(df["days_employed"])/ 365)

#print(df[df["days_employed"] > 18250]) 

As we found earlier, one person has a strange value in "gender" field. If we knew we first name of this person we probably could determine his/her gender. But we does not know the first name, so we can drop this persons' data only. 

In [11]:
# print(df[df["gender"] == "XNA"])

df = df[df["gender"] != "XNA"]

#print(df.info())

Let's check, if we have any children as bank customers. Or people how is suspiciously old to be an real customer. 

In [12]:
print(len(df[df['dob_years'] < 18]))

#print(df['dob_years'].value_counts(ascending=True).head(200))

#print(df[df['dob_years'] < 18].head(50))

print(len(df[df['dob_years'] > 100]))

91
0


We have found 91 persons in the age between 0 and 18 years (if we trust our data :-) ). Some of them are marryed, have kids and many years of professional expirience. So we assume, the real age of this people was not saved in our dataset. As far as we do not use this information in our further analysis, we left this values as they are. Alternatively we could substitute this values with a median. 

### Conclusion

We have found some missing and unreliable data in out dataset. We have preprocessed the dataset by droping some date we are not going to use. 
    
We have saved entries of 2100+ bank customers we had not data regarding their income. We filled out this data with median income for each of income categories (emloyees, unemployed, retiree, students etc.)
    
In two cases we also corrected tipos using deductive inputation and _replace()_ method. 

### Data type replacement

Let's convert some datatypes to save memory and to make it more convinient to handle with data. Thus we have reduce the memory usage by approximately 39% (1.8->1.1 MB). It is not really important in this particular case, but it can be very useful during the work with large datasets. 

In [58]:
df['children']=df['children'].astype('int8')
df['dob_years']=df['dob_years'].astype('int8')
df['education_id']=df['education_id'].astype('int8')
df['family_status_id']=df['family_status_id'].astype('int8')
df['days_employed']=df['days_employed'].astype('int32')
df['debt']=df['debt'].astype('int8')
df['education']=df['education'].astype('str')
df['family_status']=df['family_status'].astype('str')
df['income_type']=df['income_type'].astype('str')
df['purpose']=df['purpose'].astype('str')
df['total_income']=df['total_income'].astype('int32')

print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19351 entries, 0 to 21524
Data columns (total 12 columns):
children            19351 non-null int8
days_employed       19351 non-null int32
dob_years           19351 non-null int8
education           19351 non-null object
education_id        19351 non-null int8
family_status       19351 non-null object
family_status_id    19351 non-null int8
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int8
total_income        19351 non-null int32
purpose             19351 non-null object
dtypes: int32(2), int8(5), object(5)
memory usage: 1.1+ MB
None


### Conclusion

We processed the data in out dataset by casting the optimal datatype for each column. Therefore we get the data we can convinient work with and we also saved almost 40% of used memory. 

### Processing duplicates

In [59]:
df['education']=df['education'].str.lower()
df["purpose_cleaned"] =  df['purpose'].str.lower()


#print(df['education'].unique())

purposes_to_clean = df['purpose_cleaned'].unique().tolist()
print(len(purposes_to_clean))

purposes_to_clean = df['purpose_cleaned'].unique().tolist()

#print(purposes_to_clean)

38


After all the values were rerecorded in lower case, we have not found any duplicates in the 'education' column. 
But we still have some duplicates in the 'purpose' column. Let's handle it. 

In [85]:
#print(len(df["purpose_cleaned"]))
#print(df['purpose_cleaned'].head(30))  
#print()

def purpose_cats(purposes_to_clean):
    
    if "hous" in purposes_to_clean:
        return "real estate"
    if "estat"  in purposes_to_clean:
        return "real estate"
    if "propert" in purposes_to_clean:
        return "real estate"
    if "wedd" in purposes_to_clean:
        return "wedding"
    if "car" in purposes_to_clean:
        return "car"
    if "univers" in purposes_to_clean:
        return "education"
    if "education" in purposes_to_clean:
        return "education"

    else:
        return "other"
    
df['purpose_cleaned'] = df['purpose_cleaned'].apply(purpose_cats)



print(df['purpose_cleaned'].value_counts())
print()
print(df['purpose_cleaned'].value_counts().sum())
print()
print(len(df))

#print(df.head(30))

real estate    9758
car            3897
education      3240
wedding        2099
other           357
Name: purpose_cleaned, dtype: int64

19351

19351


We applyed a function _purpose_cats()_ to substitute the 38 unique cedit purposes in the column "purpose" by only 5 categories of purposes we saved in the column "purpose_cleaned". It would be obviously better for training to implement stemming as we learnt it in the theory lessons last week, but in this case stemming can also be done manually. 

Last but not least in current section: we have to get rid of duplicates in our dataset. 

In [86]:
print(df.info())
df = df.drop_duplicates().reset_index(drop=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 14 columns):
children            19351 non-null int8
days_employed       19351 non-null int32
dob_years           19351 non-null int8
education           19351 non-null object
education_id        19351 non-null int8
family_status       19351 non-null object
family_status_id    19351 non-null int8
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int8
total_income        19351 non-null int32
purpose             19351 non-null object
purpose_cleaned     19351 non-null object
income_level        19351 non-null object
dtypes: int32(2), int8(5), object(7)
memory usage: 1.3+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 14 columns):
children            19351 non-null int8
days_employed       19351 non-null int32
dob_years           19351 non-null int8
education           1

### Conclusion

We processed the duplicates in our dataset. The most challenging part was in the columns "purpose"/"purpose_cleaned", where we categorized the 38 existed purposes in 5 groups with an extra function. This categorisation helps us to answer the questions we have in step 3. 

### Categorizing Data

We need to categorise the income level of bank customers to perform further analysis. 

We are not really know, what currency we have in our dataset, so we used a deemed separation into four groups with low, middle, high and very high income. 

In [88]:
def income_level(row):
    
    income = row['total_income']   
    if income <= df["total_income"].quantile(0.25):
        return 'Low income'
    if income <= df["total_income"].quantile(0.5):
        return 'Intermediate income'
    if income <= df["total_income"].quantile(0.75):
        return 'Upper middle class income'
    else:
        return "High income"


df['income_level'] = df.apply(income_level, axis=1)

#print(df.info())
#print(df['income_level'].head(30))

### Conclusion

We categorised the income level of bank customers in our dataset and now we are ready to perform an analysis. 

### Step 3. Answer these questions

_- Is there a relation between having kids and repaying a loan on time?_

In [89]:
overdues = df[(df['debt'] == 1)].count() / df[(df['debt'] == 0)].count()

#print(overdues["debt"]) 

print("{0:.3f}% of customers are not repaying debt on time".format((overdues["debt"])*100))

pivot_kids = df.pivot_table(index=["debt"], columns = "children", values = "family_status_id", aggfunc = "count")

kid0 = pivot_kids[0][1] / pivot_kids [0][0]
kid1 = pivot_kids[1][1] / pivot_kids [1][0]
kid2 = pivot_kids[2][1] / pivot_kids [2][0]
kid3 = pivot_kids[3][1] / pivot_kids [3][0]
kid4 = pivot_kids[4][1] / pivot_kids [4][0]

print()
print(("{0:.3f}% No kids and not repaying debt on time".format(kid0*100)))
print(("{0:.3f}% One kid and not repaying debt on time".format(kid1*100)))
print(("{0:.3f}% Two kids and not repaying debt on time".format(kid2*100)))
print(("{0:.3f}% Three kids and not repaying debt on time".format(kid3*100)))
print(("{0:.3f}% Four kids and not repaying debt on time".format(kid4*100)))




8.836% of customers are not repaying debt on time

8.097% No kids and not repaying debt on time
10.368% One kid and not repaying debt on time
10.573% Two kids and not repaying debt on time
8.088% Three kids and not repaying debt on time
9.677% Four kids and not repaying debt on time


### Conclusion

The answer: Yes. People who have no kids and people how has exactly three children seems to have a better repayment discipline.  

In the groups of customer who has 1, 2 or 4 children the percent of delinquent payers is slightly above of the mean value. 

_- Is there a relation between marital status and repaying a loan on time?_

In [90]:
pivot_marriage = df.pivot_table(index=["debt"], columns = "family_status", values = "dob_years", aggfunc = "count")

# print(pivot_marriage)

unmarried = pivot_marriage["unmarried"][1] / pivot_marriage["unmarried"][0]
married = pivot_marriage["married"][1] / pivot_marriage["married"][0]
divorced = pivot_marriage["divorced"][1] / pivot_marriage["divorced"][0]
widow_er = pivot_marriage["widow / widower"][1] / pivot_marriage["widow / widower"][0]
partnership = pivot_marriage["civil partnership"][1] / pivot_marriage["civil partnership"][0]

print("{0:.3f}% unmarried and not repaying debt on time".format(unmarried * 100))
print("{0:.3f}% married and not repaying debt on time".format(married * 100))
print("{0:.3f}% divorced and not repaying debt on time".format(divorced * 100))
print("{0:.3f}% widow_er and not repaying debt on time".format(widow_er * 100))
print("{0:.3f}% in civil partnership and not repaying debt on time".format(partnership * 100))

11.185% unmarried and not repaying debt on time
8.216% married and not repaying debt on time
7.547% divorced and not repaying debt on time
6.922% widow_er and not repaying debt on time
9.982% in civil partnership and not repaying debt on time


### Conclusion

The answer: Yes. 

The best customers(best repayment discipline) are widows/widowers(only 6.9% of late payments). Only 7.5% of divorced persons fail to pay on time. The marryed persons are more often paying punctually, than average customer (8.2% vs. 8.8%). The persons in civil partnership and especially bachelors/ bachelorettes are the least reliable customers: in this groups the percent of late payments is significantly higher than average (accordingly 9.9% and 11.2%). 

_- Is there a relation between income level and repaying a loan on time?_

In [94]:
pivot_income = df.pivot_table(index=["debt"], columns = "income_level", values = "dob_years", aggfunc = "count")

#print(pivot_income)

income0 = pivot_income["Low income"][1] / pivot_income["Low income"][0]
income1 = pivot_income["Intermediate income"][1] / pivot_income["Intermediate income"][0]
income2 = pivot_income["Upper middle class income"][1] / pivot_income["Upper middle class income"][0]
income3 = pivot_income["High income"][1] / pivot_income["High income"][0]

print("{0:.3f}% Low income and not repaying debt on time".format(income0 * 100))
print("{0:.3f}% Intermediate incomeand not repaying debt on time".format(income1 * 100))
print("{0:.3f}% Upper middle class income and not repaying debt on time".format(income2 * 100))
print("{0:.3f}% High income and not repaying debt on time".format(income3 * 100))


8.597% Low income and not repaying debt on time
9.531% Intermediate incomeand not repaying debt on time
9.658% Upper middle class income and not repaying debt on time
7.583% High income and not repaying debt on time


### Conclusion

The answer: Yes. 

The people with the highest income has the best repayment discipline. 

It is a bit surprizing, but we can claim, that people with low income paying their debts in average punctually, even slightly better than an average bank customer. 

The worst repayment discipline have two groups of customers with an average and above-average income.

_- How do different loan purposes affect on-time repayment of the loan?_

In [95]:
pivot_purposes = df.pivot_table(index=["debt"], columns = "purpose_cleaned", values = "dob_years", aggfunc = "count")

print(pivot_purposes)

car = pivot_purposes["car"][1] / pivot_purposes["car"][0]
education = pivot_purposes["education"][1] / pivot_purposes["education"][0]
real_estate = pivot_purposes["real estate"][1] / pivot_purposes["real estate"][0]
wedding = pivot_purposes["wedding"][1] / pivot_purposes["wedding"][0]
other = pivot_purposes["other"][1] / pivot_purposes["other"][0]

print("{0:.3f}% of car loans were not repayed on time".format(car * 100))
print("{0:.3f}% of educational loans were not repayed on time".format(education * 100))
print("{0:.3f}% of real_estate loans were not repayed on time".format(real_estate * 100))
print("{0:.3f}% of wedding loans were not repayed on time".format(wedding * 100))
print("{0:.3f}% of other loans were not repayed on time".format(other * 100))


purpose_cleaned   car  education  other  real estate  wedding
debt                                                         
0                3530       2943    323         9043     1941
1                 367        297     34          715      158
10.397% of car loans were not repayed on time
10.092% of educational loans were not repayed on time
7.907% of real_estate loans were not repayed on time
8.140% of wedding loans were not repayed on time
10.526% of other loans were not repayed on time


### Conclusion

The answer: Yes. 

The least problematic are loans with for real estate purposes and wedding loans (7.9% and 8.1% of late repayments). 
All other loan categories more risk: more than 10% of delinquent payers. 

### Step 4. General conclusion

<div class="alert alert-info" role="alert">

In this project, we processed a dataset with standart information about our bank's clients to find a portrait of the ideal borrower. 
    
We got a bunch of raw data; some information was missing (we have filled out the most missing values with interpolation to save incomlete entries), some other information was corrupted (we dropped some of data that could not be revised; we got rid of duplicates and categorise in four large groupes (customers with income from low to high). 
    
We also processed the data in out dataset by casting the optimal datatype for each column. Therefore we saved almost 40% of used memory without reducing our amount of data.
    
Performed data analysis reveales, in average 8.836% of all customers are not repaying their debts on time. 
 
Income level, marital status, having kids and loan purposes affects the punctual repaying of loans. 
       
Unmarried customers (incl. civil partnership) has a worste paying discipline (11.2/9.9% of late payments). 
    
Customers with one or two kids overdue the payments more often than an average borrower(10.4/10.6% of late payments). 
    
Its a bit surprizing, that middle class and upper-middle class customers overdue the payments more often, than customers with low or with very high income (8.6/7.6% low/very high income against 9.5/9.7% middle/upper-middle class).
    
Regarding the purpose of the borrowings, the safest categories are real estate and wedding loans (7.9/8.1% of late payments). The delinquency rate of educational and car credits is above average(10.1/10.3%)
    
Therfore our best customers (group with the smallest the percent of delinquent payers) are widows/widowers or divorced persons, with three children and high income, who borrows money to buy some real estate object or organize a wedding. 
   
We recommend our colleagues from the bank's loan division use our analysis to refine the credit score building methodology  and thereby to reduce the overall number of late payments. 

 </div>