## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

<div class="alert alert-block alert-success">
<b>Third iteration</b>
    <p>Great job D'vorah, it was pleasure to work with you on this project. Now let's start another one :)</p>
</div>


<div class="alert alert-block alert-warning">
<b>second iteration</b>
    <p>Good job! We are almost done - in your analysis you use sums and lens when looking into the data. Try counting means there - you will be able to see what differs in debt. After that I think we will finish with this project</p>
</div>

### Reviewer's comments
Hello D'vorah! My name is Anton Leonov, I will review your work and help you improve the project. My comments will be in those Markdown blocks. You can find my comments in <font color='green'>green</font>, <font color='amber'>yellow</font> or <font color='red'>red</font> boxes:

<div class="alert alert-block alert-success">
<b>Success:</b> You did everything well, good job!
</div>

<div class="alert alert-block alert-warning">
<b>Remarks: </b> You passed the task, but I also suggest how to do this part better next time
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> those are some problems that need to be fixed before the work can be finished
</div>
I would kindly ask you not to delete my comments - this will allow both of us to be on track with the task.

### Reviewer's comments

First of all, I'd like to add your comments and conclusions as Markdown comments and not comments in code - as it makes it a lot easier to follow the work. To do so, you either can edit cells that are already there for you (like "Conclusion" one after your data initialization), or add the new one - click on "+" at the top of the page and select the column to be from "Code" to "Markdown"



 Sorry.  I didn't realize what you meant by markdown.  I hope this is much better.  Thx

### Step 1. Open the data file and have a look at the general information. 

In [1]:
import pandas as pd
credit_score = pd.read_csv('/datasets/credit_scoring_eng.csv') 

credit_score.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


In [2]:
#print(credit_score['children'].value_counts())
#print(credit_score['days_employed'].value_counts())
#print(credit_score['dob_years'].value_counts())
#print(credit_score['education'].value_counts())
#print(credit_score['family_status'].value_counts())
#print(credit_score['gender'].value_counts())
#print(credit_score['income_type'].value_counts())
#print(credit_score['debt'].value_counts())
#print(credit_score['total_income'].value_counts())
#print(credit_score['purpose'].value_counts())


In [3]:
#credit_score.head(10)

In [4]:
credit_score.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,26787.568355
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,16475.450632
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,3306.762
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,16488.5045
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,23202.87
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,32549.611
max,20.0,401755.400475,75.0,4.0,4.0,1.0,362496.645


### Conclusion

<div class="alert alert-block alert-warning">
    Here you can use <b>value_counts()</b>, <b>head()</b> as well as <b>describe()</b> to have a better understanding of the data. I'd like to suggest you to create 3 cells for each function and try them out on your dataset
</div>

### Step 2. Data preprocessing

### Processing missing values

In [5]:
import pandas as pd
credit_score = pd.read_csv('/datasets/credit_scoring_eng.csv')



credit_score['days_employed'] = credit_score['days_employed'].fillna(value='0')
credit_score['total_income'] = credit_score['total_income'].fillna(value='0')





print(credit_score.isna().sum())




children            0
days_employed       0
dob_years           0
education           0
education_id        0
family_status       0
family_status_id    0
gender              0
income_type         0
debt                0
total_income        0
purpose             0
dtype: int64


Missing values found: There were blank slots found under the columns, days_employed and total_income
I opted to use fillna() to fill the slots with 0.
At first, I thought that the slots had been left blank because the individuals were retirees or private business owners and as such they wouldn't have the standard days_employed or total_income.  That, however, proved not to be true across the board and 
thus I think they were left out simply due to human error.  

In [6]:
credit_score.head(15)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.67,42,bachelor's degree,0,married,0,F,employee,0,40620.1,purchase of the house
1,1,-4024.8,36,secondary education,1,married,0,F,employee,0,17932.8,car purchase
2,0,-5623.42,33,Secondary Education,1,married,0,M,employee,0,23341.8,purchase of the house
3,3,-4124.75,32,secondary education,1,married,0,M,employee,0,42820.6,supplementary education
4,0,340266.0,53,secondary education,1,civil partnership,1,F,retiree,0,25378.6,to have a wedding
5,0,-926.186,27,bachelor's degree,0,civil partnership,1,M,business,0,40922.2,purchase of the house
6,0,-2879.2,43,bachelor's degree,0,married,0,F,business,0,38484.2,housing transactions
7,0,-152.78,50,SECONDARY EDUCATION,1,married,0,M,employee,0,21731.8,education
8,2,-6929.87,35,BACHELOR'S DEGREE,0,civil partnership,1,F,employee,0,15337.1,having a wedding
9,0,-2188.76,41,secondary education,1,married,0,M,employee,0,23108.2,purchase of the house for my family


### Conclusion

<div class="alert alert-block alert-warning">
    You don't actually need to "print(credit_score.head(15))", the better solution would be to create a new cell and just do credit_score.head(15):
</div>

In [7]:
#credit_score.head(5)

<div class="alert alert-block alert-warning">
Additionally, you have imported pandas and assigned credit_score twice, you don't really need to duplicate this part any more.
</div>

### Data type replacement

In [9]:
credit_score['days_employed'] = pd.to_numeric(credit_score['days_employed'], errors='coerce')
credit_score['total_income'] = pd.to_numeric(credit_score['total_income'], errors='coerce')

credit_score['days_employed'] = credit_score['days_employed'].astype('int')
credit_score['total_income'] = credit_score['total_income'].astype('int')


credit_score[credit_score['children'] < 0]
credit_score.loc[credit_score['children']<0,'children']=0

credit_score['days_employed'] = credit_score['days_employed'].abs()


print(credit_score['days_employed'])
print(credit_score['total_income'])

#print(credit_score.head())






0          8437
1          4024
2          5623
3          4124
4        340266
          ...  
21520      4529
21521    343937
21522      2113
21523      3112
21524      1984
Name: days_employed, Length: 21525, dtype: int64
0        40620
1        17932
2        23341
3        42820
4        25378
         ...  
21520    35966
21521    24959
21522    14347
21523    39054
21524    13127
Name: total_income, Length: 21525, dtype: int64
   children  days_employed  dob_years            education  education_id  \
0         1           8437         42    bachelor's degree             0   
1         1           4024         36  secondary education             1   
2         0           5623         33  Secondary Education             1   
3         3           4124         32  secondary education             1   
4         0         340266         53  secondary education             1   

       family_status  family_status_id gender income_type  debt  total_income  \
0            married    

Here I changed days_employed and total_income to int64.  
I also changed the negatives found in days_employed to positives.  
I wasn't sure that this was the proper thing to do.  At first, I thought that it might indicate days unemployed
#but the retirees has positives days_employed.  

The actual numbers listed as 'days' for retirees was much too high for it to be actual days.  For example, 400,000 days is over 
1,000 years.  However, if you convert the time into hours per year, it would give you 45 years and that would make more sense
for the span of a retirees work 'days.'

If I follow the logic of hours in a year, then the negative numbers, such as 131 (#2694) wouldn't make sense as unemployed days
because that would mean the person had lost their job five days before coming in for a loan for a wedding.  So, I decided to 
make these numbers positive assuming the negatives h70+ad been a human error too.

I also changed the number of negative children (-1) to positive (1).  At first, I thought that maybe the negative number 
indicated individuals who had children but weren't dependent financially on their parents (i.e., adults, etc).  Some of the 
applicants, however, were in thier mid 20s (14359).  I suppose technically they could not have custody or any financial 
responsibility for their children but I didn't think that made sense so I assumed, yet again, the negative was a mistake.  
Also, I found over 70 cases of individuals with 20 children.  I thought about changing the 20s to 2s but since it's 
possible for people to have that many kids and it didn't have an impact on my analysis, I left them as 20.  




### Conclusion

<div class="alert alert-block alert-success">
This is a very in-depth analysis, I like it, good job! You actually even did an extra mile here, adding the examples to your conclusions - this is very good.
</div>

<div class="alert alert-block alert-warning">
Just to consider - you have changed all the children that were negative to '1', maybe it would be better either to ave them at '0' if this is indeed some type of the error, or take the absolute value (there is <b>.abs()</b> function for this). Additionally, I'd like to suggest you to change the data types of some of the colums (like days_employed, total_income - you can take int values of them to ease life a bit in future)
</div>

Hi.  I thought I did change days_employed and total_income to int64.  Did I do it incorrectly above?  

Also, how do I know whether it's better to set the number of children listed ast -1 to abs() or 0 since that type of change
would impact the outcome of the stats?  In this case, I went ahead and changed it to 0 as you suggested and will see below how it changes the stats.  

### Processing duplicates

In [9]:
print('Duplicate entries in the table:', credit_score.duplicated().sum()) 
credit_score = credit_score.drop_duplicates()



Duplicate entries in the table: 54


I used duplicated() to find the duplicates (54) and drop_duplicates() to remove them.
Possible reasons for duplicates:  Maybe the individual had applied before for other reasons.  Maybe it was a human error.  



In [10]:
credit_score['education'].unique()



array(["bachelor's degree", 'secondary education', 'Secondary Education',
       'SECONDARY EDUCATION', "BACHELOR'S DEGREE", 'some college',
       'primary education', "Bachelor's Degree", 'SOME COLLEGE',
       'Some College', 'PRIMARY EDUCATION', 'Primary Education',
       'Graduate Degree', 'GRADUATE DEGREE', 'graduate degree'],
      dtype=object)

In [11]:
#credit_score['education_lowercase'] = credit_score['education'].str.lower()
#print(credit_score['education_lowercase'])

credit_score['education'] = credit_score['education'].str.lower()
print('Duplicate entries in the table:', credit_score.duplicated().sum()) 
print()
#credit_score['education_lowercase'] = credit_score['education'].str.lower()
#print(credit_score['education_lowercase'].value_counts())
#print('Duplicate entries in the table:', credit_score.duplicated().sum()) 
#credit_score = credit_score.drop_duplicates()
print()                                 
#credit_score['education_lowercase'] = credit_score['education_lowercase'].drop_duplicates()
credit_score =  credit_score.drop_duplicates()
#print(credit_score)
print(credit_score['education'].unique())
print('Duplicate entries in the table:', credit_score.duplicated().sum()) 



Duplicate entries in the table: 17


["bachelor's degree" 'secondary education' 'some college'
 'primary education' 'graduate degree']
Duplicate entries in the table: 0


### Conclusion

<div class="alert alert-block alert-danger">
Please look into education column (tip: use unique()). It will create duplicates for you that you did not filter.
</div>

<div class="alert alert-block alert-danger">
Please check the code after this block: you use <b>credit_score_</b> to delete duplicates, but use <b>credit_score</b> after that. Is that right?
</div>

In [12]:
#credit_score_ was a mistake on my part 

### Categorizing Data

In [13]:
 #With and without children

#total_default = credit_score[credit_score['debt'] == 1].count()


without_children = credit_score[credit_score['children'] == 0]
with_children = credit_score[credit_score['children'] > 0]
#gives a full list of information about individuals with and without children

default_without_children = without_children[without_children['debt'] == 1].count()['children']
default_with_children = with_children[with_children['debt'] == 1].count()['children']
#gives single number of people with and without children

total_without_children = without_children.count()['children']
total_with_children = with_children.count()['children']
#gives total number of applicants with or without children regardless of default status 

precent_without_children_default = (default_without_children/total_without_children) * 100.0
precent_with_children_default = (default_with_children/total_with_children) * 100.0
#gives the percentage of people with and without children that defaulted based on total applicants for each group 

print('Total number of people without children: {}'.format(total_without_children))
print('Total number of people without children that defaulted: {}'.format(default_without_children))
print()
print('Total number of people with children: {}'.format(total_with_children))
print('Total number of people with children that defaulted: {}'.format(default_with_children))
print()
print('Percentage of people without children that defaulted: {:.2f}'.format(precent_without_children_default))
print('Percentage of people with children that defaulted: {:.2f}'.format(precent_with_children_default))
    


Total number of people without children: 14138
Total number of people without children that defaulted: 1064

Total number of people with children: 7316
Total number of people with children that defaulted: 677

Percentage of people without children that defaulted: 7.53
Percentage of people with children that defaulted: 9.25


In [14]:
#marital status 




married = credit_score[credit_score['family_status_id'] == 0]
civil_partnership = credit_score[credit_score['family_status_id'] == 1]
widow_widower = credit_score[credit_score['family_status_id'] == 2]
divorced = credit_score[credit_score['family_status_id'] == 3]
unmarried = credit_score[credit_score['family_status_id'] == 4]
#gives a full list of information about individuals and their relationship status

not_married = credit_score[credit_score['family_status_id'] != 0]
#combination of all unmarried groups  


default_married = married[married['debt'] == 1].count()['family_status_id']
default_not_married = not_married[not_married['debt'] == 1].count()['family_status_id']
#number of married and unmarried individuals that defaulted 

total_married = married.count()['family_status_id']
total_not_married = not_married.count()['family_status_id']
#total married and not married individuals 

precent_married_default = (default_married/total_married) * 100.0
precent_not_married_default = (default_not_married/total_not_married) * 100.0
#gives the percentage of married and unmarried people that defaulted 

print('Total number of people married: {}'.format(total_married))
print('Total number of married people that defaulted: {}'.format(default_married))
print()
print('Total number of people not married: {}'.format(total_not_married))
print('Total number of not married people that defaulted: {}'.format(default_not_married))
print()
print('Percentage of married people that defaulted: {:.2f}'.format(precent_married_default))
print('Percentage of not married people that defaulted: {:.2f}'.format(precent_not_married_default))

Total number of people married: 12339
Total number of married people that defaulted: 931

Total number of people not married: 9115
Total number of not married people that defaulted: 810

Percentage of married people that defaulted: 7.55
Percentage of not married people that defaulted: 8.89


In [15]:
#income level 

low_income = 35000
upper_income = 100000


lower_class_income = credit_score[credit_score['total_income'] <= low_income]
middle_class_income = credit_score[(credit_score['total_income'] > low_income) & (credit_score['total_income'] < upper_income)]
upper_class_income = credit_score[(credit_score['total_income'] >= upper_income)]                                   
#divides american income classes into groups

default_lower_class_income = lower_class_income[lower_class_income['debt'] == 1].count()['total_income']
default_middle_class_income = middle_class_income[middle_class_income['debt'] == 1].count()['total_income']
default_upper_class_income = upper_class_income[upper_class_income['debt'] == 1].count()['total_income']                                
#number of people from each group that defaulted

total_lower_class_income = lower_class_income.count()['total_income']
total_middle_class_income = middle_class_income.count()['total_income']
total_upper_class_income = upper_class_income.count()['total_income']                                   
#total number of individuals from each group

precent_lower_class_income_default = (default_lower_class_income/total_lower_class_income) * 100.0
precent_middle_class_income_default = (default_middle_class_income/total_middle_class_income) * 100.0
percent_upper_class_income_default = (default_upper_class_income/total_upper_class_income)                                  
#gives the percentage of individuals from each group that defaulted



print('Total number of people in the lower class income bracket: {}'.format(total_lower_class_income))
print('Total number of people in the lower class income bracket that defaulted: {}'.format(default_lower_class_income))
print()
print('Total number of people in the middle class income bracket: {}'.format(total_middle_class_income))
print('Total number of people in the middle class income bracket that defaulted: {}'.format(default_middle_class_income))
print()
print('Total number of people in the upper class income bracket: {}'.format(total_upper_class_income))
print('Total number of people in the upper class income bracket that defaulted: {}'.format(default_upper_class_income))
print()                              
print('Percentage of people from the lower class income bracket that defaulted: {:.2f}'.format(precent_lower_class_income_default))
print('Percentage of people from the middle class income bracket that defaulted: {:.2f}'.format(precent_middle_class_income_default))
print('Percentage of people from the upper class income bracket that defaulted: {:.2f}'.format(percent_upper_class_income_default))
                                 

Total number of people in the lower class income bracket: 17388
Total number of people in the lower class income bracket that defaulted: 1451

Total number of people in the middle class income bracket: 3967
Total number of people in the middle class income bracket that defaulted: 284

Total number of people in the upper class income bracket: 99
Total number of people in the upper class income bracket that defaulted: 6

Percentage of people from the lower class income bracket that defaulted: 8.34
Percentage of people from the middle class income bracket that defaulted: 7.16
Percentage of people from the upper class income bracket that defaulted: 0.06


In [16]:
#Income level 2: mean/median 

mean_income = credit_score['total_income'].mean()
median_income = credit_score['total_income'].median()
print()

total_people_above_mean = credit_score[(credit_score['total_income'] > mean_income)]['total_income'].count()
total_people_below_mean = credit_score[(credit_score['total_income'] < mean_income)]['total_income'].count()

total_people_above_median = credit_score[(credit_score['total_income'] > median_income)]['total_income'].count()
total_people_below_median = credit_score[(credit_score['total_income'] < median_income)]['total_income'].count()



above_mean_default = credit_score[(credit_score['total_income'] > mean_income) & (credit_score['debt'] == 1)]['total_income'].count()
below_mean_default = credit_score[(credit_score['total_income'] < mean_income) & (credit_score['debt'] == 1)]['total_income'].count()
print()
above_median_default = credit_score[(credit_score['total_income'] > median_income) & (credit_score['debt'] == 1)]['total_income'].count()
below_median_default = credit_score[(credit_score['total_income'] < median_income) & (credit_score['debt'] == 1)]['total_income'].count()


percent_above_mean_default = (above_mean_default/total_people_above_mean) * 100
percent_below_mean_default = (below_mean_default/total_people_below_mean) * 100
percent_above_median_default = (above_median_default/total_people_above_median) * 100
percent_below_median_default = (below_median_default/total_people_below_median) * 100

print('Mean Income: {:.1f}'.format(mean_income))
print('Median Income: {:.1f}'.format(median_income))
print()
print('Total number of people above the mean: {}'.format(total_people_above_mean))
print('Total number of people below the mean: {}'.format(total_people_below_mean))
print()
print('Total number of people above median: {}'.format(total_people_above_median))
print('Total number of people below the median: {}'.format(total_people_below_median))
print()
print('Total number of people above the mean: {}'.format(total_people_above_mean))
print('Total number of people below the mean: {}'.format(total_people_below_mean))
print()
print('Total number of people above median: {}'.format(total_people_above_median))
print('Total number of people below the median: {}'.format(total_people_below_median))
print()
print('Total number of people above the mean that defaulted: {} which is {:.2f} %'.format(above_mean_default, percent_above_mean_default))
print('Total number of people below the mean that defaulted: {} which is {:.2f} %'.format(below_mean_default, percent_below_mean_default))
print()
print('Total number of people above the median that defaulted: {} which is {:.2f} %'.format(above_median_default, percent_above_median_default))
print('Total number of people below the median that defaulted: {} which is {:.2f} %'.format(below_median_default, percent_below_median_default))
print()
#print('Total percentage of people above the mean that defaulted: {:.2f}'.format(percent_above_mean_default))
#rint('Total number of people below the mean that defaulted: {:.2f}'.format(percent_below_mean_default))
#print()
#print('Total number of people above the median that defaulted: {:.2f}'.format(percent_above_median_default))
#print('Total number of people below the median that defaulted: {:.2f}'.format(percent_below_median_default))



Mean Income: 24161.3
Median Income: 21724.5

Total number of people above the mean: 9098
Total number of people below the mean: 12356

Total number of people above median: 10727
Total number of people below the median: 10727

Total number of people above the mean: 9098
Total number of people below the mean: 12356

Total number of people above median: 10727
Total number of people below the median: 10727

Total number of people above the mean that defaulted: 713 which is 7.84 %
Total number of people below the mean that defaulted: 1028 which is 8.32 %

Total number of people above the median that defaulted: 864 which is 8.05 %
Total number of people below the median that defaulted: 877 which is 8.18 %



In [17]:
#event type 

#print(credit_score['purpose'].value_counts())

import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemma = WordNetLemmatizer()
from collections import Counter



def loan_purpose_lemma(row): 
    purpose = row['purpose']
    words = nltk.word_tokenize(purpose)
    lemmas = [wordnet_lemma.lemmatize(w, pos = 'n') for w in words]
   
    #print(Counter(lemmas))
    #return lemmas
    #was used for debugging
    

    if ('wedding' in lemmas) or ('ceremony' in lemmas): 
        return 'wedding'
    elif 'car' in lemmas:
        return 'car'
    elif ('education' in lemmas) or ('university' in lemmas) or ('supplementary' in lemmas) or ('educated' in lemmas):
        return 'education'
    elif ('house' in lemmas) or ('estate' in lemmas) or ('property' in lemmas) or ('housing' in lemmas)\
    or ('commerical' in lemmas) or ('building' in lemmas) or ('construction' in lemmas) or ('renovations' in lemmas)\
    or ('residential' in lemmas):
        return 'real estate'
    else: 
        return 'other'

    
 

credit_score['purpose_category'] = credit_score.apply(loan_purpose_lemma, axis=1)

print("================")
print(credit_score['purpose_category'].value_counts())





real estate    10811
car             4306
education       4013
wedding         2324
Name: purpose_category, dtype: int64


In [18]:
#wedding


wedding = credit_score[credit_score['purpose_category'] == 'wedding']
default_wedding = wedding[wedding['debt'] == 1].count()['purpose_category']
total_wedding_loans = wedding.count()['purpose_category']
precent_wedding_loan_default = (default_wedding/total_wedding_loans) * 100.0


print('Total number of loans for weddings: {}'.format(total_wedding_loans))
print('Total number of loans for weddings that defaulted {}'.format(default_wedding))
print('Percentage of loans for weddings that defaulted: {:.2f}'.format(precent_wedding_loan_default))


Total number of loans for weddings: 2324
Total number of loans for weddings that defaulted 186
Percentage of loans for weddings that defaulted: 8.00


In [19]:
#car 


car = credit_score[credit_score['purpose_category'] == 'car']
default_car = car[car['debt'] == 1].count()['purpose_category']
total_car_loans = car.count()['purpose_category']
precent_car_loan_default = (default_car/total_car_loans) * 100.0



print('Total number of loans for cars: {}'.format(total_car_loans))
print('Total number of loans for cars that defaulted {}'.format(default_car))
print('Percentage of loans for cars that defaulted: {:.2f}'.format(precent_car_loan_default))   


Total number of loans for cars: 4306
Total number of loans for cars that defaulted 403
Percentage of loans for cars that defaulted: 9.36


In [20]:
#education 

education = credit_score[credit_score['purpose_category'] == 'education']
default_education = education[education['debt'] == 1].count()['purpose_category']
total_education_loans = education.count()['purpose_category']
precent_education_loan_default = (default_education/total_education_loans) * 100.0


print('Total number of loans for education: {}'.format(total_education_loans))
print('Total number of loans for education that defaulted {}'.format(default_education))
print('Percentage of loans for education that defaulted: {:.2f}'.format(precent_education_loan_default))


Total number of loans for education: 4013
Total number of loans for education that defaulted 370
Percentage of loans for education that defaulted: 9.22


In [21]:
#real estate 

real_estate = credit_score[credit_score['purpose_category'] == 'real estate']
default_real_estate = real_estate[real_estate['debt'] == 1].count()['purpose_category']
total_real_estate_loans = real_estate.count()['purpose_category']
precent_real_estate_loan_default = (default_real_estate/total_real_estate_loans) * 100.0

   

print('Total number of loans for real estate: {}'.format(total_real_estate_loans))
print('Total number of loans for real estate that defaulted {}'.format(default_real_estate))
print('Percentage of loans for real estate that defaulted: {:.2f}'.format(precent_real_estate_loan_default))


Total number of loans for real estate: 10811
Total number of loans for real estate that defaulted 782
Percentage of loans for real estate that defaulted: 7.23


<div class="alert alert-block alert-warning">
Could you please look into lemmatization here? I think you could have missed some of the data.
    In the tasks: <b>Data preprocessing -> Looking for duplicates -> Lemmatization </b>
</div>

<div class="alert alert-block alert-warning">
I'd like to ask you to split the code a bit in these cells - it is hard to get the whole idea of what you did there. I appreciate the comments in code however! :)
</div>

### Conclusion

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

Initial calculations based on -1 = 1

There does seem to be a correlation between having kids and whether or not an individual pays their loans back on time.  
Based on my calculations, individuals with kids were 1.5% more likely to default than those that didn't.  
The reasons for this could vary.  I would need to do a more in-depth analysis to be certain.  It could simply be because having 
kids is expensive.  It could also be related to whether or nor a parent is single and raising a child on their own, 
education/salary level combined with the cost of raising a child; the number of children present in the home.  

calculations based on -1 = 0 

There is still a correlation between having kids and defaulting.  Based on the new calculations, individuals with kids are 1.7%
more likely to default.  I'm not how sure how accurate the number is since I'm not sure whether the -1 should be a 1 or 0, but 
either way, there is a correlation between having kids and defaulting.  

In [28]:

def with_child_check(row):
    
    if row['children'] > 0:
        return 'With Child'
    else:
        return "Without Child"

credit_score['Has Child'] = credit_score.apply(with_child_check, axis = 1)  
############################################
#data_pivot = credit_score.pivot_table(index=['family_status'], columns='Has Child', values=[''], aggfunc='len')



#print(credit_score['Has Child'].value_counts())
#number of people with and without children
#print()
#print(credit_score['debt'].value_counts())
#number of people that defaulted or didn't 
print()

data_pivot = credit_score.pivot_table(index=[ 'Has Child'], values=['debt'],\
                                      aggfunc={'debt':[sum,len, np.mean]} )

print(data_pivot)



                  debt                  
                   len      mean     sum
Has Child                               
With Child      7316.0  0.092537   677.0
Without Child  14138.0  0.075258  1064.0


I really struggled with the pivot tables and getting them to work. I feel like the pivot tables I've added are lacking.  
this table basically shows the number of individuals with and without children and how many defaulted, respectively. I added the mean which basically divides the sum by len, to give a precetage comparison. 


<div class="alert alert-block alert-danger">
It would be great to have code here, that proves your conclusions. Maybe it is a good idea to put some of the code from the previous part there. Or better yet - to do pivot tables or plot some graphs. This would help you a lot in this task. And please add them as markdown block for a better visibility
</div>

<div class="alert alert-block alert-danger">
<b> Second iteration</b>
We are almost done - you do sums and lens, wouldn't it be useful to do "mean" too for example?</div>

### Conclusion

- Is there a relation between marital status and repaying a loan on time?

There also seems to be a correlation between marital status and whether or not an individual  defaults on their loan.  
According to my calculations, people who aren't married defaulted 1.3% more often than their married counterparts did.  
I struggled with how to analyze this data.  I considered combining married people and people in a civil partnership since both 
are generally long-term relationships.  As the question specifically stated 'married,' I opted to separte them in my analysis.  
Possible reasons that married couples defaulted less could be attributed to their being two incomes in the home.  However, 
many of the individual applicants who were married, had very low incomes which indicated a single income household.  Possibly, 
the reason could be that married couples are more 'stable' and have much more to lose, as it were, by defaulting on a loan.  

In [29]:
data_pivot = credit_score.pivot_table(index=[ 'family_status'], values=['debt'],\
                                      aggfunc={'debt':[sum, len, np.mean]} )

print(data_pivot)


                      debt                 
                       len      mean    sum
family_status                              
civil partnership   4151.0  0.093471  388.0
divorced            1195.0  0.071130   85.0
married            12339.0  0.075452  931.0
unmarried           2810.0  0.097509  274.0
widow / widower      959.0  0.065693   63.0


again, I struggled as I wasn't sure how to get the unmarried individuals combined into one column.  the len column shows 
the total number of individals in each column and sum shows how many defaulted respectively.  

<div class="alert alert-block alert-danger">
Same as previous - please put the code as well as markdown block here. If you could ask me, I think having a marriage may be connected to the person being "settled" - i.e. not willing to do risks, or, for example leave the country without paying. I think that your idea to take only "married" was right
</div>

### Conclusion

- Is there a relation between income level and repaying a loan on time?

This was a tricky analysis to conduct as dividing people into income classes can vary somewhat depending on which source you use.

I opted to go middle of the road in my breakdown.  Although, I'm not certain I agree with my own decision.  Most sources list low income as being 45,000 or less.  I opted to go with 35,000 or less, as I, personally, am from an area where the average income per individual is 15,000-19,000 per year and that is before taxes are taken out. 

That being said, according to my analysis, income does in fact impact whether one defaults on a loan or not.  
Individuals from the low income bracket defaulted 1.1% more often than did individual from the middle class.  
Morewover, individuals from the lower income bracket and middle income bracket both defaulted much more (8.25% and 7.1%) 
than individuals from the higher income bracket who defaulted 0.06% of the time.  



based on the income level 2 analysis with mean/median, income still seems to be a factor in whether loans are defaulted on or not but less so.  Based on the median income, there was virtually no difference in default rates.  And based on the mean income, those who fell below the mean were more likely to default by 0.48%.  

in the pivot chart below, 0 represents the mean and median incomes from the individuals that didn't default and 1 represents 
the mean and median incomes from the individuals that did default.  There is a difference but it's very slight.  

I'm not convinced that this is givinn an accuract picture since it seems to indicate that there is virually no difference 
in default rate based on income levels.  

In [23]:
#income level 

import numpy as np

data_pivot = credit_score.pivot_table(index=[ 'debt'], values=['total_income'],\
                                      aggfunc={'total_income':[np.mean, np.median]} )

print(data_pivot)

      total_income         
              mean   median
debt                       
0     24215.510475  21738.0
1     23547.534750  21694.0


<div class="alert alert-block alert-danger">
Same as previous - please put the code as well as markdown block here.
</div>

<div class="alert alert-block alert-danger">
I like your idea of checking what are the income classes in your area/other areas and what are low incomes, however, let me suggest you for this task to do it based on the data you have - you can check the mean/median income and see how many people are near the mean/median and decide based on that data.
</div>

### Conclusion

- How do different loan purposes affect on-time repayment of the loan?

wedding default rate: 8%
cars: 9.36%
education: 9.22%
real estate: 7.23%
(I think I wrote this before but it got deleted somehow.)

There does seems to be a difference in the default rate based on loan purpose.  Loans for real estate purposes have a much lower efault rate than do the others.  This may be due to the fact that an investment in a home or business is more serious than an investment in a car or university from which one may graduate or not.  

In [30]:
data_pivot = credit_score.pivot_table(index=['purpose_category'], values=['debt'],\
                                      aggfunc={'debt':[sum, len, np.mean]} )

print(data_pivot)
print()
print ("Car loans default precent: {:.2f} %".format((403/4306)*100))
print ("Education loans default precent: {:.2f} %".format((370/4013)*100))
print ("Real estate loans default precent: {:.2f} %".format((782/10811)*100))
print ("Wedding loans default precent: {:.2f} %".format((186/2324)*100))

                     debt                 
                      len      mean    sum
purpose_category                          
car                4306.0  0.093590  403.0
education          4013.0  0.092200  370.0
real estate       10811.0  0.072334  782.0
wedding            2324.0  0.080034  186.0

Car loans default precent: 9.36 %
Education loans default precent: 9.22 %
Real estate loans default precent: 7.23 %
Wedding loans default precent: 8.00 %


### Step 4. General conclusion

In general, married individuals seemed to be more stable and default less.  Perhaps it's because their financial situations are 
generally more stable.  Perhaps it's because they have more than one income coming into the home.  And perhaps it's because 
they simply have more to lose by defaulting.  However, individuals with children did default more which is most likely attributable to the cost of raising children.  Sometimes food, medicine and clothes have to come first.  

In [25]:
low_income = 35000
upper_income = 100000

def income_class_split(row):
    
    income = row['total_income']
    if income <= low_income:
        return "low_income"
    elif (income > low_income) & (income < upper_income):
        return "mid_income"
    else: 
        return "high_income"

credit_score['income_level'] = credit_score.apply(income_class_split, axis = 1)  

def married_status_split(row):
    
    income = row['family_status']
    if income in 'married':
        return 'married'
    else:
        return 'non-married'
    
credit_score['married_status'] = credit_score.apply(married_status_split, axis = 1)
############################################
credit_score['no_debt'] = 1-credit_score['debt']
data_pivot_debt = credit_score.pivot_table(index=[ 'income_level'], columns = ['married_status', 'Has Child'], values=['debt'],\
                                       aggfunc=["sum"] )
data_pivot_total = credit_score.pivot_table(index=[ 'income_level'], columns = ['married_status', 'Has Child'], values=['debt'],\
                                       aggfunc=["count"] )
print ("Didn't pay debt (Default)\n=====================")
print(data_pivot_debt)
print ("Total in each group\n=====================")
print(data_pivot_total)


Didn't pay debt (Default)
                      sum                                        
                     debt                                        
married_status    married               non-married              
Has Child      With Child Without Child  With Child Without Child
income_level                                                     
high_income             1             2           0             3
low_income            335           430         224           462
mid_income             78            85          39            82
Total in each group
                    count                                        
                     debt                                        
married_status    married               non-married              
Has Child      With Child Without Child  With Child Without Child
income_level                                                     
high_income            29            30          16            24
low_income           3852     

<div class="alert alert-block alert-danger">
Please prove this with code :)
</div>

again, I feel like this pivot table is a disaster.  

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [ ]  file examined;
- [ ]  missing values defined;
- [ ]  missing values are filled;
- [ ]  an explanation of which missing value types were detected;
- [ ]  explanation for the possible causes of missing values;
- [ ]  an explanation of how the blanks are filled;
- [ ]  replaced the real data type with an integer;
- [ ]  an explanation of which method is used to change the data type and why;
- [ ]  duplicates deleted;
- [ ]  an explanation of which method is used to find and remove duplicates;
- [ ]  description of the possible reasons for the appearance of duplicates in the data;
- [ ]  data is categorized;
- [ ]  an explanation of the principle of data categorization;
- [ ]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [ ]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [ ]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [ ]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [ ]  conclusions are present on each stage;
- [ ]  a general conclusion is made.

I have tried checking the boxes but can't get it to work.  I'm not sure what I'm doing wrong.  

<div class="alert alert-block alert-danger">
Please fill in this one to check yourself if you have done everything!
</div>

<div class="alert alert-block alert-success">
We are on to a great start! I really liked the ideas behind your analysis and think once you are able to fix the issues I have found, the work would be great.
</div>