## Review

Hi Jared. My name is Soslan. I'm reviewing your project. I've added all my comments to new cells with different coloring.

<div class="alert alert-success" role="alert">
  If you did something great I'm using green color for my comment
</div>

<div class="alert alert-warning" role="alert">
If I want to give you advice or think that something can be improved, then I'll use yellow. This is an optional recommendation.
</div>

<div class="alert alert-danger" role="alert">
  If the topic requires some extra work so I can accept it then the color will be red
</div>

You did a good quality project. All checkpoints were done correctly. Your code and conclusions are well formatted. So I'm accepting your project. Good luck with future learning.

---

## Analyzing borrowers’ risk of defaulting

Your project is to prepare a report for a bank’s loan division. You’ll need to find out if a customer’s marital status and number of children has an impact on whether they will default on a loan. The bank already has some data on customers’ credit worthiness.

Your report will be considered when building a **credit scoring** of a potential customer. A ** credit scoring ** is used to evaluate the ability of a potential borrower to repay their loan.

### Step 1. Open the data file and have a look at the general information. 

In [2]:
import pandas as pd
data = pd.read_csv('/datasets/credit_scoring_eng.csv')
data.info()
data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
children            21525 non-null int64
days_employed       19351 non-null float64
dob_years           21525 non-null int64
education           21525 non-null object
education_id        21525 non-null int64
family_status       21525 non-null object
family_status_id    21525 non-null int64
gender              21525 non-null object
income_type         21525 non-null object
debt                21525 non-null int64
total_income        19351 non-null float64
purpose             21525 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
21520,1,-4529.316663,43,secondary education,1,civil partnership,1,F,business,0,35966.698,housing transactions
21521,0,343937.404131,67,secondary education,1,married,0,F,retiree,0,24959.969,purchase of a car
21522,1,-2113.346888,38,secondary education,1,civil partnership,1,M,employee,1,14347.610,property
21523,3,-3112.481705,38,secondary education,1,married,0,M,employee,1,39054.888,buying my own car


### Conclusion

Lots of missing info in days_employed and in total_income.

<div class="alert alert-success" role="alert">
Correct</div>

### Step 2. Data preprocessing

### Processing missing values

In [3]:
data = data.dropna(subset = ['days_employed', 'total_income']).reset_index(drop = True)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 12 columns):
children            19351 non-null int64
days_employed       19351 non-null float64
dob_years           19351 non-null int64
education           19351 non-null object
education_id        19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 1.8+ MB


### Conclusion

Dropped rows with missing values in days_employed and total_income. These may be due to individuals being unemployed, or could be a result of input error.

<div class="alert alert-success" role="alert">
Ok. Seems reasonable</div>

### Data type replacement

In [4]:
data['days_employed'] = data['days_employed'].astype(int)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 12 columns):
children            19351 non-null int64
days_employed       19351 non-null int64
dob_years           19351 non-null int64
education           19351 non-null object
education_id        19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(1), int64(6), object(5)
memory usage: 1.8+ MB


### Conclusion

Changed days_employed to an integer usint .astype method. We don't need to know someone's tenure down to the second.

### Processing duplicates

In [5]:
data['children'] = data['children'].abs()
data['days_employed'] = data['days_employed'].abs()
data['dob_years'] = data['dob_years'].abs()
data['education'] = data['education'].str.lower()
data['education_id'] = data['education_id'].abs()
data['family_status'] = data['family_status'].str.lower()
data['family_status_id'] = data['family_status_id'].abs()
data['gender'] = data['gender'].str.upper()
data['income_type'] = data['income_type'].str.lower()
data['debt'] = data['debt'].abs()
data['total_income'] = data['total_income'].abs()
data['purpose'] = data['purpose'].str.lower()

for row in data:
    print('|{: >18}|'.format(row), '{: <10}|'.format(data[row].duplicated().sum()))

|          children| 19344     |
|     days_employed| 10265     |
|         dob_years| 19293     |
|         education| 19346     |
|      education_id| 19346     |
|     family_status| 19346     |
|  family_status_id| 19346     |
|            gender| 19348     |
|       income_type| 19343     |
|              debt| 19349     |
|      total_income| 3         |
|           purpose| 19313     |


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19351 entries, 0 to 19350
Data columns (total 12 columns):
children            19351 non-null int64
days_employed       19351 non-null int64
dob_years           19351 non-null int64
education           19351 non-null object
education_id        19351 non-null int64
family_status       19351 non-null object
family_status_id    19351 non-null int64
gender              19351 non-null object
income_type         19351 non-null object
debt                19351 non-null int64
total_income        19351 non-null float64
purpose             19351 non-null object
dtypes: float64(1), int64(6), object(5)
memory usage: 1.8+ MB


In [7]:
data['total_income'].value_counts()

17312.717    2
31791.384    2
42413.096    2
54857.666    1
26935.722    1
            ..
48796.341    1
34774.610    1
15710.698    1
19232.334    1
9591.824     1
Name: total_income, Length: 19348, dtype: int64

In [8]:
data[data['total_income'] == 17312.717]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
5734,0,353726,57,secondary education,1,widow / widower,2,F,retiree,0,17312.717,to become educated
18447,1,4616,35,bachelor's degree,0,civil partnership,1,M,business,0,17312.717,wedding ceremony


In [9]:
data[data['total_income'] == 31791.384]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
9324,1,1723,44,secondary education,1,civil partnership,1,M,employee,0,31791.384,to have a wedding
18120,0,1028,51,secondary education,1,married,0,F,employee,0,31791.384,property


In [10]:
data[data['total_income'] == 42413.096]

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
5985,0,417,48,bachelor's degree,0,married,0,F,business,0,42413.096,getting higher education
7576,0,371001,56,secondary education,1,unmarried,4,M,retiree,0,42413.096,real estate transactions


In [11]:
data['children'].value_counts()

0     12710
1      4387
2      1851
3       294
20       67
4        34
5         8
Name: children, dtype: int64

<div class="alert alert-warning" role="alert">
67 persons with 20 children also looks a bit weird.</div>

In [12]:
data['purpose'].value_counts()

wedding ceremony                            721
to have a wedding                           693
having a wedding                            685
real estate transactions                    615
buy commercial real estate                  597
purchase of the house                       595
buying property for renting out             588
housing                                     587
transactions with commercial real estate    581
building a real estate                      580
housing transactions                        579
purchase of my own house                    574
property                                    572
purchase of the house for my family         570
building a property                         561
construction of own property                560
transactions with my real estate            559
buy real estate                             552
buy residential real estate                 546
housing renovation                          542
car                                     

In [13]:
from nltk.stem import SnowballStemmer
english_stemmer=SnowballStemmer('english')
words=['education','estate','car','property','wedding','house','housing','university']

def stemming_purpose(row):
    for word in row.split():
        stemmed_word=english_stemmer.stem(word)
        if stemmed_word=='estat' or stemmed_word=='properti':
            return 'real estate'
        if stemmed_word=='educ' or stemmed_word=='univers':
            return 'education'
        if stemmed_word=='car':
            return 'car'
        if stemmed_word=='hous':
            return 'house'
        if stemmed_word=='wed':
            return 'wedding'

data['stemmed_purpose']=data['purpose'].apply(stemming_purpose)
print(data['stemmed_purpose'].value_counts())

real estate    6311
car            3897
education      3597
house          3447
wedding        2099
Name: stemmed_purpose, dtype: int64


<div class="alert alert-warning" role="alert">
I tried to execute your lemmatization code but it produces an error. Also, data_flipped wasn't defined. Anyway, categorization was done correctly.</div>


### Conclusion

Utilized value_counts function to assess the the frequency at which certain values were present, and looked for instances when values were identical but should not have been. Looked like total_income might have been a source of duplicates, but these individuals all had different other qualities indicating they are different people.

Only real identified source of duplicates appeared to be purpose of the loan. I tried an excessive period of time to make tokenization/lemmatization work to figure out which words to stem but nearly lost my mind doing so.

Side note - 20 children seems unlikely. My initial thought was that these were likely errors meant to be 2, not 20. However, I don't want to assume what the data say if I don't know that this is true.

### Categorizing Data

In [14]:
data['total_income'] = data ['total_income'].astype(int)

print('Minimum:',data['total_income'].min())
print('Median:', data['total_income'].median().astype(int))
print('Maximum:',data['total_income'].max())
print('')
print('Mean:', data['total_income'].mean().astype(int))
print('Standard Deviation:', data['total_income'].std().astype(int))

Minimum: 3306
Median: 23202
Maximum: 362496

Mean: 26787
Standard Deviation: 16475


In [15]:
devs = [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
minimum = 3306
for num in devs:
    print(minimum)
    minimum += 16475
    

3306
19781
36256
52731
69206
85681
102156
118631
135106
151581
168056
184531
201006
217481
233956


In [16]:
def income_grouping(income):
    if 3306 < income <= 19781:
        return 'low income'
    if 19781 < income <= 36256:
        return 'lower middle income'
    if 36256 < income <= 52731:
        return 'higher middle income'
    if 52731 < income:
        return 'high income'
    
data['income_group'] = data['total_income'].apply(income_grouping)
print(data['income_group'].value_counts())

lower middle income     8432
low income              7225
higher middle income    2588
high income             1105
Name: income_group, dtype: int64


### Conclusion

Categorized total income into income groups,

<div class="alert alert-success" role="alert">
Great</div>

### Step 3. Answer these questions

- Is there a relation between having kids and repaying a loan on time?

In [17]:
children_risk = data.query('children < 20').groupby('children')['debt'].mean()
print(children_risk)

children
0    0.074902
1    0.093230
2    0.095624
3    0.074830
4    0.088235
5    0.000000
Name: debt, dtype: float64


### Conclusion

- There is a slight bump in risk after having a child.
- This risk does not appear to increase after having a second child.
- Risk decreases after having a third child. This may be due to increased parent age/commensurate income, but could also be due to sample size variation.
- Sample set of 4+ children is to small to take away a complete learning from.

In sum: there may be a relation between having kids and repaying a loan on time. It appears that having children may create some hightened risk, but that these risks may dissipate over time/with additional children. More age/income data is needed to confirm this.

<div class="alert alert-success" role="alert">
Correct</div>

- Is there a relation between marital status and repaying a loan on time?

In [29]:
marital_risk = data.groupby('family_status')['debt'].mean()
print(marital_risk)

family_status
civil partnership    0.090763
divorced             0.070175
married              0.075922
unmarried            0.100594
widow / widower      0.064740
Name: debt, dtype: float64


### Conclusion

Being married or having been married appears to significantly lower risk of defaulting. Civil partnerships reduce risk slightly, but not as much as marriage.

<div class="alert alert-success" role="alert">
Agree</div>

- Is there a relation between income level and repaying a loan on time?

In [30]:
income_risk = data.groupby('income_group')['debt'].mean()
print(income_risk)

income_group
high income             0.070588
higher middle income    0.071097
low income              0.081384
lower middle income     0.085389
Name: debt, dtype: float64


### Conclusion

As expected, a higher level of income generally decreases one's risk of defaulting.

However, there apears to be a spike in default rates when moving from low income applicants to lower middle income applicants. One theory may be that lower middle income applicants are pressured into a middle class lifestyle without the revenue to support it. It could also be a matter of not being poor enough to qualify for welfare while not being rich enough to fully support onesself.

You could perform some additional computations outside the scope of this project to weight these results for other attributes like age and marital status to gain a full picture.

- How do different loan purposes affect on-time repayment of the loan?

In [31]:
purpose_risk = data.groupby('stemmed_purpose')['debt'].mean()
print(purpose_risk)

stemmed_purpose
car            0.094175
education      0.092021
house          0.068755
real estate    0.075741
wedding        0.075274
Name: debt, dtype: float64


### Conclusion

It appears that family- or business-oriented debt (marriage, house, real estate) poses a lower risk than personal debt (cars, education, etc). Interestingly, real estate debt poses a higher risk than personal home mortgages. This may be due to the link between a business' profitability and its ability to pay its debts, while the individual's business success is less directly linked to their ability to pay debts.

<div class="alert alert-success" role="alert">
Overall great section. Correct conclusions.</div>

### Step 4. General conclusion

Income appears to be the most accurate single predictor of default risk. Loan purpose also appears to be a decent predictor, and number of children/family status may be helful in conjunction with loan purpose and income.

It appears that the lowest risk individuals are those who are or have been married, with middle-high level incomes, who are applying for home/retail/wedding loans. The highest risk individuals (outside of those with no income, no job), appear to be unmarried individuals in the second lowest quartile of earners, who have children.

This generally is in line with macroeconomic principles of the power of marriage to financial stability, so the outcome does not appear absurd.

As for the conclusion surrounding the product, I need to hit the books more on lemmatization for sure. I also struggled a bit in the preprocessing of data, specifically around knowing how to address missing or incorrect values. I think in a real world scenario, I would be able to access the person providing the data to dig into potential reasons. Guessing why data is messed up is not fun.

### Project Readiness Checklist

Put 'x' in the completed points. Then press Shift + Enter.

- [x]  file open;
- [x]  file examined;
- [x]  missing values defined;
- [x]  missing values are filled;
- [x]  an explanation of which missing value types were detected;
- [x]  explanation for the possible causes of missing values;
- [x]  an explanation of how the blanks are filled;
- [x]  replaced the real data type with an integer;
- [x]  an explanation of which method is used to change the data type and why;
- [x]  duplicates deleted;
- [x]  an explanation of which method is used to find and remove duplicates;
- [x]  description of the possible reasons for the appearance of duplicates in the data;
- [x]  data is categorized;
- [x]  an explanation of the principle of data categorization;
- [x]  an answer to the question "Is there a relation between having kids and repaying a loan on time?";
- [x]  an answer to the question " Is there a relation between marital status and repaying a loan on time?";
- [x]   an answer to the question " Is there a relation between income level and repaying a loan on time?";
- [x]  an answer to the question " How do different loan purposes affect on-time repayment of the loan?"
- [x]  conclusions are present on each stage;
- [x]  a general conclusion is made.