## Review
Hi, Bianca. My name is Roman and I`m going to review your project.

You can find my comments in colored boxes like this:

<div class="alert alert-success">
    If everything is done succesfully.
</div>

<div class="alert alert-warning">
    If I have some (optional) suggestions, or questions to think about, or general comments.
</div>

<div class="alert alert-danger">
    If a section requires some corrections. Work can't be accepted with red comments.
</div>

Please DON`T DELETE my comments. Great if you choose **<font color="orange">visible color</font>**  or **text format** for your comments - It's easier for us to follow the corrections.

### <font color="orange">**Summary:**</font>  
Thank you for sending your project. You've done a really good job on it!   
While there's room for improvement, on the whole, your project is impressive/looking good.  
I've found some tiny mistakes in your project. They'll be easy to fix.  
"Improve" comments mean that there are tiny corrections which could help you to make your project better.   
Every issue with our code is a chance for us to learn something new.



In [1]:
import pandas as pd

## 2.1  Step 1. Open the data file and have a look at the general information.

In [2]:
try:
    df = pd.read_csv('credit_scoring_eng.csv')
except:
     df = pd.read_csv('/datasets/credit_scoring_eng.csv')

df.head(100)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose
0,1,-8437.673028,42,bachelor's degree,0,married,0,F,employee,0,40620.102,purchase of the house
1,1,-4024.803754,36,secondary education,1,married,0,F,employee,0,17932.802,car purchase
2,0,-5623.422610,33,Secondary Education,1,married,0,M,employee,0,23341.752,purchase of the house
3,3,-4124.747207,32,secondary education,1,married,0,M,employee,0,42820.568,supplementary education
4,0,340266.072047,53,secondary education,1,civil partnership,1,F,retiree,0,25378.572,to have a wedding
...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,-541.832241,51,bachelor's degree,0,married,0,F,employee,0,15070.060,car
96,0,,44,SECONDARY EDUCATION,1,married,0,F,employee,0,,buy residential real estate
97,0,,47,bachelor's degree,0,married,0,F,employee,0,,profile education
98,0,364906.205736,54,bachelor's degree,0,married,0,F,retiree,0,31953.168,buying property for renting out


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


## Conclusion:
#### The file is not corrupted and can be worked on. The first 100 rows were successfully loaded. However, it clearly needs a deep investigation. At first glance, we can see some weird data though.
#### What instantly catches the attention is the altered numbers and Data Types of days_employed and total_income columns, as well as mixed cases in the education column. This needs to be investigated

<div class="alert alert-warning">

### Comment

Nice start!  
It would be great to add some conclusion after downloading data!  
</div>

# 2.4 Step 2: Data preprocessing

## 2.5 Processing missing values

In [4]:
df.isnull().sum() #counts missing values

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

In [5]:
total_income_mean = df['total_income'].mean()
#mean value from total_income column retrieved

In [6]:
age_mean = df['dob_years'].mean()
#mean value from dob_years column retrieved

In [7]:
df['total_income'].fillna(value=total_income_mean, inplace=True)
#total_income NaN values filled with mean value
 
df['dob_years'].fillna(value=age_mean, inplace=True)
#there were some 0 values in this column that were replaced by average age.

In [8]:
df['days_employed'].fillna(value=0, inplace=True)
#not possible to retrieve a mean realistic number, therefore NaN values were replaced by 0.

In [9]:
df.loc[df['days_employed'] < 0, 'days_employed'] = 0
#negative numbers were replaced by 0, even though this change will not affect the final result.

df.loc[df['days_employed'] > 14600, 'days_employed'] = df['days_employed'] / 24

<div class="alert alert-danger">

### Comment
So, yeah, you`ve just found a number of gaps in two columns. But i`d suggest to fill all of them. Let`s take a look closely...
- gaps in days_employed: It would be better to fill in the gaps. Gaps for example can be filled with one median / mean or zero, but for educational purposes it would be better to handle more carefully. Moreover, there is nothing terrible there. Check negative values. Then you should pay attention to too large values (in a large number of rows) - when recalculated for years, they give 800-1100 years of experience (simply dividing by 365!), Which is unacceptable. Then maybe these are not days at all (not all rows, but only these "giants") ?! Perhaps it is enough to simply divide large values (by 24) and that's it, the data (available) will be ready (processed). And about filling in the gaps: ONLY now, when all the available values are correct, it will be possible to think about how to fill in the gaps ... here you can customize the filling of the gaps, for example, by the type of employment. 
    
- gaps in income:  It would be better to fill in the gaps too. it would be acceptable to fill in one common average / median or zero if the data did not contain any other fields by which borrowers can be divided. Do employeses and student have equal average income? ;)
- gaps might be not only as Nan or Null. Look closely to 0-s at age...
- conclusion. It is very important to summarise all what you do.
</div>

<div class="alert alert-danger">

### Comment

Two cells above are reffer to gaps filling, not to type replacement. So please make sure that your project matches the brief.
</div>

## Conclusion
#### I used the mean number to fill in the gaps in the **dob_years** because, I assume, the 0's seems to be human errors or lack of information. Also this won't have a considerable effect in the final results. MCAR values.

#### Also mean number used in **total_income** missing gaps. I understand that around 10% of the table is not a very big number to consider and also could not find a relationship between this column and any other.

#### I used the number 14600 as parameter (**days_employed**) because is the exactly sum of 40 years. Which I consider is a reasonable amount of working years in an average lifespan. The numbers bigger than 14600 were divided by 24 (hours), which gave me a reasonable amount of worked days. Also **NaN** and negatuve values were replaced by 0. Again I was not able to identify a pattern in the negative numbers.

## 2.8  Data type replacement

In [10]:
df['days_employed'] = df['days_employed'].astype(int)
df['total_income'] = df['total_income'].astype(int)
df.info()
#both columns's values converted to Integers to match the rest of the table.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   children          21525 non-null  int64 
 1   days_employed     21525 non-null  int64 
 2   dob_years         21525 non-null  int64 
 3   education         21525 non-null  object
 4   education_id      21525 non-null  int64 
 5   family_status     21525 non-null  object
 6   family_status_id  21525 non-null  int64 
 7   gender            21525 non-null  object
 8   income_type       21525 non-null  object
 9   debt              21525 non-null  int64 
 10  total_income      21525 non-null  int64 
 11  purpose           21525 non-null  object
dtypes: int64(7), object(5)
memory usage: 2.0+ MB


In [11]:
df['education'].unique()
#checking if there is some values left or corrupted in the column

df['education'] = df['education'].str.lower()
#turns all of the education columns's values in lowercase

<div class="alert alert-warning">

### Comment

The type replacement done! Good. But, sorry, lets try to add some conclusion.
</div>

## Conclusion

#### In order to keep the dataframe organized and easy to manipulate in the future, days_employed and  total_income columns's values were converted from floating points to integer numbers to match the rest of the table.

#### The values in the education column had unmatching cases. They were all converted to lowercase to facilitate the duplicate count on the next step

## 2.12  Processing duplicates


In [12]:
df.duplicated().sum()
#found 71 duplicates

71

<div class="alert alert-danger">

### Comment

Ok.... you`ve corrected the word-size in education column .... But what about dupluicates? (I mean "df.duplicated().sum()")

</div>



## 2.15  Categorizing Data

In [13]:
def purpose_categories(purpose):
    if 'car' in purpose:
        return "Car"
    if 'education' in purpose:
        return 'Education'
    if 'wedding' in purpose:
        return 'Wedding'
    elif 'property' or 'real estate' or 'hous'in purpose:
        return 'Property & Real Estate'

df['purpose_categories'] = df['purpose'].apply(purpose_categories)
df['purpose_categories'].value_counts()

#purpose column categorized into 4 columns:

Property & Real Estate    11748
Car                        4315
Education                  3114
Wedding                    2348
Name: purpose_categories, dtype: int64

In [14]:
df = df.drop(columns=['purpose'], axis = 1) 
#purpose columns is replaced by categorized purposes label 'purpose'.

In [15]:
df['children'] = df['children'].replace(to_replace= -1, value=1)
#in the children column, the value = 1 were replaced by 0, no children at all.
df['children'] = df['children'].replace(to_replace= 20, value=2)
#in the children column, the value = 20 were replaced by 2, no children at all.

<div class="alert alert-success">

### Comment

Yeah, clear job!

</div>



## Data Categorization (Income Type and Total Income column)

In [16]:
df['income_type'].unique()
#must be categorized

array(['employee', 'retiree', 'business', 'civil servant', 'unemployed',
       'entrepreneur', 'student', 'paternity / maternity leave'],
      dtype=object)

In [17]:
df['income_levels'] = pd.qcut(df['total_income'], q=8, precision=0)
df.head()
#the total_income column were divided into 8 different quantiles, each one assigned an interval related to income_type values

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose_categories,income_levels
0,1,0,42,bachelor's degree,0,married,0,F,employee,0,40620,Property & Real Estate,"(31286.0, 40639.0]"
1,1,0,36,secondary education,1,married,0,F,employee,0,17932,Car,"(17247.0, 20978.0]"
2,0,0,33,secondary education,1,married,0,M,employee,0,23341,Property & Real Estate,"(20978.0, 25024.0]"
3,3,0,32,secondary education,1,married,0,M,employee,0,42820,Education,"(40639.0, 362496.0]"
4,0,14177,53,secondary education,1,civil partnership,1,F,retiree,0,25378,Wedding,"(25024.0, 26787.0]"


In [18]:
df['income_categories'] = pd.qcut(df['total_income'], q=8, precision=0, labels=['Category 1: 3305 to 13430', 
'Category 2: 13430 to 17247',
'Category 3: 17247 to 20978',
'Category 4: 20978 to 25024',
'Category 5: 25024 to 26787',
'Category 6: 26787 to 31286',
'Category 7: 31286 to 40639',
'Category 8: 40639 to 362496'])
#income categories simply gives a label to the intervals, making it easier to understand which income type refers to which interval of income.
df.head(5)

Unnamed: 0,children,days_employed,dob_years,education,education_id,family_status,family_status_id,gender,income_type,debt,total_income,purpose_categories,income_levels,income_categories
0,1,0,42,bachelor's degree,0,married,0,F,employee,0,40620,Property & Real Estate,"(31286.0, 40639.0]",Category 7: 31286 to 40639
1,1,0,36,secondary education,1,married,0,F,employee,0,17932,Car,"(17247.0, 20978.0]",Category 3: 17247 to 20978
2,0,0,33,secondary education,1,married,0,M,employee,0,23341,Property & Real Estate,"(20978.0, 25024.0]",Category 4: 20978 to 25024
3,3,0,32,secondary education,1,married,0,M,employee,0,42820,Education,"(40639.0, 362496.0]",Category 8: 40639 to 362496
4,0,14177,53,secondary education,1,civil partnership,1,F,retiree,0,25378,Wedding,"(25024.0, 26787.0]",Category 5: 25024 to 26787


In [19]:
df.groupby('income_categories')['income_type'].count()

income_categories
Category 1: 3305 to 13430      2692
Category 2: 13430 to 17247     2690
Category 3: 17247 to 20978     2690
Category 4: 20978 to 25024     2692
Category 5: 25024 to 26787     3302
Category 6: 26787 to 31286     2078
Category 7: 31286 to 40639     2690
Category 8: 40639 to 362496    2691
Name: income_type, dtype: int64

In [20]:
income_stats = df.pivot_table(index='income_categories', values='income_levels', columns='income_type', aggfunc='count')
income_stats.fillna(value=0, inplace=True)
#NaN values replaced by 0. They play no roles in the table.
income_stats = income_stats.astype(int)
#all float numbers changed to integer
income_stats

income_type,business,civil servant,employee,entrepreneur,paternity / maternity leave,retiree,student,unemployed
income_categories,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Category 1: 3305 to 13430,278,168,1362,0,1,882,0,1
Category 2: 13430 to 17247,474,184,1459,0,0,572,1,0
Category 3: 17247 to 20978,526,168,1483,0,0,513,0,0
Category 4: 20978 to 25024,602,177,1491,0,0,422,0,0
Category 5: 25024 to 26787,799,221,1701,1,0,580,0,0
Category 6: 26787 to 31286,561,149,1078,0,0,290,0,0
Category 7: 31286 to 40639,790,195,1383,0,0,321,0,1
Category 8: 40639 to 362496,1055,197,1162,1,0,276,0,0


In [21]:
df = df.drop_duplicates().reset_index(drop=True)
#duplicated data was deleted and index reseted
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21098 entries, 0 to 21097
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   children            21098 non-null  int64   
 1   days_employed       21098 non-null  int64   
 2   dob_years           21098 non-null  int64   
 3   education           21098 non-null  object  
 4   education_id        21098 non-null  int64   
 5   family_status       21098 non-null  object  
 6   family_status_id    21098 non-null  int64   
 7   gender              21098 non-null  object  
 8   income_type         21098 non-null  object  
 9   debt                21098 non-null  int64   
 10  total_income        21098 non-null  int64   
 11  purpose_categories  21098 non-null  object  
 12  income_levels       21098 non-null  category
 13  income_categories   21098 non-null  category
dtypes: category(2), int64(7), object(5)
memory usage: 2.0+ MB


## Conclusion

#### there were plenty of duplicated values in the purpose column that were categorized in only 4 differente values property & real estate, car, education and wedding. the name of the column was updated.

#### in my understanding, in the children columns, the -1 and 20 might be random values (MCAR) or a typo. I wasn't able to assign any other meaning, even comparing to the rest of the table. therefore I replaced -1 with 1 and 20 with 2. This  makes the duplicate count easier.

#### the total_income column were divided into 8 different quantiles, each one assigned an interval related to income_type values. income categories simply gives a label to the intervals, making it easier to understand which income type refers to which interval of income.

#### income_stats pivot table created to show how much of each type of income are in every interval of salary.

#### to finish the task, after standardizing the mentioned columns, the duplicate values were deleted.

<div class="alert alert-warning">

### Comment

Ok. Good! A little tip:
- try to think of a more appropriate categories names. A customers prefer to look at clear ones. for example: "Category 1" == "from 15000 to 75000"

</div>



<div class="alert alert-danger">

### Comment

Conclusion

</div>



# 2.19  Step 3. Answer these questions

<div class="alert alert-warning">

### Comment

Hello , I am a duplicate and i would like to be removed 2 steps ago :)

</div>



In [22]:
def relational_table(table, a, b):
    tabla = pd.DataFrame(table.groupby(a)[b].value_counts())
    tabla = tabla.rename(columns = {b: 'members'})
    tabla['ratio by total (%)'] = (tabla['members'] / tabla['members'].sum() * 100).round(2)
    tabla['ratio by debt category (%)'] = (tabla.groupby([a,b])['members'].sum() / tabla.groupby(a)['members'].sum() * 100).round(2)
    tabla['members'].sum()
    print(tabla)

 #The same function can be applied to the following questions

## Is there a relation between having kids and repaying a loan on time?

In [23]:
relational_table(df,'children','debt')

               members  ratio by total (%)  ratio by debt category (%)
children debt                                                         
0        0       12754               60.45                       92.32
         1        1061                5.03                        7.68
1        0        4345               20.59                       90.71
         1         445                2.11                        9.29
2        0        1912                9.06                       90.44
         1         202                0.96                        9.56
3        0         303                1.44                       91.82
         1          27                0.13                        8.18
4        0          36                0.17                       90.00
         1           4                0.02                       10.00
5        0           9                0.04                      100.00


## Conclusion
#### Around 90% of the customers who ask for a loan, don't have children and 60.45% of the whole list of customers are debt-free
#### The higher the amount of children, less chance there are that they ask for credit.
#### The amount of default clients are within the range of 7 - 10% with or without children. 
#### It seems that the chance of someone not paying his/her loans is relatively low, floating around the same range above.


<div class="alert alert-danger">

### Comment

Look... We need to know a share of delays for each catecory we use. Our ansver is about children and debt, therefore lets use in our calculation only its columns. Let s have a look to "0"-category. There are 13936 borrowers and only 1000 have debt. What does it mean? It means that only 7.54%(1000/13936*100) of its group are "bad-debtors"... And we need to do this calculations for all categories by using code :). And then match them results in conclusion. In conclusion you should to describe all categories you researched, not only write "yes, it is/no it isnt" or "yes, people with children are more debtors".... describe aaaaallll categories....
 
</div>

<div class="alert alert-danger">

### Comment

Conclusion
 
</div>

## Is there a relation between marital status and repaying a loan on time?

In [24]:
relational_table(df,'family_status','debt')

                        members  ratio by total (%)  \
family_status     debt                                
civil partnership 0        3737               17.71   
                  1         388                1.84   
divorced          0        1107                5.25   
                  1          85                0.40   
married           0       11126               52.73   
                  1         929                4.40   
unmarried         0        2508               11.89   
                  1         274                1.30   
widow / widower   0         881                4.18   
                  1          63                0.30   

                        ratio by debt category (%)  
family_status     debt                              
civil partnership 0                          90.59  
                  1                           9.41  
divorced          0                          92.87  
                  1                           7.13  
married           0  

## Conclusion
#### Around 60% of clients who asks for loans are married or in civil partnetship.
#### More than 90% of the clients pay their debts in time.
#### The main conlusion is that the marital status doesn't really affects the debts ocuurrances


<div class="alert alert-danger">

### Comment

Sorry, the same issue
 
</div>

## Is there a relation between income level and repaying a loan on time?

In [25]:
relational_table(df,'income_categories','debt')

                                  members  ratio by total (%)  \
income_categories           debt                                
Category 1: 3305 to 13430   0        2486               11.78   
                            1         206                0.98   
Category 2: 13430 to 17247  0        2469               11.70   
                            1         221                1.05   
Category 3: 17247 to 20978  0        2461               11.66   
                            1         229                1.09   
Category 4: 20978 to 25024  0        2447               11.60   
                            1         244                1.16   
Category 5: 25024 to 26787  0        2605               12.35   
                            1         271                1.28   
Category 6: 26787 to 31286  0        1896                8.99   
                            1         182                0.86   
Category 7: 31286 to 40639  0        2490               11.80   
                         

## Conclusion
#### almost all categories of incoming levels has a very low ratio of defaulting clientes, around 1%.
#### on the other hand, from around 5 to 9% of the customers who ask for loans tend to have debts
#### The salary level also does not play a big roll in the loans requests

 ## How do different loan purposes affect on-time repayment of the loan?

In [26]:
relational_table(df,'purpose_categories','debt')

                             members  ratio by total (%)  \
purpose_categories     debt                                
Car                    0        3870               18.34   
                       1         402                1.91   
Education              0        2797               13.26   
                       1         288                1.37   
Property & Real Estate 0       10572               50.11   
                       1         863                4.09   
Wedding                0        2120               10.05   
                       1         186                0.88   

                             ratio by debt category (%)  
purpose_categories     debt                              
Car                    0                          90.59  
                       1                           9.41  
Education              0                          90.66  
                       1                           9.34  
Property & Real Estate 0                          9

In [27]:
## Conclusion
#### There is a total rate of around 90% - 92% of positive loan paying.
#### More informations or parameters are necessary to state a more accurate result. 

<div class="alert alert-danger">

### Comment

Sorry, the same issue
 
</div>

In [28]:
relational_table(df,'debt', 'debt')

           members  ratio by total (%)  ratio by debt category (%)
debt debt                                                         
0    0       19359               91.76                       100.0
1    1        1739                8.24                       100.0


# OVERALL CONCLUSION

#### There were some typos and missing values in the original data which could lead to misunderstandings in the children, education, total_income and purpose_categories columns. Therefore, a .lower() method as well as mean() methods and fillna()were necessary to standardize data.

#### The creation of the 8 vategories based on the income level seemed important to name the range of salaries. Easier to understand when in words than in numbers

#### Duplicates were treated based on the purpose column, which because of repetitions were also replaced with 4 main categories.

#### The same function was used to provide material to the construction of the comparison tables of the last task.

#### To conclude, after all calculations and table divisions, the goal was better achieved as possible: in average, only 8% of the customers of the whole list have debts. This means there is no special potential of default. However, I was not able to find a special correlation between having children, being married or special purposes to take out a loan. I assume that 8% of chances of having potential default clients as a very low percentage. It is not possible to affirm that a specific group is more or less suitable to pay their loans in time. In all comparisons, the maximum % of defaulting clients is 10%, number which I was not provided with specific criteria to confirm if it is a big or small number. In addition, the retrieved tables must be examined in details by a specialist in credit. 

<div class="alert alert-warning">

### Comment

a tiny tip:
- i would suggest you to use markdown-cell whenever you want to write the text. It is in toolbar above.
 
</div>

<div class="alert alert-danger">

### Comment
    
The final conclusion, unfortunately, is not large enough. Customers (and they only read the output, they don't need the code) would be happy to see something more detailed and voluminous that they can read. Now for filling:

- what was our goal?
- what data did we have in our hands?
- how we processed / modified / worked with gaps and duplicates + reasoning on the appearance of duplicates / zeros / gaps / artifacts.
- general figures for the entire table. Average percentage of delinquency, which groups are more (for example, pensioners)
- answers to questions - in as much detail as possible for all groups with numbers and reasoning. remember that we do not find an ideal ones, but we need to describe all categories we have and their metrics.
- and at the very end "what was the goal and how the answers to the questions will help to achieve this goal"
    
This is an approximate plan of the final output for any analytical project. I propose to add your own according to it and everything will be fine!
</div>