# Default payment by different education levels
## 1. Introduction 
* According to an article published on Investopedia by James Chen, default payment is generally defined as a failure for borrowers to make required repayments on a debt for businesses, individuals or lenders in general. Default payment may severely impact different aspects in our financial situation in the future such as credit score and the ability to borrow money in the future. It is also argued that default payment is also linked with education level. Based on a research study report by Aalto University in Finland, researchers argued that default payments are more prevalent among those who have a lower educational background. 
* In this research project, we will be assessing whether the claim above is true by performing hypothesis testing on the `default of credit card clients.xls` dataset where information was based on cases of customer’s credit card in Taiwan. Data was collected by the Department of Information Management at Chung Hua University and the Department of Civil Engineering at Tangkam University both located in Taiwan. Our goal for this project is to examine whether there is a difference in the proportion of people with a university education background who have a default payment on their credit cards and the proportion of those who graduated from high school experiencing credit card’s default payment. The variables being studied are `EDUCATION` and `default payment next month`. 

## 2. Methods:
* The data collected offers a comprehensive overview of 23 variables based on whether they have default payments. With these variables we are able to study whether these variables are correlated with a person having default payments. For this study, we will determine whether education level has an effect on the likelihood of default, specifically between high school graduates and university graduates. The distinct categories that expressed in the data allow us to clearly see the difference of each group within the sample.
* Our hypotheses will be:  $𝐻_{0}:𝑝1−𝑝2=0$  and  $𝐻_{1}:𝑝1−𝑝2≠0$  (we are treating groups of high school education level and university education level as two independent samples) . Looking at the observed difference we calculated above, we are likely to reject our  𝐻0  in favor of  𝐻1 . However, we still need to make sure that our finding is statistically significant. Therefore, we will perform a difference of proportions hypothesis test. We can then create a bootstrap distribution and calculate confidence intervals for the true mean difference between the two variables.


## 3. Loading and cleaning data: 
### Loading data:

In [21]:
library(tidyverse)
library(readxl)
install.packages("rio")
library(rio)
install.packages("infer")
library(infer)
install.packages("reshape2")
library(reshape2)
install.packages("broom")
library(broom)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)

Installing package into ‘/home/jupyter/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)



In [2]:
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
credit_card_data <- rio::import(file = url, skip =1)

head(credit_card_data)
nrow(credit_card_data)

Unnamed: 0_level_0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,⋯,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,20000,2,2,1,24,2,2,-1,-1,⋯,0,0,0,0,689,0,0,0,0,1
2,2,120000,2,2,2,26,-1,2,0,0,⋯,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,3,90000,2,2,2,34,0,0,0,0,⋯,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,4,50000,2,2,1,37,0,0,0,0,⋯,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,5,50000,1,2,1,57,-1,0,-1,0,⋯,20940,19146,19131,2000,36681,10000,9000,689,679,0
6,6,50000,1,1,2,37,0,0,0,0,⋯,19394,19619,20024,2500,1815,657,1000,1000,800,0


### Cleaning and wrangling data 

In [16]:
credit_card_clean <- credit_card_data %>%
                     mutate(education_level=as.character(`EDUCATION`)) %>%
                     mutate(default=as.character(`default payment next month`)) %>%
                     select(education_level, default) %>% 
                     mutate(education_level = replace(education_level, education_level == "2", "University"),
                            default = replace(default, default== "1", "Yes")) %>%
                     mutate(education_level = replace(education_level, education_level == "3", "High_School"),
                            default = replace(default, default== "0", "No"))%>%
                    filter(education_level == 'University' | education_level == 'High_School')%>%
mutate(education_level = as_factor(education_level),
       default = as_factor(default))
head(credit_card_clean)
table(credit_card_clean)

Unnamed: 0_level_0,education_level,default
Unnamed: 0_level_1,<fct>,<fct>
1,University,Yes
2,University,Yes
3,University,No
4,University,No
5,University,No
6,University,No


               default
education_level   Yes    No
    University   3330 10700
    High_School  1237  3680

## 4. Performing hypothesis tests:


### Hypothesis Testing using Asymtotics Methods:


In [19]:
default_summary <-
    credit_card_clean %>% 
    group_by(education_level) %>% 
    summarise(n = n(), 
              p_hat = mean(default == "Yes"),  
             `.groups` = "drop") %>% 
    pivot_wider(names_from = education_level, values_from = c(n, p_hat))%>%
     mutate(prop_diff = p_hat_University - p_hat_High_School)
default_summary

n_University,n_High_School,p_hat_University,p_hat_High_School,prop_diff
<int>,<int>,<dbl>,<dbl>,<dbl>
14030,4917,0.2373485,0.2515762,-0.01422763


In [24]:
default_prop_test<-  
    tidy(
        prop.test(
        x = c(default_summary$n_University* default_summary$p_hat_University,
              default_summary$n_High_School * default_summary$p_hat_High_School), # the number of successes,
        n = c(default_summary$n_University,
              default_summary$n_High_School),# the number of trials, 
        alternative = "two.sided", # alternative hypothesis: "less", "greater", "two.sided"
        conf.level = 0.95,
        correct = FALSE))
 # No Answer - remove if you provide an answer

default_prop_test

estimate1,estimate2,statistic,p.value,parameter,conf.low,conf.high,method,alternative
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
0.2373485,0.2515762,4.028778,0.04473034,1,-0.02825126,-0.0002039881,2-sample test for equality of proportions without continuity correction,two.sided
