In [1]:
library(tidyverse)
library(ggplot2)
library(PropCIs)
library(zeallot)
library(DBI)
library(vcd)
con <- DBI::dbConnect(odbc::odbc(), driver = "/usr/local/Cellar/psqlodbc/13.02.0000/lib/psqlodbcw.so", database = "yukontaf", UID = "glebsokolov", host = "localhost",
  port = 5432)

── [1mAttaching packages[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.8
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Загрузка требуемого пакета: grid



In [2]:
credit_score <- dbSendQuery(con, "SELECT * FROM credit_score")
credit_score <- dbFetch(credit_score)

Let's convert the columns in the receieved data to the right types.

In [3]:
for (col in 1:ncol(credit_score)) {
  colnames(credit_score)[col] <- tolower(colnames(credit_score)[col])
}

In [4]:
credit_score <- subset(credit_score, select = -c(index, id))
categories <- c("sex", "education", "marriage", "default")
for (col in categories) {
  credit_score[, col] <- as.factor(credit_score[, col])
}
numerical <- names(subset(credit_score, select = -c(sex, education, marriage, default)))
for (n in numerical) {
  credit_score[, n] <- as.double(credit_score[, n])
}

In [7]:
credit_score %>% as_tibble() %>% print(n=10)

[90m# A tibble: 30,000 × 24[39m
   limit_bal sex   education marriage   age pay_0 pay_2 pay_3 pay_4 pay_5 pay_6
       [3m[90m<dbl>[39m[23m [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m     [3m[90m<fct>[39m[23m    [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m [3m[90m<dbl>[39m[23m
[90m 1[39m     [4m2[24m[4m0[24m000 2     2         1           24     2     2     0     0     0     0
[90m 2[39m    [4m1[24m[4m2[24m[4m0[24m000 2     2         2           26     0     2     0     0     0     2
[90m 3[39m     [4m9[24m[4m0[24m000 2     2         2           34     0     0     0     0     0     0
[90m 4[39m     [4m5[24m[4m0[24m000 2     2         1           37     0     0     0     0     0     0
[90m 5[39m     [4m5[24m[4m0[24m000 1     2         1           57     0     0     0     0     0     0
[90m 6[39m     [4m5[24m[4m0[24m000 1     1

Let's test two hypothesis: 
- Are the mean credit limits (limit_bal) value for two groups default = 0 (didn't returned the credit) and default = 1 equal to each other?
- Are the distributions of the limit_bal for these two groups also equal to each other?

In order to answer these and the following questions I will calculate **confidence intervals**.

In [6]:
p <- ggplot(credit_score, aes(x = default, y = limit_bal)) + geom_boxplot()
t.test(limit_bal ~ default, data = credit_score)
wilcox.test(limit_bal ~ default, data = credit_score)


	Welch Two Sample t-test

data:  limit_bal by default
t = 28.952, df = 11982, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 44740.91 51239.23
sample estimates:
mean in group 0 mean in group 1 
       178099.7        130109.7 



	Wilcoxon rank sum test with continuity correction

data:  limit_bal by default
W = 95786286, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0


**These results are obviously practically significant.**

Now, lets test another pair of hypothesis:
- Are the mean ages and their distributions for these two groups equal to each other?

In [7]:
t.test(age ~ default, credit_score)
wilcox.test(age ~ default, credit_score)


	Welch Two Sample t-test

data:  age by default
t = -2.3195, df = 10173, p-value = 0.02039
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.56915863 -0.04778641
sample estimates:
mean in group 0 mean in group 1 
       35.41727        35.72574 



	Wilcoxon rank sum test with continuity correction

data:  age by default
W = 76966880, p-value = 0.3725
alternative hypothesis: true location shift is not equal to 0


The result we received tells us that statistically, mean ages *are* different, but from the confidence interval value we can see that **this difference is hardly practically signigicant.**

Now let's see if the gender composition for the two groups differ.

In [6]:
good <- filter(credit_score, default == 0)
bad <- filter(credit_score, default == 1)
c(ngoodmen, total_good, nbadmen, total_bad) %<-% c(table(good$sex)[1], sum(table(good$sex)), table(bad$sex)[1], sum(table(bad$sex)))
diffscoreci(ngoodmen, total_good, nbadmen, total_bad, conf.level = 0.95)




data:  

95 percent confidence interval:
 -0.06057240 -0.03366348


That means that men do not return their credits **slightly more often** (3-6%) than women.

Now, let's see if the education level impacts default rate. First, calculate table which will show us the sizes of default and no-default groups for each education level, secondly, let's see how do these sizes differ from the expected ones, next calculate the value of the statistical criteria.

In [13]:
crosstab <- table(credit_score$education, credit_score$default)
crosstab
crosstab - chisq.test(crosstab)$expected
chisq.test(crosstab)
assocstats(crosstab)

   
        0     1
  0    14     0
  1  8549  2036
  2 10700  3330
  3  3680  1237
  4   116     7
  5   262    18
  6    43     8

“Chi-squared approximation may be incorrect”


   
            0         1
  0    3.0968   -3.0968
  1  305.4020 -305.4020
  2 -226.5640  226.5640
  3 -149.3596  149.3596
  4   20.2076  -20.2076
  5   43.9360  -43.9360
  6    3.2812   -3.2812

“Chi-squared approximation may be incorrect”



	Pearson's Chi-squared test

data:  crosstab
X-squared = 163.22, df = 6, p-value < 2.2e-16


                    X^2 df P(> X^2)
Likelihood Ratio 184.71  6        0
Pearson          163.22  6        0

Phi-Coefficient   : NA 
Contingency Coeff.: 0.074 
Cramer's V        : 0.074 

Finally, let's see if the marriage category impacts the default category.

In [14]:
marriage_crosstab <- table(credit_score$marriage, credit_score$default)
marriage_crosstab
assocstats(marriage_crosstab)

   
        0     1
  0    49     5
  1 10453  3206
  2 12623  3341
  3   239    84

                    X^2 df   P(> X^2)
Likelihood Ratio 36.609  3 5.5663e-08
Pearson          35.662  3 8.8259e-08

Phi-Coefficient   : NA 
Contingency Coeff.: 0.034 
Cramer's V        : 0.034 

For both variables (education and marriage) we see that **they statitically significant impact the default category**. However, the contigency coefficients (which tells us how strong the features are correlated) are relatively small.