# Predicting Education History From Credit Data

## Introduction

When individuals have a credit card, there is a credit amount which is a limit on the amount they can purchase, while the amount they actually purchase is their bill statement. A credit default is when a client cannot pay their bill statement. 
Is there a relationship between how many people default and their credit limit/bill statement? How does this change based on a person's educational history?
We are using the default of credit card clients dataset from the UCI Machine Learning Repository which has data on 30,000 individuals and their credit history. The dataset includes each individual’s credit amount, gender, education, marital status, age, history of past payment, amount of bill statement, and amount of previous payment.


## Preliminary Data Exploration

In [1]:
library(tidyverse)
library(repr)
library(readxl)
options(repr.matrix.max.rows = 8)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [13]:
# reading in the data
url <- "https://github.com/zackhamza01/DSCI-100-Project/raw/main/data/creditcardcsv.csv"
credit_data <- read_csv(url, skip = 1)

# tidying the data
tidied_credit_data <- credit_data %>%
    mutate(BILL_AVG = (BILL_AMT1 + BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6) / 6) %>%
    mutate(PAY_AVG = (PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6) / 6) %>%
    select(EDUCATION, AGE, LIMIT_BAL, BILL_AVG, PAY_AVG) %>%
    mutate(EDUCATION = as_factor(EDUCATION))

tidied_credit_data

Parsed with column specification:
cols(
  .default = col_double()
)

See spec(...) for full column specifications.



EDUCATION,AGE,LIMIT_BAL,BILL_AVG,PAY_AVG
<fct>,<dbl>,<dbl>,<dbl>,<dbl>
2,24,20000,1284.000,114.8333
2,26,120000,2846.167,833.3333
2,34,90000,16942.167,1836.3333
2,37,50000,38555.667,1398.0000
⋮,⋮,⋮,⋮,⋮
3,43,150000,3530.333,2415.000
2,37,30000,11749.333,5216.667
3,41,80000,44435.167,24530.167
2,46,50000,38479.000,1384.667
