**DSCI 100: GROUP PROJECT PROPOSAL**

**I. Introduction**

This past year, Canada’s inflation rate reached a 30-year high of 5.7% (Evans, 2022). In turn, the increase in cost of living has caused 6.4% increase in credit balances (Senett, 2022). Without a comparable increase in wages, Canadians are at a greater risk of credit card default. Credit default occurs when an individual misses the minimum payment due for six months (Bucci, 2022). Credit default has vast ramifications including weak credit scores and lawsuits.

By exploring precursors to credit default, individuals become equipped to identify warning signs. The following question arises: will an individual default on their credit payment? 

The “Default of Credit Card Clients” dataset from the UC Irvine Machine Learning Repository allows us to answer this question. It provides data on an individual’s age, sex, highest level of education obtained, marital status, amount of given credit, history of past payments, bill statement amounts, previous payment amounts, and most importantly, whether this individual had defaulted. 

**II. Preliminary Exploratory Data Analysis**

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(RColorBrewer)
options(repr.matrix.max.rows = 6)

set.seed(1)

tidy_credit_data <- credit_data |>
rename(BILL_SEPT = BILL_AMT1,
       BILL_AUG = BILL_AMT2,
       BILL_JUL = BILL_AMT3,
       BILL_JUN = BILL_AMT4,
       BILL_MAY = BILL_AMT5,
       BILL_APR = BILL_AMT6,
       PAY_SEPT = PAY_AMT1,
       PAY_AUG = PAY_AMT2,
       PAY_JUL = PAY_AMT3,
       PAY_JUN = PAY_AMT4,
       PAY_MAY = PAY_AMT5,
       PAY_APR = PAY_AMT6,
       DEFAULT = "default payment next month") |>
#select(BILL_SEPT:DEFAULT) |>
slice_sample(n = 1000) |>
mutate(DEFAULT= as_factor(DEFAULT)) 


credit_split <- initial_split(tidy_credit_data,
                              prop = 0.75, 
                              strata = DEFAULT)
credit_train <- training(credit_split)
credit_test <- testing (credit_split)

credit_train
glimpse(credit_train)

num_obs <- nrow(credit_train)
credit_train |>
group_by(DEFAULT) |>
summarize(count = n(), percentage = n() / num_obs * 100)

options(repr.plot.width = 15, repr.plot.height = 10)
billsept_paysempt <- credit_train |>
ggplot(aes(x = BILL_SEPT, y = PAY_SEPT, color = DEFAULT)) +
geom_point(alpha = 1.0) +
labs(x = "September Bill Statement", 
     y = "September Previous Payment", 
     color = "Default Payment") +
theme(text = element_text(size = 25)) 

billsept_paysempt



── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

ERROR: [1m[33mError[39m in [1m[1m`chr_as_locations()`:[22m
[33m![39m Can't rename columns that don't exist.
[31m✖[39m Column `BILL_AMT1` doesn't exist.


**III. Method**

#Stephanie 

Explain how you will conduct either your data analysis and which variables/columns you will use. 
Note - you do not need to use all variables/columns that exist in the raw data set. 
In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

Describe at least one way that you will visualize the results


**VI. Expected outcomes and significance**

#Aaron

What do you expect to find?

What impact could such findings have?

What future questions could this lead to?