# DSCI 100 Group Project -- Aadi Kanwar

## Introduction

Our group has decided to analyze a dataset containing various categorical and numerical facts about individuals (education, income, and credit score being pertinent to this project), classifying the credit score categorically as either "Low" "Average" or "High". An individual's credit score describes their "creditworthiness", i.e their ability and trustworthiness to repay their debts. Therefore one can reason that a high credit score demonstrates a high "creditworthiness". Credit scores can be numerically represented, (and often are), but for the simplicity of this project, we have decided to classify credit scores between three categorical variables, (low, average, and high). If one wishes to consider these categories numerically, we consider a low credit score to be approximately 300, a high credit score to be around 800, and an average credit score to be around 550 (Investopedia, 2023). Given this information, we formulate the question: "What is a credit score of an individual classified as, when considering their income and education?". The specific data set that will be used to answer this question is titled "Credit Score Classification Dataset" and is provided by Sujith K. Mandala on Kaggle ([here is a direct link](https://www.kaggle.com/datasets/sujithmandala/credit-score-classification-dataset/data)). 

## Preliminary Exploratory Data Analysis 

### Reading the dataset into R from the web

In [1]:
install.packages("janitor")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [11]:
library(tidyverse)
library(janitor)

url1 <- "https://raw.githubusercontent.com/aadikanwar/DSCI100_group_project/main/Credit%20Score%20Classification%20Dataset.csv"  # URL for the raw dataset, taken from the uploaded data set from our GitHub repo
download.file(url1, "data/credit_score_data.csv")  # downloading the file as a data file into the working directory 
data <- read_csv("data/credit_score_data.csv")  #reading the file into the notebook, 
head(data)

[1mRows: [22m[34m164[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (5): Gender, Education, Marital Status, Home Ownership, Credit Score
[32mdbl[39m (3): Age, Income, Number of Children

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>
25,Female,50000,Bachelor's Degree,Single,0,Rented,High
30,Male,100000,Master's Degree,Married,2,Owned,High
35,Female,75000,Doctorate,Married,1,Owned,High
40,Male,125000,High School Diploma,Single,0,Owned,High
45,Female,100000,Bachelor's Degree,Married,3,Owned,High
50,Male,150000,Master's Degree,Married,0,Owned,High


### Cleaning and Wrangling the Data 

In [18]:
clean_cs_data <- data |>
    clean_names()  # this is done to clean the column names and follow the proper tidy convention of column names being lowercase and seperated via underscores 

clean_cs_data <- clean_cs_data |>
    mutate(credit_score = as_factor(credit_score))  #converting the variable we wish to predict into a factor, which will prove to be helpful for analysis

head(clean_cs_data) 

age,gender,income,education,marital_status,number_of_children,home_ownership,credit_score
<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<fct>
25,Female,50000,Bachelor's Degree,Single,0,Rented,High
30,Male,100000,Master's Degree,Married,2,Owned,High
35,Female,75000,Doctorate,Married,1,Owned,High
40,Male,125000,High School Diploma,Single,0,Owned,High
45,Female,100000,Bachelor's Degree,Married,3,Owned,High
50,Male,150000,Master's Degree,Married,0,Owned,High


The data presented is sensibly clean and wrangled appropriately, as for now since we have not identified which variables we will use to answer our question. Each row is a singular observation, each column is a singular variable, and each cell contains a singular value, and so this data can be considered tidy. The names were cleaned to contain no spaces and be lowercase, and the credit_score column was wrangled such that it is represented as a factor, as that is what we wish to predict/classify. 