# The effect of socioeconomic factors on credit card approval
**Authors:**
* Brian McGiffin / *directory id* / *uid*
* Walter Osborne / *directory id* / *uid*
* Cedric Prentice / cprentic / 117196856

## Introduction
Credit is an increasingly important tool for Americans. The increasing costs of products like [housing](https://www.whitehouse.gov/cea/written-materials/2021/09/09/housing-prices-and-inflation/), [cars](https://fred.stlouisfed.org/series/CUSR0000SETA02), and home appliances mean that it is difficult or impossible for most Americans to buy them outright. Besides allowing people to make larger purchases than they otherwise could have, people with good credit get another big advantage: better terms for almost all credit products. People with good credit can get higher credit limits, larger loan amounts (for things like mortgages), longer loan terms, and lower interest rates.  
  
Unfortunately, not everyone has an equal chance to reap the opportunities credit provides. Historic inequalities mean that African Americans, for example, face significant financial disadvantages compared to white Americans. According to the [Center for American Progress](https://www.americanprogress.org/article/systematic-inequality/), black households have fewer personal savings, and they are more likely to need to use those savings (because of negative income shocks). This lack of available financial resources causes black households to fall into more debt than white households. All that debt makes it harder to get lines of credit.  
  
By looking at existing credit approval data, we can investigate how socioeconomic factors, like ethnicity, citizenship, and occupation, affect credit approval and credit scores. Over this tutorial, we will cover the [data science lifecycle](https://www.datascience-pm.com/data-science-life-cycle/): data collection, data processing, exploratory analysis and data visualization, analysis, and interpretation.  
  
### Table of contents:
1. *TODO: Insert a table of contents here*
  
### Aside: [credit scores](https://www.investopedia.com/terms/c/credit_score.asp)
The most important datapoint of credit is the credit score. A credit score is a number that rates a consumer’s credit worthiness. It ranges from 300 to 850, with a higher score indicating a consumer that is more worthy. Lenders use it to evaluate the probability that a borrower will repay loans in a timely manner. There are five main factors that impact credit score:
1. Payment history (35% of score)
2. Total amount owed (30% of score)
3. Length of credit history (15% of score)
4. Types of credit (10% of score)
5. New credit (10% of score)

## Data collection
### Modules used
TODO: Provide description about the libraries we used and provide links to official documentation. This doesn't have to be long.

In [43]:
# Import the required modules
import numpy as np
import pandas as pd
import os

### Importing the data
The first step of the data science lifecycle is importing data. Our data is downloadable from [Kaggle](https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data), but it is originally sourced from The University of California, Irvine. **Note that certain columns have been rescaled to protect the anonymity of the applicants.** Many of the columns are self-explanatory, but a brief description of the confusing columns are below:
* Gender: 0 = female, 1 = male
* Married: 0 = single, divorced, etc.; 1 = married
* Drivers license: 0 = no license, 1 = license
* Approved: 0 = not approved for card, 1 = approved for card
* The zip code column is randomized for the applicants’ privacy
* The outstanding debt and income columns are rescaled for privacy, but the original distribution is preserved  
  
The raw data is in CSV (Comma-Separated Value) format. To upload the data, we used the ```read_csv``` function from the [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) library.

In [44]:
# Load the data
cwd = os.getcwd()
df = pd.read_csv(cwd + '/crx.csv')
df

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
1,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
2,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
4,b,32.08,4.000,u,g,m,v,2.50,t,f,0,t,g,00360,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
685,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
686,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
687,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


## Processing the data
TODO: Text  
**Do not run this block without running the above block first! Running it multiple times will cause the missing row to be duplicated in the dataframe!**

In [45]:
# Rename all of the columns
df.rename(columns={
    'b': 'gender_raw',
    '30.83': 'age',
    '0': 'debt',
    'u': 'married_raw',
    'g': 'bank_customer_raw',
    'w': 'industry_raw',
    'v': 'ethnicity_raw',
    '1.25': 'years_employed',
    't': 'prior_default_raw',
    't.1': 'employed_raw',
    '01': 'credit_score',
    'f': 'drivers_license_raw',
    'g.1': 'citizen_raw',
    '00202': 'zip_code',
    '0.1': 'income',
    '+': 'approved_raw'
}, inplace=True)

# Re add the first row to the dataframe
new_row = {
    'gender_raw': ['b'],
    'age': [30.83],
    'debt': [0],
    'married_raw': ['u'],
    'bank_customer_raw': ['g'],
    'industry_raw': ['w'],
    'ethnicity_raw': ['v'],
    'years_employed': [1.25],
    'prior_default_raw': ['t'],
    'employed_raw': ['t'],
    'credit_score': [1],
    'drivers_license_raw': ['f'],
    'citizen_raw': ['g'],
    'zip_code': ['00202'],
    'income': [0],
    'approved_raw': ['+']
}
df_temp = pd.DataFrame(new_row)
df = pd.concat([df, df_temp], ignore_index=True)

# Replace missing values with NaN
df.replace(to_replace='?', value=np.NaN, inplace=True)
df.dropna(axis=0, how='any', inplace=True)

# Replace the existing gender values with numeric values

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 653 entries, 0 to 689
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender_raw           653 non-null    object 
 1   age                  653 non-null    object 
 2   debt                 653 non-null    float64
 3   married_raw          653 non-null    object 
 4   bank_customer_raw    653 non-null    object 
 5   industry_raw         653 non-null    object 
 6   ethnicity_raw        653 non-null    object 
 7   years_employed       653 non-null    float64
 8   prior_default_raw    653 non-null    object 
 9   employed_raw         653 non-null    object 
 10  credit_score         653 non-null    int64  
 11  drivers_license_raw  653 non-null    object 
 12  citizen_raw          653 non-null    object 
 13  zip_code             653 non-null    object 
 14  income               653 non-null    int64  
 15  approved_raw         653 non-null    object 
