# The effect of socioeconomic factors on credit card approval
**Authors:**
* Brian McGiffin / *directory id* / *uid*
* Walter Osborne / *directory id* / *uid*
* Cedric Prentice / cprentic / 117196856

## Introduction
Credit is an increasingly important tool for Americans. The increasing costs of products like [housing](https://www.whitehouse.gov/cea/written-materials/2021/09/09/housing-prices-and-inflation/), [cars](https://fred.stlouisfed.org/series/CUSR0000SETA02), and home appliances mean that it is difficult or impossible for most Americans to buy them outright. Besides allowing people to make larger purchases than they otherwise could have, people with good credit get another big advantage: better terms for almost all credit products. People with good credit can get higher credit limits, larger loan amounts (for things like mortgages), longer loan terms, and lower interest rates.  
  
Unfortunately, not everyone has an equal chance to reap the opportunities credit provides. Historic inequalities mean that African Americans, for example, face significant financial disadvantages compared to white Americans. According to the [Center for American Progress](https://www.americanprogress.org/article/systematic-inequality/), black households have fewer personal savings, and they are more likely to need to use those savings (because of negative income shocks). This lack of available financial resources causes black households to fall into more debt than white households. All that debt makes it harder to get lines of credit.  
  
By looking at existing credit approval data, we can investigate how socioeconomic factors, like ethnicity, citizenship, and occupation, affect credit approval and credit scores. Over this tutorial, we will cover the [data science lifecycle](https://www.datascience-pm.com/data-science-life-cycle/): data collection, data processing, exploratory analysis and data visualization, analysis, and interpretation.  
  
### Table of contents:
1. *TODO: Insert a table of contents here*
  
### Aside: [credit scores](https://www.investopedia.com/terms/c/credit_score.asp)
The most important datapoint of credit is the credit score. A credit score is a number that rates a consumer’s credit worthiness. It ranges from 300 to 850, with a higher score indicating a consumer that is more worthy. Lenders use it to evaluate the probability that a borrower will repay loans in a timely manner. There are five main factors that impact credit score:
1. Payment history (35% of score)
2. Total amount owed (30% of score)
3. Length of credit history (15% of score)
4. Types of credit (10% of score)
5. New credit (10% of score)

## Data collection
### Modules used
TODO: Provide description about the libraries we used and provide links to official documentation. This doesn't have to be long.

In [34]:
# Import the required modules
import numpy as np
import pandas as pd
import os

### Importing the data
The first step of the data science lifecycle is importing data. Our data is downloadable from [Kaggle](https://www.kaggle.com/datasets/samuelcortinhas/credit-card-approval-clean-data), but it is originally sourced from The University of California, Irvine. **Note that certain columns have been rescaled to protect the anonymity of the applicants.** The data is hard to understand now, but in the next step we will clean it up and make it readable and usable for analysis.
  
The raw data is in CSV (Comma-Separated Value) format. To upload the data, we used the ```read_csv``` function from the [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) library.

In [35]:
# Load the data
cwd = os.getcwd()
df = pd.read_csv(cwd + '/crx.csv')
df

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
1,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
2,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
4,b,32.08,4.000,u,g,m,v,2.50,t,f,0,t,g,00360,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
685,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
686,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
687,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


## Processing the data
TODO: Text  
  
Many of the columns are self-explanatory, but a brief description of the confusing columns are below:
* Gender: 0 = female, 1 = male
* Married: 0 = single, divorced, etc.; 1 = married
* Drivers license: 0 = no license, 1 = license
* Approved: 0 = not approved for card, 1 = approved for card
* The zip code column is randomized for the applicants’ privacy
* The outstanding debt and income columns are rescaled for privacy, but the original distribution is preserved  

**Do not run this block without running the above block first! Running it multiple times will cause the missing row to be duplicated in the dataframe!**

In [36]:
# Rename all of the columns
df.rename(columns={
    'b': 'gender_raw',
    '30.83': 'age',
    '0': 'debt',
    'u': 'married_raw',
    'g': 'bank_customer_raw',
    'w': 'industry',
    'v': 'ethnicity',
    '1.25': 'years_employed',
    't': 'prior_default_raw',
    't.1': 'employed_raw',
    '01': 'credit_score',
    'f': 'drivers_license_raw',
    'g.1': 'citizen',
    '00202': 'zip_code',
    '0.1': 'income',
    '+': 'approved_raw'
}, inplace=True)

# Re add the first row to the dataframe
new_row = {
    'gender_raw': ['b'],
    'age': [30.83],
    'debt': [0],
    'married_raw': ['u'],
    'bank_customer_raw': ['g'],
    'industry': ['w'],
    'ethnicity': ['v'],
    'years_employed': [1.25],
    'prior_default_raw': ['t'],
    'employed_raw': ['t'],
    'credit_score': [1],
    'drivers_license_raw': ['f'],
    'citizen': ['g'],
    'zip_code': ['00202'],
    'income': [0],
    'approved_raw': ['+']
}
df_temp = pd.DataFrame(new_row)
df = pd.concat([df, df_temp], ignore_index=True)

# Replace missing values with NaN
df.replace(to_replace='?', value=np.NaN, inplace=True)
df.dropna(axis=0, how='any', inplace=True)

# Replace the existing gender values with numeric values
df.insert(1, 'gender', 0)
for index, row in df.iterrows():
    raw = df.at[index, 'gender_raw']
    if raw == 'b':
        df.at[index, 'gender'] = 1 # 1 represents male
    else:
        df.at[index, 'gender'] = 0 # 0 represents female
df.drop(columns=['gender_raw'], inplace=True)

# Replace the existing marriage statuses with a numeric value
df.insert(4, 'married', 0)
for index, row in df.iterrows():
    raw = df.at[index, 'married_raw']
    if raw == 'u':
        df.at[index, 'married'] = 1 # 1 represents a married person
    else:
        df.at[index, 'married'] = 0 # 0 represents anyone who isn't married
df.drop(columns=['married_raw'], inplace=True)

# Replace existing bank customer values with numeric values
df.insert(5, 'bank_customer', 0)
for index, row in df.iterrows():
    raw = df.at[index, 'bank_customer_raw']
    if raw == 'p':
        df.at[index, 'bank_customer'] = 0 # 0 represents someone without a bank account
    else:
        df.at[index, 'bank_customer'] = 1 # 1 represents someone with at least one bank account
df.drop(columns=['bank_customer_raw'], inplace=True)

# Replace existing ethnicity values with a more helpful names
for index, row in df.iterrows():
    raw = df.at[index, 'ethnicity']
    match raw:
        case 'bb':
            df.at[index, 'ethnicity'] = 'asian'
        case 'ff':
            df.at[index, 'ethnicity'] = 'latino'
        case 'h':
            df.at[index, 'ethnicity'] = 'black'
        case 'v':
            df.at[index, 'ethnicity'] = 'white'
        case _:
            df.at[index, 'ethnicity'] = 'other'

# Replace existing prior default, employed, and driver's license values with numeric values
df.insert(9, 'prior_default', 0)
df.insert(10, 'employed', 0)
df.insert(14, 'drivers_license', 0)
for index, row in df.iterrows():
    raw_default = df.at[index, 'prior_default_raw']
    raw_employed = df.at[index, 'prior_default_raw']
    raw_license = df.at[index, 'prior_default_raw']

    if raw_default == 't':
        df.at[index, 'prior_default'] = 1 # has defaulted on a loan before
    else:
        df.at[index, 'prior_default'] = 0 # has not defaulted on a loan before
    
    if raw_employed == 't':
        df.at[index, 'employed'] = 1 # employed
    else:
        df.at[index, 'employed'] = 0 # not employed

    if raw_license == 't':
        df.at[index, 'drivers_license'] = 1 # has a driver's license
    else:
        df.at[index, 'drivers_license'] = 0 # doesn't have a driver's license
df.drop(columns=['prior_default_raw', 'employed_raw', 'drivers_license_raw'], inplace=True)

# Replace existing citizenship values with meaningful strings
for index, row in df.iterrows():
    raw = df.at[index, 'citizen']
    match raw:
        case 'g':
            df.at[index, 'citizen'] = 'birth'
        case 'p':
            df.at[index, 'citizen'] = 'temporary'
        case 's':
            df.at[index, 'citizen'] = 'naturalized'

# Replace existing approval values with numeric values
df.insert(16, 'approved', 0)
for index, row in df.iterrows():
    raw = df.at[index, 'approved_raw']
    if raw == '+':
        df.at[index, 'approved'] = 1
    else:
        df.at[index, 'approved'] = 0

# Drop industry and zip code because they are unusable for our analysis (and approved_raw from the last step)
df.drop(columns=['approved_raw', 'industry', 'zip_code'], inplace=True)

df

Unnamed: 0,gender,age,debt,married,bank_customer,ethnicity,years_employed,prior_default,employed,credit_score,drivers_license,citizen,income,approved
0,0,58.67,4.460,1,1,black,3.04,1,1,6,1,birth,560,1
1,0,24.50,0.500,1,1,black,1.50,1,1,0,1,birth,824,1
2,1,27.83,1.540,1,1,white,3.75,1,1,5,1,birth,3,1
3,1,20.17,5.625,1,1,white,1.71,1,1,0,1,naturalized,0,1
4,1,32.08,4.000,1,1,white,2.50,1,1,0,1,birth,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,0,22.67,0.750,1,1,white,2.00,0,0,2,0,birth,394,0
686,0,25.25,13.500,0,0,latino,2.00,0,0,1,0,birth,1,0
687,1,17.92,0.205,1,1,white,0.04,0,0,0,0,birth,750,0
688,1,35.00,3.375,1,1,black,8.29,0,0,0,0,birth,0,0
