# Group 13 Project Proposal

In [1]:
pip install -U altair

Note: you may need to restart the kernel to use updated packages.



**Title: Credit Score Classification**

https://www.kaggle.com/datasets/parisrohan/credit-score-classification/data 


**Introduction:**
A credit report is a summary of a person’s credit history and is created when you borrow money or apply for a credit card. A credit score is a 3 digit number calculated based on your credit report that summarizes how well you manage your credit and how risky it would be for someone to lend you money. The higher credit score the better your rating is.


A credit score is calculated based on a few different factors such as:
- a person's anual income
- the number of credit cards they have
- the number of loans they have
- their credit card payment history
- how old their credit is
and more




**The Question we aim to answer:** Can we classify someone’s credit score based on certain banking history and financial traits (shown above)?

**Dataset description:** The dataset contains the bank and credit-related information of many individuals that have been amassed by a global finance company. It contains 27 columns of these information such as bank account history, loans, debt and EMI along with the number of credit cards a person has and their credit card payment history


In [2]:
import random
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

### Reading the data from a URL

In [3]:
url_train = "https://raw.githubusercontent.com/vedika37/dsci100-proj/main/train.csv"
url_test = "https://raw.githubusercontent.com/vedika37/dsci100-proj/main/test.csv"
train = pd.read_csv(url_train,sep = ",", low_memory=False)
test = pd.read_csv(url_test,sep = ",", low_memory=False)

### Cleaning Training Data 

In [4]:
# dropping null values and columns not used in analysis

# Predictors:

# - number of delayed payments 
# - delay from due date 
# - Credit_Utilization_Ratio
# - credit mix
# - credit history age



train = train[['Delay_from_due_date', 'Num_of_Delayed_Payment', 'Credit_Mix', 'Credit_Utilization_Ratio','Credit_History_Age','Credit_Score']]
train = train.dropna()
train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
0,3,7,_,26.822620,22 Years and 1 Months,Good
2,3,7,Good,28.609352,22 Years and 3 Months,Good
3,5,4,Good,31.377862,22 Years and 4 Months,Good
5,8,4,Good,27.262259,22 Years and 6 Months,Good
6,3,8_,Good,22.537593,22 Years and 7 Months,Good
...,...,...,...,...,...,...
99994,20,6,_,39.323569,31 Years and 5 Months,Poor
99995,23,7,_,34.663572,31 Years and 6 Months,Poor
99996,18,7,_,40.565631,31 Years and 7 Months,Poor
99997,27,6,Good,41.255522,31 Years and 8 Months,Poor


In [5]:
train["Credit_Mix"].unique()
train["Credit_Score"].unique()

array(['Good', 'Standard', 'Poor'], dtype=object)

In [6]:
# deleting garbage values
train = train[train["Credit_Mix"] != "_"]
train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,Good,28.609352,22 Years and 3 Months,Good
3,5,4,Good,31.377862,22 Years and 4 Months,Good
5,8,4,Good,27.262259,22 Years and 6 Months,Good
6,3,8_,Good,22.537593,22 Years and 7 Months,Good
8,3,4,Good,24.464031,26 Years and 7 Months,Standard
...,...,...,...,...,...,...
99986,33,25,Bad,24.713861,5 Years and 10 Months,Poor
99989,33,25,Bad,33.359987,6 Years and 1 Months,Standard
99991,33,25,Bad,37.140784,6 Years and 3 Months,Standard
99997,27,6,Good,41.255522,31 Years and 8 Months,Poor


In [7]:
train = train[train['Num_of_Delayed_Payment'].str.isdigit()]
train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,Good,28.609352,22 Years and 3 Months,Good
3,5,4,Good,31.377862,22 Years and 4 Months,Good
5,8,4,Good,27.262259,22 Years and 6 Months,Good
8,3,4,Good,24.464031,26 Years and 7 Months,Standard
9,7,1,Good,38.550848,26 Years and 8 Months,Good
...,...,...,...,...,...,...
99986,33,25,Bad,24.713861,5 Years and 10 Months,Poor
99989,33,25,Bad,33.359987,6 Years and 1 Months,Standard
99991,33,25,Bad,37.140784,6 Years and 3 Months,Standard
99997,27,6,Good,41.255522,31 Years and 8 Months,Poor


In [8]:
train= train.astype({'Num_of_Delayed_Payment': int})
train


Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,Good,28.609352,22 Years and 3 Months,Good
3,5,4,Good,31.377862,22 Years and 4 Months,Good
5,8,4,Good,27.262259,22 Years and 6 Months,Good
8,3,4,Good,24.464031,26 Years and 7 Months,Standard
9,7,1,Good,38.550848,26 Years and 8 Months,Good
...,...,...,...,...,...,...
99986,33,25,Bad,24.713861,5 Years and 10 Months,Poor
99989,33,25,Bad,33.359987,6 Years and 1 Months,Standard
99991,33,25,Bad,37.140784,6 Years and 3 Months,Standard
99997,27,6,Good,41.255522,31 Years and 8 Months,Poor


In [9]:
train['Credit_History_Age'] = train['Credit_History_Age'].str.split(" ").str[0]
train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,Good,28.609352,22,Good
3,5,4,Good,31.377862,22,Good
5,8,4,Good,27.262259,22,Good
8,3,4,Good,24.464031,26,Standard
9,7,1,Good,38.550848,26,Good
...,...,...,...,...,...,...
99986,33,25,Bad,24.713861,5,Poor
99989,33,25,Bad,33.359987,6,Standard
99991,33,25,Bad,37.140784,6,Standard
99997,27,6,Good,41.255522,31,Poor


In [10]:
train = train.astype({'Credit_History_Age': int})
train.dtypes

Delay_from_due_date           int64
Num_of_Delayed_Payment        int64
Credit_Mix                   object
Credit_Utilization_Ratio    float64
Credit_History_Age            int64
Credit_Score                 object
dtype: object

In [11]:
# note: mapping credit mix to numerical values to be able to use it as a predictor for classification
#print(train['Credit_Mix'].unique())
mapping = {'Good': 1, 'Bad': -1, 'Standard': 0}
train['Credit_Mix'] = train['Credit_Mix'].map(mapping)
train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,1,28.609352,22,Good
3,5,4,1,31.377862,22,Good
5,8,4,1,27.262259,22,Good
8,3,4,1,24.464031,26,Standard
9,7,1,1,38.550848,26,Good
...,...,...,...,...,...,...
99986,33,25,-1,24.713861,5,Poor
99989,33,25,-1,33.359987,6,Standard
99991,33,25,-1,37.140784,6,Standard
99997,27,6,1,41.255522,31,Poor


In [23]:
train = train.astype({'Credit_Mix': int})
print(train['Credit_Mix'].unique())

[ 1  0 -1]


### Cleaning Testing Data

In [12]:
# dropping null values and columns not used in analysis
test = test[['Delay_from_due_date', 'Num_of_Delayed_Payment', 'Credit_Mix', 'Credit_Utilization_Ratio','Credit_History_Age']]
test = test.dropna()
test

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age
0,3,7,Good,35.030402,22 Years and 9 Months
1,3,9,Good,33.053114,22 Years and 10 Months
3,4,5,Good,32.430559,23 Years and 0 Months
4,3,1,Good,25.926822,27 Years and 3 Months
5,3,3,Good,30.116600,27 Years and 4 Months
...,...,...,...,...,...
49993,33,25,Bad,37.528511,6 Years and 5 Months
49994,33,22,Bad,27.027812,6 Years and 6 Months
49997,23,5,Good,36.858542,32 Years and 0 Months
49998,21,6_,Good,39.139840,32 Years and 1 Months


In [13]:
test['Credit_Mix'].unique()

array(['Good', '_', 'Standard', 'Bad'], dtype=object)

In [14]:
test = test[test["Credit_Mix"] != "_"]
test['Credit_Mix'].unique()

array(['Good', 'Standard', 'Bad'], dtype=object)

In [15]:
test = test[test['Num_of_Delayed_Payment'].str.isdigit()]
test= test.astype({'Num_of_Delayed_Payment': int})
test


Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age
0,3,7,Good,35.030402,22 Years and 9 Months
1,3,9,Good,33.053114,22 Years and 10 Months
3,4,5,Good,32.430559,23 Years and 0 Months
4,3,1,Good,25.926822,27 Years and 3 Months
5,3,3,Good,30.116600,27 Years and 4 Months
...,...,...,...,...,...
49990,7,12,Good,25.708414,30 Years and 7 Months
49992,33,25,Bad,32.391288,6 Years and 4 Months
49993,33,25,Bad,37.528511,6 Years and 5 Months
49994,33,22,Bad,27.027812,6 Years and 6 Months


In [16]:
test['Num_of_Delayed_Payment'].unique()

array([   7,    9,    5,    1,    3, 1942,    6,   18,    0,   17,   15,
         14,   12,    8,   19,    2,   10,   11, 1150,   23,   21,   13,
       2077,   16,   20,    4,   24,   27,   22,  429, 2806,   25, 2849,
       2465, 4246, 1186,   75,   26, 4318, 4219,   28, 1591, 3552,  420,
       2606,  186,  959, 1122, 3178, 1884, 3398, 1486, 4128,  351,  861,
        873, 3948, 2801, 3627, 3825, 3954,  376, 2354, 4343, 2279, 1958,
       2608, 1274,  452,  845, 3097, 2161,  377, 1544, 4136, 2412, 3429,
        211,  518,  657, 3477,  370, 1234, 1534, 1080, 3057, 2819, 1329,
       3689, 2273, 2403, 1266, 3684, 1802,  181, 3591, 2912, 1356, 1117,
       2672, 2001, 4298,  590, 1095,  100, 1235, 3177, 2276, 2822, 1035,
        832, 2942, 2802, 4351, 3393, 3556,  288, 1146,  975, 2424,  265,
        179, 1513, 3071,  175,  700, 2836,  434, 2649, 2903, 1891, 2999,
        687, 1437, 3898, 4122, 2568, 4278, 1633, 2431,  538, 1570, 1297,
        414, 2352, 4044,  773, 1632, 2492,  961,  9

### Summarizing Training Data

In [17]:
# each column along with its datatype
train.dtypes

Delay_from_due_date           int64
Num_of_Delayed_Payment        int64
Credit_Mix                    int64
Credit_Utilization_Ratio    float64
Credit_History_Age            int64
Credit_Score                 object
dtype: object

In [18]:
# generating descriptive statistics for numeric and 'object' type columns
train.describe()

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age
count,65084.0,65084.0,65084.0,65084.0,65084.0
mean,21.166615,30.710205,0.059969,32.294107,17.919581
std,14.865283,224.280037,0.732188,5.115086,8.320779
min,-5.0,0.0,-1.0,20.172942,0.0
25%,10.0,9.0,0.0,28.069226,12.0
50%,18.0,14.0,0.0,32.313587,18.0
75%,28.0,18.0,1.0,36.507653,25.0
max,67.0,4397.0,1.0,50.0,33.0


### Visualizing Data - Distribution of Predictor Variables

In [19]:
# taking a subset of the data since the original is too big for charts
subset_train = train.iloc[:1000, :]
subset_train

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Mix,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
2,3,7,1,28.609352,22,Good
3,5,4,1,31.377862,22,Good
5,8,4,1,27.262259,22,Good
8,3,4,1,24.464031,26,Standard
9,7,1,1,38.550848,26,Good
...,...,...,...,...,...,...
1532,9,7,1,32.035662,15,Poor
1533,14,7,1,40.136062,15,Poor
1534,10,7,1,29.174795,15,Standard
1535,14,8,1,28.592943,15,Standard


In [24]:
quantitative_to_plot = ["Delay_from_due_date", "Num_of_Delayed_Payment", "Credit_Mix", "Credit_Utilization_Ratio", "Credit_History_Age"]

def remove_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    return (column >= Q1 - 1.5 * IQR) & (column <= Q3 + 1.5 * IQR)

# Apply outlier removal to each column
for col in quantitative_to_plot:
    subset_train = subset_train[remove_outliers(subset_train[col])]

train_pairplot = alt.Chart(subset_train).mark_point(opacity=0.4).encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    color = alt.Color("Credit_Score").title("Credit_Score")
).properties(
    width=200,
    height=200
).repeat(
    column=quantitative_to_plot,
    row=quantitative_to_plot
)
train_pairplot

### Methods

We plan to conduct our data analysis using the K Nearest Neighbors Classification algrorithm. We'll choose the best value of k using cross-validation and then use the following predictors to predict whether someone's Credit Score is Good, Standard or Poor.

Predictors:

- number of delayed payments 
- delay from due date 
- Credit_Utilization_Ratio
- credit mix
- credit history age


**Describing our visualizations**
We will plot histograms to visualize the distributions of our predictors. This will help us explore the relation between predicted credit score and the factors that may impact the credit score more than others like income, missed payments, credit utilization ratios.


### Expected outcomes and significance
- *What do you expect to find?* 
  
  People with more loans to be categorized in a ‘lower’ category. People with a bigger income, older credit account  age (more credit history) and fewer delayed payments to be in  a better category. Having a mix of credit types - loans/credit cards/mortgages also results in a better score.
  
- *What impact could such findings have?*
  - Helping banks predict whether it is a good idea to issue a new credit card to a new user
  - Can influence an individual’s credit limit/interest rate.
  - Studying the relation of individual factors with credit score category classification.
  
- *What future questions could this lead to?* 
  
  How do we evaluate or categorize a new person who has just started working and does not have a long enough credit history.
