# Group 13 Project Proposal

In [49]:
pip install -U altair

^C
Note: you may need to restart the kernel to use updated packages.



**Title: Credit Score Classification**

https://www.kaggle.com/datasets/parisrohan/credit-score-classification/data 


**Introduction:**
A credit report is a summary of a person’s credit history and is created when you borrow money or apply for a credit card. A credit score is a 3 digit number calculated based on your credit report that summarizes how well you manage your credit and how risky it would be for someone to lend you money. The higher credit score the better your rating is.


A credit score is calculated based on a few different factors such as:
- a person's anual income
- the number of credit cards they have
- the number of loans they have
- their credit card payment history
- how old their credit is
and more




**The Question we aim to answer:** Can we classify someone’s credit score based on certain banking history and financial traits (shown above)?

**Dataset description:** The dataset contains the bank and credit-related information of many individuals that have been amassed by a global finance company. It contains 27 columns of these information such as bank account history, loans, debt and EMI along with the number of credit cards a person has and their credit card payment history


In [None]:
import random
import altair as alt
import pandas as pd
import numpy as np
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate

np.random.seed(999) # setting the seed

### Reading the data from a URL

In [None]:
url_credit = "https://raw.githubusercontent.com/vedika37/dsci100-proj/main/train.csv"
credit = pd.read_csv(url_credit,sep = ",", low_memory=False)
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

As we can see from the output of credit.info() the number of non-null observations are not the same across all variables, so to be able to analyze patterns in our data, we will first have to do some data cleaning. 

### Cleaning Data 

In [None]:
# dropping null values and columns not used in analysis

# Predictors:

# - number of delayed payments 
# - delay from due date 
# - Credit_Utilization_Ratio
# - credit history age



credit = credit[['Num_of_Delayed_Payment', 'Delay_from_due_date', 'Credit_Score']]
credit = credit.dropna()
credit

Unnamed: 0,Num_of_Delayed_Payment,Delay_from_due_date,Credit_Score
0,7,3,Good
2,7,3,Good
3,4,5,Good
5,4,8,Good
8,4,3,Standard
...,...,...,...
99994,6,20,Poor
99995,7,23,Poor
99996,7,18,Poor
99997,6,27,Poor


In [None]:
credit["Credit_Score"].unique()

array(['Good', 'Standard', 'Poor'], dtype=object)

We see here that our output variable i.e the credit score takes three unique values -> Good, Standard and Poor.

In [None]:
credit = credit[credit['Num_of_Delayed_Payment'].str.isdigit()]
credit

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
0,3,7,26.822620,22 Years and 1 Months,Good
2,3,7,28.609352,22 Years and 3 Months,Good
3,5,4,31.377862,22 Years and 4 Months,Good
5,8,4,27.262259,22 Years and 6 Months,Good
8,3,4,24.464031,26 Years and 7 Months,Standard
...,...,...,...,...,...
99994,20,6,39.323569,31 Years and 5 Months,Poor
99995,23,7,34.663572,31 Years and 6 Months,Poor
99996,18,7,40.565631,31 Years and 7 Months,Poor
99997,27,6,41.255522,31 Years and 8 Months,Poor


In [None]:
credit= credit.astype({'Num_of_Delayed_Payment': int})
credit


Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Utilization_Ratio,Credit_History_Age,Credit_Score
0,3,7,26.822620,22 Years and 1 Months,Good
2,3,7,28.609352,22 Years and 3 Months,Good
3,5,4,31.377862,22 Years and 4 Months,Good
5,8,4,27.262259,22 Years and 6 Months,Good
8,3,4,24.464031,26 Years and 7 Months,Standard
...,...,...,...,...,...
99994,20,6,39.323569,31 Years and 5 Months,Poor
99995,23,7,34.663572,31 Years and 6 Months,Poor
99996,18,7,40.565631,31 Years and 7 Months,Poor
99997,27,6,41.255522,31 Years and 8 Months,Poor


### Summarizing Data

In [None]:
# each column along with its datatype
credit.dtypes

Delay_from_due_date           int64
Num_of_Delayed_Payment        int32
Credit_Utilization_Ratio    float64
Credit_History_Age            int32
Credit_Score                 object
dtype: object

In [None]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 81549 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Delay_from_due_date       81549 non-null  int64  
 1   Num_of_Delayed_Payment    81549 non-null  int32  
 2   Credit_Utilization_Ratio  81549 non-null  float64
 3   Credit_History_Age        81549 non-null  int32  
 4   Credit_Score              81549 non-null  object 
dtypes: float64(1), int32(2), int64(1), object(1)
memory usage: 3.1+ MB


In [None]:
# generating descriptive statistics for numeric and 'object' type columns
credit.describe()

Unnamed: 0,Delay_from_due_date,Num_of_Delayed_Payment,Credit_Utilization_Ratio,Credit_History_Age
count,81549.0,81549.0,81549.0,81549.0
mean,21.180162,31.155796,32.297847,17.919815
std,14.883628,226.908684,5.11568,8.321895
min,-5.0,0.0,20.172942,0.0
25%,10.0,9.0,28.072546,12.0
50%,18.0,14.0,32.326517,18.0
75%,28.0,18.0,36.505576,25.0
max,67.0,4397.0,50.0,33.0


### Taking a Random Sample
Since our current dataset has 81549 values, it is too  large to perform 

In [None]:
credit['Credit_Score'].value_counts(normalize=True)

Standard    0.532318
Poor        0.292058
Good        0.175624
Name: Credit_Score, dtype: float64

In [None]:
credit_sample = credit.sample(n = 1000)
credit_sample['Credit_Score'].value_counts(normalize=True)

Standard    0.541
Poor        0.297
Good        0.162
Name: Credit_Score, dtype: float64

#### Impact Of Taking Subset of Data

We have limited our sample size to 1,000 observations, despite the original dataset containing over 80,000 entries. Training our model on the full dataset might yield more representative results, but this approach demands significant computational resources and time. To mitigate this, we've ensured that the proportions of output categories in our sample closely mirror those in the original dataset, thus minimizing the risk of bias in our results. However, it's notable that the 'standard' category constitutes approximately 53.3% of our sample outputs. This could suggest a sampling bias, a factor often beyond the control of data scientists and a potential limitation of our study. To further address this, we could consider selecting a dataset where all outputs are equally represented. However, this approach may not accurately reflect the actual population demographics, presenting a trade-off between balanced representation and real-world application.

### Train - Test Split 

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    credit_sample, train_size=0.75, stratify=credit_sample["Credit_Score"]
)
print(train.info())
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 750 entries, 84127 to 79125
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Num_of_Delayed_Payment  750 non-null    int32 
 1   Delay_from_due_date     750 non-null    int64 
 2   Credit_Score            750 non-null    object
dtypes: int32(1), int64(1), object(1)
memory usage: 20.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 22730 to 33440
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Num_of_Delayed_Payment  250 non-null    int32 
 1   Delay_from_due_date     250 non-null    int64 
 2   Credit_Score            250 non-null    object
dtypes: int32(1), int64(1), object(1)
memory usage: 6.8+ KB
None


In [None]:
print("Value counts for training data\n",train["Credit_Score"].value_counts(normalize=True))
print("\nValue counts for testing data\n",test["Credit_Score"].value_counts(normalize=True))

Value counts for training data
 Standard    0.541333
Poor        0.297333
Good        0.161333
Name: Credit_Score, dtype: float64

Value counts for testing data
 Standard    0.540
Poor        0.296
Good        0.164
Name: Credit_Score, dtype: float64


### Visualizing Data - Distribution of Predictor Variables

In [None]:
# initializing list of predictors
predictors = ["Delay_from_due_date", "Num_of_Delayed_Payment"]

# visualizing training data

# removing outliers - just for the visualization
def remove_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1
    return (column >= Q1 - 1.5 * IQR) & (column <= Q3 + 1.5 * IQR)

# Apply outlier removal to each column
for col in predictors:
    train = train[remove_outliers(train[col])]

train_pairplot = alt.Chart(train).mark_point(opacity=0.4).encode(
    alt.X(alt.repeat("row"), type="quantitative"),
    alt.Y(alt.repeat("column"), type="quantitative"),
    color = alt.Color("Credit_Score").title("Credit_Score")
).properties(
    width=200,
    height=200
).repeat(
    column=predictors,
    row=predictors
)

train_pairplot

Delay from due date vs

### Methods

We plan to conduct our data analysis using the K Nearest Neighbors Classification algrorithm. We'll choose the best value of k using cross-validation and then use the following predictors to predict whether someone's Credit Score is Good, Standard or Poor.

Predictors:

- number of delayed payments 
- delay from due date 
- Credit_Utilization_Ratio
- credit history age


**Describing our visualizations**
We will plot histograms to visualize the distributions of our predictors. This will help us explore the relation between predicted credit score and the factors that may impact the credit score more than others like income, missed payments, credit utilization ratios.


In [None]:
# training_labels = subset_train["Credit_Score"]
# training_labels

In [None]:
knn = KNeighborsClassifier(n_neighbors=57)
knn

In [None]:
preprocessor = make_column_transformer(
    (StandardScaler(), predictors),
    remainder='passthrough',
    verbose_feature_names_out=False
)
preprocessor

In [None]:
X = train.drop(columns='Credit_Score')
y = train["Credit_Score"]

train_fit = make_pipeline(preprocessor, knn).fit(X, y)

train_fit

In [None]:
test_predictions = test.assign(
    predicted = train_fit.predict(test[predictors])
)
test_predictions

Unnamed: 0,Num_of_Delayed_Payment,Delay_from_due_date,Credit_Score,predicted
22730,19,17,Standard,Standard
73480,21,56,Standard,Poor
72345,3,7,Good,Good
14293,17,32,Poor,Standard
44,7,11,Good,Standard
...,...,...,...,...
97756,18,21,Standard,Standard
34642,5,7,Poor,Standard
1041,6,15,Standard,Standard
80943,8,3,Standard,Standard


In [None]:
correct_preds = test_predictions[
    test_predictions['Credit_Score'] == test_predictions['predicted']
]

correct_preds.shape[0] / test_predictions.shape[0]

0.648

In [None]:
X_test = test[predictors]
y_test = test["Credit_Score"]
accuracy = train_fit.score(X_test,y_test)
accuracy

0.648

### Choosing the best K

In [None]:
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 100, 5),
}
tune_pipe = make_pipeline(preprocessor, KNeighborsClassifier())

In [None]:
knn_tune_grid = GridSearchCV(
    estimator = tune_pipe, param_grid = param_grid, cv=25,
)
knn_tune_grid

In [None]:
knn_model_grid = knn_tune_grid.fit(X, y)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_)
#accuracies_grid

In [None]:
accuracy_versus_k_grid = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors")
        .title("neighbors")
        .scale(zero=False),
    y=alt.Y("mean_test_score")
        .title("Mean test score")
        .scale(zero=False)
)
accuracy_versus_k_grid

### Expected outcomes and significance
- *What do you expect to find?* 
  
  People with more loans to be categorized in a ‘lower’ category. People with a bigger income, older credit account  age (more credit history) and fewer delayed payments to be in  a better category. Having a mix of credit types - loans/credit cards/mortgages also results in a better score.
  
- *What impact could such findings have?*
  - Helping banks predict whether it is a good idea to issue a new credit card to a new user
  - Can influence an individual’s credit limit/interest rate.
  - Studying the relation of individual factors with credit score category classification.
  
- *What future questions could this lead to?* 
  
  How do we evaluate or categorize a new person who has just started working and does not have a long enough credit history. 


### Discussion


We would a k value of 30 leads the highest accuracy in our model. With the corresponding accuracy being 58%. Since we don't have a small k-value, it is indicative that our model is less prone to overfitting. A very large k-value on the other hand would mean high bias, however, a k-value of 30 treads a fine line between the two. An optimal k-value of 30 also means that our model is fit for larger datasets and will have a smooth decision boundary as it will factor in more neighbours while making a prediction. 